[ https://issues.apache.org/jira/browse/SOLR-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13697454#comment-13697454 ]
Terrance A. Snyder commented on SOLR-2242: ------------------------------------------ [~otis] I got the email - I'll give some background as we've enhanced and combined but I should be able to put together a patch in the following week. There is an old version on github I need to update to trunk I'll spend time doing this, most of this work was enhancing two existing JIRA items which are wonderful. Core Work: https://issues.apache.org/jira/browse/SOLR-2894 https://issues.apache.org/jira/browse/SOLR-3583 Newer features: + Some of the issues that have been discussed around distributed counting has already been done in larger installations (counting billions of items). I work in the advertising space and counting/slicing dicing things and sending between shards 90+ billion documents on highly unique facet counts such as session id, or cookie ID is hugely wasteful and doesn't scale. + The Ad industry is great at counting stuff "at scale" - sessions, web events, etc. We take the stance that counting stuff can be "roughly" right when we get to billions + or - 0-1.5% error rate is OK when the response goes from minutes to milliseconds. As such, optional parameters for "estimated count" is added which will leverage a HyperLogLog implementation to do a 98.5% correct response. By default this is turned on for us - on a large installation (multiple billions of POS transactions) *Questions as I'd like to actually do this right* + Rather than re-invent the wheel I use stream-lib (https://github.com/clearspring/stream-lib). It is apache licensed and includes HyperLogLog, HyperLogLogPlus, BloomFilters, TopK, QDigest, etc. Is this an issue? + Test cases - I've got 82% code coverage - is this good enough? + Documentation - I've got markdown documents that cover the commands and syntax - is this the right format? + SOLR-2894, SOLR-3583 - It makes logical sense that these start to be joined together. When using all these I sometimes start smelling solr as an analytic engine (and it's a very nice one when combining probabilistic data structures). If someone can answer the above questions while I sync to /trunk please let me know. Old Version for posterity until I get around to updating to latest trunk and including the HyperLogLog implementation - doesn't include HyperLogLog sketching - minor updates. https://github.com/terrancesnyder/solr-analytics/blob/master/solr/core/src/java/org/apache/solr/handler/component/PivotFacetHelper.java > Get distinct count of names for a facet field > --------------------------------------------- > > Key: SOLR-2242 > URL: https://issues.apache.org/jira/browse/SOLR-2242 > Project: Solr > Issue Type: New Feature > Components: Response Writers > Affects Versions: 4.0-ALPHA > Reporter: Bill Bell > Priority: Minor > Fix For: 4.4 > > Attachments: SOLR-2242-3x_5_tests.patch, SOLR-2242-3x.patch, > SOLR-2242.patch, SOLR-2242.patch, SOLR-2242.patch, > SOLR-2242.shard.withtests.patch, SOLR-2242.solr3.1-fix.patch, > SOLR-2242.solr3.1.patch, SOLR.2242.solr3.1.patch, SOLR-2242-solr40-3.patch > > > When returning facet.field=<name of field> you will get a list of matches for > distinct values. This is normal behavior. This patch tells you how many > distinct values you have (# of rows). Use with limit=-1 and mincount=1. > The feature is called "namedistinct". Here is an example: > Parameters: > facet.numTerms or f.<field>.facet.numTerms = true (default is false) - turn > on distinct counting of terms > facet.field - the field to count the terms > It creates a new section in the facet section... > http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solr&indent=true&q=*:*&facet=true&facet.mincount=1&facet.numTerms=true&facet.limit=-1&facet.field=price > http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solr&indent=true&q=*:*&facet=true&facet.mincount=1&facet.numTerms=false&facet.limit=-1&facet.field=price > http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solr&indent=true&q=*:*&facet=true&facet.mincount=1&facet.numTerms=true&facet.limit=-1&facet.field=price > This currently only works on facet.field. > {code} > <lst name="facet_counts"> > <lst name="facet_queries"/> > <lst name="facet_fields">...</lst> > <lst name="facet_numTerms"> > <lst name="localhost:8983/solr/"> > <int name="price">14</int> > </lst> > <lst name="localhost:8080/solr/"> > <int name="price">14</int> > </lst> > </lst> > <lst name="facet_dates"/> > <lst name="facet_ranges"/> > </lst> > OR with no sharding- > <lst name="facet_numTerms"> > <int name="price">14</int> > </lst> > {code} > Several people use this to get the group.field count (the # of groups). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org