[jira] [Commented] (SOLR-2242) Get distinct count of names for a facet field

Terrance A. Snyder (JIRA) Mon, 01 Jul 2013 20:11:28 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13697454#comment-13697454
 ]


Terrance A. Snyder commented on SOLR-2242:
------------------------------------------

[~otis] I got the email - I'll give some background as we've enhanced and 
combined but I should be able to put together a patch in the following week. 
There is an old version on github I need to update to trunk I'll spend time 
doing this, most of this work was enhancing two existing JIRA items which are 
wonderful.

Core Work:
https://issues.apache.org/jira/browse/SOLR-2894
https://issues.apache.org/jira/browse/SOLR-3583

Newer features:

+ Some of the issues that have been discussed around distributed counting has 
already been done in larger installations (counting billions of items). I work 
in the advertising space and counting/slicing dicing things and sending between 
shards 90+ billion documents on highly unique facet counts such as session id, 
or cookie ID is hugely wasteful and doesn't scale.

+ The Ad industry is great at counting stuff "at scale" - sessions, web events, 
etc. We take the stance that counting stuff can be "roughly" right when we get 
to billions + or - 0-1.5% error rate is OK when the response goes from minutes 
to milliseconds. As such, optional parameters for "estimated count" is added 
which will leverage a HyperLogLog implementation to do a 98.5% correct 
response. By default this is turned on for us - on a large installation 
(multiple billions of POS transactions)

*Questions as I'd like to actually do this right*

+ Rather than re-invent the wheel I use stream-lib 
(https://github.com/clearspring/stream-lib). It is apache licensed and includes 
HyperLogLog, HyperLogLogPlus, BloomFilters, TopK, QDigest, etc. Is this an 
issue?

+ Test cases - I've got 82% code coverage - is this good enough?

+ Documentation - I've got markdown documents that cover the commands and 
syntax - is this the right format?

+ SOLR-2894, SOLR-3583 - It makes logical sense that these start to be joined 
together. When using all these I sometimes start smelling solr as an analytic 
engine (and it's a very nice one when combining probabilistic data structures).

If someone can answer the above questions while I sync to /trunk please let me 
know.

Old Version for posterity until I get around to updating to latest trunk and 
including the HyperLogLog implementation - doesn't include HyperLogLog 
sketching - minor updates.
https://github.com/terrancesnyder/solr-analytics/blob/master/solr/core/src/java/org/apache/solr/handler/component/PivotFacetHelper.java
                
> Get distinct count of names for a facet field
> ---------------------------------------------
>
>                 Key: SOLR-2242
>                 URL: https://issues.apache.org/jira/browse/SOLR-2242
>             Project: Solr
>          Issue Type: New Feature
>          Components: Response Writers
>    Affects Versions: 4.0-ALPHA
>            Reporter: Bill Bell
>            Priority: Minor
>             Fix For: 4.4
>
>         Attachments: SOLR-2242-3x_5_tests.patch, SOLR-2242-3x.patch, 
> SOLR-2242.patch, SOLR-2242.patch, SOLR-2242.patch, 
> SOLR-2242.shard.withtests.patch, SOLR-2242.solr3.1-fix.patch, 
> SOLR-2242.solr3.1.patch, SOLR.2242.solr3.1.patch, SOLR-2242-solr40-3.patch
>
>
> When returning facet.field=<name of field> you will get a list of matches for 
> distinct values. This is normal behavior. This patch tells you how many 
> distinct values you have (# of rows). Use with limit=-1 and mincount=1.
> The feature is called "namedistinct". Here is an example:
> Parameters:
> facet.numTerms or f.<field>.facet.numTerms = true (default is false) - turn 
> on distinct counting of terms
> facet.field - the field to count the terms
> It creates a new section in the facet section...
> http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solr&indent=true&q=*:*&facet=true&facet.mincount=1&facet.numTerms=true&facet.limit=-1&facet.field=price
> http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solr&indent=true&q=*:*&facet=true&facet.mincount=1&facet.numTerms=false&facet.limit=-1&facet.field=price
> http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solr&indent=true&q=*:*&facet=true&facet.mincount=1&facet.numTerms=true&facet.limit=-1&facet.field=price
> This currently only works on facet.field.
> {code}
> <lst name="facet_counts">
> <lst name="facet_queries"/>
> <lst name="facet_fields">...</lst>
> <lst name="facet_numTerms">
> <lst name="localhost:8983/solr/">
> <int name="price">14</int>
> </lst>
> <lst name="localhost:8080/solr/">
> <int name="price">14</int>
> </lst>
> </lst>
> <lst name="facet_dates"/>
> <lst name="facet_ranges"/>
> </lst>
> OR with no sharding-
> <lst name="facet_numTerms">
> <int name="price">14</int>
> </lst>
> {code} 
> Several people use this to get the group.field count (the # of groups).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-2242) Get distinct count of names for a facet field

Reply via email to