[
https://issues.apache.org/jira/browse/SOLR-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13697454#comment-13697454
]
Terrance A. Snyder commented on SOLR-2242:
------------------------------------------
[~otis] I got the email - I'll give some background as we've enhanced and
combined but I should be able to put together a patch in the following week.
There is an old version on github I need to update to trunk I'll spend time
doing this, most of this work was enhancing two existing JIRA items which are
wonderful.
Core Work:
https://issues.apache.org/jira/browse/SOLR-2894
https://issues.apache.org/jira/browse/SOLR-3583
Newer features:
+ Some of the issues that have been discussed around distributed counting has
already been done in larger installations (counting billions of items). I work
in the advertising space and counting/slicing dicing things and sending between
shards 90+ billion documents on highly unique facet counts such as session id,
or cookie ID is hugely wasteful and doesn't scale.
+ The Ad industry is great at counting stuff "at scale" - sessions, web events,
etc. We take the stance that counting stuff can be "roughly" right when we get
to billions + or - 0-1.5% error rate is OK when the response goes from minutes
to milliseconds. As such, optional parameters for "estimated count" is added
which will leverage a HyperLogLog implementation to do a 98.5% correct
response. By default this is turned on for us - on a large installation
(multiple billions of POS transactions)
*Questions as I'd like to actually do this right*
+ Rather than re-invent the wheel I use stream-lib
(https://github.com/clearspring/stream-lib). It is apache licensed and includes
HyperLogLog, HyperLogLogPlus, BloomFilters, TopK, QDigest, etc. Is this an
issue?
+ Test cases - I've got 82% code coverage - is this good enough?
+ Documentation - I've got markdown documents that cover the commands and
syntax - is this the right format?
+ SOLR-2894, SOLR-3583 - It makes logical sense that these start to be joined
together. When using all these I sometimes start smelling solr as an analytic
engine (and it's a very nice one when combining probabilistic data structures).
If someone can answer the above questions while I sync to /trunk please let me
know.
Old Version for posterity until I get around to updating to latest trunk and
including the HyperLogLog implementation - doesn't include HyperLogLog
sketching - minor updates.
https://github.com/terrancesnyder/solr-analytics/blob/master/solr/core/src/java/org/apache/solr/handler/component/PivotFacetHelper.java
> Get distinct count of names for a facet field
> ---------------------------------------------
>
> Key: SOLR-2242
> URL: https://issues.apache.org/jira/browse/SOLR-2242
> Project: Solr
> Issue Type: New Feature
> Components: Response Writers
> Affects Versions: 4.0-ALPHA
> Reporter: Bill Bell
> Priority: Minor
> Fix For: 4.4
>
> Attachments: SOLR-2242-3x_5_tests.patch, SOLR-2242-3x.patch,
> SOLR-2242.patch, SOLR-2242.patch, SOLR-2242.patch,
> SOLR-2242.shard.withtests.patch, SOLR-2242.solr3.1-fix.patch,
> SOLR-2242.solr3.1.patch, SOLR.2242.solr3.1.patch, SOLR-2242-solr40-3.patch
>
>
> When returning facet.field=<name of field> you will get a list of matches for
> distinct values. This is normal behavior. This patch tells you how many
> distinct values you have (# of rows). Use with limit=-1 and mincount=1.
> The feature is called "namedistinct". Here is an example:
> Parameters:
> facet.numTerms or f.<field>.facet.numTerms = true (default is false) - turn
> on distinct counting of terms
> facet.field - the field to count the terms
> It creates a new section in the facet section...
> http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solr&indent=true&q=*:*&facet=true&facet.mincount=1&facet.numTerms=true&facet.limit=-1&facet.field=price
> http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solr&indent=true&q=*:*&facet=true&facet.mincount=1&facet.numTerms=false&facet.limit=-1&facet.field=price
> http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solr&indent=true&q=*:*&facet=true&facet.mincount=1&facet.numTerms=true&facet.limit=-1&facet.field=price
> This currently only works on facet.field.
> {code}
> <lst name="facet_counts">
> <lst name="facet_queries"/>
> <lst name="facet_fields">...</lst>
> <lst name="facet_numTerms">
> <lst name="localhost:8983/solr/">
> <int name="price">14</int>
> </lst>
> <lst name="localhost:8080/solr/">
> <int name="price">14</int>
> </lst>
> </lst>
> <lst name="facet_dates"/>
> <lst name="facet_ranges"/>
> </lst>
> OR with no sharding-
> <lst name="facet_numTerms">
> <int name="price">14</int>
> </lst>
> {code}
> Several people use this to get the group.field count (the # of groups).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]