[
https://issues.apache.org/jira/browse/SOLR-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hoss Man updated SOLR-6968:
---------------------------
Attachment: SOLR-6968.patch
really simple straw man implementation using java-hll...
https://github.com/aggregateknowledge/java-hll
The bulk of the current patch is in test refactoring because all the special
case conditionals in StatsComponentTest.testIndividualStatLocalParams were
driving me insane.
Currently only cardinality of numeric fields is supported (and even then, only
long fields really work "correctly"). Current syntax is...
{noformat}
/select?q=*:*&stats=true&stats.field={!cardinality=true}fieldname_l
{noformat}
...but i'm thinking that should change ... there's at least two types of knobs
we should support, i'm just not sure which is more important, or if either
should be mandatory:
* An indication of wether or not hte input is already hashed
** reading up more on HLL i'm realizing how important it is that the values be
hashed (into longs).
** We should certainly support on the fly hashing, but for people who plan to
compute cardinalities a lot, particularly over large sets or strings, we should
also have both:
*** an easy way for them to compute those long hashes at index time (simple
UpdateProcessor)
*** a stats localparam indicate that the field they are computing cardinality
over is already hashed
* precisions / size tunning
** similar to how we have an optional "tdigestCompression" param we could have
an "hllOptions" param for overriding the "log2m" and "regwidth" options
** or we could require that the value of the "cardinality" param be a value
indicating how much the user cares about accuracy vs ram (ie: a float between 0
and 1 indicating min ram vs max accurace) and compute log2m+regwidth from those
("false" or negative values could disable complete, while "true" could be
shorthand for some default)
*** this would have the benefit of being something we could continue to support
even if a better cardinality algorithm comes along in the future
My next steps are to focus on more concrete tests & then refactoring to work
with other field types, and think about the knobs/configuration as i go.
> add hyperloglog in statscomponent as an approximate count
> ---------------------------------------------------------
>
> Key: SOLR-6968
> URL: https://issues.apache.org/jira/browse/SOLR-6968
> Project: Solr
> Issue Type: Sub-task
> Reporter: Hoss Man
> Attachments: SOLR-6968.patch
>
>
> stats component currently supports "calcDistinct" but it's terribly
> inefficient -- especially in distib mode.
> we should add support for using hyperloglog to compute an approximate count
> of distinct values (using localparams via SOLR-6349 to control the precision
> of the approximation)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]