[
https://issues.apache.org/jira/browse/LUCENE-7994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16206903#comment-16206903
]
Robert Muir commented on LUCENE-7994:
-------------------------------------
I am confused about the heuristic, can you explain it?
{code}
return taxoReaderSize < 1024 || sumTotalHits < taxoReaderSize/10;
{code}
For the first condition, Isn't taxoReaderSize essentially the cardinality? Why
would we want a sparse hashtable in this low-cardinality case, I would think
the opposite (a simple array should be best, it will be small).
And the second condition confuses me too, because we seem to be comparing
apples and oranges. Wouldn't we instead only look at sumTotalHits/maxDoc (what
% of the docs the query matches) when taxoReaderSize > 1024k? If its only 10%
of the docs in the collection, we infer that an array could be very wasteful...
of course we don't know the distribution but its just a heuristic.
> Use int/int hash map for int taxonomy facet counts
> --------------------------------------------------
>
> Key: LUCENE-7994
> URL: https://issues.apache.org/jira/browse/LUCENE-7994
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Fix For: master (8.0), 7.2
>
> Attachments: LUCENE-7994.patch
>
>
> Int taxonomy facets today always count into a dense {{int[]}}, which is
> wasteful in cases where the number of unique facet labels is high and the
> size of the current result set is small.
> I factored the native hash map from LUCENE-7927 and use a simple heuristic
> (customizable by the user by subclassing) to decide up front whether to count
> sparse or dense. I also made loading of the large children and siblings
> {{int[]}} lazy, so that they are only instantiated if you really need them.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]