[ 
https://issues.apache.org/jira/browse/LUCENE-10281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456048#comment-17456048
 ] 

Greg Miller commented on LUCENE-10281:
--------------------------------------

Yeah, +1 to not considering this a bug (but I'm a little biased I suppose since 
I wrote this). As you point out, the heuristic would be better if we knew how 
many of the hits actually had values in the SSDV field, but it's expensive to 
determine that up-front. So the current heuristic (which is just a heuristic 
and could be flawed in a number of ways), assumes all the hits have a value.

> Error condition used to judge whether hits are sparse in 
> StringValueFacetCounts
> -------------------------------------------------------------------------------
>
>                 Key: LUCENE-10281
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10281
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/facet
>    Affects Versions: 8.11
>            Reporter: Lu Xugang
>            Priority: Minor
>         Attachments: 1.jpg
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Description:
> In construction method StringValueFacetCounts(StringDocValuesReaderState 
> state, FacetsCollector facetsCollector), if facetsCollector was provided, a 
> condition of *(totalHits < totalDocs / 10)* used to judge whether using 
> IntIntHashMap which means sparse to store term ord and count 。
> But per totalHits doesn't means it must be containing SSDV , and so is 
> totalDocs. so the right calculation should be *( totalHits has SSDV) / 
> (totalDocs has SSDV) .( totalDocs has SSDV)* was easy to get by 
> SortedSetDocValues#getValueCount(), *totalHits has SSDV* is hard to get 
> because we can only read index by docId provided by FacetsCollector, but the 
> way of getting *totalHits has SSDV* is slow and redundant.
> Solution:
> if we don't wanna to break the old logic that using denseCounts while 
> cardinality < 1024 and using IntIntHashMap while 10% threshold and using 
> denseCounts while the rest of the case, then we could still use denseCounts 
> if cardinality < 1024, if not , using IntIntHashMap. when 10% of the unique 
> term collected,then change to use denseCounts.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to