[jira] [Commented] (SOLR-12820) Auto pick method:dvhash based on thresholds

Yonik Seeley (JIRA) Wed, 03 Oct 2018 06:10:49 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-12820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16636921#comment-16636921
 ]


Yonik Seeley commented on SOLR-12820:
-------------------------------------

bq. // Trying to find the cardinality for the matchingDocs would be expensive.

The heuristic I had in mind would just use the cardinality of the whole field 
in conjunction with fcontext.base.size()
For example, if one is faceting on US states (50 values) you're pretty much 
always going to want to use the array approach.  Comparing to maxDoc isn't too 
meaningful here.

Even though it may not be implemented yet, we should also keep multi-valued 
fields in mind when thinking about the API access/control for this.

> Auto pick method:dvhash based on thresholds
> -------------------------------------------
>
>                 Key: SOLR-12820
>                 URL: https://issues.apache.org/jira/browse/SOLR-12820
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: Facet Module
>            Reporter: Varun Thacker
>            Priority: Major
>
> I've worked with two users last week where explicitly using method:dvhash 
> improved the faceting speeds drastically.
> The common theme in both the use-cases were:  One collection hosting data for 
> multiple users.  We always filter documents for one user ( therby limiting 
> the number of documents drastically ) and then perfoming a complex nested 
> JSON facet.
> Both use-cases fit perfectly in this criteria that [[email protected]] 
> mentioed on SOLR-9142
> {quote}faceting on a string field with a high cardinality compared to it's 
> domain is less efficient than it could be.
> {quote}
> And DVHASH was the perfect optimization for these use-cases.
> We are using the facet stream expression in one of the use-cases which 
> doesn't expose the method param. We could expose the method param to facet 
> stream but I feel the better approach to solve this problem would be to 
> address this TODO in the code withing the JSON Facet Module
> {code:java}
>       if (mincount > 0 && prefix == null && (ntype != null || method == 
> FacetMethod.DVHASH)) {
>         // TODO can we auto-pick for strings when term cardinality is much 
> greater than DocSet cardinality?
>         //   or if we don't know cardinality but DocSet size is very small
>         return new FacetFieldProcessorByHashDV(fcontext, this, sf);{code}
> I thought about this a little and this was the approach I am thinking 
> currently to tackle this problem
> {code:java}
> int matchingDocs = fcontext.base.size();
> int totalDocs = fcontext.searcher.getIndexReader().maxDoc();
> //if matchingDocs is close to the totalDocs then we aren't filtering many 
> documents.
> //that means the array approach would probably be better than the dvhash 
> approach
> //Trying to find the cardinality for the matchingDocs would be expensive.
> //Also for totalDocs we don't have a global cardinality present at index time 
> but we have a per segment cardinality
> //So using the number of matches as an alternate heuristic would do the job 
> here?{code}
> Any thoughts if this approach makes sense? it could be I'm thinking of this 
> approach just because both the users I worked with last week fell in this 
> cateogory.
>  
> cc [~dsmiley] [~joel.bernstein]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-12820) Auto pick method:dvhash based on thresholds

Reply via email to