[jira] [Commented] (SOLR-12820) Auto pick method:dvhash based on thresholds

David Smiley (JIRA) Tue, 02 Oct 2018 22:10:21 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-12820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16636441#comment-16636441
 ]


David Smiley commented on SOLR-12820:
-------------------------------------

Makes sense to me.  It'd be nice to consider FacetMethod here as well so that a 
user that sets the FacetMethod to "DV" then he/she gets the current ordinal 
array algorithm.  Or maybe the ratio could be configurable.  Looking back... 
hmm... I suppose if the ratio were configurable, then there would be no need 
for DVHASH enum.

What ratio of docSet to numDocs?  Perhaps 1/16th or smaller use hash?

> Auto pick method:dvhash based on thresholds
> -------------------------------------------
>
>                 Key: SOLR-12820
>                 URL: https://issues.apache.org/jira/browse/SOLR-12820
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: Facet Module
>            Reporter: Varun Thacker
>            Priority: Major
>
> I've worked with two users last week where explicitly using method:dvhash 
> improved the faceting speeds drastically.
> The common theme in both the use-cases were:  One collection hosting data for 
> multiple users.  We always filter documents for one user ( therby limiting 
> the number of documents drastically ) and then perfoming a complex nested 
> JSON facet.
> Both use-cases fit perfectly in this criteria that [[email protected]] 
> mentioed on SOLR-9142
> {quote}faceting on a string field with a high cardinality compared to it's 
> domain is less efficient than it could be.
> {quote}
> And DVHASH was the perfect optimization for these use-cases.
> We are using the facet stream expression in one of the use-cases which 
> doesn't expose the method param. We could expose the method param to facet 
> stream but I feel the better approach to solve this problem would be to 
> address this TODO in the code withing the JSON Facet Module
> {code:java}
>       if (mincount > 0 && prefix == null && (ntype != null || method == 
> FacetMethod.DVHASH)) {
>         // TODO can we auto-pick for strings when term cardinality is much 
> greater than DocSet cardinality?
>         //   or if we don't know cardinality but DocSet size is very small
>         return new FacetFieldProcessorByHashDV(fcontext, this, sf);{code}
> I thought about this a little and this was the approach I am thinking 
> currently to tackle this problem
> {code:java}
> int matchingDocs = fcontext.base.size();
> int totalDocs = fcontext.searcher.getIndexReader().maxDoc();
> //if matchingDocs is close to the totalDocs then we aren't filtering many 
> documents.
> //that means the array approach would probably be better than the dvhash 
> approach
> //Trying to find the cardinality for the matchingDocs would be expensive.
> //Also for totalDocs we don't have a global cardinality present at index time 
> but we have a per segment cardinality
> //So using the number of matches as an alternate heuristic would do the job 
> here?{code}
> Any thoughts if this approach makes sense? it could be I'm thinking of this 
> approach just because both the users I worked with last week fell in this 
> cateogory.
>  
> cc [~dsmiley] [~joel.bernstein]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-12820) Auto pick method:dvhash based on thresholds

Reply via email to