Varun Thacker created SOLR-12820:
------------------------------------
Summary: Auto pick method:dvhash based on thresholds
Key: SOLR-12820
URL: https://issues.apache.org/jira/browse/SOLR-12820
Project: Solr
Issue Type: Improvement
Security Level: Public (Default Security Level. Issues are Public)
Components: Facet Module
Reporter: Varun Thacker
I've worked with two users last week where explicitly using method:dvhash
improved the faceting speeds drastically.
The common theme in both the use-cases were: One collection hosting data for
multiple users. We always filter documents for one user ( therby limiting the
number of documents drastically ) and then perfoming a complex nested JSON
facet.
Both use-cases fit perfectly in this criteria that [[email protected]] mentioed
on SOLR-9142
{quote}faceting on a string field with a high cardinality compared to it's
domain is less efficient than it could be.
{quote}
And DVHASH was the perfect optimization for these use-cases.
We are using the facet stream expression in one of the use-cases which doesn't
expose the method param. We could expose the method param to facet stream but I
feel the better approach to solve this problem would be to address this TODO in
the code withing the JSON Facet Module
{code:java}
if (mincount > 0 && prefix == null && (ntype != null || method ==
FacetMethod.DVHASH)) {
// TODO can we auto-pick for strings when term cardinality is much
greater than DocSet cardinality?
// or if we don't know cardinality but DocSet size is very small
return new FacetFieldProcessorByHashDV(fcontext, this, sf);{code}
I thought about this a little and this was the approach I am thinking currently
to tackle this problem
{code:java}
int matchingDocs = fcontext.base.size();
int totalDocs = fcontext.searcher.getIndexReader().maxDoc();
//if matchingDocs is close to the totalDocs then we aren't filtering many
documents.
//that means the array approach would probably be better than the dvhash
approach
//Trying to find the cardinality for the matchingDocs would be expensive.
//Also for totalDocs we don't have a global cardinality present at index time
but we have a per segment cardinality
//So using the number of matches as an alternate heuristic would do the job
here?{code}
Any thoughts if this approach makes sense? it could be I'm thinking of this
approach just because both the users I worked with last week fell in this
cateogory.
cc [~dsmiley] [~joel.bernstein]
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]