Re: [PR] Optimize counts on two clause term disjunctions [lucene]

via GitHub Mon, 29 Jan 2024 01:48:40 -0800


jpountz commented on PR #13036:
URL: https://github.com/apache/lucene/pull/13036#issuecomment-1914327764


   This is a great speedup on `CountOrHighMed`! Too bad it's not faster all the 
time, though I'm not too surprised as conjunctions have more overhead than 
disjunctions when all clauses have a high cost.
   
   As a first step, maybe we can have a simple heuristic to only enable this 
optimization when it's almost guaranteed to yield a speedup? I'm not sure what 
makes the most sense, maybe a threshold on the minimum count across both 
clauses, and only enabling the optimization below this threshold. You'll 
probably need to play with various disjunctions to figure out a threshold that 
works.
   
   One inefficiency that your PR introduces is that it requires more lookups in 
terms dictionaries. We could avoid this by caching the `TermState` for each 
term query, which you could do in your 
`rewriteTwoClauseDisjunctionWithTermsForCount()` utility method: if the 
TermQuery has a null `TermQuery#getTermStates()`, then you could rewrite it to 
a `TermQuery` that has a non-null `TermStates` object.
   
   And maybe as a follow-up we could look again into the old idea of using 
bitsets to evaluate dense conjunctions, just like `BooleanScorer` does for 
disjunctions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Optimize counts on two clause term disjunctions [lucene]

Reply via email to