magibney commented on PR #12207:
URL: https://github.com/apache/lucene/pull/12207#issuecomment-1478015886

   The performance of the approach taken by this proposal comes from the fact 
that when you know the exact term of the limit threshold, you can determine a 
single index that will suffice to compare for every candidate term in the 
source TermsEnum. So beyond the cost of an extra terms dictionary seek (or 
two), you're guaranteed to compare exactly one byte per term in filtering.
   
   This proposed implementation is very simple, but for PrefixQuery, simple is 
appropriate, given that we know this is always going to be a linear scan of 
terms.
   
   The benefit is seen for cases where the cost of terms iteration is 
relatively large. One such case is "smaller indexes", but the motivating case 
is actually "longer prefixes matching larger numbers of terms" (e.g., URLs, 
taxonomies), which is hard to demonstrate with the consistent fanout of the 
standard benchmarking data.
   
   Not easily reproducible for now (sorry!), but for an index with 33m docs, 
faceting on a field of cardinality 2.3m, a prefix covering 212k unique values 
(~10% of terms) is consistently ~30% faster with the new approach than with an 
automaton-based approach. When the prefix covers 2.1m (~90% of terms -- crazy I 
know, but it happens), the new approach is consistently ~40% faster. (For 
transparency, I'm set up to easily test this on Lucene 8.8, so that's what 
these numbers are coming from). And as large as these speedups are 
percentage-wise, the absolute difference is even greater, given that the 
largest impact is on the slower queries (request latency for 10% and 90% prefix 
coverage are respectively 120ms/180ms, 670ms/1150ms).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to