[GitHub] [lucene] zacharymorn commented on pull request #12194: [GITHUB-11915] [Discussion Only] Make Lucene smarter about long runs of matches via new API on DISI

via GitHub Fri, 17 Mar 2023 23:17:01 -0700


zacharymorn commented on PR #12194:
URL: https://github.com/apache/lucene/pull/12194#issuecomment-1474743257


   > Maybe we could try to leverage the geonames dataset (there's a few 
benchmarks for it in lucene-util), which has a few low-cardinality fields like 
the time zone or country. Then enable index sorting on these fields. And make 
sure we're getting performance when using the time zone in required or excluded 
clauses?
   
   Thanks @jpountz for the suggestion! I actually did something similar by 
modifying the benchmarking code to support sorting index by `month` field, 
which is a `KeywordField`, and creating corresponding query with the following 
syntax to exercise the `ReqExclScorer - Posting - DocValue` path that I have 
changed:
   
   `ReqExclWithDocValues: newbenchmarktest//names docvalues:-month:[Jan TO Aug]`
   
   However, while I was able to run this query, the test actually failed when 
verifying top matches & scores between baseline and modified code. Upon further 
digging, I realized a potential issue to this API idea. As Lucene does a lot of 
two phase iterations, and two phase iterator's approximation may provide a 
superset of the actual matches. If we were to use this API to find and ignore / 
skip over a bunch of doc ids from approximation, wouldn't the result be 
inaccurate? 
   
   For example, in the below `skippingReqApproximation` disi I put inside 
`ReqExclScorer`, `exclApproximation.peekNextNonMatchingDocID()` may provide an 
inaccurate, longer run of matches as approximation provides superset matches. 
If we were to further confirm the boundary by actually checking matches on its 
iterator, then we basically resort to linear scan on the iterator, which 
defeats the purpose of this new API?
   
   ```
   final DocIdSetIterator skippingReqApproximation =
           new DocIdSetIterator() {
             @Override
             public int docID() {}
   
             @Override
             public int nextDoc() throws IOException {}
   
             @Override
             public int advance(int target) throws IOException {
               // this exclNonMatchingDoc could be inaccruate, as 
exclApproximation provides a superset of matches than its underlying iterator
               int exclNonMatchingDoc = 
exclApproximation.peekNextNonMatchingDocID(); 
   
               if (exclApproximation.docID() < target
                   && target < exclNonMatchingDoc) {
                 return reqApproximation.advance(exclNonMatchingDoc);
               } else {
                 return reqApproximation.advance(target);
               }
             }
   
             @Override
             public long cost() {}
           };
   ```
   
   Please let me know if you have any suggestion on this.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] zacharymorn commented on pull request #12194: [GITHUB-11915] [Discussion Only] Make Lucene smarter about long runs of matches via new API on DISI

Reply via email to