zacharymorn commented on PR #12194: URL: https://github.com/apache/lucene/pull/12194#issuecomment-1474743257
> Maybe we could try to leverage the geonames dataset (there's a few benchmarks for it in lucene-util), which has a few low-cardinality fields like the time zone or country. Then enable index sorting on these fields. And make sure we're getting performance when using the time zone in required or excluded clauses? Thanks @jpountz for the suggestion! I actually did something similar by modifying the benchmarking code to support sorting index by `month` field, which is a `KeywordField`, and creating corresponding query with the following syntax to exercise the `ReqExclScorer - Posting - DocValue` path that I have changed: `ReqExclWithDocValues: newbenchmarktest//names docvalues:-month:[Jan TO Aug]` However, while I was able to run this query, the test actually failed when verifying top matches & scores between baseline and modified code. Upon further digging, I realized a potential issue to this API idea. As Lucene does a lot of two phase iterations, and two phase iterator's approximation may provide a superset of the actual matches. If we were to use this API to find and ignore / skip over a bunch of doc ids from approximation, wouldn't the result be inaccurate? For example, in the below `skippingReqApproximation` disi I put inside `ReqExclScorer`, `exclApproximation.peekNextNonMatchingDocID()` may provide an inaccurate, longer run of matches as approximation provides superset matches. If we were to further confirm the boundary by actually checking matches on its iterator, then we basically resort to linear scan on the iterator, which defeats the purpose of this new API? ``` final DocIdSetIterator skippingReqApproximation = new DocIdSetIterator() { @Override public int docID() {} @Override public int nextDoc() throws IOException {} @Override public int advance(int target) throws IOException { // this exclNonMatchingDoc could be inaccruate, as exclApproximation provides a superset of matches than its underlying iterator int exclNonMatchingDoc = exclApproximation.peekNextNonMatchingDocID(); if (exclApproximation.docID() < target && target < exclNonMatchingDoc) { return reqApproximation.advance(exclNonMatchingDoc); } else { return reqApproximation.advance(target); } } @Override public long cost() {} }; ``` Please let me know if you have any suggestion on this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org