jtibshirani opened a new pull request #715: LUCENE-7714 Add a range query that takes advantage of index sorting. URL: https://github.com/apache/lucene-solr/pull/715 I’m opening this draft PR to get feedback on an approach to [LUCENE-7714](https://issues.apache.org/jira/browse/LUCENE-7714). The PR adds the new query type `IndexSortDocValuesRangeQuery`, a range query that takes advantage of the fact that the index is sorted on the same field as the query. It performs binary search on the field's doc values to find the doc IDs at the lower and upper ends of the range. The query can only be used if all of these conditions hold: - The index is sorted, and its primary sort is on the same field as the query. - The query field has {@link SortedNumericDocValues}. - The segments must have at most one field value per document (otherwise we cannot easily determine the matching document IDs through a binary search). I was hoping for feedback on the overall approach, and also had a few open questions: - I wasn’t sure on the best way to structure the query. As it stands, it requires that segments are sorted correctly and contain at most one value per document. Perhaps we could introduce a wrapper query that can make the decision on a segment-by-segment basis: it would only use `IndexSortDocValueRangeQuery` if the right conditions are met, and otherwise fall back to a standard range query. This wrapper query would have similarities to `IndexOrDocValuesQuery`. - Because doc values only support forward iteration, we need to recreate the comparators every time we backtrack in the binary search. I assumed this recreation would be expensive and experimented with some strategies to avoid it, such as starting with a shared binary search that checks both `lowerValue` and `upperValue`, then moving on to the two individual binary searches. However, these efforts didn’t show any performance improvements in my benchmarks. I plan to do more research around sparse + blocked doc values to understand the circumstances in which backtracking can be costly. **Benchmarking results** I ingested part of the [http logs](https://github.com/elastic/rally-tracks/tree/master/http_logs) dataset (123M total documents) into an index sorted by `@timestamp`. Every document has a single `@timestamp`, although they are not unique across documents. The following date ranges were tested: - range with single point [897303051, 897303051], 124 docs - small range (897633930, 897655999], ~41K docs - medium range (897623930, 897655999], ~5M docs - large range (897259801, 897503930], ~21M docs ``` | 50th percentile service time | range-single | 7.50658 | ms | | 90th percentile service time | range-single | 7.81589 | ms | | 50th percentile service time | range-optimized-single | 8.02532 | ms | | 90th percentile service time | range-optimized-single | 8.38834 | ms | | 50th percentile service time | range-small | 12.7203 | ms | | 90th percentile service time | range-small | 13.3798 | ms | | 50th percentile service time | range-optimized-small | 7.43383 | ms | | 90th percentile service time | range-optimized-small | 7.67655 | ms | | 50th percentile service time | range-medium | 20.8554 | ms | | 90th percentile service time | range-medium | 22.1637 | ms | | 50th percentile service time | range-optimized-medium | 8.38864 | ms | | 90th percentile service time | range-optimized-medium | 8.66427 | ms | | 50th percentile service time | range-large | 53.8697 | ms | | 90th percentile service time | range-large | 60.2573 | ms | | 50th percentile service time | range-optimized-large | 7.81584 | ms | | 90th percentile service time | range-optimized-large | 8.02755 | ms | ```
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
