jtibshirani opened a new pull request #715: LUCENE-7714 Add a range query that 
takes advantage of index sorting.
URL: https://github.com/apache/lucene-solr/pull/715
 
 
   I’m opening this draft PR to get feedback on an approach to 
[LUCENE-7714](https://issues.apache.org/jira/browse/LUCENE-7714).
   
   The PR adds the new query type `IndexSortDocValuesRangeQuery`, a range query 
that takes advantage of the fact that the index is sorted on the same field as 
the query. It performs binary search on the field's doc values to find the doc 
IDs at the lower and upper ends of the range.
   
   The query can only be used if all of these conditions hold:
   - The index is sorted, and its primary sort is on the same field as the 
query.
   - The query field has {@link SortedNumericDocValues}.
   - The segments must have at most one field value per document (otherwise we 
cannot easily
     determine the matching document IDs through a binary search).
   
   I was hoping for feedback on the overall approach, and also had a few open 
questions:
   - I wasn’t sure on the best way to structure the query. As it stands, it 
requires that segments are sorted correctly and contain at most one value per 
document. Perhaps we could introduce a wrapper query that can make the decision 
on a segment-by-segment basis: it would only use `IndexSortDocValueRangeQuery` 
if the right conditions are met, and otherwise fall back to a standard range 
query. This wrapper query would have similarities to `IndexOrDocValuesQuery`.
   - Because doc values only support forward iteration, we need to recreate the 
comparators every time we backtrack in the binary search. I assumed this 
recreation would be expensive and experimented with some strategies to avoid 
it, such as starting with a shared binary search that checks both `lowerValue` 
and `upperValue`, then moving on to the two individual binary searches. 
However, these efforts didn’t show any performance improvements in my 
benchmarks. I plan to do more research around sparse + blocked doc values to 
understand the circumstances in which backtracking can be costly.
   
   **Benchmarking results**
   I ingested part of the [http 
logs](https://github.com/elastic/rally-tracks/tree/master/http_logs) dataset 
(123M total documents) into an index sorted by `@timestamp`. Every document has 
a single `@timestamp`, although they are not unique across documents.
   
   The following date ranges were tested:
   - range with single point [897303051, 897303051], 124 docs
   - small range (897633930, 897655999], ~41K docs
   - medium range (897623930, 897655999],  ~5M docs
   - large range (897259801, 897503930], ~21M docs
   
   ```
   | 50th percentile service time |            range-single |   7.50658 |     
ms |
   | 90th percentile service time |            range-single |   7.81589 |     
ms |
   | 50th percentile service time |  range-optimized-single |   8.02532 |     
ms |
   | 90th percentile service time |  range-optimized-single |   8.38834 |     
ms |
   
   | 50th percentile service time |               range-small |   12.7203 |     
ms |
   | 90th percentile service time |               range-small |   13.3798 |     
ms |
   | 50th percentile service time |     range-optimized-small |   7.43383 |     
ms |
   | 90th percentile service time |     range-optimized-small |   7.67655 |     
ms |
   
   | 50th percentile service time |               range-medium |   20.8554 |    
 ms |
   | 90th percentile service time |               range-medium |   22.1637 |    
 ms |
   | 50th percentile service time |     range-optimized-medium |   8.38864 |    
 ms |
   | 90th percentile service time |     range-optimized-medium |   8.66427 |    
 ms |
   
   | 50th percentile service time |           range-large |   53.8697 |     ms |
   | 90th percentile service time |           range-large |   60.2573 |     ms |
   | 50th percentile service time | range-optimized-large |   7.81584 |     ms |
   | 90th percentile service time | range-optimized-large |   8.02755 |     ms |
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to