[GitHub] [druid] clintropolis opened a new pull request, #12830: add NumericRangeIndex interface and BoundFilter support

GitBox Wed, 27 Jul 2022 21:04:38 -0700


clintropolis opened a new pull request, #12830:
URL: https://github.com/apache/druid/pull/12830


   ### Description
   This PR adds a `NumericRangeIndex` interface, which is like 
`LexicographicalRangeIndex` but for numbers. `BoundFilter` now uses 
`NumericRangeIndex` if comparator is numeric and there is no `extractionFn` 
defined, allowing number columns with indexes to take the same shortcut we use 
for string columns.
   
   This has a pretty decent performance boost compared to using the predicate 
based index for numeric ranges.
   
   no index (regular numeric columns):
   ```
   Benchmark                        (query)  (rowsPerSegment)  (vectorize)  
Mode  Cnt    Score   Error  Units
   SqlNestedDataBenchmark.querySql       14           5000000        false  
avgt    5   97.164 ± 3.767  ms/op
   SqlNestedDataBenchmark.querySql       14           5000000        force  
avgt    5   53.835 ± 0.783  ms/op
   SqlNestedDataBenchmark.querySql       26           5000000        false  
avgt    5   75.090 ± 2.488  ms/op
   SqlNestedDataBenchmark.querySql       26           5000000        force  
avgt    5   48.634 ± 1.097  ms/op
   SqlNestedDataBenchmark.querySql       28           5000000        false  
avgt    5   86.636 ± 3.613  ms/op
   SqlNestedDataBenchmark.querySql       28           5000000        force  
avgt    5   52.267 ± 0.978  ms/op
   ```
   
   predicate index (equivalent nested numeric columns with identical data):
   ```
   Benchmark                        (query)  (rowsPerSegment)  (vectorize)  
Mode  Cnt    Score   Error  Units
   SqlNestedDataBenchmark.querySql       15           5000000        false  
avgt    5  552.875 ± 6.711  ms/op
   SqlNestedDataBenchmark.querySql       15           5000000        force  
avgt    5  543.700 ± 6.557  ms/op
   SqlNestedDataBenchmark.querySql       27           5000000        false  
avgt    5   83.298 ± 1.470  ms/op
   SqlNestedDataBenchmark.querySql       27           5000000        force  
avgt    5   83.296 ± 1.646  ms/op
   SqlNestedDataBenchmark.querySql       29           5000000        false  
avgt    5  129.794 ± 2.851  ms/op
   SqlNestedDataBenchmark.querySql       29           5000000        force  
avgt    5  108.845 ± 2.076  ms/op
   ```
   
   range index (equivalent nested numeric columns with identical data):
   ```
   Benchmark                        (query)  (rowsPerSegment)  (vectorize)  
Mode  Cnt    Score    Error  Units
   SqlNestedDataBenchmark.querySql       15           5000000        false  
avgt    5  562.009 ±  5.874  ms/op
   SqlNestedDataBenchmark.querySql       15           5000000        force  
avgt    5  559.338 ± 10.342  ms/op
   SqlNestedDataBenchmark.querySql       27           5000000        false  
avgt    5   13.442 ±  0.654  ms/op
   SqlNestedDataBenchmark.querySql       27           5000000        force  
avgt    5   13.652 ±  0.548  ms/op
   SqlNestedDataBenchmark.querySql       29           5000000        false  
avgt    5   88.355 ±  2.357  ms/op
   SqlNestedDataBenchmark.querySql       29           5000000        force  
avgt    5   87.215 ±  1.282  ms/op
   ```
   
   Revisiting the initial benchmarks I ran for nested columns, 
https://github.com/apache/druid/pull/12753#issuecomment-1183571908, query '15' 
was significantly slower than '14' which had no indexes, and this is still 
true. Digging a bit deeper, the reason for this is that 2.6 million values 
match which means merging that many bitmaps. Query 27 has ~1500 matches, while 
query 29 has 270k matching values.
   
   This indicates that there is some threshold of "too many bitmaps" at which 
we should skip using indexes and fall back to a full scan and value matchers. I 
plan to introduce a mechanism for this in a future PR.
   
   I also added some direct testing for 
`NestedFieldLiteralColumnIndexSupplier`, which was previously only indirectly 
tested via queries. This was a bit tedious, but also fixed a couple of minor 
bugs.
   
   <hr>
   
   This PR has:
   - [x] been self-reviewed.
   - [x] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [x] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [x] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   - [x] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] clintropolis opened a new pull request, #12830: add NumericRangeIndex interface and BoundFilter support

Reply via email to