clintropolis opened a new pull request, #12830: URL: https://github.com/apache/druid/pull/12830
### Description This PR adds a `NumericRangeIndex` interface, which is like `LexicographicalRangeIndex` but for numbers. `BoundFilter` now uses `NumericRangeIndex` if comparator is numeric and there is no `extractionFn` defined, allowing number columns with indexes to take the same shortcut we use for string columns. This has a pretty decent performance boost compared to using the predicate based index for numeric ranges. no index (regular numeric columns): ``` Benchmark (query) (rowsPerSegment) (vectorize) Mode Cnt Score Error Units SqlNestedDataBenchmark.querySql 14 5000000 false avgt 5 97.164 ± 3.767 ms/op SqlNestedDataBenchmark.querySql 14 5000000 force avgt 5 53.835 ± 0.783 ms/op SqlNestedDataBenchmark.querySql 26 5000000 false avgt 5 75.090 ± 2.488 ms/op SqlNestedDataBenchmark.querySql 26 5000000 force avgt 5 48.634 ± 1.097 ms/op SqlNestedDataBenchmark.querySql 28 5000000 false avgt 5 86.636 ± 3.613 ms/op SqlNestedDataBenchmark.querySql 28 5000000 force avgt 5 52.267 ± 0.978 ms/op ``` predicate index (equivalent nested numeric columns with identical data): ``` Benchmark (query) (rowsPerSegment) (vectorize) Mode Cnt Score Error Units SqlNestedDataBenchmark.querySql 15 5000000 false avgt 5 552.875 ± 6.711 ms/op SqlNestedDataBenchmark.querySql 15 5000000 force avgt 5 543.700 ± 6.557 ms/op SqlNestedDataBenchmark.querySql 27 5000000 false avgt 5 83.298 ± 1.470 ms/op SqlNestedDataBenchmark.querySql 27 5000000 force avgt 5 83.296 ± 1.646 ms/op SqlNestedDataBenchmark.querySql 29 5000000 false avgt 5 129.794 ± 2.851 ms/op SqlNestedDataBenchmark.querySql 29 5000000 force avgt 5 108.845 ± 2.076 ms/op ``` range index (equivalent nested numeric columns with identical data): ``` Benchmark (query) (rowsPerSegment) (vectorize) Mode Cnt Score Error Units SqlNestedDataBenchmark.querySql 15 5000000 false avgt 5 562.009 ± 5.874 ms/op SqlNestedDataBenchmark.querySql 15 5000000 force avgt 5 559.338 ± 10.342 ms/op SqlNestedDataBenchmark.querySql 27 5000000 false avgt 5 13.442 ± 0.654 ms/op SqlNestedDataBenchmark.querySql 27 5000000 force avgt 5 13.652 ± 0.548 ms/op SqlNestedDataBenchmark.querySql 29 5000000 false avgt 5 88.355 ± 2.357 ms/op SqlNestedDataBenchmark.querySql 29 5000000 force avgt 5 87.215 ± 1.282 ms/op ``` Revisiting the initial benchmarks I ran for nested columns, https://github.com/apache/druid/pull/12753#issuecomment-1183571908, query '15' was significantly slower than '14' which had no indexes, and this is still true. Digging a bit deeper, the reason for this is that 2.6 million values match which means merging that many bitmaps. Query 27 has ~1500 matches, while query 29 has 270k matching values. This indicates that there is some threshold of "too many bitmaps" at which we should skip using indexes and fall back to a full scan and value matchers. I plan to introduce a mechanism for this in a future PR. I also added some direct testing for `NestedFieldLiteralColumnIndexSupplier`, which was previously only indirectly tested via queries. This was a bit tedious, but also fixed a couple of minor bugs. <hr> This PR has: - [x] been self-reviewed. - [x] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links. - [x] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader. - [x] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met. - [x] been tested in a test Druid cluster. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
