clintropolis opened a new pull request, #13977:
URL: https://github.com/apache/druid/pull/13977
### Description
A sort of follow-up to #12830, which added `NumericRangeIndex` to improve
the performance of nested column bound filters on numeric columns, but noted
that there were still cases where the cost of bitmap operations overran the
cost of just doing a full scan and using value matchers. This is also more or
less the same thing described in #3878.
This PR attempts to improve this situation by adding cardinality based
thresholds for range and predicate indexes to choose to skip using bitmap
indexes.
changes:
* adds `skipValueRangeIndexScale` and `skipValuePredicateIndexScale` to
`ColumnConfig` (e.g. `DruidProcessingConfig`) available as system config via
`druid.processing.indexes.skipValueRangeIndexScale` and
`druid.processing.indexes.skipValuePredicateIndexScale`
* `NestedColumnIndexSupplier` uses `skipValueRangeIndexScale` and
`skipValuePredicateIndexScale` to multiply by the total number of rows to be
processed to determine the threshold at which we should no longer consider
using bitmap indexes because it will be too many operations
* Default values for `skipValueRangeIndexScale` and
`skipValuePredicateIndexScale` have been initially set to 0.08, but are
separate to allow independent tuning
* these are not documented on purpose yet because they are kind of hard to
explain, the mainly exist to help conduct larger scale experiments than the jmh
benchmarks used to derive the initial set of values
* these changes provide a pretty sweet performance boost for filter
processing on nested columns
We might want to consider making these same changes to
`DictionaryEncodedStringIndexSupplier`, but I have not done so at this time.
#### Benchmarks comparing to regular columns
```
SELECT SUM(long1) FROM foo WHERE double3 < 10000.0 AND double3 > 1000.0
SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE
JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 10000.0 AND
JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0
Benchmark (query) (rowsPerSegment) (stringEncoding)
(vectorize) Mode Cnt Score Error Units
SqlNestedDataBenchmark.querySql 14 5000000 none
false avgt 5 102.947 ± 0.875 ms/op
SqlNestedDataBenchmark.querySql 14 5000000 none
force avgt 5 63.557 ± 0.477 ms/op
SqlNestedDataBenchmark.querySql 15 5000000 none
false avgt 5 122.576 ± 0.919 ms/op
SqlNestedDataBenchmark.querySql 15 5000000 none
force avgt 5 66.158 ± 0.667 ms/op
```
```
SELECT SUM(long1) FROM foo WHERE double3 < 1005.0 AND double3 > 1000.0
SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE
JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 1005.0 AND
JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0
Benchmark (query) (rowsPerSegment) (stringEncoding)
(vectorize) Mode Cnt Score Error Units
SqlNestedDataBenchmark.querySql 26 5000000 none
false avgt 5 89.212 ± 4.634 ms/op
SqlNestedDataBenchmark.querySql 26 5000000 none
force avgt 5 60.594 ± 3.423 ms/op
SqlNestedDataBenchmark.querySql 27 5000000 none
false avgt 5 19.526 ± 0.543 ms/op
SqlNestedDataBenchmark.querySql 27 5000000 none
force avgt 5 23.740 ± 0.404 ms/op
```
```
SELECT SUM(long1) FROM foo WHERE double3 < 2000.0 AND double3 > 1000.0
SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE
JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 2000.0 AND
JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0
Benchmark (query) (rowsPerSegment) (stringEncoding)
(vectorize) Mode Cnt Score Error Units
SqlNestedDataBenchmark.querySql 28 5000000 none
false avgt 5 113.113 ± 14.400 ms/op
SqlNestedDataBenchmark.querySql 28 5000000 none
force avgt 5 62.150 ± 0.211 ms/op
SqlNestedDataBenchmark.querySql 29 5000000 none
false avgt 5 60.687 ± 0.549 ms/op
SqlNestedDataBenchmark.querySql 29 5000000 none
force avgt 5 50.789 ± 0.935 ms/op
```
```
SELECT SUM(long1) FROM foo WHERE double3 < 3000.0 AND double3 > 1000.0
SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE
JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 3000.0 AND
JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0
Benchmark (query) (rowsPerSegment) (stringEncoding)
(vectorize) Mode Cnt Score Error Units
SqlNestedDataBenchmark.querySql 30 5000000 none
false avgt 5 104.038 ± 1.316 ms/op
SqlNestedDataBenchmark.querySql 30 5000000 none
force avgt 5 62.930 ± 0.327 ms/op
SqlNestedDataBenchmark.querySql 31 5000000 none
false avgt 5 126.925 ± 1.966 ms/op
SqlNestedDataBenchmark.querySql 31 5000000 none
force avgt 5 67.254 ± 2.807 ms/op
```
```
SELECT SUM(long1) FROM foo WHERE double3 < 5000.0 AND double3 > 1000.0
SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE
JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 5000.0 AND
JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0
Benchmark (query) (rowsPerSegment) (stringEncoding)
(vectorize) Mode Cnt Score Error Units
SqlNestedDataBenchmark.querySql 32 5000000 none
false avgt 5 111.034 ± 1.047 ms/op
SqlNestedDataBenchmark.querySql 32 5000000 none
force avgt 5 73.748 ± 0.749 ms/op
SqlNestedDataBenchmark.querySql 33 5000000 none
false avgt 5 116.868 ± 2.659 ms/op
SqlNestedDataBenchmark.querySql 33 5000000 none
force avgt 5 75.269 ± 0.208 ms/op
```
```
34,35 smaller cardinality like range filter
SELECT SUM(long1) FROM foo WHERE string1 LIKE '1%'
SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE
JSON_VALUE(nested, '$.nesteder.string1') LIKE '1%'
Benchmark (query) (rowsPerSegment) (stringEncoding)
(vectorize) Mode Cnt Score Error Units
SqlNestedDataBenchmark.querySql 34 5000000 none
false avgt 5 34.077 ± 0.365 ms/op
SqlNestedDataBenchmark.querySql 34 5000000 none
force avgt 5 25.754 ± 0.384 ms/op
SqlNestedDataBenchmark.querySql 35 5000000 none
false avgt 5 59.226 ± 0.300 ms/op
SqlNestedDataBenchmark.querySql 35 5000000 none
force avgt 5 29.900 ± 0.301 ms/op
```
```
36,37 smaller cardinality like predicate filter
SELECT SUM(long1) FROM foo WHERE string1 LIKE '%1%'
SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE
JSON_VALUE(nested, '$.nesteder.string1') LIKE '%1%'
Benchmark (query) (rowsPerSegment) (stringEncoding)
(vectorize) Mode Cnt Score Error Units
SqlNestedDataBenchmark.querySql 36 5000000 none
false avgt 5 81.823 ± 0.609 ms/op
SqlNestedDataBenchmark.querySql 36 5000000 none
force avgt 5 40.584 ± 0.703 ms/op
SqlNestedDataBenchmark.querySql 37 5000000 none
false avgt 5 80.930 ± 4.364 ms/op
SqlNestedDataBenchmark.querySql 37 5000000 none
force avgt 5 37.533 ± 0.464 ms/op
```
```
38, 39 moderate cardinality like range
SELECT SUM(long1) FROM foo WHERE string5 LIKE '1%'
SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE
JSON_VALUE(nested, '$.nesteder.string5') LIKE '1%'
Benchmark (query) (rowsPerSegment) (stringEncoding)
(vectorize) Mode Cnt Score Error Units
SqlNestedDataBenchmark.querySql 38 5000000 none
false avgt 5 77.782 ± 0.898 ms/op
SqlNestedDataBenchmark.querySql 38 5000000 none
force avgt 5 44.718 ± 0.882 ms/op
SqlNestedDataBenchmark.querySql 39 5000000 none
false avgt 5 68.806 ± 0.952 ms/op
SqlNestedDataBenchmark.querySql 39 5000000 none
force avgt 5 50.848 ± 0.975 ms/op
```
```
40, 41 big cardinality lex range
SELECT SUM(long1) FROM foo WHERE string5 > '1'
SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE
JSON_VALUE(nested, '$.nesteder.string5') > '1'
Benchmark (query) (rowsPerSegment) (stringEncoding)
(vectorize) Mode Cnt Score Error Units
SqlNestedDataBenchmark.querySql 40 5000000 none
false avgt 5 245.087 ± 1.706 ms/op
SqlNestedDataBenchmark.querySql 40 5000000 none
force avgt 5 222.619 ± 3.206 ms/op
SqlNestedDataBenchmark.querySql 41 5000000 none
false avgt 5 354.551 ± 8.115 ms/op
SqlNestedDataBenchmark.querySql 41 5000000 none
force avgt 5 211.717 ± 9.943 ms/op
```
```
42, 43 big cardinality like predicate filter
SELECT SUM(long1) FROM foo WHERE string5 LIKE '%1%'
SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE
JSON_VALUE(nested, '$.nesteder.string5') LIKE '%1%'
Benchmark (query) (rowsPerSegment) (stringEncoding)
(vectorize) Mode Cnt Score Error Units
SqlNestedDataBenchmark.querySql 42 5000000 none
false avgt 5 796.396 ± 119.324 ms/op
SqlNestedDataBenchmark.querySql 42 5000000 none
force avgt 5 648.279 ± 4.008 ms/op
SqlNestedDataBenchmark.querySql 43 5000000 none
false avgt 5 504.493 ± 92.658 ms/op
SqlNestedDataBenchmark.querySql 43 5000000 none
force avgt 5 383.268 ± 14.283 ms/op
```
```
44, 45 big cardinality like filter + selector filter
SELECT SUM(long1) FROM foo WHERE string5 LIKE '%1%' AND string1 = '1000'
SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE
JSON_VALUE(nested, '$.nesteder.string5') LIKE '%1%' AND JSON_VALUE(nested,
'$.nesteder.string1') = '1000'
Benchmark (query) (rowsPerSegment) (stringEncoding)
(vectorize) Mode Cnt Score Error Units
SqlNestedDataBenchmark.querySql 44 5000000 none
false avgt 5 641.740 ± 13.625 ms/op
SqlNestedDataBenchmark.querySql 44 5000000 none
force avgt 5 598.289 ± 9.617 ms/op
SqlNestedDataBenchmark.querySql 45 5000000 none
false avgt 5 21.523 ± 0.254 ms/op
SqlNestedDataBenchmark.querySql 45 5000000 none
force avgt 5 19.543 ± 0.231 ms/op
```
#### Release note
<!-- Give your best effort to summarize your changes in a couple of
sentences aimed toward Druid users.
If your change doesn't have end user impact, you can skip this section.
For tips about how to write a good release note, see [Release
notes](https://github.com/apache/druid/blob/master/CONTRIBUTING.md#release-notes).
-->
<hr>
##### Key changed/added classes in this PR
* `ColumnConfig`
* `NestedFieldLiteralColumnIndexSupplier`
<hr>
This PR has:
- [ ] been self-reviewed.
- [ ] using the [concurrency
checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md)
(Remove this item if the PR doesn't have any relation to concurrency.)
- [ ] added documentation for new or modified features or behaviors.
- [ ] a release note entry in the PR description.
- [ ] added Javadocs for most classes and all non-trivial methods. Linked
related entities via Javadoc links.
- [ ] added or updated version, license, or notice information in
[licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
- [ ] added comments explaining the "why" and the intent of the code
wherever would not be obvious for an unfamiliar reader.
- [ ] added unit tests or modified existing tests to cover new code paths,
ensuring the threshold for [code
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
is met.
- [ ] added integration tests.
- [ ] been tested in a test Druid cluster.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]