[GitHub] [druid] clintropolis opened a new pull request, #13977: smarter nested column index utilization

via GitHub Fri, 24 Mar 2023 23:04:47 -0700


clintropolis opened a new pull request, #13977:
URL: https://github.com/apache/druid/pull/13977


   ### Description
   A sort of follow-up to #12830, which added `NumericRangeIndex` to improve 
the performance of nested column bound filters on numeric columns, but noted 
that there were still cases where the cost of bitmap operations overran the 
cost of just doing a full scan and using value matchers. This is also more or 
less the same thing described in #3878.
   
   This PR attempts to improve this situation by adding cardinality based 
thresholds for range and predicate indexes to choose to skip using bitmap 
indexes.
   
   changes:
   * adds `skipValueRangeIndexScale` and `skipValuePredicateIndexScale` to 
`ColumnConfig` (e.g. `DruidProcessingConfig`) available as system config via 
`druid.processing.indexes.skipValueRangeIndexScale` and 
`druid.processing.indexes.skipValuePredicateIndexScale`
   * `NestedColumnIndexSupplier` uses `skipValueRangeIndexScale` and 
`skipValuePredicateIndexScale` to multiply by the total number of rows to be 
processed to determine the threshold at which we should no longer consider 
using bitmap indexes because it will be too many operations
   * Default values for `skipValueRangeIndexScale` and 
`skipValuePredicateIndexScale` have been initially set to 0.08, but are 
separate to allow independent tuning
   * these are not documented on purpose yet because they are kind of hard to 
explain, the mainly exist to help conduct larger scale experiments than the jmh 
benchmarks used to derive the initial set of values
   * these changes provide a pretty sweet performance boost for filter 
processing on nested columns
   
   We might want to consider making these same changes to 
`DictionaryEncodedStringIndexSupplier`, but I have not done so at this time.
   
   #### Benchmarks comparing to regular columns
   
   ```
   SELECT SUM(long1) FROM foo WHERE double3 < 10000.0 AND double3 > 1000.0
   SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE 
JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 10000.0 AND 
JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0
   
   Benchmark                        (query)  (rowsPerSegment)  (stringEncoding) 
 (vectorize)  Mode  Cnt    Score     Error  Units
   SqlNestedDataBenchmark.querySql       14           5000000              none 
       false  avgt    5  102.947 ±   0.875  ms/op
   SqlNestedDataBenchmark.querySql       14           5000000              none 
       force  avgt    5   63.557 ±   0.477  ms/op
   SqlNestedDataBenchmark.querySql       15           5000000              none 
       false  avgt    5  122.576 ±   0.919  ms/op
   SqlNestedDataBenchmark.querySql       15           5000000              none 
       force  avgt    5   66.158 ±   0.667  ms/op
   ```
   
   ```
   SELECT SUM(long1) FROM foo WHERE double3 < 1005.0 AND double3 > 1000.0
   SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE 
JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 1005.0 AND 
JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0
   
   Benchmark                        (query)  (rowsPerSegment)  (stringEncoding) 
 (vectorize)  Mode  Cnt    Score     Error  Units
   SqlNestedDataBenchmark.querySql       26           5000000              none 
       false  avgt    5   89.212 ±   4.634  ms/op
   SqlNestedDataBenchmark.querySql       26           5000000              none 
       force  avgt    5   60.594 ±   3.423  ms/op
   SqlNestedDataBenchmark.querySql       27           5000000              none 
       false  avgt    5   19.526 ±   0.543  ms/op
   SqlNestedDataBenchmark.querySql       27           5000000              none 
       force  avgt    5   23.740 ±   0.404  ms/op
   ```
   
   ```
   SELECT SUM(long1) FROM foo WHERE double3 < 2000.0 AND double3 > 1000.0
   SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE 
JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 2000.0 AND 
JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0
   
   Benchmark                        (query)  (rowsPerSegment)  (stringEncoding) 
 (vectorize)  Mode  Cnt    Score     Error  Units
   SqlNestedDataBenchmark.querySql       28           5000000              none 
       false  avgt    5  113.113 ±  14.400  ms/op
   SqlNestedDataBenchmark.querySql       28           5000000              none 
       force  avgt    5   62.150 ±   0.211  ms/op
   SqlNestedDataBenchmark.querySql       29           5000000              none 
       false  avgt    5   60.687 ±   0.549  ms/op
   SqlNestedDataBenchmark.querySql       29           5000000              none 
       force  avgt    5   50.789 ±   0.935  ms/op
   ```
   
   ```
   SELECT SUM(long1) FROM foo WHERE double3 < 3000.0 AND double3 > 1000.0
   SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE 
JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 3000.0 AND 
JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0
   
   Benchmark                        (query)  (rowsPerSegment)  (stringEncoding) 
 (vectorize)  Mode  Cnt    Score     Error  Units
   SqlNestedDataBenchmark.querySql       30           5000000              none 
       false  avgt    5  104.038 ±   1.316  ms/op
   SqlNestedDataBenchmark.querySql       30           5000000              none 
       force  avgt    5   62.930 ±   0.327  ms/op
   SqlNestedDataBenchmark.querySql       31           5000000              none 
       false  avgt    5  126.925 ±   1.966  ms/op
   SqlNestedDataBenchmark.querySql       31           5000000              none 
       force  avgt    5   67.254 ±   2.807  ms/op
   ```
   
   ```
   SELECT SUM(long1) FROM foo WHERE double3 < 5000.0 AND double3 > 1000.0
   SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE 
JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 5000.0 AND 
JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0
   
   Benchmark                        (query)  (rowsPerSegment)  (stringEncoding) 
 (vectorize)  Mode  Cnt    Score     Error  Units
   SqlNestedDataBenchmark.querySql       32           5000000              none 
       false  avgt    5  111.034 ±   1.047  ms/op
   SqlNestedDataBenchmark.querySql       32           5000000              none 
       force  avgt    5   73.748 ±   0.749  ms/op
   SqlNestedDataBenchmark.querySql       33           5000000              none 
       false  avgt    5  116.868 ±   2.659  ms/op
   SqlNestedDataBenchmark.querySql       33           5000000              none 
       force  avgt    5   75.269 ±   0.208  ms/op
   ```
   
   ```
   34,35 smaller cardinality like range filter
   SELECT SUM(long1) FROM foo WHERE string1 LIKE '1%'
   SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE 
JSON_VALUE(nested, '$.nesteder.string1') LIKE '1%'
   
   Benchmark                        (query)  (rowsPerSegment)  (stringEncoding) 
 (vectorize)  Mode  Cnt    Score     Error  Units
   SqlNestedDataBenchmark.querySql       34           5000000              none 
       false  avgt    5   34.077 ±   0.365  ms/op
   SqlNestedDataBenchmark.querySql       34           5000000              none 
       force  avgt    5   25.754 ±   0.384  ms/op
   SqlNestedDataBenchmark.querySql       35           5000000              none 
       false  avgt    5   59.226 ±   0.300  ms/op
   SqlNestedDataBenchmark.querySql       35           5000000              none 
       force  avgt    5   29.900 ±   0.301  ms/op
   ```
   
   ```
   36,37 smaller cardinality like predicate filter
   SELECT SUM(long1) FROM foo WHERE string1 LIKE '%1%'
   SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE 
JSON_VALUE(nested, '$.nesteder.string1') LIKE '%1%'
   
   Benchmark                        (query)  (rowsPerSegment)  (stringEncoding) 
 (vectorize)  Mode  Cnt    Score     Error  Units
   SqlNestedDataBenchmark.querySql       36           5000000              none 
       false  avgt    5   81.823 ±   0.609  ms/op
   SqlNestedDataBenchmark.querySql       36           5000000              none 
       force  avgt    5   40.584 ±   0.703  ms/op
   SqlNestedDataBenchmark.querySql       37           5000000              none 
       false  avgt    5   80.930 ±   4.364  ms/op
   SqlNestedDataBenchmark.querySql       37           5000000              none 
       force  avgt    5   37.533 ±   0.464  ms/op
   ```
   
   ```
   38, 39 moderate cardinality like range
   SELECT SUM(long1) FROM foo WHERE string5 LIKE '1%'
   SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE 
JSON_VALUE(nested, '$.nesteder.string5') LIKE '1%'
   
   Benchmark                        (query)  (rowsPerSegment)  (stringEncoding) 
 (vectorize)  Mode  Cnt    Score     Error  Units
   SqlNestedDataBenchmark.querySql       38           5000000              none 
       false  avgt    5   77.782 ±   0.898  ms/op
   SqlNestedDataBenchmark.querySql       38           5000000              none 
       force  avgt    5   44.718 ±   0.882  ms/op
   SqlNestedDataBenchmark.querySql       39           5000000              none 
       false  avgt    5   68.806 ±   0.952  ms/op
   SqlNestedDataBenchmark.querySql       39           5000000              none 
       force  avgt    5   50.848 ±   0.975  ms/op
   ```
   
   ```
   40, 41 big cardinality lex range
   SELECT SUM(long1) FROM foo WHERE string5 > '1'
   SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE 
JSON_VALUE(nested, '$.nesteder.string5') > '1'
   
   Benchmark                        (query)  (rowsPerSegment)  (stringEncoding) 
 (vectorize)  Mode  Cnt    Score     Error  Units
   SqlNestedDataBenchmark.querySql       40           5000000              none 
       false  avgt    5  245.087 ±   1.706  ms/op
   SqlNestedDataBenchmark.querySql       40           5000000              none 
       force  avgt    5  222.619 ±   3.206  ms/op
   SqlNestedDataBenchmark.querySql       41           5000000              none 
       false  avgt    5  354.551 ±   8.115  ms/op
   SqlNestedDataBenchmark.querySql       41           5000000              none 
       force  avgt    5  211.717 ±   9.943  ms/op
   ```
   
   ```
   42, 43 big cardinality like predicate filter
   SELECT SUM(long1) FROM foo WHERE string5 LIKE '%1%'
   SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE 
JSON_VALUE(nested, '$.nesteder.string5') LIKE '%1%'
   
   Benchmark                        (query)  (rowsPerSegment)  (stringEncoding) 
 (vectorize)  Mode  Cnt    Score     Error  Units
   SqlNestedDataBenchmark.querySql       42           5000000              none 
       false  avgt    5  796.396 ± 119.324  ms/op
   SqlNestedDataBenchmark.querySql       42           5000000              none 
       force  avgt    5  648.279 ±   4.008  ms/op
   SqlNestedDataBenchmark.querySql       43           5000000              none 
       false  avgt    5  504.493 ±  92.658  ms/op
   SqlNestedDataBenchmark.querySql       43           5000000              none 
       force  avgt    5  383.268 ±  14.283  ms/op
   ```
   
   ```
   44, 45 big cardinality like filter + selector filter
   SELECT SUM(long1) FROM foo WHERE string5 LIKE '%1%' AND string1 = '1000'
   SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE 
JSON_VALUE(nested, '$.nesteder.string5') LIKE '%1%' AND JSON_VALUE(nested, 
'$.nesteder.string1') = '1000'
   
   Benchmark                        (query)  (rowsPerSegment)  (stringEncoding) 
 (vectorize)  Mode  Cnt    Score     Error  Units
   SqlNestedDataBenchmark.querySql       44           5000000              none 
       false  avgt    5  641.740 ±  13.625  ms/op
   SqlNestedDataBenchmark.querySql       44           5000000              none 
       force  avgt    5  598.289 ±   9.617  ms/op
   SqlNestedDataBenchmark.querySql       45           5000000              none 
       false  avgt    5   21.523 ±   0.254  ms/op
   SqlNestedDataBenchmark.querySql       45           5000000              none 
       force  avgt    5   19.543 ±   0.231  ms/op
   ```
   
   #### Release note
   <!-- Give your best effort to summarize your changes in a couple of 
sentences aimed toward Druid users. 
   
   If your change doesn't have end user impact, you can skip this section.
   
   For tips about how to write a good release note, see [Release 
notes](https://github.com/apache/druid/blob/master/CONTRIBUTING.md#release-notes).
   
   -->
   
   
   <hr>
   
   ##### Key changed/added classes in this PR
    * `ColumnConfig`
    * `NestedFieldLiteralColumnIndexSupplier`
   
   <hr>
   
   
   This PR has:
   
   - [ ] been self-reviewed.
      - [ ] using the [concurrency 
checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md)
 (Remove this item if the PR doesn't have any relation to concurrency.)
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] a release note entry in the PR description.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in 
[licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [ ] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [ ] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] clintropolis opened a new pull request, #13977: smarter nested column index utilization

Reply via email to