clintropolis opened a new pull request, #19023:
URL: https://github.com/apache/druid/pull/19023
changes:
* Added `getValueIterator` method to `DictionaryEncodedValueIndex` to give
an easy way for consumers to iterate the dictionary values in order
* `ExpressionPredicateIndexSupplier` now uses `getValueIterator` to scan the
dictionary values, offering a performance improvement, particularly when using
front-coding
* fixed a few other places that were iterating the dictionary using get to
use iterator instead
Credit to #19004 for the added benchmark query and bringing this issue to
attention, where when using front-coding it was causing computing the indexes
to be slower than just doing a full scan (at least in some cases, such as this
query)
before:
```
Benchmark (complexCompression)
(deferExpressionDimensions) (jsonObjectStorageEncoding) (query)
(rowsPerSegment) (schemaType) (storageType) (stringEncoding) (vectorize)
Mode Cnt Score Error Units
SqlExpressionBenchmark.querySql NONE
singleString SMILE 61 1500000
explicit MMAP FRONT_CODED_16_V1 false avgt 5 522.387 ±
22.942 ms/op
SqlExpressionBenchmark.querySql NONE
singleString SMILE 61 1500000
explicit MMAP FRONT_CODED_16_V1 force avgt 5 501.122 ±
17.559 ms/op
SqlExpressionBenchmark.querySql NONE
fixedWidth SMILE 61 1500000
explicit MMAP FRONT_CODED_16_V1 false avgt 5 547.506 ±
15.055 ms/op
SqlExpressionBenchmark.querySql NONE
fixedWidth SMILE 61 1500000
explicit MMAP FRONT_CODED_16_V1 force avgt 5 446.650 ±
5.308 ms/op
SqlExpressionBenchmark.querySql NONE
fixedWidthNonNumeric SMILE 61 1500000
explicit MMAP FRONT_CODED_16_V1 false avgt 5 572.099 ±
67.823 ms/op
SqlExpressionBenchmark.querySql NONE
fixedWidthNonNumeric SMILE 61 1500000
explicit MMAP FRONT_CODED_16_V1 force avgt 5 499.534 ±
19.926 ms/op
SqlExpressionBenchmark.querySql NONE
always SMILE 61 1500000 explicit
MMAP FRONT_CODED_16_V1 false avgt 5 549.607 ± 25.846 ms/op
SqlExpressionBenchmark.querySql NONE
always SMILE 61 1500000 explicit
MMAP FRONT_CODED_16_V1 force avgt 5 496.660 ± 16.439 ms/op
```
after:
```
Segment) (schemaType) (storageType) (stringEncoding) (vectorize) Mode
Cnt Score Error Units
SqlExpressionBenchmark.querySql NONE
singleString SMILE 61 1500000
explicit MMAP FRONT_CODED_16_V1 false avgt 5 428.333 ±
14.320 ms/op
SqlExpressionBenchmark.querySql NONE
singleString SMILE 61 1500000
explicit MMAP FRONT_CODED_16_V1 force avgt 5 364.073 ±
5.671 ms/op
SqlExpressionBenchmark.querySql NONE
fixedWidth SMILE 61 1500000
explicit MMAP FRONT_CODED_16_V1 false avgt 5 423.951 ±
12.710 ms/op
SqlExpressionBenchmark.querySql NONE
fixedWidth SMILE 61 1500000
explicit MMAP FRONT_CODED_16_V1 force avgt 5 371.926 ±
5.133 ms/op
SqlExpressionBenchmark.querySql NONE
fixedWidthNonNumeric SMILE 61 1500000
explicit MMAP FRONT_CODED_16_V1 false avgt 5 424.357 ±
10.445 ms/op
SqlExpressionBenchmark.querySql NONE
fixedWidthNonNumeric SMILE 61 1500000
explicit MMAP FRONT_CODED_16_V1 force avgt 5 419.708 ±
71.678 ms/op
SqlExpressionBenchmark.querySql NONE
always SMILE 61 1500000 explicit
MMAP FRONT_CODED_16_V1 false avgt 5 444.724 ± 112.962 ms/op
SqlExpressionBenchmark.querySql NONE
always SMILE 61 1500000 explicit
MMAP FRONT_CODED_16_V1 force avgt 5 373.843 ± 8.409 ms/op
```
I also considered adding a `getBitmapsIterator` to
`DictionaryEncodedValueIndex`, but ultimately decided against it because most
of the bitmap `get` methods do some coercion of null values to empty bitmaps so
they can't just use the underlying `Indexed` iterator directly... which sounded
a bit more tedious than i wanted to deal with. Perhaps can consider doing this
as a follow-up so the places that are iterating both dictionaries and
collecting the corresponding bitmaps can just both use iterators instead of
keeping a counter, or making some convenient structure to iterate both things
at the same time so we don't even need to keep in sync...
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]