gianm opened a new pull request, #12517:
URL: https://github.com/apache/druid/pull/12517
Directly related:
1) InDimFilter: Store stored Strings (in ValuesSet) plus sorted UTF-8
ByteBuffers (in valuesUtf8). Use valuesUtf8 whenever possible. If
necessary, the input set is copied into a ValuesSet. Much logic is
simplified, because we always know what type the values set will be.
I think that there won't even be an efficiency loss in most cases.
InDimFilter is most frequently created by deserialization, and this
patch updates the JsonCreator constructor to deserialize
directly into a ValuesSet.
2) Add Utf8ValueSetIndex, which InDimFilter uses to avoid UTF-8 decodes
during index lookups.
3) Add unsigned comparator to ByteBufferUtils and use it in
GenericIndexed.BYTE_BUFFER_STRATEGY. This is important because UTF-8
bytes can be compared as bytes if, and only if, the comparison
is unsigned.
4) Add specialization to GenericIndexed.singleThreaded().indexOf that
avoids needless ByteBuffer allocations.
5) Clarify that objects returned by ColumnIndexSupplier.as are not
thread-safe. DictionaryEncodedStringIndexSupplier now calls
singleThreaded() on all relevant GenericIndexed objects, saving
a ByteBuffer allocation per access.
Also:
1) Fix performance regression in LikeFilter: since #12315, it applied
the suffix matcher to all values in range even for type MATCH_ALL.
2) Add ObjectStrategy.canCompare() method. This fixes LikeFilterBenchmark,
which was broken due to calls to strategy.compare in
GenericIndexed.fromIterable.
Benchmarks:
Improvements to "in" filters (due to using UTF-8 comparisons) and "like"
filters (due to fixing a perf regression).
**patch**
```
Benchmark (dictionarySize) (filterSize) Mode Cnt
Score Error Units
InFilterBenchmark.doFilter 1000000 10000 avgt 10
3522.824 ± 35.642 us/op
Benchmark (cardinality) Mode Cnt
Score Error Units
LikeFilterBenchmark.matchLikePrefix 1000000 avgt 10
660.164 ± 19.946 us/op
```
**9177515be224269dc0299eae809591cb373d83a0 (master)**
```
Benchmark (dictionarySize) (filterSize) Mode Cnt
Score Error Units
InFilterBenchmark.doFilter 1000000 10000 avgt 10
6723.385 ± 59.267 us/op
Benchmark (cardinality) Mode Cnt
Score Error Units
LikeFilterBenchmark.matchLikePrefix 1000000 avgt 10
1348.478 ± 12.959 us/op
```
**deb69d1bc03aef784a563260bed1d21505495439 (before #12388)**
```
Benchmark (dictionarySize) (filterSize) Mode Cnt
Score Error Units
InFilterBenchmark.doFilter 1000000 10000 avgt 10
6803.080 ± 87.903 us/op
Benchmark (cardinality) Mode Cnt
Score Error Units
LikeFilterBenchmark.matchLikePrefix 1000000 avgt 10
717.091 ± 3.451 us/op
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]