[GitHub] [druid] gianm opened a new pull request, #12517: Direct UTF-8 access for "in" filters.

GitBox Thu, 12 May 2022 19:03:19 -0700


gianm opened a new pull request, #12517:
URL: https://github.com/apache/druid/pull/12517


   Directly related:
   
   1) InDimFilter: Store stored Strings (in ValuesSet) plus sorted UTF-8
      ByteBuffers (in valuesUtf8). Use valuesUtf8 whenever possible. If
      necessary, the input set is copied into a ValuesSet. Much logic is
      simplified, because we always know what type the values set will be.
      I think that there won't even be an efficiency loss in most cases.
      InDimFilter is most frequently created by deserialization, and this
      patch updates the JsonCreator constructor to deserialize
      directly into a ValuesSet.
   
   2) Add Utf8ValueSetIndex, which InDimFilter uses to avoid UTF-8 decodes
      during index lookups.
   
   3) Add unsigned comparator to ByteBufferUtils and use it in
      GenericIndexed.BYTE_BUFFER_STRATEGY. This is important because UTF-8
      bytes can be compared as bytes if, and only if, the comparison
      is unsigned.
   
   4) Add specialization to GenericIndexed.singleThreaded().indexOf that
      avoids needless ByteBuffer allocations.
   
   5) Clarify that objects returned by ColumnIndexSupplier.as are not
      thread-safe. DictionaryEncodedStringIndexSupplier now calls
      singleThreaded() on all relevant GenericIndexed objects, saving
      a ByteBuffer allocation per access.
   
   Also:
   
   1) Fix performance regression in LikeFilter: since #12315, it applied
      the suffix matcher to all values in range even for type MATCH_ALL.
   
   2) Add ObjectStrategy.canCompare() method. This fixes LikeFilterBenchmark,
      which was broken due to calls to strategy.compare in
      GenericIndexed.fromIterable.
   
   Benchmarks:
   
   Improvements to "in" filters (due to using UTF-8 comparisons) and "like" 
filters (due to fixing a perf regression).
   
   **patch**
   
   ```
   Benchmark                         (dictionarySize)  (filterSize)  Mode  Cnt  
   Score    Error  Units
   InFilterBenchmark.doFilter                 1000000         10000  avgt   10  
3522.824 ± 35.642  us/op
   
   Benchmark                            (cardinality)                Mode  Cnt  
   Score    Error  Units
   LikeFilterBenchmark.matchLikePrefix        1000000                avgt   10  
 660.164 ± 19.946  us/op
   ```
   
   **9177515be224269dc0299eae809591cb373d83a0 (master)**
   
   ```
   Benchmark                         (dictionarySize)  (filterSize)  Mode  Cnt  
   Score    Error  Units
   InFilterBenchmark.doFilter                 1000000         10000  avgt   10  
6723.385 ± 59.267  us/op
   
   Benchmark                            (cardinality)                Mode  Cnt  
   Score    Error  Units
   LikeFilterBenchmark.matchLikePrefix        1000000                avgt   10  
1348.478 ± 12.959  us/op
   ```
   
   **deb69d1bc03aef784a563260bed1d21505495439 (before #12388)**
   
   ```
   Benchmark                         (dictionarySize)  (filterSize)  Mode  Cnt  
   Score    Error  Units
   InFilterBenchmark.doFilter                 1000000         10000  avgt   10  
6803.080 ± 87.903  us/op
   
   Benchmark                            (cardinality)                Mode  Cnt  
   Score   Error  Units
   LikeFilterBenchmark.matchLikePrefix        1000000                avgt   10  
 717.091 ± 3.451  us/op
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] gianm opened a new pull request, #12517: Direct UTF-8 access for "in" filters.

Reply via email to