gianm opened a new pull request #11201:
URL: https://github.com/apache/druid/pull/11201


   Builds on the concept from #11172 and adds a way to feed HLL sketches
   with UTF-8 bytes.
   
   This must be an option rather than always-on, because prior to this
   patch, HLL sketches used UTF-16LE encoding when hashing strings. To
   remain compatible with sketch images created prior to this patch -- which
   matters during rolling updates and when reading sketches that have been
   written to segments -- we must keep UTF-16LE as the default.
   
   Not currently documented, because I'm not yet sure how best to expose
   this functionality to users. I think the first place would be in the SQL
   layer: we could have it automatically select UTF-8 or UTF-16LE when
   building sketches at query time. We need to be careful about this, though,
   because UTF-8 isn't always faster. Sometimes, like for the results of
   expressions, UTF-16LE is faster. I expect we will sort this out in
   future patches.
   
   **Benchmarks**
   
   These are benchmarks of `APPROX_COUNT_DISTINCT_DS_HLL(<string column>)` on a 
500k row segment. The values that appear are generally short (a few characters 
long).
   
   ```
   patch - utf8
   
   Benchmark              (query)  (rowsPerSegment)  (vectorize)  Mode  Cnt   
Score   Error  Units
   SqlBenchmark.querySql        1            500000        false  avgt   15  
27.639 ± 1.221  ms/op
   SqlBenchmark.querySql        1            500000        force  avgt   15  
25.362 ± 2.406  ms/op
   
   patch - utf16le
   
   Benchmark              (query)  (rowsPerSegment)  (vectorize)  Mode  Cnt   
Score   Error  Units
   SqlBenchmark.querySql        1            500000        false  avgt   15  
40.828 ± 2.855  ms/op
   SqlBenchmark.querySql        1            500000        force  avgt   15  
39.257 ± 5.548  ms/op
   
   master
   
   Benchmark              (query)  (rowsPerSegment)  (vectorize)  Mode  Cnt   
Score   Error  Units
   SqlBenchmark.querySql        1            500000        false  avgt   15  
43.704 ± 3.875  ms/op
   SqlBenchmark.querySql        1            500000        force  avgt   15  
45.293 ± 3.045  ms/op
   ```
   
   **Testing strategy**
   
   - Extended HllSketchAggregatorTest to run on a matrix of (vectorized, 
nonvectorized) x (utf8, utf16le).
   - Added specific sketch image expectations to HllSketchAggregatorTest and 
manually verified them against master, to ensure consistency across versions.
   - Added serde tests to make sure the new field JSON-round-trips properly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to