gianm opened a new pull request #11201: URL: https://github.com/apache/druid/pull/11201
Builds on the concept from #11172 and adds a way to feed HLL sketches with UTF-8 bytes. This must be an option rather than always-on, because prior to this patch, HLL sketches used UTF-16LE encoding when hashing strings. To remain compatible with sketch images created prior to this patch -- which matters during rolling updates and when reading sketches that have been written to segments -- we must keep UTF-16LE as the default. Not currently documented, because I'm not yet sure how best to expose this functionality to users. I think the first place would be in the SQL layer: we could have it automatically select UTF-8 or UTF-16LE when building sketches at query time. We need to be careful about this, though, because UTF-8 isn't always faster. Sometimes, like for the results of expressions, UTF-16LE is faster. I expect we will sort this out in future patches. **Benchmarks** These are benchmarks of `APPROX_COUNT_DISTINCT_DS_HLL(<string column>)` on a 500k row segment. The values that appear are generally short (a few characters long). ``` patch - utf8 Benchmark (query) (rowsPerSegment) (vectorize) Mode Cnt Score Error Units SqlBenchmark.querySql 1 500000 false avgt 15 27.639 ± 1.221 ms/op SqlBenchmark.querySql 1 500000 force avgt 15 25.362 ± 2.406 ms/op patch - utf16le Benchmark (query) (rowsPerSegment) (vectorize) Mode Cnt Score Error Units SqlBenchmark.querySql 1 500000 false avgt 15 40.828 ± 2.855 ms/op SqlBenchmark.querySql 1 500000 force avgt 15 39.257 ± 5.548 ms/op master Benchmark (query) (rowsPerSegment) (vectorize) Mode Cnt Score Error Units SqlBenchmark.querySql 1 500000 false avgt 15 43.704 ± 3.875 ms/op SqlBenchmark.querySql 1 500000 force avgt 15 45.293 ± 3.045 ms/op ``` **Testing strategy** - Extended HllSketchAggregatorTest to run on a matrix of (vectorized, nonvectorized) x (utf8, utf16le). - Added specific sketch image expectations to HllSketchAggregatorTest and manually verified them against master, to ensure consistency across versions. - Added serde tests to make sure the new field JSON-round-trips properly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
