drcrallen opened a new pull request #6865: Densify swapped hll buffer URL: https://github.com/apache/incubator-druid/pull/6865 We had an upstream data producer who was sampling data. The sampling algorithm seemed to be based on Murmur3_128, or at least a related algorithm where the hash collisions were similar. When doing a HLL sketch of the dimension values, we were getting really weird results where all the HLL buckets would end up with values that were not good sketches of the input data (every bucket nibble with a `1` for example). `testCanFillUpOnMod` demonstrates such a scenario. The unfortunate side effect of this was that the folding operation can easily cause corrupt buffers if the buffer folding in is sparse. `testRegisterSwapWithSparse` will fail against master at `folded.toByteBuffer()` similar to how the jackson serialization of the collector fails on historicals in the error mode we found. ``` java.nio.BufferOverflowException at java.nio.Buffer.nextPutIndex(Buffer.java:527) at java.nio.HeapByteBuffer.putShort(HeapByteBuffer.java:321) at org.apache.druid.hll.HyperLogLogCollector.toByteBuffer(HyperLogLogCollector.java:488) ``` With this PR applied, the query result does not crash, but does return as sketch that is useless, as demonstrated in the estimate cardinality checks during the added unit tests. A tangential long term solution here would probably be to also seed the murmur hash with a custom value... but that will break historical compatibility in nasty ways.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
