drcrallen opened a new pull request #6865: Densify swapped hll buffer
URL: https://github.com/apache/incubator-druid/pull/6865
 
 
   We had an upstream data producer who was sampling data. The sampling 
algorithm seemed to be based on Murmur3_128, or at least a related algorithm 
where the hash collisions were similar. When doing a HLL sketch of the 
dimension values, we were getting really weird results where all the HLL 
buckets would end up with values that were not good sketches of the input data 
(every bucket nibble with a `1` for example). `testCanFillUpOnMod` demonstrates 
such a scenario.
   
   The unfortunate side effect of this was that the folding operation can 
easily cause corrupt buffers if the buffer folding in is sparse. 
`testRegisterSwapWithSparse` will fail against master at 
`folded.toByteBuffer()` similar to how the jackson serialization of the 
collector fails on historicals in the error mode we found. 
   
   ```
   java.nio.BufferOverflowException
        at java.nio.Buffer.nextPutIndex(Buffer.java:527)
        at java.nio.HeapByteBuffer.putShort(HeapByteBuffer.java:321)
        at 
org.apache.druid.hll.HyperLogLogCollector.toByteBuffer(HyperLogLogCollector.java:488)
   ```
   
   
   With this PR applied, the query result does not crash, but does return as 
sketch that is useless, as demonstrated in the estimate cardinality checks 
during the added unit tests.
   
   A tangential long term solution here would probably be to also seed the 
murmur hash with a custom value... but that will break historical compatibility 
in nasty ways.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to