leerho commented on issue #6865: Densify swapped hll buffer URL: https://github.com/apache/incubator-druid/pull/6865#issuecomment-464555857 @drcrallen @clintropolis ...continued: - I'm not sure if you noticed but Line 388 in HyperLogLogCollector just above your inserted lines is a no-op: `storageBuffer.duplicate().put(other.storageBuffer.asReadOnlyBuffer());` The returned duplicated read-only buffer is never captured or used. - Inside the scope where you put your sparse check: ``` if (getRegisterOffset() < other.getRegisterOffset()) { // "Swap" the buffers so that we are folding into the one with the higher offset final ByteBuffer tmpBuffer = ByteBuffer.allocate(storageBuffer.remaining()); tmpBuffer.put(storageBuffer.asReadOnlyBuffer()); tmpBuffer.clear(); storageBuffer.duplicate().put(other.storageBuffer.asReadOnlyBuffer()); //Added by PR #6865 if (other.storageBuffer.remaining() != other.getNumBytesForDenseStorage()) { // The other buffer was sparse, densify it final int newLImit = storageBuffer.position() + other.storageBuffer.remaining(); storageBuffer.limit(newLImit); convertToDenseStorage(); } //end of PR #6865 other = HyperLogLogCollector.makeCollector(tmpBuffer); } ``` The `other` sketch can never be sparse! ... Once the sketch has increased the Register Offset bigger than zero, the sketch should **never** be placed back into the sparse state, ever! ... **Unless you fill the registers artificially with impossible values as you do in your test!**. The HLL sketch is a complex state machine, and if implemented correctly, should never allow itself to be placed in illegal states especially from public methods. It is hard enough to validate that the state machine operates correctly, but placing them in illegal states and then expecting meaningful results is asking for trouble. You are not the only one to fall into this trap, clearly the author did too. - The swap logic itself is rather risky. Because of the existence of the overflow register, there are a number of corner cases that complicate the merging and it is hard to confirm that they are all handled correctly. The test code doesn't seem to check for these corner cases either. (There are much safer ways to do this merging, but that is another topic.) - WRT your test called ` testCanFillUpOnMod()`. I sense that what you are trying to do is use this as an example of what NOT to do. However, it should be annotated with @Ignore so that it doesn't always log the message "Filled up registers after %s random numbers". Nonetheless, the real issue to warn users about is to never ever use `Hashing.murmur3_128()` without specifying a non-zero seed AND using that hash function to effectively select the hash values that are presented to the sketch. All that your test essentially does is select only odd hash values to be presented to the sketch, thus all the registers of the sketch will end up with 1's. The likelihood that this could ever occur using the sketch properly is probably longer than the age of the universe. This goes back to what I have said before, that the hashing operation should never have been exposed and delegated to the user. But that is history. In conclusion 1. The additional code added to `HyperLogLogCollector` is unnecessary. It would only ever be executed from tests that create illegal states of the sketch. What should be modified is the tests! 2. Both of the tests added to `HyperLogLogCollectorTest` use methods that place the sketch in illegal states. The first test could be rewritten to easily confirm proper swapping logic without doing this. The second test IMHO, is too complex and misses the bigger danger of misusing the hash function.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
