leerho commented on issue #6865: Densify swapped hll buffer
URL: https://github.com/apache/incubator-druid/pull/6865#issuecomment-464555857
 
 
   @drcrallen @clintropolis 
   ...continued:
   
   - I'm not sure if you noticed but Line 388 in HyperLogLogCollector just 
above your inserted lines is a no-op:
   
   `storageBuffer.duplicate().put(other.storageBuffer.asReadOnlyBuffer());`
   
   The returned duplicated read-only buffer is never captured or used.
   
   - Inside the scope where you put your sparse check:
   ```
       if (getRegisterOffset() < other.getRegisterOffset()) {
         // "Swap" the buffers so that we are folding into the one with the 
higher offset
         final ByteBuffer tmpBuffer = 
ByteBuffer.allocate(storageBuffer.remaining());
         tmpBuffer.put(storageBuffer.asReadOnlyBuffer());
         tmpBuffer.clear();
   
         storageBuffer.duplicate().put(other.storageBuffer.asReadOnlyBuffer());
   
         //Added by PR #6865
         if (other.storageBuffer.remaining() != 
other.getNumBytesForDenseStorage()) {
           // The other buffer was sparse, densify it
           final int newLImit = storageBuffer.position() + 
other.storageBuffer.remaining();
           storageBuffer.limit(newLImit);
           convertToDenseStorage();
         } //end of PR #6865
   
         other = HyperLogLogCollector.makeCollector(tmpBuffer);
       }
   ```
   The `other`  sketch can never be sparse! ... Once the sketch has increased 
the Register Offset bigger than zero, the sketch should **never** be placed 
back into the sparse state, ever!   ... **Unless you fill the registers 
artificially with impossible values as you do in your test!**.    
   
   The HLL sketch is a complex state machine, and if implemented correctly, 
should never allow itself to be placed in illegal states especially from public 
methods.  It is hard enough to validate that the state machine operates 
correctly, but placing them in illegal states and then expecting meaningful 
results is asking for trouble.  You are not the only one to fall into this 
trap, clearly the author did too.
   
   - The swap logic itself is rather risky.  Because of the existence of the 
overflow register, there are a number of corner cases that complicate the 
merging and it is hard to confirm that they are all handled correctly.  The 
test code doesn't seem to check for these corner cases either.  (There are much 
safer ways to do this merging, but that is another topic.) 
   
   - WRT your test called ` testCanFillUpOnMod()`.    I sense that what you are 
trying to do is use this as an example of what NOT to do.  However, it should 
be annotated with @Ignore so that it doesn't always log the message "Filled up 
registers after %s random numbers". 
   
   Nonetheless, the real issue to warn users about is to never ever use 
`Hashing.murmur3_128()` without specifying a non-zero seed AND using that hash 
function to effectively select the hash values that are presented to the 
sketch.  All that your test essentially does is select only odd hash values to 
be presented to the sketch, thus all the registers of the sketch will end up 
with 1's.  The likelihood that this could ever occur using the sketch properly 
is probably longer than the age of the universe.  
   
   This goes back to what I have said before, that the hashing operation should 
never have been exposed and delegated to the user.  But that is history.
   
   In conclusion
   1. The additional code added to `HyperLogLogCollector` is unnecessary.  It 
would only ever be executed from tests that create illegal states of the 
sketch.  What should be modified is the tests!
   
   2. Both of the tests added to `HyperLogLogCollectorTest` use methods that 
place the sketch in illegal states.  The first test could be rewritten to 
easily confirm proper swapping logic without doing this. The second test IMHO, 
is too complex and misses the bigger danger of misusing the hash function.
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to