leerho edited a comment on issue #6865: Densify swapped hll buffer
URL: https://github.com/apache/incubator-druid/pull/6865#issuecomment-461680212
 
 
   @drcrallen 
   
   I have several comments on your situation:
   
   - First of all, you are managing your own hash function that feeds the HLLC 
sketch.  This is a No No!  I had mentioned this to @gianm in #6814 recently and 
he assured me:
   
   > This isn't true in practice. The druid-hll library lets callers use any 
hash function, but Druid doesn't expose that to end users. It always uses 
`Hashing.murmur3_128()`
   
   The sketch must do its own hashing preferably with its own hash function and 
with a private seed and users should not peek inside and use the same hash 
function with the same seed for  performing an upstream modulo sampling as you 
do in `testCanFillUpOnMod()`.    
   
   HLL sketches are stochastic functions that rely on good randomness 
properties of the hash function that are **independent** of the incoming data!  
So by using the same exact hash function and the same seed in your mod function 
you are violating this independence property and all bets are off! 
   
   - Nonetheless, what you also have uncovered is likely a bug.  I took your 
test and added the DataSketches HLL sketch in parallel inside the test.  I also 
added a few more outputs at the end and got these results:
   
   ```
   Filled up registers after 3,918,870 random numbers
   Count: 19590
   HLLc Uniques: 0
   DS-HLL Uniques: 19169.299781
   ```
   
   The Count is the number of times a value is added to the sketch (at the 
bottom of the do loop). This is not the true number of uniques as there may be 
a few collisions amongst those 4M random numbers, but it was adequate for this 
experiment. 
   
   The Druid HLL shows a count of zero.  I did not debug this but perhaps by 
checking the NumNonZeroRegisters variable just when it hits zero, you are 
catching the sketch just before it transitions.  I am not sure, I am kinda 
surprised by this.
   
   The DS-HLL shows a count of 19169 which is well within the error bounds of 
the sketch of that size.
   
   ***
   As for your suggested change, I'm really not sure what ultimate effect it 
will have. 
   
   Cheers
   
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to