[GitHub] leerho commented on issue #6865: Densify swapped hll buffer

GitBox Thu, 07 Feb 2019 19:39:10 -0800

leerho commented on issue #6865: Densify swapped hll buffer
URL: https://github.com/apache/incubator-druid/pull/6865#issuecomment-461680212
 
 
   @drcrallen 
   
   I have several comments on your situation:
   
   - First of all, you are managing your own hash function that feeds the HLLC 
sketch.  This is a No No!  I had mentioned this to @gianm in #6814 recently and 
he assured me:
   
   > This isn't true in practice. The druid-hll library lets callers use any 
hash function, but Druid doesn't expose that to end users. It always uses 
`Hashing.murmur3_128()`
   
   The sketch must do its own hashing preferably with its own hash function and 
with a private seed and users should not peek inside and use the same hash 
function with the same seed for other purposes, like performing an upstream 
modulo sampling as you do in `testCanFillUpOnMod()`.    
   
   HLL sketches are stochastic functions that rely on good randomness 
properties of the hash function that are **independent** of the incoming data!  
So by using the same exact hash function and the same seed in your mod function 
you are violating this independence property and all bets are off! 
   
   - Nonetheless, what you also have uncovered is likely a big YAFU (Yet 
Another F*** Up) by the designers of the Druid HLL sketch.  I took your test 
and added the DataSketches HLL sketch in parallel inside the test.  I also 
added a few more outputs at the end and got these results:
   
   ```
   Filled up registers after 3,918,870 random numbers
   Count: 19590
   HLLc Uniques: 0
   DS-HLL Uniques: 19169.299781
   ```
   
   The Count is the number of times a value is added to the sketch (at the 
bottom of the do loop). This is not the true number of uniques as there may be 
a few collisions amongst those 4M random numbers, but it was adequate for this 
experiment. 
   
   The Druid HLL shows a count of zero!  I did not (and don't have the time) to 
debug this but perhaps by checking the NumNonZeroRegisters variable just when 
it hits zero, you are catching the sketch with its pants down, just before it 
transitions to shifting the nibble registers by one.  I am not sure, I am kinda 
surprised it fails this badly!
   
   The DS-HLL shows a count of 19169 which is well within the error bounds of 
the sketch of that size.
   
   I suggest you abandon the Druid HLL sketch and use the DS-HLL sketch instead.
   
   ***
   As for your suggested change, I'm really not sure what ultimate effect it 
will have.  It would require tons of characterization studies to understand its 
effect.  The Druid HLL sketch has so many other problems and users that still 
rely on it.  I would be cautious about any changes like this, as it may make 
existing binary compatibilities with historically stored sketches even worse 
than it already is.
   
   Cheers


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] leerho commented on issue #6865: Densify swapped hll buffer

Reply via email to