leerho commented on issue #6865: Densify swapped hll buffer URL: https://github.com/apache/incubator-druid/pull/6865#issuecomment-461680212 @drcrallen I have several comments on your situation: - First of all, you are managing your own hash function that feeds the HLLC sketch. This is a No No! I had mentioned this to @gianm in #6814 recently and he assured me: > This isn't true in practice. The druid-hll library lets callers use any hash function, but Druid doesn't expose that to end users. It always uses `Hashing.murmur3_128()` The sketch must do its own hashing preferably with its own hash function and with a private seed and users should not peek inside and use the same hash function with the same seed for other purposes, like performing an upstream modulo sampling as you do in `testCanFillUpOnMod()`. HLL sketches are stochastic functions that rely on good randomness properties of the hash function that are **independent** of the incoming data! So by using the same exact hash function and the same seed in your mod function you are violating this independence property and all bets are off! - Nonetheless, what you also have uncovered is likely a big YAFU (Yet Another F*** Up) by the designers of the Druid HLL sketch. I took your test and added the DataSketches HLL sketch in parallel inside the test. I also added a few more outputs at the end and got these results: ``` Filled up registers after 3,918,870 random numbers Count: 19590 HLLc Uniques: 0 DS-HLL Uniques: 19169.299781 ``` The Count is the number of times a value is added to the sketch (at the bottom of the do loop). This is not the true number of uniques as there may be a few collisions amongst those 4M random numbers, but it was adequate for this experiment. The Druid HLL shows a count of zero! I did not (and don't have the time) to debug this but perhaps by checking the NumNonZeroRegisters variable just when it hits zero, you are catching the sketch with its pants down, just before it transitions to shifting the nibble registers by one. I am not sure, I am kinda surprised it fails this badly! The DS-HLL shows a count of 19169 which is well within the error bounds of the sketch of that size. I suggest you abandon the Druid HLL sketch and use the DS-HLL sketch instead. *** As for your suggested change, I'm really not sure what ultimate effect it will have. It would require tons of characterization studies to understand its effect. The Druid HLL sketch has so many other problems and users that still rely on it. I would be cautious about any changes like this, as it may make existing binary compatibilities with historically stored sketches even worse than it already is. Cheers
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
