leerho commented on issue #6814: [Discuss] Replacing hyperUnique as 'default' 
distinct count sketch
URL: 
https://github.com/apache/incubator-druid/issues/6814#issuecomment-452943951
 
 
   > This isn't true in practice. The Druid-HLL library lets callers use any 
hash function, but Druid doesn't expose that to end users. It always uses 
`Hashing.murmur3_128()` to avoid the problems you mentioned.
   
   Well that is a relief!  This reveals my lack of understanding of the 
internals of Druid, as I was just looking at the design of the HLL classes, and 
I shudder by what I see!  :)   
   
   > Out of curiosity what is the missing information? I didn't see details on 
that in the link you provided.
   
   It is complicated and requires deep understanding of the HLL Stochastic 
process and how it evolves as a function of *n*.  Nonetheless, I will attempt 
to explain it.
   
   The HLL algorithm is usually implemented with at least 2 distinct phases: 
sparse and dense.  The sparse phase is used for very low counts and saves space 
by not needing to allocate the full HLL dense array.  At some point, the sketch 
will graduate to dense mode and allocate the final full array of *k* slots.  
(e.g., if logK=11, k=2048 slots).  For practical reasons a slot must be at 
least 6 bits.  Simple implementations of HLL assign a byte (8 bits) per slot so 
the space required for the sketch in dense mode (logK=11) would be 2048 bytes 
plus some header bytes.  More sophisticated implementations, like the Druid-HLL 
and the DataSketches-HLL, perform a form of compression so that each slot only 
requires 4-bits.  So now the total space required for the sketch would be 1024 
bytes plus some header bytes.  However, this compression scheme only works 
"most of the time" and occasionally 4 bits is not enough.  When this happens, a 
special "exception value" is placed in the relevant slot which defers the 
lookup of the true value for that slot from small hash-map of exception values 
keyed off the slot address.  How often these "exceptions" occur is a function 
of the size of the sketch, *n*, and probabilistic occurrence of certain hash 
values.  So it is more likely to happen when merging lots of relatively large 
sketches together, which is what we do in our test.
   
   We know from our own extensive characterization studies (that sometimes run 
for days!) that an HLL 4-bit sketch implementation where logK=11 needs at least 
4 exception registers, and could require more, although it would be extremely 
rare; nevertheless, it can never be ruled out.  In theory, for a LogK=11 
sketch, each exception register would need only 24 bits.  Since the DS-HLL 
sketch allows up to LogK=21, we use 32 bits per exception register. 
   
   The mistake that the developers of the Druid-HLL sketch made was that they 
only provided space for one exception register of 24 bits.  Due to insufficient 
understanding of the probabilities of exceptions and insufficient testing, they 
assumed only one register was enough and did not think through what the 
consequences would be. You can see this at the top of the HyperLogLogCollector 
class: The Header has only 7 bytes.  Byte 4 holds the exception value and bytes 
5 and 6 hold the slot address.  The total space allocated for the sketch is 
1024 + 7 = 1031 bytes.
   
   If more than one slot (of the 2048 slots) needs an exception register, there 
is no place to record the exception, so the prior exception's record is 
overwritten.  Thus, information is lost.  The state of the HLL array is now 
corrupted and accurate recovery of the estimate is no longer possible.  The 
consequence is severe undercounting, which is revealed in our testing.
   
   I fully understand what the benefits would be, if it were possible to 
correct for this flaw.  But what you are asking for is the ability to **merge** 
the flawed HLL sketches with some new fixed HLL sketch.  The 4-bit compression 
algorithm is rather complex and would have to be completely be reviewed, 
redesigned and tested.  I don't have the time or motivation to attempt this 
type of redesign.  
   
   Even though we also use the MurmurHash3_64_128 hash function, we do use a 
specific static final non-zero seed.  Thus our hash functions are incompatible. 
 I'm not even sure that the guava hash implementation would necessarily produce 
the same bits even if the seed issue didn't exist, but this could be verified.
   
   Even producing a variation of our HLL sketch that could merge with the 
Druid-HLL sketch images would require considerable work, as we would have to 
extend our design to allow for variable seeds.
   Then we would have to design an adaptor that could merge from the Druid-HLL 
sketch, which is also not trivial.
   
   I will keep thinking about this, but don't get your hopes up :)
   
   
   
   
   
   
   
   
   
   
    
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to