leerho commented on issue #6814: [Discuss] Replacing hyperUnique as 'default' 
distinct count sketch
URL: 
https://github.com/apache/incubator-druid/issues/6814#issuecomment-452567631
 
 
   > The on-disk format is not compatible, and cannot be, due to the difference 
in hash functions used. 
   
   Let me clarify the problem with the Druid-HLL sketch with respect to its use 
of Hash Functions.
   
   **The Druid-HLL sketch delegates the responsibility of hashing the input 
items to the user.**  
   
   The only requirement is that the resulting hash be at least 10 bytes long.  
Although this may seem to be advantageous by providing user flexibility, it has 
very serious drawbacks.
   
   When sketches are serialized and stored, the resulting data is a highly 
compressed jumble of bits similar to a picture image.  Unless you understand 
how to decode the sequence of bits and know exactly how it was encoded, the 
sketch image is not very useful. The chosen hashing function (including any 
seeding of the hash) determines how the bits are encoded. Although it is not 
required to know what hash function (and seed) was used to encode the sketch in 
order to decode it and obtain an estimate, any attempt to **merge** two 
sketches will result in garbage unless the same exact hash function (and seed) 
was used to create the two sketches. 
   
   Suppose a user in department A creates a history of sketches using one hash 
function, and a user in department B creates a history of sketches using a 
slightly different hash function.  These two histories can never be merged 
together.   The Druid-HLL sketch therefore must rely on **human** coordination 
between the users that created and stored the sketches, and the users that wish 
to merge these sketches later on in some query process.  In fact these two 
groups of users could be separated not only by department, but by company 
and/or long periods of time. 
   
   The Druid-HLL sketch has no means of detecting that two sketches may have 
been created using different hash functions (or seed). in other words, this is 
a silent failure mechanism.  
   
   **The DataSketches-HLL sketch always uses the same hash function (and seed) 
and it cannot be changed by the user.**
   
   This was an intentional design decision to prevent the above disaster that 
is almost guaranteed to happen in large corporations.  
   
   **The DataSketches-Theta (and Tuple and CPC) sketch families always use the 
same hash function, but do allow users to optionally specify a different seed.**
   
   This was a specific design decision to provide certain groups that must 
sketch sensitive data with a bit more security in their stored data.  The 
chosen seed is never stored in the sketch image for this reason. 
   
   However, the Theta sketches have a detection mechanism as a safety measure 
to detect if different seeds were used in the creation of the sketch images.  
All merge operations first check if the sketches are, in fact, compatible, and 
throw an exception if not. 
   
   > Patching hyperUnique's implementation to improve its error characteristics.
   
   The design flaw that is causing the [poor error on 
merging](https://datasketches.github.io/docs/HLL/HllSketchVsDruidHyperLogLogCollector.html)
 is fundamental. The Druid-HLL sketch is not keeping sufficient information in 
its stored image to decode it accurately. Therefore, there is no fancy 
algorithm that exists that could reconstruct this missing data.  And the 
Druid-HLL sketch provides no warning mechanism that these estimation errors 
will occur.  
   
   Even if it was fixed with a new format that would retain the missing data, 
merging old sketches with new ones will still have the same error problems.  
   
   The flaws in the design of the Druid-HLL sketch don't stop here, it has many 
other design problems.  I see no benefit to attempting to "fix" the Druid-HLL 
sketch as it would require a complete redesign from scratch.  -- But we have 
already designed a far superior HLL sketch as part of the DataSketches library. 
 I strongly suggest we document and communicate these issues to the user 
community, and suggest they migrate to the DataSketches library.
   
   
   
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to