leerho commented on issue #6814: [Discuss] Replacing hyperUnique as 'default' distinct count sketch URL: https://github.com/apache/incubator-druid/issues/6814#issuecomment-452567631 > The on-disk format is not compatible, and cannot be, due to the difference in hash functions used. Let me clarify the problem with the Druid-HLL sketch with respect to its use of Hash Functions. **The Druid-HLL sketch delegates the responsibility of hashing the input items to the user.** The only requirement is that the resulting hash be at least 10 bytes long. Although this may seem to be advantageous by providing user flexibility, it has very serious drawbacks. When sketches are serialized and stored, the resulting data is a highly compressed jumble of bits similar to a picture image. Unless you understand how to decode the sequence of bits and know exactly how it was encoded, the sketch image is not very useful. The chosen hashing function (including any seeding of the hash) determines how the bits are encoded. Although it is not required to know what hash function (and seed) was used to encode the sketch in order to decode it and obtain an estimate, any attempt to **merge** two sketches will result in garbage unless the same exact hash function (and seed) was used to create the two sketches. Suppose a user in department A creates a history of sketches using one hash function, and a user in department B creates a history of sketches using a slightly different hash function. These two histories can never be merged together. The Druid-HLL sketch therefore must rely on **human** coordination between the users that created and stored the sketches, and the users that wish to merge these sketches later on in some query process. In fact these two groups of users could be separated not only by department, but by company and/or long periods of time. The Druid-HLL sketch has no means of detecting that two sketches may have been created using different hash functions (or seed). in other words, this is a silent failure mechanism. **The DataSketches-HLL sketch always uses the same hash function (and seed) and it cannot be changed by the user.** This was an intentional design decision to prevent the above disaster that is almost guaranteed to happen in large corporations. **The DataSketches-Theta (and Tuple and CPC) sketch families always use the same hash function, but do allow users to optionally specify a different seed.** This was a specific design decision to provide certain groups that must sketch sensitive data with a bit more security in their stored data. The chosen seed is never stored in the sketch image for this reason. However, the Theta sketches have a detection mechanism as a safety measure to detect if different seeds were used in the creation of the sketch images. All merge operations first check if the sketches are, in fact, compatible, and throw an exception if not. > Patching hyperUnique's implementation to improve its error characteristics. The design flaw that is causing the [poor error on merging](https://datasketches.github.io/docs/HLL/HllSketchVsDruidHyperLogLogCollector.html) is fundamental. The Druid-HLL sketch is not keeping sufficient information in its stored image to decode it accurately. Therefore, there is no fancy algorithm that exists that could reconstruct this missing data. And the Druid-HLL sketch provides no warning mechanism that these estimation errors will occur. Even if it was fixed with a new format that would retain the missing data, merging old sketches with new ones will still have the same error problems. The flaws in the design of the Druid-HLL sketch don't stop here, it has many other design problems. I see no benefit to attempting to "fix" the Druid-HLL sketch as it would require a complete redesign from scratch. -- But we have already designed a far superior HLL sketch as part of the DataSketches library. I strongly suggest we document and communicate these issues to the user community, and suggest they migrate to the DataSketches library.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
