[GitHub] leerho commented on issue #6814: [Discuss] Replacing hyperUnique as 'default' distinct count sketch

GitBox Tue, 08 Jan 2019 20:25:29 -0800

leerho commented on issue #6814: [Discuss] Replacing hyperUnique as 'default'
distinct count sketch
URL:
https://github.com/apache/incubator-druid/issues/6814#issuecomment-452567631

> The on-disk format is not compatible, and cannot be, due to the difference
in hash functions used.

Let me clarify the problem with the Druid-HLL sketch with respect to its use
of Hash Functions.

**The Druid-HLL sketch delegates the responsibility of hashing the input
items to the user.**

The only requirement is that the resulting hash be at least 10 bytes long.
Although this may seem to be advantageous by providing user flexibility, it has
very serious drawbacks.

When sketches are serialized and stored, the resulting data is a highly
compressed jumble of bits similar to a picture image. Unless you understand
how to decode the sequence of bits and know exactly how it was encoded, the
sketch image is not very useful. The chosen hashing function (including any
seeding of the hash) determines how the bits are encoded. Although it is not
required to know what hash function (and seed) was used to encode the sketch in
order to decode it and obtain an estimate, any attempt to **merge** two
sketches will result in garbage unless the same exact hash function (and seed)
was used to create the two sketches.

Suppose a user in department A creates a history of sketches using one hash
function, and a user in department B creates a history of sketches using a
slightly different hash function. These two histories can never be merged
together. The Druid-HLL sketch therefore must rely on **human** coordination
between the users that created and stored the sketches, and the users that wish
to merge these sketches later on in some query process. In fact these two
groups of users could be separated not only by department, but by company
and/or long periods of time.

The Druid-HLL sketch has no means of detecting that two sketches may have
been created using different hash functions (or seed). in other words, this is
a silent failure mechanism.

**The DataSketches-HLL sketch always uses the same hash function (and seed)
and it cannot be changed by the user.**

This was an intentional design decision to prevent the above disaster that
is almost guaranteed to happen in large corporations.

**The DataSketches-Theta (and Tuple and CPC) sketch families always use the
same hash function, but do allow users to optionally specify a different seed.**

This was a specific design decision to provide certain groups that must
sketch sensitive data with a bit more security in their stored data. The
chosen seed is never stored in the sketch image for this reason.

However, the Theta sketches have a detection mechanism as a safety measure
to detect if different seeds were used in the creation of the sketch images.
All merge operations first check if the sketches are, in fact, compatible, and
throw an exception if not.

> Patching hyperUnique's implementation to improve its error characteristics.

The design flaw that is causing the [poor error on
merging](https://datasketches.github.io/docs/HLL/HllSketchVsDruidHyperLogLogCollector.html)
is fundamental. The Druid-HLL sketch is not keeping sufficient information in
its stored image to decode it accurately. Therefore, there is no fancy
algorithm that exists that could reconstruct this missing data. And the
Druid-HLL sketch provides no warning mechanism that these estimation errors
will occur.

Even if it was fixed with a new format that would retain the missing data,
merging old sketches with new ones will still have the same error problems.

The flaws in the design of the Druid-HLL sketch don't stop here, it has many
other design problems. I see no benefit to attempting to "fix" the Druid-HLL
sketch as it would require a complete redesign from scratch. -- But we have
already designed a far superior HLL sketch as part of the DataSketches library.
I strongly suggest we document and communicate these issues to the user
community, and suggest they migrate to the DataSketches library.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] leerho commented on issue #6814: [Discuss] Replacing hyperUnique as 'default' distinct count sketch

Reply via email to