Re: Question about Datasketches HLL as replacement of Clearspring in Apache Cassandra

Štefan Miklošovič Thu, 13 Mar 2025 01:29:02 -0700

Hi everybody,

Thank you for your in-depth explanations!


We are trying to figure out how to live with this fact. We most probably
choose Apache Datasketches HLL implementation for Apache Cassandra as the
successor of Clearspring HLL. We will most likely need to live with two HLL
implementations at once for a while as older Cassandra deployments work
with Clearspring but newer would work with Datasketches. There will need to
be a period where both implementations are necessary. If there is a
situation where a node has a mix of old and new data, we will still need to
merge Clearspring HLLs. Merging of Datasketches HLLs will be possible only
in case there are no SSTables present which contain old SSTable format with
Clearspring HLL.

Anyway, that is a completely different problem the Datasketches community
has nothing to do with.

Regards

On Thu, Mar 13, 2025 at 12:23 AM Lee Rhodes <lee...@gmail.com> wrote:

> Štefan,
> I noticed that I didn't directly answer your question whether any kind of
> merge is possible between the CS sketch and our HLL sketch. The answer is
> no.
>
> There are several reasons "merge" is impossible:
>
>    - The CS HyperLogLogPlus sketch uses MurmurHash3 (like we do) but they
>    implemented the 64-bit version, and our HLL sketch implemented the 128-bit
>    version of MurmurHash3, which is a totally different animal; which means
>    totally incompatible.  This alone means you are dead-in-the-water.
>    - In addition, even if the hash functions were compatible, the CS
>    HyperLogLogPlus attempted to implement the Google HLL++ sparse compression
>    algorithm but didn't do it correctly. This means the CS sketches in sparse
>    mode cannot be trusted.  And given the natural power-law distribution of
>    stream sizes with most big data sources, this means the vast majority of
>    your sketches (the smaller cardinality ones) are likely corrupted and
>    attempts to decompress these sketches to pure sparse mode would likely be
>    futile.
>
> Attempting data recovery is also impossible:
>
> One of the key properties of the HyperLogLog algorithm is that it is
> intentionally lossy.  The good part about this is that it reduces the size
> of the sketch considerably for a given accuracy.  The unfortunate
> consequence of this, in your case, is that any attempt to reverse engineer
> the bits of the sketch back to the original hashes is impossible: 1) there
> can be many millions of hashes that generate the same pattern of bits
> retained in the HLL array, and 2) even if you had the hashes, reversing
> them back to the original item data that created the sketch is also nearly
> impossible.
>
> I sympathize with your predicament, but I think your only recourse is to
> try to find the original data and recreate the sketches using our
> DataSketches HLL sketch.
>
>
> On Tue, Mar 11, 2025 at 5:56 PM Lee Rhodes <lee...@gmail.com> wrote:
>
>> Hello Štefan,
>>
>> We did a major study and comparison
>> <https://datasketches.apache.org/docs/HLL/Hll_vs_CS_Hllpp.html> of the
>> DataSketches HLL sketch to the Clearspring implementation of the HLL++
>> sketch back in 2017 and found that the Clearspring sketch had serious error
>> problems, did not implement the Google HLL++ paper correctly, and is slow.
>>
>> To answer your question as to whether any of your CS HLL sketch data can
>> be recovered, I would say no. And even if it could be recovered, with the
>> serious problems of the CS implementation, I wouldn't trust it.
>>
>>
>> On Tue, Mar 11, 2025 at 6:24 AM Štefan Miklošovič <smikloso...@apache.org>
>> wrote:
>>
>>> Hello Datasketches community,
>>>
>>> I am from Apache Cassandra where we use Clearspring (1) for estimating
>>> the cardinalities for rows in Cassandra's SSTables. We serialize the whole
>>> HyperLogLog from (1) (more or less) to the disk and then we deserialize it
>>> back and we merge all logs together to know the final result across the
>>> whole data.
>>>
>>> (1) is, as you probably know, archived / not actively contributed
>>> anymore. Hence, we are looking for replacements.
>>>
>>> Datasketches are quite an obvious choice but I would like to know some
>>> answers to the questions before the transition.
>>>
>>> We need to work with old data as well. If there is an SSTable on a disk
>>> with HLL from Clearspring, then we can not merge this to Datasketches,
>>> right? In other words, this is not possible:
>>>
>>>     @Test
>>>     public void testMerging() throws Throwable
>>>     {
>>>         // wrapper around Clearspring
>>>         LegacyCardinality clearspringCardinality = new
>>> LegacyCardinality(new HyperLogLogPlus(13, 25));
>>>         clearspringCardinality.offerHashed(12345);
>>>
>>>         // wrapper around Datasketches HLL
>>>         DefaultCardinality datasketchesCardinality = new
>>> DefaultCardinality();
>>>         datasketchesCardinality.offerHashed(23456);
>>>
>>>         // this fails, as well as similar variations of that
>>>         clearspringCardinality.merge(new
>>> LegacyCardinality(HyperLogLogPlus.Builder.build(datasketchesCardinality.getBytes())).getCardinality());
>>>     }
>>>
>>> It would be great if you confirmed (or denied) that there is no way to
>>> merge these two together. How would you go around this problem in general?
>>> If they are not mergeable, then we would need to find another way to deal
>>> with this but that is another story.
>>>
>>> I see that there is (2) which is a great in-depth description of
>>> differences between two but there is no information to my knowledge which
>>> would say if one is convertible to another.
>>>
>>> Thank you and regards
>>>
>>> Stefan Miklosovic
>>>
>>> (1) https://github.com/addthis/stream-lib/tree/master
>>> (2) https://datasketches.apache.org/docs/HLL/Hll_vs_CS_Hllpp.html
>>>
>>

Re: Question about Datasketches HLL as replacement of Clearspring in Apache Cassandra

Reply via email to