Just in general, unless explicit care was taken to make things mergeable (bit-level compatibility between hash functions, consistent handling of string inputs, etc) the assumption is that different implementations of the same algorithm will not be mergeable.
Within our library we designed things to be compatible across languages and have tests to confirm that. And once we got the tests working it turned out we had some minor bugs causing issues, despite making an explicit compatibility effort. jon On Tue, Mar 11, 2025 at 5:56 PM Lee Rhodes <lee...@gmail.com> wrote: > > Hello Štefan, > > We did a major study and comparison of the DataSketches HLL sketch to the > Clearspring implementation of the HLL++ sketch back in 2017 and found that > the Clearspring sketch had serious error problems, did not implement the > Google HLL++ paper correctly, and is slow. > > To answer your question as to whether any of your CS HLL sketch data can be > recovered, I would say no. And even if it could be recovered, with the > serious problems of the CS implementation, I wouldn't trust it. > > > On Tue, Mar 11, 2025 at 6:24 AM Štefan Miklošovič <smikloso...@apache.org> > wrote: >> >> Hello Datasketches community, >> >> I am from Apache Cassandra where we use Clearspring (1) for estimating the >> cardinalities for rows in Cassandra's SSTables. We serialize the whole >> HyperLogLog from (1) (more or less) to the disk and then we deserialize it >> back and we merge all logs together to know the final result across the >> whole data. >> >> (1) is, as you probably know, archived / not actively contributed anymore. >> Hence, we are looking for replacements. >> >> Datasketches are quite an obvious choice but I would like to know some >> answers to the questions before the transition. >> >> We need to work with old data as well. If there is an SSTable on a disk with >> HLL from Clearspring, then we can not merge this to Datasketches, right? In >> other words, this is not possible: >> >> @Test >> public void testMerging() throws Throwable >> { >> // wrapper around Clearspring >> LegacyCardinality clearspringCardinality = new LegacyCardinality(new >> HyperLogLogPlus(13, 25)); >> clearspringCardinality.offerHashed(12345); >> >> // wrapper around Datasketches HLL >> DefaultCardinality datasketchesCardinality = new >> DefaultCardinality(); >> datasketchesCardinality.offerHashed(23456); >> >> // this fails, as well as similar variations of that >> clearspringCardinality.merge(new >> LegacyCardinality(HyperLogLogPlus.Builder.build(datasketchesCardinality.getBytes())).getCardinality()); >> } >> >> It would be great if you confirmed (or denied) that there is no way to merge >> these two together. How would you go around this problem in general? If they >> are not mergeable, then we would need to find another way to deal with this >> but that is another story. >> >> I see that there is (2) which is a great in-depth description of differences >> between two but there is no information to my knowledge which would say if >> one is convertible to another. >> >> Thank you and regards >> >> Stefan Miklosovic >> >> (1) https://github.com/addthis/stream-lib/tree/master >> (2) https://datasketches.apache.org/docs/HLL/Hll_vs_CS_Hllpp.html --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@datasketches.apache.org For additional commands, e-mail: dev-h...@datasketches.apache.org