Thank you, Brandon! I have moved the page to https://cwiki.apache.org/confluence/display/CASSANDRA/SSTable%27s+partition+cardinality+implementation
On Fri, 3 Jan 2025 at 14:45, Brandon Williams <dri...@gmail.com> wrote: > I've granted access to the account "Dmitry Konstantinov (netudima)" > > Kind Regards, > Brandon > > On Fri, Jan 3, 2025 at 8:18 AM Dmitry Konstantinov <netud...@gmail.com> > wrote: > > > > Hi Brain, I wanted it to be created under > https://cwiki.apache.org/confluence/display/CASSANDRA/Discussion but it > looks like I do not have grants to add a page there and Confluence > automatically selected this space to store the page. > > I do not have permission to move it too :-( > > Can I get grants to create pages under > https://cwiki.apache.org/confluence/display/CASSANDRA/ ? > > > > Thank you, > > Dmitry > > > > On Fri, 3 Jan 2025 at 14:12, Brian Proffitt <b...@apache.org> wrote: > >> > >> Dmitry: > >> > >> You are using a section of the Confluence wiki that is dedicated to > Community Over Code, the Apache Conference. Please move that page to a more > appropriate part of the Apache wiki as soon as you can. > >> > >> Thanks! > >> BKP > >> > >> On 2025/01/03 13:55:49 Dmitry Konstantinov wrote: > >> > I have summarized information from this mail thread to > >> > > https://cwiki.apache.org/confluence/display/COC/SSTable%27s+partition+cardinality+implementation > >> > Probably later it can be transformed to a CEP.. > >> > Regarding experience of DataSketches library's authors and > publications > >> > here there is a good summary in Background section: > >> > > https://cwiki.apache.org/confluence/display/INCUBATOR/DataSketchesProposal > >> > . It looks good.. > >> > > >> > On Fri, 3 Jan 2025 at 13:06, Štefan Miklošovič < > smikloso...@apache.org> > >> > wrote: > >> > > >> > > Right ... that sounds reasonable. Let's "sleep on it" for a while. > It is > >> > > not something which is urgent to deal with right now but I find > myself > >> > > quite often to identify the functionality where we go to the disk > more > >> > > often than necessary and this was next on the list to take a look at > >> > > reading CASSANDRA-13338. So I took a look ... and here we are. > >> > > > >> > > If you guys go to bump SSTable version in 5.1 / 6.0, this change > might be > >> > > just shipped with that too. > >> > > > >> > > On Fri, Jan 3, 2025 at 1:47 PM Benedict <bened...@apache.org> > wrote: > >> > > > >> > >> I’ve had a quick skim of the data sketches library, and it does > seem to > >> > >> have made some more efficient decisions in its design than > clearspring, > >> > >> appears to maybe support off-heap representations, and has > reasonably good > >> > >> documentation about the theoretical properties of the sketches. > The chair > >> > >> of the project is a published author on the topic, and the library > has > >> > >> newer algorithms for cardinality estimation than HLL. > >> > >> > >> > >> So, honestly, it might not be a bad idea to (carefully) consider a > >> > >> migration, even if the current library isn’t broken for our needs. > >> > >> > >> > >> It would not be high up my priority list for the project, but I > would > >> > >> support it if it scratches someone’s itch. > >> > >> > >> > >> On 3 Jan 2025, at 12:16, Štefan Miklošovič <smikloso...@apache.org > > > >> > >> wrote: > >> > >> > >> > >> > >> > >> Okay ... first problems. > >> > >> > >> > >> These 2000 bytes I have mentioned in my response to Chris were > indeed > >> > >> correct, but that was with Datasketches and the main parameter for > Hall > >> > >> Sketch (DEFAULT_LG_K) was 12. When I changed that to 13 to match > what we > >> > >> currently have in Cassandra with Clearspring, that doubled the > size to > >> > >> ~4000 bytes. > >> > >> > >> > >> When we do not use Datasketches, what Clearspring generates is > about > >> > >> ~5000 bytes for the array itself but that array is wrapped into an > >> > >> ICardinality object of Clearspring and we need that object in > order to > >> > >> merge another ICardinality into that. So, we would need to cache > this > >> > >> ICardinality object instead of just an array itself. If we don't > cache > >> > >> whole ICardinality, we would then need to do basically what > >> > >> CompactionMetadata.CompactionMetadataSerializer.deserialize is > doing which > >> > >> would allocate a lot / often (ICardinality cardinality = > >> > >> HyperLogLogPlus.Builder.build(that_cached_array)). > >> > >> > >> > >> To avoid the allocations every time we compute, we would just > cache that > >> > >> whole ICardinality of Clearspring, but that whole object measures > like > >> > >> 11/12 KB. So even 10k tables would occupy like 100MB. 50k tables > 500MB. > >> > >> That is becoming quite a problem. > >> > >> > >> > >> On the other hand, HllSketch of Datasketches, array included, adds > >> > >> minimal overhead. Like an array has 5000 bytes and the whole > object like > >> > >> 5500. You got the idea ... > >> > >> > >> > >> If we are still OK with these sizes, sure ... I am just being > transparent > >> > >> about the consequences here. > >> > >> > >> > >> A user would just opt-in into this (by default it would be turned > off). > >> > >> > >> > >> On the other hand, if we have 10k SSTables, reading that 10+KB > from disk > >> > >> takes around 2-3ms so we would read the disk 20/30 seconds every > time we > >> > >> would hit that metric (and we haven't even started to merge the > logs). > >> > >> > >> > >> If this is still not something which would sell Datasketches as a > viable > >> > >> alternative then I guess we need to stick to these numbers and > cache it all > >> > >> with Clearspring, occupying way more memory. > >> > >> > >> > >> On Thu, Jan 2, 2025 at 10:15 PM Benedict <bened...@apache.org> > wrote: > >> > >> > >> > >>> I would like to see somebody who has some experience writing data > >> > >>> structures, preferably someone we trust as a community to be > competent at > >> > >>> this (ie having some experience within the project contributing > at this > >> > >>> level), look at the code like they were at least lightly > reviewing the > >> > >>> feature as a contribution to this project. > >> > >>> > >> > >>> This should be the bar for any new library really, but triply so > for > >> > >>> replacing a library that works fine. > >> > >>> > >> > >>> On 2 Jan 2025, at 21:02, Štefan Miklošovič < > smikloso...@apache.org> > >> > >>> wrote: > >> > >>> > >> > >>> > >> > >>> Point 2) is pretty hard to fulfil, I can not imagine what would be > >> > >>> "enough" for you to be persuaded. What should concretely happen? > Because > >> > >>> whoever comes and says "yeah this is a good lib, it works" is > probably not > >> > >>> going to be enough given the vague requirements you put under 2) > You would > >> > >>> like to see exactly what? > >> > >>> > >> > >>> The way it looks to me is to just shut it down because of > perceived > >> > >>> churn caused by that and there will always be some argument > against that. > >> > >>> > >> > >>> Based on (1) I don't think what we have is bug free. > >> > >>> > >> > >>> Jeff: > >> > >>> > >> > >>> Thank you for that answer, I think we are on the same page that > caching > >> > >>> it is just fine, that's what I got from your last two paragraphs. > >> > >>> > >> > >>> So the path from here is > >> > >>> > >> > >>> 1) add datasketches and cache > >> > >>> 2) don't add datasketches and cache it anyway > >> > >>> > >> > >>> The introduction of datasketches lib is not the absolute must in > order > >> > >>> to achieve that, we can cache / compute it parallel with > Clearspring as > >> > >>> well, it is just a bitter-sweet solution which just doesn't feel > right. > >> > >>> > >> > >>> (1) https://github.com/addthis/stream-lib/issues > >> > >>> > >> > >>> On Thu, Jan 2, 2025 at 9:26 PM Benedict <bened...@apache.org> > wrote: > >> > >>> > >> > >>>> Your message seemed to be all about the caching proposal, which > I have > >> > >>>> proposed we separate, hence my confusion. > >> > >>>> > >> > >>>> To restate my answer to your question, I think that unless the > new > >> > >>>> library actually offers us concrete benefits we can point to > that we > >> > >>>> actually care about then yes it’s a bad idea to incur the churn > of > >> > >>>> migration. > >> > >>>> > >> > >>>> I’m not inherently opposed to a migration but simply “new is > better” is > >> > >>>> just plain wrong. Nothing you’ve presented yet convinces me this > library is > >> > >>>> worth the effort of vetting given our current solution works > fine. > >> > >>>> > >> > >>>> My position is that for any new library we should: > >> > >>>> > >> > >>>> 1) Point to something it solves that we actually want and is > worth the > >> > >>>> time investment > >> > >>>> 2) Solicit folk in the community competent in the relevant data > >> > >>>> structures to vet the library for the proposed functionality > >> > >>>> > >> > >>>> The existing solution never went through (2) because it dates > from the > >> > >>>> dark ages where we just threw dependencies in willynilly. But it > has the > >> > >>>> benefit of having been used for a very long time without > incident. > >> > >>>> > >> > >>>> > >> > >>>> > >> > >>>> On 2 Jan 2025, at 20:12, Štefan Miklošovič < > smikloso...@apache.org> > >> > >>>> wrote: > >> > >>>> > >> > >>>> > >> > >>>> Hi Benedict, > >> > >>>> > >> > >>>> you wrote: > >> > >>>> > >> > >>>> I am strongly opposed to updating libraries simply for the sake > of it. > >> > >>>> Something like HLL does not need much ongoing maintenance if it > works. > >> > >>>> We’re simply asking for extra work and bugs by switching, and > some risk > >> > >>>> without understanding the quality control for the new library > project’s > >> > >>>> releases. > >> > >>>> > >> > >>>> I understand this. But really, do you think that it is a bad > idea to > >> > >>>> switch to a well maintained library which is already used quite > widely (the > >> > >>>> website mentions extensions for sketches in Apache Druid, Hive, > Pig, Pinot > >> > >>>> and PostgreSQL) and using the library which was abandoned for 6 > years? > >> > >>>> > >> > >>>> As I mentioned there is also extensive comparison with > Clearspring (1) > >> > >>>> where all performance benefits / speedups etc present in detail > with charts > >> > >>>> attached. > >> > >>>> > >> > >>>> I think this is a mature project, under Apache, so when we think > that a > >> > >>>> 6 years old and abandoned library is better than what Apache > Datasketches > >> > >>>> provides, then the question is what are we doing here? Are we > not believing > >> > >>>> what Apache itself offers and we need to rely on a 6 years old > and dead > >> > >>>> library instead of that? Huh? That lib has 3k commits, releases > often, it's > >> > >>>> a pretty active project ... > >> > >>>> > >> > >>>> I don't say that we should not test more deeply how it behaves, > we > >> > >>>> might even re-consider the parameters of hyperloglog as we do > that. But I > >> > >>>> don't think that having this library introduced would cause some > kind of a > >> > >>>> widespread / systemic risk. > >> > >>>> > >> > >>>> (1) > https://datasketches.apache.org/docs/HLL/Hll_vs_CS_Hllpp.html > >> > >>>> > >> > >>>> On Thu, Jan 2, 2025 at 5:03 PM Benedict <bened...@apache.org> > wrote: > >> > >>>> > >> > >>>>> I am strongly opposed to updating libraries simply for the sake > of it. > >> > >>>>> Something like HLL does not need much ongoing maintenance if it > works. > >> > >>>>> We’re simply asking for extra work and bugs by switching, and > some risk > >> > >>>>> without understanding the quality control for the new library > project’s > >> > >>>>> releases. > >> > >>>>> > >> > >>>>> That said, I was not very impressed with the clear spring > library when > >> > >>>>> I looked at it, so I would be open to a stronger argument about > data > >> > >>>>> sketches being superior otherwise in a way that matters to us. > >> > >>>>> > >> > >>>>> If we are to replace the library, we should at the very least do > >> > >>>>> proper due diligence by reviewing the new library’s > implementation(s) > >> > >>>>> ourselves. We cannot simply assume the new library behaves well > for our use > >> > >>>>> cases, or is well maintained. > >> > >>>>> > >> > >>>>> We should also not use the fallback intersection method, as > this would > >> > >>>>> represent a regression to compaction on upgrade. We should > really convert > >> > >>>>> from one HLL to another. > >> > >>>>> > >> > >>>>> The proposal to reduce allocations appears to be orthogonal to > this > >> > >>>>> library, so let’s separate out that discussion? If there’s > evidence this > >> > >>>>> library alone improves the memory profile let’s discuss that. > >> > >>>>> > >> > >>>>> > >> > >>>>> On 2 Jan 2025, at 15:26, Chris Lohfink <clohfin...@gmail.com> > wrote: > >> > >>>>> > >> > >>>>> > >> > >>>>> I think switching to datasketches is a good idea first off > simply > >> > >>>>> because of the lack of maintenance and improvements from > clearspring. I am > >> > >>>>> however, am not sold that it will actually improve anything > significantly. > >> > >>>>> Caches might help on small cases, but those small cases > probably are not > >> > >>>>> actually impacted. In the large cases the caches cost more in > complexity, > >> > >>>>> memory, and ultimately wont matter when theres 50k sstables and > the cache > >> > >>>>> holds 1k so everythings hitting disk anyway. > >> > >>>>> > >> > >>>>> The 5% is missing some relevant information like what the > allocation > >> > >>>>> rate was, how many tables there are etc. On an idle system thats > >> > >>>>> meaningless, if there were 5gb/s allocations of reads/writes > happening at > >> > >>>>> the time thats huge. > >> > >>>>> > >> > >>>>> On Thu, Jan 2, 2025 at 8:42 AM Štefan Miklošovič < > >> > >>>>> smikloso...@apache.org> wrote: > >> > >>>>> > >> > >>>>>> Interesting, thanks for this. Well ... 5% here, 5% there ... it > >> > >>>>>> compounds. I think it is worth trying to do something with > this. Would be > >> > >>>>>> great if you were part of this effort! > >> > >>>>>> > >> > >>>>>> On Thu, Jan 2, 2025 at 3:38 PM Dmitry Konstantinov < > >> > >>>>>> netud...@gmail.com> wrote: > >> > >>>>>> > >> > >>>>>>> I have seen this place in async profiler memory allocation > profile > >> > >>>>>>> on one of production environments some time ago, it was > visible but not > >> > >>>>>>> dramatic, about 5% of allocations: > >> > >>>>>>> <image.png> > >> > >>>>>>> > >> > >>>>>>> The amount of overhead also depends on a metric collection > frequency > >> > >>>>>>> (in my case it was once per 60 seconds) > >> > >>>>>>> > >> > >>>>>>> Regards, > >> > >>>>>>> Dmitry > >> > >>>>>>> > >> > >>>>>>> On Thu, 2 Jan 2025 at 14:21, Štefan Miklošovič < > >> > >>>>>>> smikloso...@apache.org> wrote: > >> > >>>>>>> > >> > >>>>>>>> Indeed, I plan to measure it and compare, maybe some bench > test > >> > >>>>>>>> would be cool to add .. > >> > >>>>>>>> > >> > >>>>>>>> I strongly suspect that the primary reason for the slowness > (if it > >> > >>>>>>>> is verified to be true) is us going to the disk every time > and reading > >> > >>>>>>>> stats for every SSTable all over again. > >> > >>>>>>>> > >> > >>>>>>>> While datasketches say that it is way faster to update (1), > we are > >> > >>>>>>>> living in a realm of nanoseconds here and I don't think that > itself would > >> > >>>>>>>> make any meaningful difference when merging one hyperloglog > with another as > >> > >>>>>>>> part of partition rows estimation computation. The only > place we are > >> > >>>>>>>> updating is SortableTableWriter#endParition which calls > >> > >>>>>>>> metadatacollector.addKey(key.getKey()) which eventually > updates the > >> > >>>>>>>> estimator via cardinality#offeredHashed. > >> > >>>>>>>> > >> > >>>>>>>> In other words, I think that going to the disk and reading it > >> > >>>>>>>> repeatedly is disproportionally more IO / time intensive > than switching the > >> > >>>>>>>> hyperloglog implementation. > >> > >>>>>>>> > >> > >>>>>>>> However, I consider the replacement of the library still > important. > >> > >>>>>>>> I feel uneasy about staying with an abandoned library where > there is > >> > >>>>>>>> clearly a well-maintained replacement. > >> > >>>>>>>> > >> > >>>>>>>> What we could do is to cache all cardinality estimators and > just > >> > >>>>>>>> merge it all when asked upon metric resolution. That is > different from > >> > >>>>>>>> going to disk to deserialize StatsComponent's all over again. > >> > >>>>>>>> > >> > >>>>>>>> Then on SSTable removal, we would remove that from cache > too. I > >> > >>>>>>>> think there is some kind of an observer when SSTable is > removed ... > >> > >>>>>>>> > >> > >>>>>>>> However, I am not sure I can just hold it all in the memory, > it > >> > >>>>>>>> works for laptop testing but if we have thousands of > SSTables with > >> > >>>>>>>> non-trivial number of rows things start to get interesting. > >> > >>>>>>>> > >> > >>>>>>>> (1) > https://datasketches.apache.org/docs/HLL/Hll_vs_CS_Hllpp.html > >> > >>>>>>>> - section HllSketch vs. HyperLogLogPlus Update Speed Behavior > >> > >>>>>>>> > >> > >>>>>>>> On Thu, Jan 2, 2025 at 2:46 PM Jon Haddad < > j...@rustyrazorblade.com> > >> > >>>>>>>> wrote: > >> > >>>>>>>> > >> > >>>>>>>>> Sounds interesting. I took a look at the issue but I'm not > seeing > >> > >>>>>>>>> any data to back up "expensive". Can this be quantified a > bit more? > >> > >>>>>>>>> > >> > >>>>>>>>> Anytime we have a performance related issue, there should > be some > >> > >>>>>>>>> data to back it up, even if it seems obvious. > >> > >>>>>>>>> > >> > >>>>>>>>> Jon > >> > >>>>>>>>> > >> > >>>>>>>>> On Thu, Jan 2, 2025 at 8:20 AM Štefan Miklošovič < > >> > >>>>>>>>> smikloso...@apache.org> wrote: > >> > >>>>>>>>> > >> > >>>>>>>>>> Hello, > >> > >>>>>>>>>> > >> > >>>>>>>>>> I just stumbled upon this library we are using for getting > >> > >>>>>>>>>> estimations of the number of partitions in a SSTable which > are used e.g. in > >> > >>>>>>>>>> EstimatedPartitionCount metric. (1) > >> > >>>>>>>>>> > >> > >>>>>>>>>> A user reported in (1) that it is an expensive operation. > When > >> > >>>>>>>>>> one looks into what it is doing, it calls > >> > >>>>>>>>>> SSTableReader.getApproximateKeyCount() (6) which basically > goes to disk > >> > >>>>>>>>>> every single time, it loads all Stats components and it > looks into > >> > >>>>>>>>>> CompactionMetadata where the cardinality estimator is > located. > >> > >>>>>>>>>> > >> > >>>>>>>>>> We are serializing the hyperloglog to disk as part of a > SSTable > >> > >>>>>>>>>> and we deserialize it back in runtime for every SSTable in > a CF and we > >> > >>>>>>>>>> merge them all to one cardinality again. > >> > >>>>>>>>>> > >> > >>>>>>>>>> I do not think there is a way around this because of the > nature > >> > >>>>>>>>>> of how a cardinality estimator works (hyperloglog). We can > not "cache it", > >> > >>>>>>>>>> it would work only in case we are adding SSTables only - > hence we would > >> > >>>>>>>>>> just merge again - but if we remove an SSTable as part of > the compaction, > >> > >>>>>>>>>> we can not "unmerge" it. > >> > >>>>>>>>>> > >> > >>>>>>>>>> That being said, we are currently using this library for > >> > >>>>>>>>>> hyperloglog (1) which was archived in summer 2020 and > nothing was > >> > >>>>>>>>>> contributed to that for 6 years. That lib is dead. > >> > >>>>>>>>>> > >> > >>>>>>>>>> There is very nice replacement of that (2) directly from > Apache > >> > >>>>>>>>>> (!!!) and they are even giving the detailed and in-depth > comparison of > >> > >>>>>>>>>> hyperloglog implementation found in stream-lib we happen > to use (3) > >> > >>>>>>>>>> (stream-lib = Clearspring) where they are saying that > updating is way > >> > >>>>>>>>>> faster and it is also giving better estimations in general. > >> > >>>>>>>>>> > >> > >>>>>>>>>> I have implemented the usage of both cardinality > estimators (4), > >> > >>>>>>>>>> (5). The reason we need to keep the old one around is that > we may have old > >> > >>>>>>>>>> SSTables around and we need to work with them too. That > translates to a new > >> > >>>>>>>>>> SSTable version (ob) which uses new implementation and for > versions < ob, > >> > >>>>>>>>>> it uses the old one. When SSTables are upgraded from oa to > ob, the old > >> > >>>>>>>>>> estimator will not be used anymore. > >> > >>>>>>>>>> > >> > >>>>>>>>>> There is also a case of a user not upgrading his oa > SSTables, > >> > >>>>>>>>>> turning a node on and creating new SSTables with ob > version. When this > >> > >>>>>>>>>> happens and we ask what is the cardinality (e.g via > nodetool tablestats), I > >> > >>>>>>>>>> am checking if all SSTables are on the same version or > not. If they are, > >> > >>>>>>>>>> they will use either an old or new estimator. (we can not > merge estimations > >> > >>>>>>>>>> from two different hyperloglog implementations). If they > are not, it will > >> > >>>>>>>>>> compute that from index summaries. (The computation for > index summaries was > >> > >>>>>>>>>> already in place (6) as a fail-over in case the estimation > computation > >> > >>>>>>>>>> failed / was not present). > >> > >>>>>>>>>> > >> > >>>>>>>>>> Does this all make sense to drive further to the > completion and > >> > >>>>>>>>>> eventually merge this work to trunk? > >> > >>>>>>>>>> > >> > >>>>>>>>>> Worth to add that Apache Datasketches are just two > dependencies > >> > >>>>>>>>>> for us, it has zero external dependencies. > >> > >>>>>>>>>> > >> > >>>>>>>>>> (1) https://github.com/addthis/stream-lib > >> > >>>>>>>>>> (2) https://datasketches.apache.org/ > >> > >>>>>>>>>> (3) > https://datasketches.apache.org/docs/HLL/Hll_vs_CS_Hllpp.html > >> > >>>>>>>>>> (4) https://issues.apache.org/jira/browse/CASSANDRA-13338 > >> > >>>>>>>>>> (5) https://github.com/apache/cassandra/pull/3767 > >> > >>>>>>>>>> (6) > >> > >>>>>>>>>> > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/format/SSTableReader.java#L284-L338 > >> > >>>>>>>>>> > >> > >>>>>>>>>> Regards > >> > >>>>>>>>>> > >> > >>>>>>>>> > >> > >>>>>>> > >> > >>>>>>> -- > >> > >>>>>>> Dmitry Konstantinov > >> > >>>>>>> > >> > >>>>>> > >> > > >> > -- > >> > Dmitry Konstantinov > >> > > > > > > > > > -- > > Dmitry Konstantinov > -- Dmitry Konstantinov