Re: [DISCUSS] Replacement of SSTable's partition cardinality implementation from stream-lib to Apache Datasketches

Dmitry Konstantinov Fri, 03 Jan 2025 06:18:50 -0800

Hi Brain, I wanted it to be created under
https://cwiki.apache.org/confluence/display/CASSANDRA/Discussion but it
looks like I do not have grants to add a page there and Confluence
automatically selected this space to store the page.
I do not have permission to move it too :-(
Can I get grants to create pages under
https://cwiki.apache.org/confluence/display/CASSANDRA/ ?


Thank you,
Dmitry

On Fri, 3 Jan 2025 at 14:12, Brian Proffitt <[email protected]> wrote:

> Dmitry:
>
> You are using a section of the Confluence wiki that is dedicated to
> Community Over Code, the Apache Conference. Please move that page to a more
> appropriate part of the Apache wiki as soon as you can.
>
> Thanks!
> BKP
>
> On 2025/01/03 13:55:49 Dmitry Konstantinov wrote:
> > I have summarized information from this mail thread to
> >
> https://cwiki.apache.org/confluence/display/COC/SSTable%27s+partition+cardinality+implementation
> > Probably later it can be transformed to a CEP..
> > Regarding experience of DataSketches library's authors and publications
> > here there is a good summary in Background section:
> >
> https://cwiki.apache.org/confluence/display/INCUBATOR/DataSketchesProposal
> > . It looks good..
> >
> > On Fri, 3 Jan 2025 at 13:06, Štefan Miklošovič <[email protected]>
> > wrote:
> >
> > > Right ... that sounds reasonable. Let's "sleep on it" for a while. It
> is
> > > not something which is urgent to deal with right now but I find myself
> > > quite often to identify the functionality where we go to the disk more
> > > often than necessary and this was next on the list to take a look at
> > > reading CASSANDRA-13338. So I took a look ... and here we are.
> > >
> > > If you guys go to bump SSTable version in 5.1 / 6.0, this change might
> be
> > > just shipped with that too.
> > >
> > > On Fri, Jan 3, 2025 at 1:47 PM Benedict <[email protected]> wrote:
> > >
> > >> I’ve had a quick skim of the data sketches library, and it does seem
> to
> > >> have made some more efficient decisions in its design than
> clearspring,
> > >> appears to maybe support off-heap representations, and has reasonably
> good
> > >> documentation about the theoretical properties of the sketches. The
> chair
> > >> of the project is a published author on the topic, and the library has
> > >> newer algorithms for cardinality estimation than HLL.
> > >>
> > >> So, honestly, it might not be a bad idea to (carefully) consider a
> > >> migration, even if the current library isn’t broken for our needs.
> > >>
> > >> It would not be high up my priority list for the project, but I would
> > >> support it if it scratches someone’s itch.
> > >>
> > >> On 3 Jan 2025, at 12:16, Štefan Miklošovič <[email protected]>
> > >> wrote:
> > >>
> > >> 
> > >> Okay ... first problems.
> > >>
> > >> These 2000 bytes I have mentioned in my response to Chris were indeed
> > >> correct, but that was with Datasketches and the main parameter for
> Hall
> > >> Sketch (DEFAULT_LG_K) was 12. When I changed that to 13 to match what
> we
> > >> currently have in Cassandra with Clearspring, that doubled the size to
> > >> ~4000 bytes.
> > >>
> > >> When we do not use Datasketches, what Clearspring generates is about
> > >> ~5000 bytes for the array itself but that array is wrapped into an
> > >> ICardinality object of Clearspring and we need that object in order to
> > >> merge another ICardinality into that. So, we would need to cache this
> > >> ICardinality object instead of just an array itself. If we don't cache
> > >> whole ICardinality, we would then need to do basically what
> > >> CompactionMetadata.CompactionMetadataSerializer.deserialize is doing
> which
> > >> would allocate a lot / often (ICardinality cardinality =
> > >> HyperLogLogPlus.Builder.build(that_cached_array)).
> > >>
> > >> To avoid the allocations every time we compute, we would just cache
> that
> > >> whole ICardinality of Clearspring, but that whole object measures like
> > >> 11/12 KB. So even 10k tables would occupy like 100MB. 50k tables
> 500MB.
> > >> That is becoming quite a problem.
> > >>
> > >> On the other hand, HllSketch of Datasketches, array included, adds
> > >> minimal overhead. Like an array has 5000 bytes and the whole object
> like
> > >> 5500. You got the idea ...
> > >>
> > >> If we are still OK with these sizes, sure ... I am just being
> transparent
> > >> about the consequences here.
> > >>
> > >> A user would just opt-in into this (by default it would be turned
> off).
> > >>
> > >> On the other hand, if we have 10k SSTables, reading that 10+KB from
> disk
> > >> takes around 2-3ms so we would read the disk 20/30 seconds every time
> we
> > >> would hit that metric (and we haven't even started to merge the logs).
> > >>
> > >> If this is still not something which would sell Datasketches as a
> viable
> > >> alternative then I guess we need to stick to these numbers and cache
> it all
> > >> with Clearspring, occupying way more memory.
> > >>
> > >> On Thu, Jan 2, 2025 at 10:15 PM Benedict <[email protected]> wrote:
> > >>
> > >>> I would like to see somebody who has some experience writing data
> > >>> structures, preferably someone we trust as a community to be
> competent at
> > >>> this (ie having some experience within the project contributing at
> this
> > >>> level), look at the code like they were at least lightly reviewing
> the
> > >>> feature as a contribution to this project.
> > >>>
> > >>> This should be the bar for any new library really, but triply so for
> > >>> replacing a library that works fine.
> > >>>
> > >>> On 2 Jan 2025, at 21:02, Štefan Miklošovič <[email protected]>
> > >>> wrote:
> > >>>
> > >>> 
> > >>> Point 2) is pretty hard to fulfil, I can not imagine what would be
> > >>> "enough" for you to be persuaded. What should concretely happen?
> Because
> > >>> whoever comes and says "yeah this is a good lib, it works" is
> probably not
> > >>> going to be enough given the vague requirements you put under 2) You
> would
> > >>> like to see exactly what?
> > >>>
> > >>> The way it looks to me is to just shut it down because of perceived
> > >>> churn caused by that and there will always be some argument against
> that.
> > >>>
> > >>> Based on (1) I don't think what we have is bug free.
> > >>>
> > >>> Jeff:
> > >>>
> > >>> Thank you for that answer, I think we are on the same page that
> caching
> > >>> it is just fine, that's what I got from your last two paragraphs.
> > >>>
> > >>> So the path from here is
> > >>>
> > >>> 1) add datasketches and cache
> > >>> 2) don't add datasketches and cache it anyway
> > >>>
> > >>> The introduction of datasketches lib is not the absolute must in
> order
> > >>> to achieve that, we can cache / compute it parallel with Clearspring
> as
> > >>> well, it is just a bitter-sweet solution which just doesn't feel
> right.
> > >>>
> > >>> (1) https://github.com/addthis/stream-lib/issues
> > >>>
> > >>> On Thu, Jan 2, 2025 at 9:26 PM Benedict <[email protected]> wrote:
> > >>>
> > >>>> Your message seemed to be all about the caching proposal, which I
> have
> > >>>> proposed we separate, hence my confusion.
> > >>>>
> > >>>> To restate my answer to your question, I think that unless the new
> > >>>> library actually offers us concrete benefits we can point to that we
> > >>>> actually care about then yes it’s a bad idea to incur the churn of
> > >>>> migration.
> > >>>>
> > >>>> I’m not inherently opposed to a migration but simply “new is
> better” is
> > >>>> just plain wrong. Nothing you’ve presented yet convinces me this
> library is
> > >>>> worth the effort of vetting given our current solution works fine.
> > >>>>
> > >>>> My position is that for any new library we should:
> > >>>>
> > >>>> 1) Point to something it solves that we actually want and is worth
> the
> > >>>> time investment
> > >>>> 2) Solicit folk in the community competent in the relevant data
> > >>>> structures to vet the library for the proposed functionality
> > >>>>
> > >>>> The existing solution never went through (2) because it dates from
> the
> > >>>> dark ages where we just threw dependencies in willynilly. But it
> has the
> > >>>> benefit of having been used for a very long time without incident.
> > >>>>
> > >>>>
> > >>>>
> > >>>> On 2 Jan 2025, at 20:12, Štefan Miklošovič <[email protected]>
> > >>>> wrote:
> > >>>>
> > >>>> 
> > >>>> Hi Benedict,
> > >>>>
> > >>>> you wrote:
> > >>>>
> > >>>> I am strongly opposed to updating libraries simply for the sake of
> it.
> > >>>> Something like HLL does not need much ongoing maintenance if it
> works.
> > >>>> We’re simply asking for extra work and bugs by switching, and some
> risk
> > >>>> without understanding the quality control for the new library
> project’s
> > >>>> releases.
> > >>>>
> > >>>> I understand this. But really, do you think that it is a bad idea to
> > >>>> switch to a well maintained library which is already used quite
> widely (the
> > >>>> website mentions extensions for sketches in Apache Druid, Hive,
> Pig, Pinot
> > >>>> and PostgreSQL) and using the library which was abandoned for 6
> years?
> > >>>>
> > >>>> As I mentioned there is also extensive comparison with Clearspring
> (1)
> > >>>> where all performance benefits / speedups etc present in detail
> with charts
> > >>>> attached.
> > >>>>
> > >>>> I think this is a mature project, under Apache, so when we think
> that a
> > >>>> 6 years old and abandoned library is better than what Apache
> Datasketches
> > >>>> provides, then the question is what are we doing here? Are we not
> believing
> > >>>> what Apache itself offers and we need to rely on a 6 years old and
> dead
> > >>>> library instead of that? Huh? That lib has 3k commits, releases
> often, it's
> > >>>> a pretty active project ...
> > >>>>
> > >>>> I don't say that we should not test more deeply how it behaves, we
> > >>>> might even re-consider the parameters of hyperloglog as we do that.
> But I
> > >>>> don't think that having this library introduced would cause some
> kind of a
> > >>>> widespread / systemic risk.
> > >>>>
> > >>>> (1) https://datasketches.apache.org/docs/HLL/Hll_vs_CS_Hllpp.html
> > >>>>
> > >>>> On Thu, Jan 2, 2025 at 5:03 PM Benedict <[email protected]>
> wrote:
> > >>>>
> > >>>>> I am strongly opposed to updating libraries simply for the sake of
> it.
> > >>>>> Something like HLL does not need much ongoing maintenance if it
> works.
> > >>>>> We’re simply asking for extra work and bugs by switching, and some
> risk
> > >>>>> without understanding the quality control for the new library
> project’s
> > >>>>> releases.
> > >>>>>
> > >>>>> That said, I was not very impressed with the clear spring library
> when
> > >>>>> I looked at it, so I would be open to a stronger argument about
> data
> > >>>>> sketches being superior otherwise in a way that matters to us.
> > >>>>>
> > >>>>> If we are to replace the library, we should at the very least do
> > >>>>> proper due diligence by reviewing the new library’s
> implementation(s)
> > >>>>> ourselves. We cannot simply assume the new library behaves well
> for our use
> > >>>>> cases, or is well maintained.
> > >>>>>
> > >>>>> We should also not use the fallback intersection method, as this
> would
> > >>>>> represent a regression to compaction on upgrade. We should really
> convert
> > >>>>> from one HLL to another.
> > >>>>>
> > >>>>> The proposal to reduce allocations appears to be orthogonal to this
> > >>>>> library, so let’s separate out that discussion? If there’s
> evidence this
> > >>>>> library alone improves the memory profile let’s discuss that.
> > >>>>>
> > >>>>>
> > >>>>> On 2 Jan 2025, at 15:26, Chris Lohfink <[email protected]>
> wrote:
> > >>>>>
> > >>>>> 
> > >>>>> I think switching to datasketches is a good idea first off simply
> > >>>>> because of the lack of maintenance and improvements from
> clearspring. I am
> > >>>>> however, am not sold that it will actually improve anything
> significantly.
> > >>>>> Caches might help on small cases, but those small cases probably
> are not
> > >>>>> actually impacted. In the large cases the caches cost more in
> complexity,
> > >>>>> memory, and ultimately wont matter when theres 50k sstables and
> the cache
> > >>>>> holds 1k so everythings hitting disk anyway.
> > >>>>>
> > >>>>> The 5% is missing some relevant information like what the
> allocation
> > >>>>> rate was, how many tables there are etc. On an idle system thats
> > >>>>> meaningless, if there were 5gb/s allocations of reads/writes
> happening at
> > >>>>> the time thats huge.
> > >>>>>
> > >>>>> On Thu, Jan 2, 2025 at 8:42 AM Štefan Miklošovič <
> > >>>>> [email protected]> wrote:
> > >>>>>
> > >>>>>> Interesting, thanks for this. Well ... 5% here, 5% there ... it
> > >>>>>> compounds. I think it is worth trying to do something with this.
> Would be
> > >>>>>> great if you were part of this effort!
> > >>>>>>
> > >>>>>> On Thu, Jan 2, 2025 at 3:38 PM Dmitry Konstantinov <
> > >>>>>> [email protected]> wrote:
> > >>>>>>
> > >>>>>>> I have seen this place in async profiler memory allocation
> profile
> > >>>>>>> on one of production environments some time ago, it was visible
> but not
> > >>>>>>> dramatic, about 5% of allocations:
> > >>>>>>> <image.png>
> > >>>>>>>
> > >>>>>>> The amount of overhead also depends on a metric collection
> frequency
> > >>>>>>> (in my case it was once per 60 seconds)
> > >>>>>>>
> > >>>>>>> Regards,
> > >>>>>>> Dmitry
> > >>>>>>>
> > >>>>>>> On Thu, 2 Jan 2025 at 14:21, Štefan Miklošovič <
> > >>>>>>> [email protected]> wrote:
> > >>>>>>>
> > >>>>>>>> Indeed, I plan to measure it and compare, maybe some bench test
> > >>>>>>>> would be cool to add ..
> > >>>>>>>>
> > >>>>>>>> I strongly suspect that the primary reason for the slowness (if
> it
> > >>>>>>>> is verified to be true) is us going to the disk every time and
> reading
> > >>>>>>>> stats for every SSTable all over again.
> > >>>>>>>>
> > >>>>>>>> While datasketches say that it is way faster to update (1), we
> are
> > >>>>>>>> living in a realm of nanoseconds here and I don't think that
> itself would
> > >>>>>>>> make any meaningful difference when merging one hyperloglog
> with another as
> > >>>>>>>> part of partition rows estimation computation. The only place
> we are
> > >>>>>>>> updating is SortableTableWriter#endParition which calls
> > >>>>>>>> metadatacollector.addKey(key.getKey()) which eventually updates
> the
> > >>>>>>>> estimator via cardinality#offeredHashed.
> > >>>>>>>>
> > >>>>>>>> In other words, I think that going to the disk and reading it
> > >>>>>>>> repeatedly is disproportionally more IO / time intensive than
> switching the
> > >>>>>>>> hyperloglog implementation.
> > >>>>>>>>
> > >>>>>>>> However, I consider the replacement of the library still
> important.
> > >>>>>>>> I feel uneasy about staying with an abandoned library where
> there is
> > >>>>>>>> clearly a well-maintained replacement.
> > >>>>>>>>
> > >>>>>>>> What we could do is to cache all cardinality estimators and just
> > >>>>>>>> merge it all when asked upon metric resolution. That is
> different from
> > >>>>>>>> going to disk to deserialize StatsComponent's all over again.
> > >>>>>>>>
> > >>>>>>>> Then on SSTable removal, we would remove that from cache too. I
> > >>>>>>>> think there is some kind of an observer when SSTable is removed
> ...
> > >>>>>>>>
> > >>>>>>>> However, I am not sure I can just hold it all in the memory, it
> > >>>>>>>> works for laptop testing but if we have thousands of SSTables
> with
> > >>>>>>>> non-trivial number of rows things start to get interesting.
> > >>>>>>>>
> > >>>>>>>> (1)
> https://datasketches.apache.org/docs/HLL/Hll_vs_CS_Hllpp.html
> > >>>>>>>> - section HllSketch vs. HyperLogLogPlus Update Speed Behavior
> > >>>>>>>>
> > >>>>>>>> On Thu, Jan 2, 2025 at 2:46 PM Jon Haddad <
> [email protected]>
> > >>>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> Sounds interesting.  I took a look at the issue but I'm not
> seeing
> > >>>>>>>>> any data to back up "expensive".  Can this be quantified a bit
> more?
> > >>>>>>>>>
> > >>>>>>>>> Anytime we have a performance related issue, there should be
> some
> > >>>>>>>>> data to back it up, even if it seems obvious.
> > >>>>>>>>>
> > >>>>>>>>> Jon
> > >>>>>>>>>
> > >>>>>>>>> On Thu, Jan 2, 2025 at 8:20 AM Štefan Miklošovič <
> > >>>>>>>>> [email protected]> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> Hello,
> > >>>>>>>>>>
> > >>>>>>>>>> I just stumbled upon this library we are using for getting
> > >>>>>>>>>> estimations of the number of partitions in a SSTable which
> are used e.g. in
> > >>>>>>>>>> EstimatedPartitionCount metric. (1)
> > >>>>>>>>>>
> > >>>>>>>>>> A user reported in (1) that it is an expensive operation. When
> > >>>>>>>>>> one looks into what it is doing, it calls
> > >>>>>>>>>> SSTableReader.getApproximateKeyCount() (6) which basically
> goes to disk
> > >>>>>>>>>> every single time, it loads all Stats components and it looks
> into
> > >>>>>>>>>> CompactionMetadata where the cardinality estimator is located.
> > >>>>>>>>>>
> > >>>>>>>>>> We are serializing the hyperloglog to disk as part of a
> SSTable
> > >>>>>>>>>> and we deserialize it back in runtime for every SSTable in a
> CF and we
> > >>>>>>>>>> merge them all to one cardinality again.
> > >>>>>>>>>>
> > >>>>>>>>>> I do not think there is a way around this because of the
> nature
> > >>>>>>>>>> of how a cardinality estimator works (hyperloglog). We can
> not "cache it",
> > >>>>>>>>>> it would work only in case we are adding SSTables only -
> hence we would
> > >>>>>>>>>> just merge again - but if we remove an SSTable as part of the
> compaction,
> > >>>>>>>>>> we can not "unmerge" it.
> > >>>>>>>>>>
> > >>>>>>>>>> That being said, we are currently using this library for
> > >>>>>>>>>> hyperloglog (1) which was archived in summer 2020 and nothing
> was
> > >>>>>>>>>> contributed to that for 6 years. That lib is dead.
> > >>>>>>>>>>
> > >>>>>>>>>> There is very nice replacement of that (2) directly from
> Apache
> > >>>>>>>>>> (!!!) and they are even giving the detailed and in-depth
> comparison of
> > >>>>>>>>>> hyperloglog implementation found in stream-lib we happen to
> use (3)
> > >>>>>>>>>> (stream-lib = Clearspring) where they are saying that
> updating is way
> > >>>>>>>>>> faster and it is also giving better estimations in general.
> > >>>>>>>>>>
> > >>>>>>>>>> I have implemented the usage of both cardinality estimators
> (4),
> > >>>>>>>>>> (5). The reason we need to keep the old one around is that we
> may have old
> > >>>>>>>>>> SSTables around and we need to work with them too. That
> translates to a new
> > >>>>>>>>>> SSTable version (ob) which uses new implementation and for
> versions < ob,
> > >>>>>>>>>> it uses the old one. When SSTables are upgraded from oa to
> ob, the old
> > >>>>>>>>>> estimator will not be used anymore.
> > >>>>>>>>>>
> > >>>>>>>>>> There is also a case of a user not upgrading his oa SSTables,
> > >>>>>>>>>> turning a node on and creating new SSTables with ob version.
> When this
> > >>>>>>>>>> happens and we ask what is the cardinality (e.g via nodetool
> tablestats), I
> > >>>>>>>>>> am checking if all SSTables are on the same version or not.
> If they are,
> > >>>>>>>>>> they will use either an old or new estimator. (we can not
> merge estimations
> > >>>>>>>>>> from two different hyperloglog implementations). If they are
> not, it will
> > >>>>>>>>>> compute that from index summaries. (The computation for index
> summaries was
> > >>>>>>>>>> already in place (6) as a fail-over in case the estimation
> computation
> > >>>>>>>>>> failed / was not present).
> > >>>>>>>>>>
> > >>>>>>>>>> Does this all make sense to drive further to the completion
> and
> > >>>>>>>>>> eventually merge this work to trunk?
> > >>>>>>>>>>
> > >>>>>>>>>> Worth to add that Apache Datasketches are just two
> dependencies
> > >>>>>>>>>> for us, it has zero external dependencies.
> > >>>>>>>>>>
> > >>>>>>>>>> (1) https://github.com/addthis/stream-lib
> > >>>>>>>>>> (2) https://datasketches.apache.org/
> > >>>>>>>>>> (3)
> https://datasketches.apache.org/docs/HLL/Hll_vs_CS_Hllpp.html
> > >>>>>>>>>> (4) https://issues.apache.org/jira/browse/CASSANDRA-13338
> > >>>>>>>>>> (5) https://github.com/apache/cassandra/pull/3767
> > >>>>>>>>>> (6)
> > >>>>>>>>>>
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/format/SSTableReader.java#L284-L338
> > >>>>>>>>>>
> > >>>>>>>>>> Regards
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>
> > >>>>>>> --
> > >>>>>>> Dmitry Konstantinov
> > >>>>>>>
> > >>>>>>
> >
> > --
> > Dmitry Konstantinov
> >
>


-- 
Dmitry Konstantinov

Re: [DISCUSS] Replacement of SSTable's partition cardinality implementation from stream-lib to Apache Datasketches

Reply via email to