Thank you, Brandon!

I have moved the page to
https://cwiki.apache.org/confluence/display/CASSANDRA/SSTable%27s+partition+cardinality+implementation

On Fri, 3 Jan 2025 at 14:45, Brandon Williams <dri...@gmail.com> wrote:

> I've granted access to the account "Dmitry Konstantinov (netudima)"
>
> Kind Regards,
> Brandon
>
> On Fri, Jan 3, 2025 at 8:18 AM Dmitry Konstantinov <netud...@gmail.com>
> wrote:
> >
> > Hi Brain, I wanted it to be created under
> https://cwiki.apache.org/confluence/display/CASSANDRA/Discussion but it
> looks like I do not have grants to add a page there and Confluence
> automatically selected this space to store the page.
> > I do not have permission to move it too :-(
> > Can I get grants to create pages under
> https://cwiki.apache.org/confluence/display/CASSANDRA/ ?
> >
> > Thank you,
> > Dmitry
> >
> > On Fri, 3 Jan 2025 at 14:12, Brian Proffitt <b...@apache.org> wrote:
> >>
> >> Dmitry:
> >>
> >> You are using a section of the Confluence wiki that is dedicated to
> Community Over Code, the Apache Conference. Please move that page to a more
> appropriate part of the Apache wiki as soon as you can.
> >>
> >> Thanks!
> >> BKP
> >>
> >> On 2025/01/03 13:55:49 Dmitry Konstantinov wrote:
> >> > I have summarized information from this mail thread to
> >> >
> https://cwiki.apache.org/confluence/display/COC/SSTable%27s+partition+cardinality+implementation
> >> > Probably later it can be transformed to a CEP..
> >> > Regarding experience of DataSketches library's authors and
> publications
> >> > here there is a good summary in Background section:
> >> >
> https://cwiki.apache.org/confluence/display/INCUBATOR/DataSketchesProposal
> >> > . It looks good..
> >> >
> >> > On Fri, 3 Jan 2025 at 13:06, Štefan Miklošovič <
> smikloso...@apache.org>
> >> > wrote:
> >> >
> >> > > Right ... that sounds reasonable. Let's "sleep on it" for a while.
> It is
> >> > > not something which is urgent to deal with right now but I find
> myself
> >> > > quite often to identify the functionality where we go to the disk
> more
> >> > > often than necessary and this was next on the list to take a look at
> >> > > reading CASSANDRA-13338. So I took a look ... and here we are.
> >> > >
> >> > > If you guys go to bump SSTable version in 5.1 / 6.0, this change
> might be
> >> > > just shipped with that too.
> >> > >
> >> > > On Fri, Jan 3, 2025 at 1:47 PM Benedict <bened...@apache.org>
> wrote:
> >> > >
> >> > >> I’ve had a quick skim of the data sketches library, and it does
> seem to
> >> > >> have made some more efficient decisions in its design than
> clearspring,
> >> > >> appears to maybe support off-heap representations, and has
> reasonably good
> >> > >> documentation about the theoretical properties of the sketches.
> The chair
> >> > >> of the project is a published author on the topic, and the library
> has
> >> > >> newer algorithms for cardinality estimation than HLL.
> >> > >>
> >> > >> So, honestly, it might not be a bad idea to (carefully) consider a
> >> > >> migration, even if the current library isn’t broken for our needs.
> >> > >>
> >> > >> It would not be high up my priority list for the project, but I
> would
> >> > >> support it if it scratches someone’s itch.
> >> > >>
> >> > >> On 3 Jan 2025, at 12:16, Štefan Miklošovič <smikloso...@apache.org
> >
> >> > >> wrote:
> >> > >>
> >> > >> 
> >> > >> Okay ... first problems.
> >> > >>
> >> > >> These 2000 bytes I have mentioned in my response to Chris were
> indeed
> >> > >> correct, but that was with Datasketches and the main parameter for
> Hall
> >> > >> Sketch (DEFAULT_LG_K) was 12. When I changed that to 13 to match
> what we
> >> > >> currently have in Cassandra with Clearspring, that doubled the
> size to
> >> > >> ~4000 bytes.
> >> > >>
> >> > >> When we do not use Datasketches, what Clearspring generates is
> about
> >> > >> ~5000 bytes for the array itself but that array is wrapped into an
> >> > >> ICardinality object of Clearspring and we need that object in
> order to
> >> > >> merge another ICardinality into that. So, we would need to cache
> this
> >> > >> ICardinality object instead of just an array itself. If we don't
> cache
> >> > >> whole ICardinality, we would then need to do basically what
> >> > >> CompactionMetadata.CompactionMetadataSerializer.deserialize is
> doing which
> >> > >> would allocate a lot / often (ICardinality cardinality =
> >> > >> HyperLogLogPlus.Builder.build(that_cached_array)).
> >> > >>
> >> > >> To avoid the allocations every time we compute, we would just
> cache that
> >> > >> whole ICardinality of Clearspring, but that whole object measures
> like
> >> > >> 11/12 KB. So even 10k tables would occupy like 100MB. 50k tables
> 500MB.
> >> > >> That is becoming quite a problem.
> >> > >>
> >> > >> On the other hand, HllSketch of Datasketches, array included, adds
> >> > >> minimal overhead. Like an array has 5000 bytes and the whole
> object like
> >> > >> 5500. You got the idea ...
> >> > >>
> >> > >> If we are still OK with these sizes, sure ... I am just being
> transparent
> >> > >> about the consequences here.
> >> > >>
> >> > >> A user would just opt-in into this (by default it would be turned
> off).
> >> > >>
> >> > >> On the other hand, if we have 10k SSTables, reading that 10+KB
> from disk
> >> > >> takes around 2-3ms so we would read the disk 20/30 seconds every
> time we
> >> > >> would hit that metric (and we haven't even started to merge the
> logs).
> >> > >>
> >> > >> If this is still not something which would sell Datasketches as a
> viable
> >> > >> alternative then I guess we need to stick to these numbers and
> cache it all
> >> > >> with Clearspring, occupying way more memory.
> >> > >>
> >> > >> On Thu, Jan 2, 2025 at 10:15 PM Benedict <bened...@apache.org>
> wrote:
> >> > >>
> >> > >>> I would like to see somebody who has some experience writing data
> >> > >>> structures, preferably someone we trust as a community to be
> competent at
> >> > >>> this (ie having some experience within the project contributing
> at this
> >> > >>> level), look at the code like they were at least lightly
> reviewing the
> >> > >>> feature as a contribution to this project.
> >> > >>>
> >> > >>> This should be the bar for any new library really, but triply so
> for
> >> > >>> replacing a library that works fine.
> >> > >>>
> >> > >>> On 2 Jan 2025, at 21:02, Štefan Miklošovič <
> smikloso...@apache.org>
> >> > >>> wrote:
> >> > >>>
> >> > >>> 
> >> > >>> Point 2) is pretty hard to fulfil, I can not imagine what would be
> >> > >>> "enough" for you to be persuaded. What should concretely happen?
> Because
> >> > >>> whoever comes and says "yeah this is a good lib, it works" is
> probably not
> >> > >>> going to be enough given the vague requirements you put under 2)
> You would
> >> > >>> like to see exactly what?
> >> > >>>
> >> > >>> The way it looks to me is to just shut it down because of
> perceived
> >> > >>> churn caused by that and there will always be some argument
> against that.
> >> > >>>
> >> > >>> Based on (1) I don't think what we have is bug free.
> >> > >>>
> >> > >>> Jeff:
> >> > >>>
> >> > >>> Thank you for that answer, I think we are on the same page that
> caching
> >> > >>> it is just fine, that's what I got from your last two paragraphs.
> >> > >>>
> >> > >>> So the path from here is
> >> > >>>
> >> > >>> 1) add datasketches and cache
> >> > >>> 2) don't add datasketches and cache it anyway
> >> > >>>
> >> > >>> The introduction of datasketches lib is not the absolute must in
> order
> >> > >>> to achieve that, we can cache / compute it parallel with
> Clearspring as
> >> > >>> well, it is just a bitter-sweet solution which just doesn't feel
> right.
> >> > >>>
> >> > >>> (1) https://github.com/addthis/stream-lib/issues
> >> > >>>
> >> > >>> On Thu, Jan 2, 2025 at 9:26 PM Benedict <bened...@apache.org>
> wrote:
> >> > >>>
> >> > >>>> Your message seemed to be all about the caching proposal, which
> I have
> >> > >>>> proposed we separate, hence my confusion.
> >> > >>>>
> >> > >>>> To restate my answer to your question, I think that unless the
> new
> >> > >>>> library actually offers us concrete benefits we can point to
> that we
> >> > >>>> actually care about then yes it’s a bad idea to incur the churn
> of
> >> > >>>> migration.
> >> > >>>>
> >> > >>>> I’m not inherently opposed to a migration but simply “new is
> better” is
> >> > >>>> just plain wrong. Nothing you’ve presented yet convinces me this
> library is
> >> > >>>> worth the effort of vetting given our current solution works
> fine.
> >> > >>>>
> >> > >>>> My position is that for any new library we should:
> >> > >>>>
> >> > >>>> 1) Point to something it solves that we actually want and is
> worth the
> >> > >>>> time investment
> >> > >>>> 2) Solicit folk in the community competent in the relevant data
> >> > >>>> structures to vet the library for the proposed functionality
> >> > >>>>
> >> > >>>> The existing solution never went through (2) because it dates
> from the
> >> > >>>> dark ages where we just threw dependencies in willynilly. But it
> has the
> >> > >>>> benefit of having been used for a very long time without
> incident.
> >> > >>>>
> >> > >>>>
> >> > >>>>
> >> > >>>> On 2 Jan 2025, at 20:12, Štefan Miklošovič <
> smikloso...@apache.org>
> >> > >>>> wrote:
> >> > >>>>
> >> > >>>> 
> >> > >>>> Hi Benedict,
> >> > >>>>
> >> > >>>> you wrote:
> >> > >>>>
> >> > >>>> I am strongly opposed to updating libraries simply for the sake
> of it.
> >> > >>>> Something like HLL does not need much ongoing maintenance if it
> works.
> >> > >>>> We’re simply asking for extra work and bugs by switching, and
> some risk
> >> > >>>> without understanding the quality control for the new library
> project’s
> >> > >>>> releases.
> >> > >>>>
> >> > >>>> I understand this. But really, do you think that it is a bad
> idea to
> >> > >>>> switch to a well maintained library which is already used quite
> widely (the
> >> > >>>> website mentions extensions for sketches in Apache Druid, Hive,
> Pig, Pinot
> >> > >>>> and PostgreSQL) and using the library which was abandoned for 6
> years?
> >> > >>>>
> >> > >>>> As I mentioned there is also extensive comparison with
> Clearspring (1)
> >> > >>>> where all performance benefits / speedups etc present in detail
> with charts
> >> > >>>> attached.
> >> > >>>>
> >> > >>>> I think this is a mature project, under Apache, so when we think
> that a
> >> > >>>> 6 years old and abandoned library is better than what Apache
> Datasketches
> >> > >>>> provides, then the question is what are we doing here? Are we
> not believing
> >> > >>>> what Apache itself offers and we need to rely on a 6 years old
> and dead
> >> > >>>> library instead of that? Huh? That lib has 3k commits, releases
> often, it's
> >> > >>>> a pretty active project ...
> >> > >>>>
> >> > >>>> I don't say that we should not test more deeply how it behaves,
> we
> >> > >>>> might even re-consider the parameters of hyperloglog as we do
> that. But I
> >> > >>>> don't think that having this library introduced would cause some
> kind of a
> >> > >>>> widespread / systemic risk.
> >> > >>>>
> >> > >>>> (1)
> https://datasketches.apache.org/docs/HLL/Hll_vs_CS_Hllpp.html
> >> > >>>>
> >> > >>>> On Thu, Jan 2, 2025 at 5:03 PM Benedict <bened...@apache.org>
> wrote:
> >> > >>>>
> >> > >>>>> I am strongly opposed to updating libraries simply for the sake
> of it.
> >> > >>>>> Something like HLL does not need much ongoing maintenance if it
> works.
> >> > >>>>> We’re simply asking for extra work and bugs by switching, and
> some risk
> >> > >>>>> without understanding the quality control for the new library
> project’s
> >> > >>>>> releases.
> >> > >>>>>
> >> > >>>>> That said, I was not very impressed with the clear spring
> library when
> >> > >>>>> I looked at it, so I would be open to a stronger argument about
> data
> >> > >>>>> sketches being superior otherwise in a way that matters to us.
> >> > >>>>>
> >> > >>>>> If we are to replace the library, we should at the very least do
> >> > >>>>> proper due diligence by reviewing the new library’s
> implementation(s)
> >> > >>>>> ourselves. We cannot simply assume the new library behaves well
> for our use
> >> > >>>>> cases, or is well maintained.
> >> > >>>>>
> >> > >>>>> We should also not use the fallback intersection method, as
> this would
> >> > >>>>> represent a regression to compaction on upgrade. We should
> really convert
> >> > >>>>> from one HLL to another.
> >> > >>>>>
> >> > >>>>> The proposal to reduce allocations appears to be orthogonal to
> this
> >> > >>>>> library, so let’s separate out that discussion? If there’s
> evidence this
> >> > >>>>> library alone improves the memory profile let’s discuss that.
> >> > >>>>>
> >> > >>>>>
> >> > >>>>> On 2 Jan 2025, at 15:26, Chris Lohfink <clohfin...@gmail.com>
> wrote:
> >> > >>>>>
> >> > >>>>> 
> >> > >>>>> I think switching to datasketches is a good idea first off
> simply
> >> > >>>>> because of the lack of maintenance and improvements from
> clearspring. I am
> >> > >>>>> however, am not sold that it will actually improve anything
> significantly.
> >> > >>>>> Caches might help on small cases, but those small cases
> probably are not
> >> > >>>>> actually impacted. In the large cases the caches cost more in
> complexity,
> >> > >>>>> memory, and ultimately wont matter when theres 50k sstables and
> the cache
> >> > >>>>> holds 1k so everythings hitting disk anyway.
> >> > >>>>>
> >> > >>>>> The 5% is missing some relevant information like what the
> allocation
> >> > >>>>> rate was, how many tables there are etc. On an idle system thats
> >> > >>>>> meaningless, if there were 5gb/s allocations of reads/writes
> happening at
> >> > >>>>> the time thats huge.
> >> > >>>>>
> >> > >>>>> On Thu, Jan 2, 2025 at 8:42 AM Štefan Miklošovič <
> >> > >>>>> smikloso...@apache.org> wrote:
> >> > >>>>>
> >> > >>>>>> Interesting, thanks for this. Well ... 5% here, 5% there ... it
> >> > >>>>>> compounds. I think it is worth trying to do something with
> this. Would be
> >> > >>>>>> great if you were part of this effort!
> >> > >>>>>>
> >> > >>>>>> On Thu, Jan 2, 2025 at 3:38 PM Dmitry Konstantinov <
> >> > >>>>>> netud...@gmail.com> wrote:
> >> > >>>>>>
> >> > >>>>>>> I have seen this place in async profiler memory allocation
> profile
> >> > >>>>>>> on one of production environments some time ago, it was
> visible but not
> >> > >>>>>>> dramatic, about 5% of allocations:
> >> > >>>>>>> <image.png>
> >> > >>>>>>>
> >> > >>>>>>> The amount of overhead also depends on a metric collection
> frequency
> >> > >>>>>>> (in my case it was once per 60 seconds)
> >> > >>>>>>>
> >> > >>>>>>> Regards,
> >> > >>>>>>> Dmitry
> >> > >>>>>>>
> >> > >>>>>>> On Thu, 2 Jan 2025 at 14:21, Štefan Miklošovič <
> >> > >>>>>>> smikloso...@apache.org> wrote:
> >> > >>>>>>>
> >> > >>>>>>>> Indeed, I plan to measure it and compare, maybe some bench
> test
> >> > >>>>>>>> would be cool to add ..
> >> > >>>>>>>>
> >> > >>>>>>>> I strongly suspect that the primary reason for the slowness
> (if it
> >> > >>>>>>>> is verified to be true) is us going to the disk every time
> and reading
> >> > >>>>>>>> stats for every SSTable all over again.
> >> > >>>>>>>>
> >> > >>>>>>>> While datasketches say that it is way faster to update (1),
> we are
> >> > >>>>>>>> living in a realm of nanoseconds here and I don't think that
> itself would
> >> > >>>>>>>> make any meaningful difference when merging one hyperloglog
> with another as
> >> > >>>>>>>> part of partition rows estimation computation. The only
> place we are
> >> > >>>>>>>> updating is SortableTableWriter#endParition which calls
> >> > >>>>>>>> metadatacollector.addKey(key.getKey()) which eventually
> updates the
> >> > >>>>>>>> estimator via cardinality#offeredHashed.
> >> > >>>>>>>>
> >> > >>>>>>>> In other words, I think that going to the disk and reading it
> >> > >>>>>>>> repeatedly is disproportionally more IO / time intensive
> than switching the
> >> > >>>>>>>> hyperloglog implementation.
> >> > >>>>>>>>
> >> > >>>>>>>> However, I consider the replacement of the library still
> important.
> >> > >>>>>>>> I feel uneasy about staying with an abandoned library where
> there is
> >> > >>>>>>>> clearly a well-maintained replacement.
> >> > >>>>>>>>
> >> > >>>>>>>> What we could do is to cache all cardinality estimators and
> just
> >> > >>>>>>>> merge it all when asked upon metric resolution. That is
> different from
> >> > >>>>>>>> going to disk to deserialize StatsComponent's all over again.
> >> > >>>>>>>>
> >> > >>>>>>>> Then on SSTable removal, we would remove that from cache
> too. I
> >> > >>>>>>>> think there is some kind of an observer when SSTable is
> removed ...
> >> > >>>>>>>>
> >> > >>>>>>>> However, I am not sure I can just hold it all in the memory,
> it
> >> > >>>>>>>> works for laptop testing but if we have thousands of
> SSTables with
> >> > >>>>>>>> non-trivial number of rows things start to get interesting.
> >> > >>>>>>>>
> >> > >>>>>>>> (1)
> https://datasketches.apache.org/docs/HLL/Hll_vs_CS_Hllpp.html
> >> > >>>>>>>> - section HllSketch vs. HyperLogLogPlus Update Speed Behavior
> >> > >>>>>>>>
> >> > >>>>>>>> On Thu, Jan 2, 2025 at 2:46 PM Jon Haddad <
> j...@rustyrazorblade.com>
> >> > >>>>>>>> wrote:
> >> > >>>>>>>>
> >> > >>>>>>>>> Sounds interesting.  I took a look at the issue but I'm not
> seeing
> >> > >>>>>>>>> any data to back up "expensive".  Can this be quantified a
> bit more?
> >> > >>>>>>>>>
> >> > >>>>>>>>> Anytime we have a performance related issue, there should
> be some
> >> > >>>>>>>>> data to back it up, even if it seems obvious.
> >> > >>>>>>>>>
> >> > >>>>>>>>> Jon
> >> > >>>>>>>>>
> >> > >>>>>>>>> On Thu, Jan 2, 2025 at 8:20 AM Štefan Miklošovič <
> >> > >>>>>>>>> smikloso...@apache.org> wrote:
> >> > >>>>>>>>>
> >> > >>>>>>>>>> Hello,
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> I just stumbled upon this library we are using for getting
> >> > >>>>>>>>>> estimations of the number of partitions in a SSTable which
> are used e.g. in
> >> > >>>>>>>>>> EstimatedPartitionCount metric. (1)
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> A user reported in (1) that it is an expensive operation.
> When
> >> > >>>>>>>>>> one looks into what it is doing, it calls
> >> > >>>>>>>>>> SSTableReader.getApproximateKeyCount() (6) which basically
> goes to disk
> >> > >>>>>>>>>> every single time, it loads all Stats components and it
> looks into
> >> > >>>>>>>>>> CompactionMetadata where the cardinality estimator is
> located.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> We are serializing the hyperloglog to disk as part of a
> SSTable
> >> > >>>>>>>>>> and we deserialize it back in runtime for every SSTable in
> a CF and we
> >> > >>>>>>>>>> merge them all to one cardinality again.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> I do not think there is a way around this because of the
> nature
> >> > >>>>>>>>>> of how a cardinality estimator works (hyperloglog). We can
> not "cache it",
> >> > >>>>>>>>>> it would work only in case we are adding SSTables only -
> hence we would
> >> > >>>>>>>>>> just merge again - but if we remove an SSTable as part of
> the compaction,
> >> > >>>>>>>>>> we can not "unmerge" it.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> That being said, we are currently using this library for
> >> > >>>>>>>>>> hyperloglog (1) which was archived in summer 2020 and
> nothing was
> >> > >>>>>>>>>> contributed to that for 6 years. That lib is dead.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> There is very nice replacement of that (2) directly from
> Apache
> >> > >>>>>>>>>> (!!!) and they are even giving the detailed and in-depth
> comparison of
> >> > >>>>>>>>>> hyperloglog implementation found in stream-lib we happen
> to use (3)
> >> > >>>>>>>>>> (stream-lib = Clearspring) where they are saying that
> updating is way
> >> > >>>>>>>>>> faster and it is also giving better estimations in general.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> I have implemented the usage of both cardinality
> estimators (4),
> >> > >>>>>>>>>> (5). The reason we need to keep the old one around is that
> we may have old
> >> > >>>>>>>>>> SSTables around and we need to work with them too. That
> translates to a new
> >> > >>>>>>>>>> SSTable version (ob) which uses new implementation and for
> versions < ob,
> >> > >>>>>>>>>> it uses the old one. When SSTables are upgraded from oa to
> ob, the old
> >> > >>>>>>>>>> estimator will not be used anymore.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> There is also a case of a user not upgrading his oa
> SSTables,
> >> > >>>>>>>>>> turning a node on and creating new SSTables with ob
> version. When this
> >> > >>>>>>>>>> happens and we ask what is the cardinality (e.g via
> nodetool tablestats), I
> >> > >>>>>>>>>> am checking if all SSTables are on the same version or
> not. If they are,
> >> > >>>>>>>>>> they will use either an old or new estimator. (we can not
> merge estimations
> >> > >>>>>>>>>> from two different hyperloglog implementations). If they
> are not, it will
> >> > >>>>>>>>>> compute that from index summaries. (The computation for
> index summaries was
> >> > >>>>>>>>>> already in place (6) as a fail-over in case the estimation
> computation
> >> > >>>>>>>>>> failed / was not present).
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> Does this all make sense to drive further to the
> completion and
> >> > >>>>>>>>>> eventually merge this work to trunk?
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> Worth to add that Apache Datasketches are just two
> dependencies
> >> > >>>>>>>>>> for us, it has zero external dependencies.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> (1) https://github.com/addthis/stream-lib
> >> > >>>>>>>>>> (2) https://datasketches.apache.org/
> >> > >>>>>>>>>> (3)
> https://datasketches.apache.org/docs/HLL/Hll_vs_CS_Hllpp.html
> >> > >>>>>>>>>> (4) https://issues.apache.org/jira/browse/CASSANDRA-13338
> >> > >>>>>>>>>> (5) https://github.com/apache/cassandra/pull/3767
> >> > >>>>>>>>>> (6)
> >> > >>>>>>>>>>
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/format/SSTableReader.java#L284-L338
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> Regards
> >> > >>>>>>>>>>
> >> > >>>>>>>>>
> >> > >>>>>>>
> >> > >>>>>>> --
> >> > >>>>>>> Dmitry Konstantinov
> >> > >>>>>>>
> >> > >>>>>>
> >> >
> >> > --
> >> > Dmitry Konstantinov
> >> >
> >
> >
> >
> > --
> > Dmitry Konstantinov
>


-- 
Dmitry Konstantinov

Reply via email to