I'd love to chat about this on a call, I think it's valuable and could
unlock 4KiB block sizes at ~all times without sacrificing ratio.

-Joey

On Fri, Aug 1, 2025 at 9:53 AM Dinesh Joshi <djo...@apache.org> wrote:

> We have explored compressing using trained dictionaries at various levels
> - component, table, keyspace level. Obviously component level dictionary
> compression is best but results in a _lot_ of dictionaries. Anyway, this
> really needs a bit of thought. Since there is a lot of interest and prior
> work that each of us may have done, I would suggest we discuss the various
> approaches in this thread or get on a quick call and bring back the summary
> back to this list. Happy to organize a call if y'all are interested.
>
>
> On Fri, Aug 1, 2025 at 9:07 AM Štefan Miklošovič <smikloso...@apache.org>
> wrote:
>
>> Looking into my prototype (I think it is not doing anything yet, just
>> WIP), I am training it on flushing so that is in line with what Jon is
>> trying to do as well / what he suggests would be optimal.
>>
>> I do not have a dedicated dictionary component, what I tried to do was to
>> just put the dict directly into COMPRESSION_INFO and then bumped the
>> SSTable version with a boolean saying if it supports dictionary or not. So
>> there is one component less at least.
>>
>> On Fri, Aug 1, 2025 at 5:59 PM Yifan Cai <yc25c...@gmail.com> wrote:
>>
>>> Yeah. I have built 2 POCs and have initial benchmark data comparing w/
>>> and w/o dictionary. Unfortunately, the work went to backlog. I can pick it
>>> up again if there is a demand for the feature.
>>> There are some discussions in the Jira that Stefan linked. (thanks
>>> Stefan!)
>>>
>>> - Yifan
>>>
>>> ------------------------------
>>> *From:* Štefan Miklošovič <smikloso...@apache.org>
>>> *Sent:* Friday, August 1, 2025 8:54:07 AM
>>> *To:* dev@cassandra.apache.org <dev@cassandra.apache.org>
>>> *Subject:* Re: zstd dictionaries
>>>
>>> There is already a ticket for this
>>> https://issues.apache.org/jira/browse/CASSANDRA-17021
>>>
>>> I would love to see this in action, I was investigating this a few years
>>> ago when ZSTD landed for the first time in 4.0 I think, I was discussing
>>> that with Yifan, I think, if my memory serves me well, but, as other
>>> things, it just went nowhere and was probably forgotten. I think that there
>>> might be some POC around already. I started to work on this few years ago
>>> and I abandoned it because ... I still have a branch around and it would be
>>> great to compare what you have etc.
>>>
>>> On Fri, Aug 1, 2025 at 5:12 PM Jon Haddad <j...@rustyrazorblade.com>
>>> wrote:
>>>
>>> Hi folks,
>>>
>>> I'm working with a team that's interested in seeing zstd dictionaries
>>> for SSTable compression implemented due to the potential space and cost
>>> savings. I wanted to share my initial thoughts and get the dev list's
>>> thoughts as well.
>>>
>>> According to the zstd documentation [1], dictionaries can provide
>>> approximately 3x improvement in space savings compared to non-dictionary
>>> compression, along with roughly 4x faster compression and decompression
>>> performance. The site notes that "training works if there is some
>>> correlation in a family of small data samples. The more data-specific a
>>> dictionary is, the more efficient it is (there is no universal dictionary).
>>> Hence, deploying one dictionary per type of data will provide the greatest
>>> benefits."
>>>
>>> The implementation appears straightforward from a code perspective, but
>>> there are some architectural considerations I'd like to discuss:
>>>
>>> *Dictionary Management* One critical aspect is that the dictionary
>>> becomes essential for data recovery - if you lose the dictionary, you lose
>>> access to the compressed data, similar to losing an encryption key. (Please
>>> correct me if I'm misunderstanding this dependency.)
>>>
>>> *Storage Approach* I'm considering two options for storing the
>>> dictionary:
>>>
>>>    1.
>>>
>>>    *SSTable Component*: Save the dictionary as a separate SSTable
>>>    component alongside the existing files. My hesitation here is that we've
>>>    traditionally maintained that Data.db is the only essential component.
>>>    2.
>>>
>>>    *Data.db Header*: Embed the dictionary directly in the Data.db file
>>>    header.
>>>
>>> I'm strongly leaning toward the component approach because it avoids
>>> modifications to the Data.db file format and can leverage our existing
>>> streaming infrastructure.  I spoke with Blake about this and it sounds like
>>> some of the newer features are more dependent on the components other than
>>> Data, so I think this is acceptable.
>>>
>>> Dictionary Generation
>>>
>>> We currently default to flushing using LZ4, although I think that's only
>>> an optimization to avoid high overhead from zSTD.  Using the memtable data
>>> to create a dictionary prior to flush could remove the need for this
>>> optimization entirely.
>>>
>>> During compaction, my plan is to generate dictionaries by either
>>> sampling chunks from existing files (similar overhead to reading random
>>> rows) or using just the first pages of data from each SSTable.  I'd need to
>>> do some testing to see what the optimal setup is here.
>>>
>>> Opt-in: I think the initial version of this should be opt-in via a flag
>>> on compression, but assuming it delivers on the performance and space gains
>>> I think we'd want to remove the flag and make it the default.  Assuming
>>> this feature lands in 6.0, I'd be looking to make it on by default in 7.0
>>> when using zSTD.  The performance table lists lz4 as still more performant
>>> so I think we'd probably leave it as the default compression strategy,
>>> although performance benchmarks should be our guide here.
>>>
>>> Questions for the Community
>>>
>>>    - Has anyone already explored zstd dictionaries for Cassandra?
>>>    - If so, are there existing performance tests or benchmarks?
>>>    - Any thoughts on the storage approach or dictionary generation
>>>    strategy?
>>>    - Other considerations I might be missing?
>>>
>>> It seems like this would be a fairly easy win to improving density in
>>> clusters that are limited by disk space per node.  It should also improve
>>> overall performance by reducing compression and decompression overhead.
>>> For the team I'm working with, we'd be reducing node count in AWS by
>>> several hundred nodes.  We started with about 1K nodes at 4TB / node, and
>>> were able to remove roughly 700 with the introduction of CASSANDRA-15452
>>> (now at approximately 13TB /node), and are looking to cut the number at
>>> least in half again.
>>>
>>> Looking forward to hearing your thoughts.
>>>
>>> Thanks,
>>>
>>> Jon
>>> [1] https://facebook.github.io/zstd/
>>>
>>>

Reply via email to