We have explored compressing using trained dictionaries at various levels -
component, table, keyspace level. Obviously component level dictionary
compression is best but results in a _lot_ of dictionaries. Anyway, this
really needs a bit of thought. Since there is a lot of interest and prior
work that each of us may have done, I would suggest we discuss the various
approaches in this thread or get on a quick call and bring back the summary
back to this list. Happy to organize a call if y'all are interested.


On Fri, Aug 1, 2025 at 9:07 AM Štefan Miklošovič <smikloso...@apache.org>
wrote:

> Looking into my prototype (I think it is not doing anything yet, just
> WIP), I am training it on flushing so that is in line with what Jon is
> trying to do as well / what he suggests would be optimal.
>
> I do not have a dedicated dictionary component, what I tried to do was to
> just put the dict directly into COMPRESSION_INFO and then bumped the
> SSTable version with a boolean saying if it supports dictionary or not. So
> there is one component less at least.
>
> On Fri, Aug 1, 2025 at 5:59 PM Yifan Cai <yc25c...@gmail.com> wrote:
>
>> Yeah. I have built 2 POCs and have initial benchmark data comparing w/
>> and w/o dictionary. Unfortunately, the work went to backlog. I can pick it
>> up again if there is a demand for the feature.
>> There are some discussions in the Jira that Stefan linked. (thanks
>> Stefan!)
>>
>> - Yifan
>>
>> ------------------------------
>> *From:* Štefan Miklošovič <smikloso...@apache.org>
>> *Sent:* Friday, August 1, 2025 8:54:07 AM
>> *To:* dev@cassandra.apache.org <dev@cassandra.apache.org>
>> *Subject:* Re: zstd dictionaries
>>
>> There is already a ticket for this
>> https://issues.apache.org/jira/browse/CASSANDRA-17021
>>
>> I would love to see this in action, I was investigating this a few years
>> ago when ZSTD landed for the first time in 4.0 I think, I was discussing
>> that with Yifan, I think, if my memory serves me well, but, as other
>> things, it just went nowhere and was probably forgotten. I think that there
>> might be some POC around already. I started to work on this few years ago
>> and I abandoned it because ... I still have a branch around and it would be
>> great to compare what you have etc.
>>
>> On Fri, Aug 1, 2025 at 5:12 PM Jon Haddad <j...@rustyrazorblade.com>
>> wrote:
>>
>> Hi folks,
>>
>> I'm working with a team that's interested in seeing zstd dictionaries for
>> SSTable compression implemented due to the potential space and cost
>> savings. I wanted to share my initial thoughts and get the dev list's
>> thoughts as well.
>>
>> According to the zstd documentation [1], dictionaries can provide
>> approximately 3x improvement in space savings compared to non-dictionary
>> compression, along with roughly 4x faster compression and decompression
>> performance. The site notes that "training works if there is some
>> correlation in a family of small data samples. The more data-specific a
>> dictionary is, the more efficient it is (there is no universal dictionary).
>> Hence, deploying one dictionary per type of data will provide the greatest
>> benefits."
>>
>> The implementation appears straightforward from a code perspective, but
>> there are some architectural considerations I'd like to discuss:
>>
>> *Dictionary Management* One critical aspect is that the dictionary
>> becomes essential for data recovery - if you lose the dictionary, you lose
>> access to the compressed data, similar to losing an encryption key. (Please
>> correct me if I'm misunderstanding this dependency.)
>>
>> *Storage Approach* I'm considering two options for storing the
>> dictionary:
>>
>>    1.
>>
>>    *SSTable Component*: Save the dictionary as a separate SSTable
>>    component alongside the existing files. My hesitation here is that we've
>>    traditionally maintained that Data.db is the only essential component.
>>    2.
>>
>>    *Data.db Header*: Embed the dictionary directly in the Data.db file
>>    header.
>>
>> I'm strongly leaning toward the component approach because it avoids
>> modifications to the Data.db file format and can leverage our existing
>> streaming infrastructure.  I spoke with Blake about this and it sounds like
>> some of the newer features are more dependent on the components other than
>> Data, so I think this is acceptable.
>>
>> Dictionary Generation
>>
>> We currently default to flushing using LZ4, although I think that's only
>> an optimization to avoid high overhead from zSTD.  Using the memtable data
>> to create a dictionary prior to flush could remove the need for this
>> optimization entirely.
>>
>> During compaction, my plan is to generate dictionaries by either sampling
>> chunks from existing files (similar overhead to reading random rows) or
>> using just the first pages of data from each SSTable.  I'd need to do some
>> testing to see what the optimal setup is here.
>>
>> Opt-in: I think the initial version of this should be opt-in via a flag
>> on compression, but assuming it delivers on the performance and space gains
>> I think we'd want to remove the flag and make it the default.  Assuming
>> this feature lands in 6.0, I'd be looking to make it on by default in 7.0
>> when using zSTD.  The performance table lists lz4 as still more performant
>> so I think we'd probably leave it as the default compression strategy,
>> although performance benchmarks should be our guide here.
>>
>> Questions for the Community
>>
>>    - Has anyone already explored zstd dictionaries for Cassandra?
>>    - If so, are there existing performance tests or benchmarks?
>>    - Any thoughts on the storage approach or dictionary generation
>>    strategy?
>>    - Other considerations I might be missing?
>>
>> It seems like this would be a fairly easy win to improving density in
>> clusters that are limited by disk space per node.  It should also improve
>> overall performance by reducing compression and decompression overhead.
>> For the team I'm working with, we'd be reducing node count in AWS by
>> several hundred nodes.  We started with about 1K nodes at 4TB / node, and
>> were able to remove roughly 700 with the introduction of CASSANDRA-15452
>> (now at approximately 13TB /node), and are looking to cut the number at
>> least in half again.
>>
>> Looking forward to hearing your thoughts.
>>
>> Thanks,
>>
>> Jon
>> [1] https://facebook.github.io/zstd/
>>
>>

Reply via email to