Re: zstd dictionaries

Yifan Cai Wed, 06 Aug 2025 09:58:32 -0700

Here are the meeting notes.
https://docs.google.com/document/d/1Pnirz6sSYNrStlN3k90yUo-MIj9pp6lxM_RdiyJbcPA/edit?usp=sharing


We shared context about the ZSTD w/ dictionary prototypes and findings.
Discussed the implementations focusing on both SSTable compression and
potential client-side compression benefits.

- Yifan

On Fri, Aug 1, 2025 at 1:33 PM Yifan Cai <[email protected]> wrote:

> I'm excited to hear about the interest in this feature! I'm scheduling a
> Google Meet for Tuesday at 9 AM PST for one hour to discuss ZSTD with
> dictionary compression in Cassandra. I will send the meeting details closer
> to the time of the meeting. Please send me an email if you would like to
> participate.
>
>
> - Yifan
>
> On Fri, Aug 1, 2025 at 12:58 PM Štefan Miklošovič <[email protected]>
> wrote:
>
>> Sure! Please share the link to the call if possible. I will be glad to
>> participate in this in whatever way I can.
>>
>> Regards
>>
>> On Fri, Aug 1, 2025 at 6:53 PM Dinesh Joshi <[email protected]> wrote:
>>
>>> We have explored compressing using trained dictionaries at various
>>> levels - component, table, keyspace level. Obviously component level
>>> dictionary compression is best but results in a _lot_ of dictionaries.
>>> Anyway, this really needs a bit of thought. Since there is a lot of
>>> interest and prior work that each of us may have done, I would suggest we
>>> discuss the various approaches in this thread or get on a quick call and
>>> bring back the summary back to this list. Happy to organize a call if y'all
>>> are interested.
>>>
>>>
>>> On Fri, Aug 1, 2025 at 9:07 AM Štefan Miklošovič <[email protected]>
>>> wrote:
>>>
>>>> Looking into my prototype (I think it is not doing anything yet, just
>>>> WIP), I am training it on flushing so that is in line with what Jon is
>>>> trying to do as well / what he suggests would be optimal.
>>>>
>>>> I do not have a dedicated dictionary component, what I tried to do was
>>>> to just put the dict directly into COMPRESSION_INFO and then bumped the
>>>> SSTable version with a boolean saying if it supports dictionary or not. So
>>>> there is one component less at least.
>>>>
>>>> On Fri, Aug 1, 2025 at 5:59 PM Yifan Cai <[email protected]> wrote:
>>>>
>>>>> Yeah. I have built 2 POCs and have initial benchmark data comparing w/
>>>>> and w/o dictionary. Unfortunately, the work went to backlog. I can pick it
>>>>> up again if there is a demand for the feature.
>>>>> There are some discussions in the Jira that Stefan linked. (thanks
>>>>> Stefan!)
>>>>>
>>>>> - Yifan
>>>>>
>>>>> ------------------------------
>>>>> *From:* Štefan Miklošovič <[email protected]>
>>>>> *Sent:* Friday, August 1, 2025 8:54:07 AM
>>>>> *To:* [email protected] <[email protected]>
>>>>> *Subject:* Re: zstd dictionaries
>>>>>
>>>>> There is already a ticket for this
>>>>> https://issues.apache.org/jira/browse/CASSANDRA-17021
>>>>>
>>>>> I would love to see this in action, I was investigating this a few
>>>>> years ago when ZSTD landed for the first time in 4.0 I think, I was
>>>>> discussing that with Yifan, I think, if my memory serves me well, but, as
>>>>> other things, it just went nowhere and was probably forgotten. I think 
>>>>> that
>>>>> there might be some POC around already. I started to work on this few 
>>>>> years
>>>>> ago and I abandoned it because ... I still have a branch around and it
>>>>> would be great to compare what you have etc.
>>>>>
>>>>> On Fri, Aug 1, 2025 at 5:12 PM Jon Haddad <[email protected]>
>>>>> wrote:
>>>>>
>>>>> Hi folks,
>>>>>
>>>>> I'm working with a team that's interested in seeing zstd dictionaries
>>>>> for SSTable compression implemented due to the potential space and cost
>>>>> savings. I wanted to share my initial thoughts and get the dev list's
>>>>> thoughts as well.
>>>>>
>>>>> According to the zstd documentation [1], dictionaries can provide
>>>>> approximately 3x improvement in space savings compared to non-dictionary
>>>>> compression, along with roughly 4x faster compression and decompression
>>>>> performance. The site notes that "training works if there is some
>>>>> correlation in a family of small data samples. The more data-specific a
>>>>> dictionary is, the more efficient it is (there is no universal 
>>>>> dictionary).
>>>>> Hence, deploying one dictionary per type of data will provide the greatest
>>>>> benefits."
>>>>>
>>>>> The implementation appears straightforward from a code perspective,
>>>>> but there are some architectural considerations I'd like to discuss:
>>>>>
>>>>> *Dictionary Management* One critical aspect is that the dictionary
>>>>> becomes essential for data recovery - if you lose the dictionary, you lose
>>>>> access to the compressed data, similar to losing an encryption key. 
>>>>> (Please
>>>>> correct me if I'm misunderstanding this dependency.)
>>>>>
>>>>> *Storage Approach* I'm considering two options for storing the
>>>>> dictionary:
>>>>>
>>>>>    1.
>>>>>
>>>>>    *SSTable Component*: Save the dictionary as a separate SSTable
>>>>>    component alongside the existing files. My hesitation here is that 
>>>>> we've
>>>>>    traditionally maintained that Data.db is the only essential component.
>>>>>    2.
>>>>>
>>>>>    *Data.db Header*: Embed the dictionary directly in the Data.db
>>>>>    file header.
>>>>>
>>>>> I'm strongly leaning toward the component approach because it avoids
>>>>> modifications to the Data.db file format and can leverage our existing
>>>>> streaming infrastructure.  I spoke with Blake about this and it sounds 
>>>>> like
>>>>> some of the newer features are more dependent on the components other than
>>>>> Data, so I think this is acceptable.
>>>>>
>>>>> Dictionary Generation
>>>>>
>>>>> We currently default to flushing using LZ4, although I think that's
>>>>> only an optimization to avoid high overhead from zSTD.  Using the memtable
>>>>> data to create a dictionary prior to flush could remove the need for this
>>>>> optimization entirely.
>>>>>
>>>>> During compaction, my plan is to generate dictionaries by either
>>>>> sampling chunks from existing files (similar overhead to reading random
>>>>> rows) or using just the first pages of data from each SSTable.  I'd need 
>>>>> to
>>>>> do some testing to see what the optimal setup is here.
>>>>>
>>>>> Opt-in: I think the initial version of this should be opt-in via a
>>>>> flag on compression, but assuming it delivers on the performance and space
>>>>> gains I think we'd want to remove the flag and make it the default.
>>>>> Assuming this feature lands in 6.0, I'd be looking to make it on by 
>>>>> default
>>>>> in 7.0 when using zSTD.  The performance table lists lz4 as still more
>>>>> performant so I think we'd probably leave it as the default compression
>>>>> strategy, although performance benchmarks should be our guide here.
>>>>>
>>>>> Questions for the Community
>>>>>
>>>>>    - Has anyone already explored zstd dictionaries for Cassandra?
>>>>>    - If so, are there existing performance tests or benchmarks?
>>>>>    - Any thoughts on the storage approach or dictionary generation
>>>>>    strategy?
>>>>>    - Other considerations I might be missing?
>>>>>
>>>>> It seems like this would be a fairly easy win to improving density in
>>>>> clusters that are limited by disk space per node.  It should also improve
>>>>> overall performance by reducing compression and decompression overhead.
>>>>> For the team I'm working with, we'd be reducing node count in AWS by
>>>>> several hundred nodes.  We started with about 1K nodes at 4TB / node, and
>>>>> were able to remove roughly 700 with the introduction of CASSANDRA-15452
>>>>> (now at approximately 13TB /node), and are looking to cut the number at
>>>>> least in half again.
>>>>>
>>>>> Looking forward to hearing your thoughts.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Jon
>>>>> [1] https://facebook.github.io/zstd/
>>>>>
>>>>>

Re: zstd dictionaries

Reply via email to