Re: zstd dictionaries

Yifan Cai Fri, 01 Aug 2025 08:59:47 -0700

Yeah. I have built 2 POCs and have initial benchmark data comparing w/ and w/o 
dictionary. Unfortunately, the work went to backlog. I can pick it up again if 
there is a demand for the feature.
There are some discussions in the Jira that Stefan linked. (thanks Stefan!)

- Yifan

________________________________
From: Štefan Miklošovič <[email protected]>
Sent: Friday, August 1, 2025 8:54:07 AM
To: [email protected] <[email protected]>
Subject: Re: zstd dictionaries

There is already a ticket for this 
https://issues.apache.org/jira/browse/CASSANDRA-17021

I would love to see this in action, I was investigating this a few years ago 
when ZSTD landed for the first time in 4.0 I think, I was discussing that with 
Yifan, I think, if my memory serves me well, but, as other things, it just went 
nowhere and was probably forgotten. I think that there might be some POC around 
already. I started to work on this few years ago and I abandoned it because ... 
I still have a branch around and it would be great to compare what you have etc.

On Fri, Aug 1, 2025 at 5:12 PM Jon Haddad 
<[email protected]<mailto:[email protected]>> wrote:

Hi folks,

I'm working with a team that's interested in seeing zstd dictionaries for 
SSTable compression implemented due to the potential space and cost savings. I 
wanted to share my initial thoughts and get the dev list's thoughts as well.

According to the zstd documentation [1], dictionaries can provide approximately 
3x improvement in space savings compared to non-dictionary compression, along 
with roughly 4x faster compression and decompression performance. The site 
notes that "training works if there is some correlation in a family of small 
data samples. The more data-specific a dictionary is, the more efficient it is 
(there is no universal dictionary). Hence, deploying one dictionary per type of 
data will provide the greatest benefits."

The implementation appears straightforward from a code perspective, but there 
are some architectural considerations I'd like to discuss:

Dictionary Management One critical aspect is that the dictionary becomes 
essential for data recovery - if you lose the dictionary, you lose access to 
the compressed data, similar to losing an encryption key. (Please correct me if 
I'm misunderstanding this dependency.)

Storage Approach I'm considering two options for storing the dictionary:

  1.  SSTable Component: Save the dictionary as a separate SSTable component 
alongside the existing files. My hesitation here is that we've traditionally 
maintained that Data.db is the only essential component.

  2.  Data.db Header: Embed the dictionary directly in the Data.db file header.

I'm strongly leaning toward the component approach because it avoids 
modifications to the Data.db file format and can leverage our existing 
streaming infrastructure.  I spoke with Blake about this and it sounds like 
some of the newer features are more dependent on the components other than 
Data, so I think this is acceptable.

Dictionary Generation

We currently default to flushing using LZ4, although I think that's only an 
optimization to avoid high overhead from zSTD.  Using the memtable data to 
create a dictionary prior to flush could remove the need for this optimization 
entirely.

During compaction, my plan is to generate dictionaries by either sampling 
chunks from existing files (similar overhead to reading random rows) or using 
just the first pages of data from each SSTable.  I'd need to do some testing to 
see what the optimal setup is here.

Opt-in: I think the initial version of this should be opt-in via a flag on 
compression, but assuming it delivers on the performance and space gains I 
think we'd want to remove the flag and make it the default.  Assuming this 
feature lands in 6.0, I'd be looking to make it on by default in 7.0 when using 
zSTD.  The performance table lists lz4 as still more performant so I think we'd 
probably leave it as the default compression strategy, although performance 
benchmarks should be our guide here.

Questions for the Community

  *   Has anyone already explored zstd dictionaries for Cassandra?
  *   If so, are there existing performance tests or benchmarks?
  *   Any thoughts on the storage approach or dictionary generation strategy?
  *   Other considerations I might be missing?

It seems like this would be a fairly easy win to improving density in clusters 
that are limited by disk space per node.  It should also improve overall 
performance by reducing compression and decompression overhead.  For the team 
I'm working with, we'd be reducing node count in AWS by several hundred nodes.  
We started with about 1K nodes at 4TB / node, and were able to remove roughly 
700 with the introduction of CASSANDRA-15452 (now at approximately 13TB /node), 
and are looking to cut the number at least in half again.

Looking forward to hearing your thoughts.

Thanks,

Jon

[1] https://facebook.github.io/zstd/

Re: zstd dictionaries

Reply via email to