Yeah. I have built 2 POCs and have initial benchmark data comparing w/ and w/o dictionary. Unfortunately, the work went to backlog. I can pick it up again if there is a demand for the feature. There are some discussions in the Jira that Stefan linked. (thanks Stefan!)
- Yifan ________________________________ From: Štefan Miklošovič <smikloso...@apache.org> Sent: Friday, August 1, 2025 8:54:07 AM To: dev@cassandra.apache.org <dev@cassandra.apache.org> Subject: Re: zstd dictionaries There is already a ticket for this https://issues.apache.org/jira/browse/CASSANDRA-17021 I would love to see this in action, I was investigating this a few years ago when ZSTD landed for the first time in 4.0 I think, I was discussing that with Yifan, I think, if my memory serves me well, but, as other things, it just went nowhere and was probably forgotten. I think that there might be some POC around already. I started to work on this few years ago and I abandoned it because ... I still have a branch around and it would be great to compare what you have etc. On Fri, Aug 1, 2025 at 5:12 PM Jon Haddad <j...@rustyrazorblade.com<mailto:j...@rustyrazorblade.com>> wrote: Hi folks, I'm working with a team that's interested in seeing zstd dictionaries for SSTable compression implemented due to the potential space and cost savings. I wanted to share my initial thoughts and get the dev list's thoughts as well. According to the zstd documentation [1], dictionaries can provide approximately 3x improvement in space savings compared to non-dictionary compression, along with roughly 4x faster compression and decompression performance. The site notes that "training works if there is some correlation in a family of small data samples. The more data-specific a dictionary is, the more efficient it is (there is no universal dictionary). Hence, deploying one dictionary per type of data will provide the greatest benefits." The implementation appears straightforward from a code perspective, but there are some architectural considerations I'd like to discuss: Dictionary Management One critical aspect is that the dictionary becomes essential for data recovery - if you lose the dictionary, you lose access to the compressed data, similar to losing an encryption key. (Please correct me if I'm misunderstanding this dependency.) Storage Approach I'm considering two options for storing the dictionary: 1. SSTable Component: Save the dictionary as a separate SSTable component alongside the existing files. My hesitation here is that we've traditionally maintained that Data.db is the only essential component. 2. Data.db Header: Embed the dictionary directly in the Data.db file header. I'm strongly leaning toward the component approach because it avoids modifications to the Data.db file format and can leverage our existing streaming infrastructure. I spoke with Blake about this and it sounds like some of the newer features are more dependent on the components other than Data, so I think this is acceptable. Dictionary Generation We currently default to flushing using LZ4, although I think that's only an optimization to avoid high overhead from zSTD. Using the memtable data to create a dictionary prior to flush could remove the need for this optimization entirely. During compaction, my plan is to generate dictionaries by either sampling chunks from existing files (similar overhead to reading random rows) or using just the first pages of data from each SSTable. I'd need to do some testing to see what the optimal setup is here. Opt-in: I think the initial version of this should be opt-in via a flag on compression, but assuming it delivers on the performance and space gains I think we'd want to remove the flag and make it the default. Assuming this feature lands in 6.0, I'd be looking to make it on by default in 7.0 when using zSTD. The performance table lists lz4 as still more performant so I think we'd probably leave it as the default compression strategy, although performance benchmarks should be our guide here. Questions for the Community * Has anyone already explored zstd dictionaries for Cassandra? * If so, are there existing performance tests or benchmarks? * Any thoughts on the storage approach or dictionary generation strategy? * Other considerations I might be missing? It seems like this would be a fairly easy win to improving density in clusters that are limited by disk space per node. It should also improve overall performance by reducing compression and decompression overhead. For the team I'm working with, we'd be reducing node count in AWS by several hundred nodes. We started with about 1K nodes at 4TB / node, and were able to remove roughly 700 with the introduction of CASSANDRA-15452 (now at approximately 13TB /node), and are looking to cut the number at least in half again. Looking forward to hearing your thoughts. Thanks, Jon [1] https://facebook.github.io/zstd/