We have explored compressing using trained dictionaries at various levels - component, table, keyspace level. Obviously component level dictionary compression is best but results in a _lot_ of dictionaries. Anyway, this really needs a bit of thought. Since there is a lot of interest and prior work that each of us may have done, I would suggest we discuss the various approaches in this thread or get on a quick call and bring back the summary back to this list. Happy to organize a call if y'all are interested.
On Fri, Aug 1, 2025 at 9:07 AM Štefan Miklošovič <smikloso...@apache.org> wrote: > Looking into my prototype (I think it is not doing anything yet, just > WIP), I am training it on flushing so that is in line with what Jon is > trying to do as well / what he suggests would be optimal. > > I do not have a dedicated dictionary component, what I tried to do was to > just put the dict directly into COMPRESSION_INFO and then bumped the > SSTable version with a boolean saying if it supports dictionary or not. So > there is one component less at least. > > On Fri, Aug 1, 2025 at 5:59 PM Yifan Cai <yc25c...@gmail.com> wrote: > >> Yeah. I have built 2 POCs and have initial benchmark data comparing w/ >> and w/o dictionary. Unfortunately, the work went to backlog. I can pick it >> up again if there is a demand for the feature. >> There are some discussions in the Jira that Stefan linked. (thanks >> Stefan!) >> >> - Yifan >> >> ------------------------------ >> *From:* Štefan Miklošovič <smikloso...@apache.org> >> *Sent:* Friday, August 1, 2025 8:54:07 AM >> *To:* dev@cassandra.apache.org <dev@cassandra.apache.org> >> *Subject:* Re: zstd dictionaries >> >> There is already a ticket for this >> https://issues.apache.org/jira/browse/CASSANDRA-17021 >> >> I would love to see this in action, I was investigating this a few years >> ago when ZSTD landed for the first time in 4.0 I think, I was discussing >> that with Yifan, I think, if my memory serves me well, but, as other >> things, it just went nowhere and was probably forgotten. I think that there >> might be some POC around already. I started to work on this few years ago >> and I abandoned it because ... I still have a branch around and it would be >> great to compare what you have etc. >> >> On Fri, Aug 1, 2025 at 5:12 PM Jon Haddad <j...@rustyrazorblade.com> >> wrote: >> >> Hi folks, >> >> I'm working with a team that's interested in seeing zstd dictionaries for >> SSTable compression implemented due to the potential space and cost >> savings. I wanted to share my initial thoughts and get the dev list's >> thoughts as well. >> >> According to the zstd documentation [1], dictionaries can provide >> approximately 3x improvement in space savings compared to non-dictionary >> compression, along with roughly 4x faster compression and decompression >> performance. The site notes that "training works if there is some >> correlation in a family of small data samples. The more data-specific a >> dictionary is, the more efficient it is (there is no universal dictionary). >> Hence, deploying one dictionary per type of data will provide the greatest >> benefits." >> >> The implementation appears straightforward from a code perspective, but >> there are some architectural considerations I'd like to discuss: >> >> *Dictionary Management* One critical aspect is that the dictionary >> becomes essential for data recovery - if you lose the dictionary, you lose >> access to the compressed data, similar to losing an encryption key. (Please >> correct me if I'm misunderstanding this dependency.) >> >> *Storage Approach* I'm considering two options for storing the >> dictionary: >> >> 1. >> >> *SSTable Component*: Save the dictionary as a separate SSTable >> component alongside the existing files. My hesitation here is that we've >> traditionally maintained that Data.db is the only essential component. >> 2. >> >> *Data.db Header*: Embed the dictionary directly in the Data.db file >> header. >> >> I'm strongly leaning toward the component approach because it avoids >> modifications to the Data.db file format and can leverage our existing >> streaming infrastructure. I spoke with Blake about this and it sounds like >> some of the newer features are more dependent on the components other than >> Data, so I think this is acceptable. >> >> Dictionary Generation >> >> We currently default to flushing using LZ4, although I think that's only >> an optimization to avoid high overhead from zSTD. Using the memtable data >> to create a dictionary prior to flush could remove the need for this >> optimization entirely. >> >> During compaction, my plan is to generate dictionaries by either sampling >> chunks from existing files (similar overhead to reading random rows) or >> using just the first pages of data from each SSTable. I'd need to do some >> testing to see what the optimal setup is here. >> >> Opt-in: I think the initial version of this should be opt-in via a flag >> on compression, but assuming it delivers on the performance and space gains >> I think we'd want to remove the flag and make it the default. Assuming >> this feature lands in 6.0, I'd be looking to make it on by default in 7.0 >> when using zSTD. The performance table lists lz4 as still more performant >> so I think we'd probably leave it as the default compression strategy, >> although performance benchmarks should be our guide here. >> >> Questions for the Community >> >> - Has anyone already explored zstd dictionaries for Cassandra? >> - If so, are there existing performance tests or benchmarks? >> - Any thoughts on the storage approach or dictionary generation >> strategy? >> - Other considerations I might be missing? >> >> It seems like this would be a fairly easy win to improving density in >> clusters that are limited by disk space per node. It should also improve >> overall performance by reducing compression and decompression overhead. >> For the team I'm working with, we'd be reducing node count in AWS by >> several hundred nodes. We started with about 1K nodes at 4TB / node, and >> were able to remove roughly 700 with the introduction of CASSANDRA-15452 >> (now at approximately 13TB /node), and are looking to cut the number at >> least in half again. >> >> Looking forward to hearing your thoughts. >> >> Thanks, >> >> Jon >> [1] https://facebook.github.io/zstd/ >> >>