I'd love to chat about this on a call, I think it's valuable and could unlock 4KiB block sizes at ~all times without sacrificing ratio.
-Joey On Fri, Aug 1, 2025 at 9:53 AM Dinesh Joshi <djo...@apache.org> wrote: > We have explored compressing using trained dictionaries at various levels > - component, table, keyspace level. Obviously component level dictionary > compression is best but results in a _lot_ of dictionaries. Anyway, this > really needs a bit of thought. Since there is a lot of interest and prior > work that each of us may have done, I would suggest we discuss the various > approaches in this thread or get on a quick call and bring back the summary > back to this list. Happy to organize a call if y'all are interested. > > > On Fri, Aug 1, 2025 at 9:07 AM Štefan Miklošovič <smikloso...@apache.org> > wrote: > >> Looking into my prototype (I think it is not doing anything yet, just >> WIP), I am training it on flushing so that is in line with what Jon is >> trying to do as well / what he suggests would be optimal. >> >> I do not have a dedicated dictionary component, what I tried to do was to >> just put the dict directly into COMPRESSION_INFO and then bumped the >> SSTable version with a boolean saying if it supports dictionary or not. So >> there is one component less at least. >> >> On Fri, Aug 1, 2025 at 5:59 PM Yifan Cai <yc25c...@gmail.com> wrote: >> >>> Yeah. I have built 2 POCs and have initial benchmark data comparing w/ >>> and w/o dictionary. Unfortunately, the work went to backlog. I can pick it >>> up again if there is a demand for the feature. >>> There are some discussions in the Jira that Stefan linked. (thanks >>> Stefan!) >>> >>> - Yifan >>> >>> ------------------------------ >>> *From:* Štefan Miklošovič <smikloso...@apache.org> >>> *Sent:* Friday, August 1, 2025 8:54:07 AM >>> *To:* dev@cassandra.apache.org <dev@cassandra.apache.org> >>> *Subject:* Re: zstd dictionaries >>> >>> There is already a ticket for this >>> https://issues.apache.org/jira/browse/CASSANDRA-17021 >>> >>> I would love to see this in action, I was investigating this a few years >>> ago when ZSTD landed for the first time in 4.0 I think, I was discussing >>> that with Yifan, I think, if my memory serves me well, but, as other >>> things, it just went nowhere and was probably forgotten. I think that there >>> might be some POC around already. I started to work on this few years ago >>> and I abandoned it because ... I still have a branch around and it would be >>> great to compare what you have etc. >>> >>> On Fri, Aug 1, 2025 at 5:12 PM Jon Haddad <j...@rustyrazorblade.com> >>> wrote: >>> >>> Hi folks, >>> >>> I'm working with a team that's interested in seeing zstd dictionaries >>> for SSTable compression implemented due to the potential space and cost >>> savings. I wanted to share my initial thoughts and get the dev list's >>> thoughts as well. >>> >>> According to the zstd documentation [1], dictionaries can provide >>> approximately 3x improvement in space savings compared to non-dictionary >>> compression, along with roughly 4x faster compression and decompression >>> performance. The site notes that "training works if there is some >>> correlation in a family of small data samples. The more data-specific a >>> dictionary is, the more efficient it is (there is no universal dictionary). >>> Hence, deploying one dictionary per type of data will provide the greatest >>> benefits." >>> >>> The implementation appears straightforward from a code perspective, but >>> there are some architectural considerations I'd like to discuss: >>> >>> *Dictionary Management* One critical aspect is that the dictionary >>> becomes essential for data recovery - if you lose the dictionary, you lose >>> access to the compressed data, similar to losing an encryption key. (Please >>> correct me if I'm misunderstanding this dependency.) >>> >>> *Storage Approach* I'm considering two options for storing the >>> dictionary: >>> >>> 1. >>> >>> *SSTable Component*: Save the dictionary as a separate SSTable >>> component alongside the existing files. My hesitation here is that we've >>> traditionally maintained that Data.db is the only essential component. >>> 2. >>> >>> *Data.db Header*: Embed the dictionary directly in the Data.db file >>> header. >>> >>> I'm strongly leaning toward the component approach because it avoids >>> modifications to the Data.db file format and can leverage our existing >>> streaming infrastructure. I spoke with Blake about this and it sounds like >>> some of the newer features are more dependent on the components other than >>> Data, so I think this is acceptable. >>> >>> Dictionary Generation >>> >>> We currently default to flushing using LZ4, although I think that's only >>> an optimization to avoid high overhead from zSTD. Using the memtable data >>> to create a dictionary prior to flush could remove the need for this >>> optimization entirely. >>> >>> During compaction, my plan is to generate dictionaries by either >>> sampling chunks from existing files (similar overhead to reading random >>> rows) or using just the first pages of data from each SSTable. I'd need to >>> do some testing to see what the optimal setup is here. >>> >>> Opt-in: I think the initial version of this should be opt-in via a flag >>> on compression, but assuming it delivers on the performance and space gains >>> I think we'd want to remove the flag and make it the default. Assuming >>> this feature lands in 6.0, I'd be looking to make it on by default in 7.0 >>> when using zSTD. The performance table lists lz4 as still more performant >>> so I think we'd probably leave it as the default compression strategy, >>> although performance benchmarks should be our guide here. >>> >>> Questions for the Community >>> >>> - Has anyone already explored zstd dictionaries for Cassandra? >>> - If so, are there existing performance tests or benchmarks? >>> - Any thoughts on the storage approach or dictionary generation >>> strategy? >>> - Other considerations I might be missing? >>> >>> It seems like this would be a fairly easy win to improving density in >>> clusters that are limited by disk space per node. It should also improve >>> overall performance by reducing compression and decompression overhead. >>> For the team I'm working with, we'd be reducing node count in AWS by >>> several hundred nodes. We started with about 1K nodes at 4TB / node, and >>> were able to remove roughly 700 with the introduction of CASSANDRA-15452 >>> (now at approximately 13TB /node), and are looking to cut the number at >>> least in half again. >>> >>> Looking forward to hearing your thoughts. >>> >>> Thanks, >>> >>> Jon >>> [1] https://facebook.github.io/zstd/ >>> >>>