Hi folks,

I'm working with a team that's interested in seeing zstd dictionaries for
SSTable compression implemented due to the potential space and cost
savings. I wanted to share my initial thoughts and get the dev list's
thoughts as well.

According to the zstd documentation [1], dictionaries can provide
approximately 3x improvement in space savings compared to non-dictionary
compression, along with roughly 4x faster compression and decompression
performance. The site notes that "training works if there is some
correlation in a family of small data samples. The more data-specific a
dictionary is, the more efficient it is (there is no universal dictionary).
Hence, deploying one dictionary per type of data will provide the greatest
benefits."

The implementation appears straightforward from a code perspective, but
there are some architectural considerations I'd like to discuss:

*Dictionary Management* One critical aspect is that the dictionary becomes
essential for data recovery - if you lose the dictionary, you lose access
to the compressed data, similar to losing an encryption key. (Please
correct me if I'm misunderstanding this dependency.)

*Storage Approach* I'm considering two options for storing the dictionary:

   1.

   *SSTable Component*: Save the dictionary as a separate SSTable component
   alongside the existing files. My hesitation here is that we've
   traditionally maintained that Data.db is the only essential component.
   2.

   *Data.db Header*: Embed the dictionary directly in the Data.db file
   header.

I'm strongly leaning toward the component approach because it avoids
modifications to the Data.db file format and can leverage our existing
streaming infrastructure.  I spoke with Blake about this and it sounds like
some of the newer features are more dependent on the components other than
Data, so I think this is acceptable.

Dictionary Generation

We currently default to flushing using LZ4, although I think that's only an
optimization to avoid high overhead from zSTD.  Using the memtable data to
create a dictionary prior to flush could remove the need for this
optimization entirely.

During compaction, my plan is to generate dictionaries by either sampling
chunks from existing files (similar overhead to reading random rows) or
using just the first pages of data from each SSTable.  I'd need to do some
testing to see what the optimal setup is here.

Opt-in: I think the initial version of this should be opt-in via a flag on
compression, but assuming it delivers on the performance and space gains I
think we'd want to remove the flag and make it the default.  Assuming this
feature lands in 6.0, I'd be looking to make it on by default in 7.0 when
using zSTD.  The performance table lists lz4 as still more performant so I
think we'd probably leave it as the default compression strategy, although
performance benchmarks should be our guide here.

Questions for the Community

   - Has anyone already explored zstd dictionaries for Cassandra?
   - If so, are there existing performance tests or benchmarks?
   - Any thoughts on the storage approach or dictionary generation strategy?
   - Other considerations I might be missing?

It seems like this would be a fairly easy win to improving density in
clusters that are limited by disk space per node.  It should also improve
overall performance by reducing compression and decompression overhead.
For the team I'm working with, we'd be reducing node count in AWS by
several hundred nodes.  We started with about 1K nodes at 4TB / node, and
were able to remove roughly 700 with the introduction of CASSANDRA-15452
(now at approximately 13TB /node), and are looking to cut the number at
least in half again.

Looking forward to hearing your thoughts.

Thanks,

Jon
[1] https://facebook.github.io/zstd/

Reply via email to