[
https://issues.apache.org/jira/browse/CASSANDRA-17021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17687634#comment-17687634
]
Stefan Miklosovic commented on CASSANDRA-17021:
-----------------------------------------------
In order to create a dictionary per SSTable, we would need to create a
dictionary upon SSTable's flush (probably). The process of dictionary creation
is called "training". So to train samples, we would need to use some mutations
to be written to disk for training and before writing the compressed data, we
would need to compress it with that dictionary. So, put differently, we would
need to train on part of SSTable we want to compress (and write) and then store
the information what dictionary was used upon that so we know how to decompress
it.
The amount of data in an SSTable is not known in advance (we might flush
manually or tables might be compacted etc) so it is questionable what size of
SSTable data we would need to use for training.
Another inconvenience is that CompressionParams are creating compressors way
early before they are actually used. If we happened to have ZstdCompressor per
sstable (because we are using different dictionaries, per SSTable), we would
need to make the difference in what concrete SStable we are decompressing to
use the right dictionary and I do not see that kind of API in the current
ICompressor interface. The current code in place is pretty much "wired" and it
was not planned to be extensible in this regard too much.
This all gets more complicated when compaction is involved.
Another approach is to train on data which are already written (to train on few
first SSTables) to create one dictionary which would be used for all following
SSTables. The more SSTables we train on, the better dictionary we get, although
I do not think that having huge dictionaries is a good idea. This probably
needs to be empirically tested.
The training on existing SSTables can be done offline (some cli tool in
Cassandra) to produce a dictionary. Then table parameters would be modified to
point to that dictionary.
I think the second approach is way easier and I try to go with that approach
here first.
Questions:
What happens when dictionary gets lost or if it is corrupted? Are data
"uncompressable" for ever? How does uncompressing on data without dictionary
work?
> Enhance Zstd support in Cassandra with dictionaries
> ---------------------------------------------------
>
> Key: CASSANDRA-17021
> URL: https://issues.apache.org/jira/browse/CASSANDRA-17021
> Project: Cassandra
> Issue Type: Improvement
> Components: Feature/Compression
> Reporter: Dinesh Joshi
> Assignee: Stefan Miklosovic
> Priority: Normal
>
> Currently Cassandra supports zstd compression. However, Zstd also supports
> dictionaries to enhance not only the compression ratio but also the speed.
> Dictionaries can show 3-4x savings. We should add support to train
> dictionaries, ideally per SSTable this will yield the maximum gains.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]