[jira] [Commented] (CASSANDRA-17021) Enhance Zstd support in Cassandra with dictionaries

Stefan Miklosovic (Jira) Sun, 12 Feb 2023 13:20:40 -0800


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-17021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17687634#comment-17687634
 ]


Stefan Miklosovic commented on CASSANDRA-17021:
-----------------------------------------------

In order to create a dictionary per SSTable, we would need to create a 
dictionary upon SSTable's flush (probably). The process of dictionary creation 
is called "training". So to train samples, we would need to use some mutations 
to be written to disk for training and before writing the compressed data, we 
would need to compress it with that dictionary. So, put differently, we would 
need to train on part of SSTable we want to compress (and write) and then store 
the information what dictionary was used upon that so we know how to decompress 
it.

The amount of data in an SSTable is not known in advance (we might flush 
manually or tables might be compacted etc) so it is questionable what size of 
SSTable data we would need to use for training.

Another inconvenience is that CompressionParams are creating compressors way 
early before they are actually used. If we happened to have ZstdCompressor per 
sstable (because we are using different dictionaries, per SSTable), we would 
need to make the difference in what concrete SStable we are decompressing to 
use the right dictionary and I do not see that kind of API in the current 
ICompressor interface. The current code in place is pretty much "wired" and it 
was not planned to be extensible in this regard too much.

This all gets more complicated when compaction is involved.

Another approach is to train on data which are already written (to train on few 
first SSTables) to create one dictionary which would be used for all following 
SSTables. The more SSTables we train on, the better dictionary we get, although 
I do not think that having huge dictionaries is a good idea. This probably 
needs to be empirically tested.

The training on existing SSTables can be done offline (some cli tool in 
Cassandra) to produce a dictionary. Then table parameters would be modified to 
point to that dictionary. 

I think the second approach is way easier and I try to go with that approach 
here first.

Questions: 

What happens when dictionary gets lost or if it is corrupted? Are data 
"uncompressable" for ever? How does uncompressing on data without dictionary 
work?

> Enhance Zstd support in Cassandra with dictionaries
> ---------------------------------------------------
>
>                 Key: CASSANDRA-17021
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-17021
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Feature/Compression
>            Reporter: Dinesh Joshi
>            Assignee: Stefan Miklosovic
>            Priority: Normal
>
> Currently Cassandra supports zstd compression. However, Zstd also supports 
> dictionaries to enhance not only the compression ratio but also the speed. 
> Dictionaries can show 3-4x savings. We should add support to train 
> dictionaries, ideally per SSTable this will yield the maximum gains.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (CASSANDRA-17021) Enhance Zstd support in Cassandra with dictionaries

Reply via email to