[ 
https://issues.apache.org/jira/browse/CASSANDRA-17021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18028401#comment-18028401
 ] 

Jon Haddad commented on CASSANDRA-17021:
----------------------------------------

[~smiklosovic] Yes, exactly.  I don't think there's anything wrong with reading 
flushed data as an optimization, or to generate the initial dictionaries on the 
first SSTable flush, but it's not a reliable mechanism that covers all use 
cases.  For example, I have a bunch of tables sitting around that were written 
as part of an analytics job.  They're only read from.  This is personalization 
data that's calculated in a Spark job, similar to movie recommendations.  I'd 
really like to compress it.  The current design doesn't allow for me to go back 
and train a dictionary and re-compress the old data.  There's not really a good 
explanation as to why we can't - it's just a gap in the design.

Sampling from the SSTables sitting on disk has multiple benefits:

1. It gives us training data from the entire dataset, which should be a better 
representation of the overall data set and hopefully give us a better 
compression rate overall
2. It decouples the write path from training
3. If done right, we can train a dictionary out of process.  This can apply to 
spark jobs, but also backups.

> Enhance Zstd support in Cassandra with dictionaries
> ---------------------------------------------------
>
>                 Key: CASSANDRA-17021
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-17021
>             Project: Apache Cassandra
>          Issue Type: Improvement
>          Components: Feature/Compression
>            Reporter: Dinesh Joshi
>            Assignee: Yifan Cai
>            Priority: Normal
>          Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> Currently Cassandra supports zstd compression. However, Zstd also supports 
> dictionaries to enhance not only the compression ratio but also the speed. 
> Dictionaries can show 3-4x savings. We should add support to train 
> dictionaries, ideally per SSTable this will yield the maximum gains.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to