[
https://issues.apache.org/jira/browse/CASSANDRA-17021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18028401#comment-18028401
]
Jon Haddad commented on CASSANDRA-17021:
----------------------------------------
[~smiklosovic] Yes, exactly. I don't think there's anything wrong with reading
flushed data as an optimization, or to generate the initial dictionaries on the
first SSTable flush, but it's not a reliable mechanism that covers all use
cases. For example, I have a bunch of tables sitting around that were written
as part of an analytics job. They're only read from. This is personalization
data that's calculated in a Spark job, similar to movie recommendations. I'd
really like to compress it. The current design doesn't allow for me to go back
and train a dictionary and re-compress the old data. There's not really a good
explanation as to why we can't - it's just a gap in the design.
Sampling from the SSTables sitting on disk has multiple benefits:
1. It gives us training data from the entire dataset, which should be a better
representation of the overall data set and hopefully give us a better
compression rate overall
2. It decouples the write path from training
3. If done right, we can train a dictionary out of process. This can apply to
spark jobs, but also backups.
> Enhance Zstd support in Cassandra with dictionaries
> ---------------------------------------------------
>
> Key: CASSANDRA-17021
> URL: https://issues.apache.org/jira/browse/CASSANDRA-17021
> Project: Apache Cassandra
> Issue Type: Improvement
> Components: Feature/Compression
> Reporter: Dinesh Joshi
> Assignee: Yifan Cai
> Priority: Normal
> Time Spent: 2h 50m
> Remaining Estimate: 0h
>
> Currently Cassandra supports zstd compression. However, Zstd also supports
> dictionaries to enhance not only the compression ratio but also the speed.
> Dictionaries can show 3-4x savings. We should add support to train
> dictionaries, ideally per SSTable this will yield the maximum gains.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]