[
https://issues.apache.org/jira/browse/CASSANDRA-17021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18028242#comment-18028242
]
Jon Haddad commented on CASSANDRA-17021:
----------------------------------------
I've provided some docs for this patch, which Yifan has merged into his branch.
Some minor usability things:
It would be nice if this could tell me some stats about what's been sampled.
Do we have anything available? Something like this could be nice:
{noformat}
nodetool traincompressiondictionary --status chunk_4kb keyvalue
Trainer is collecting sample data for chunk_4kb.keyvalue. 150,000 rows
sampled, 9 minutes remaining.{noformat}
Can we check once a minute by default for new dictionaries?
I'm finding that training can fail if memtables aren't flushed to disk. This
can happen on systems with a ton of memory. I didn't have a flush despite 10MM
writes, and as a result, this failed:
{noformat}
Caused by: java.lang.IllegalStateException: Insufficient samples for training:
0 (minimum required: 10)
at
org.apache.cassandra.db.compression.ZstdDictionaryTrainer.trainDictionary(ZstdDictionaryTrainer.java:118)
at
org.apache.cassandra.db.compression.ICompressionDictionaryTrainer.lambda$trainDictionaryAsync$0(ICompressionDictionaryTrainer.java:81)
at
java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768)
... 9 common frames omitted{noformat}
I don't think this is a deal breaker, as it'll still be useful for a lot of
people, but I think this will bite some folks.
> Enhance Zstd support in Cassandra with dictionaries
> ---------------------------------------------------
>
> Key: CASSANDRA-17021
> URL: https://issues.apache.org/jira/browse/CASSANDRA-17021
> Project: Apache Cassandra
> Issue Type: Improvement
> Components: Feature/Compression
> Reporter: Dinesh Joshi
> Assignee: Yifan Cai
> Priority: Normal
> Time Spent: 2h 40m
> Remaining Estimate: 0h
>
> Currently Cassandra supports zstd compression. However, Zstd also supports
> dictionaries to enhance not only the compression ratio but also the speed.
> Dictionaries can show 3-4x savings. We should add support to train
> dictionaries, ideally per SSTable this will yield the maximum gains.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]