[
https://issues.apache.org/jira/browse/CASSANDRA-17021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18028370#comment-18028370
]
Jon Haddad commented on CASSANDRA-17021:
----------------------------------------
bq. I am a bit concerned about the behavior, mainly from the perspective of
single responsibility.
Agreed. I share this concern. The issue stems from tight coupling to the write
path. This is something we could have avoided with the on-disk sampling
approach I originally proposed.
bq. Maybe we can have a command option that enables auto-flush with a specified
interval. Default is off. So operators have the flexibility of running flush
manually or automatically along with the command. How does it sound to you?
Why off by default? I've run through the training process about 20 times now,
and roughly 25% of the time something goes wrong. From a user perspective, it
feels broken. If we're going to tightly couple training to the write path, it
still needs to work reliably, which means addressing the edge cases like no
flushing or on data that's been sitting around for a while and isn't rewritten.
Making users opt-in to correct behavior is a pattern I'd like us to move away
from as a project. I'm a hard -1 on off by default.
The on-disk sampling approach would have eliminated this entire class of
problems. For my use cases, it's actually a requirement—write-once workloads
are going to be unnecessarily difficult to orchestrate with training tied to
writes.
This is a step forward, but we're adding workarounds for architectural
decisions that create operational headaches. I'd still like to see us
reconsider sampling on-disk data, which would make this "just work" without the
coupling issues we're now trying to patch around.
> Enhance Zstd support in Cassandra with dictionaries
> ---------------------------------------------------
>
> Key: CASSANDRA-17021
> URL: https://issues.apache.org/jira/browse/CASSANDRA-17021
> Project: Apache Cassandra
> Issue Type: Improvement
> Components: Feature/Compression
> Reporter: Dinesh Joshi
> Assignee: Yifan Cai
> Priority: Normal
> Time Spent: 2h 40m
> Remaining Estimate: 0h
>
> Currently Cassandra supports zstd compression. However, Zstd also supports
> dictionaries to enhance not only the compression ratio but also the speed.
> Dictionaries can show 3-4x savings. We should add support to train
> dictionaries, ideally per SSTable this will yield the maximum gains.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]