[ 
https://issues.apache.org/jira/browse/CASSANDRA-17021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18028370#comment-18028370
 ] 

Jon Haddad commented on CASSANDRA-17021:
----------------------------------------

bq. I am a bit concerned about the behavior, mainly from the perspective of 
single responsibility. 

Agreed.  I share this concern. The issue stems from tight coupling to the write 
path.  This is something we could have avoided with the on-disk sampling 
approach I originally proposed.

bq. Maybe we can have a command option that enables auto-flush with a specified 
interval. Default is off. So operators have the flexibility of running flush 
manually or automatically along with the command. How does it sound to you? 

Why off by default? I've run through the training process about 20 times now, 
and roughly 25% of the time something goes wrong. From a user perspective, it 
feels broken. If we're going to tightly couple training to the write path, it 
still needs to work reliably, which means addressing the edge cases like no 
flushing or on data that's been sitting around for a while and isn't rewritten. 
Making users opt-in to correct behavior is a pattern I'd like us to move away 
from as a project. I'm a hard -1 on off by default.

The on-disk sampling approach would have eliminated this entire class of 
problems. For my use cases, it's actually a requirement—write-once workloads 
are going to be unnecessarily difficult to orchestrate with training tied to 
writes.

This is a step forward, but we're adding workarounds for architectural 
decisions that create operational headaches. I'd still like to see us 
reconsider sampling on-disk data, which would make this "just work" without the 
coupling issues we're now trying to patch around.

> Enhance Zstd support in Cassandra with dictionaries
> ---------------------------------------------------
>
>                 Key: CASSANDRA-17021
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-17021
>             Project: Apache Cassandra
>          Issue Type: Improvement
>          Components: Feature/Compression
>            Reporter: Dinesh Joshi
>            Assignee: Yifan Cai
>            Priority: Normal
>          Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Currently Cassandra supports zstd compression. However, Zstd also supports 
> dictionaries to enhance not only the compression ratio but also the speed. 
> Dictionaries can show 3-4x savings. We should add support to train 
> dictionaries, ideally per SSTable this will yield the maximum gains.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to