Re: [VOTE] CEP-49: Hardware-accelerated compression

Josh McKenzie Tue, 16 Dec 2025 10:10:34 -0800

> As a user, I'd rather have a WARN in my logs than to be unable to start the 
> database without changing cluster-wide configuration like schema / compaction 
> parameters.
Strong +1 here.


While on the one hand we expect homogenous hardware environments for clusters, 
to Scott's point that's not always going to hold true in containerized and 
cloud-based environments. Definitely think we need to let the operators know, 
but graceful degradation of the database (in a step-wise plateau-based fashion 
like this, not a death spiral scenario to be clear) is much preferred IMO.

On Tue, Dec 16, 2025, at 10:32 AM, Štefan Miklošovič wrote:
> Okay I guess that is a good compromise to make here. So warning in the
> logs + metrics? I think that metrics would be cool to have so we might
> chart how often it happens etc.
> 
> On Tue, Dec 16, 2025 at 4:27 PM C. Scott Andreas <[email protected]> wrote:
> >
> > One example where lack of a fallback would be problematic is:
> >
> > – User provisions AWS metal-class instances that expose hardware QAT and 
> > adopts.
> > – User needs to expand cluster or replace failed hardware.
> > – Insufficient hardware-QAT-capable machines available from AWS
> > – Cassandra unable to start on replacement/expanded machines due to lack of 
> > fallback.
> >
> > There are a handful of cases where the database performs similar fallbacks 
> > today, such as attempting mlockall on startup for improved memory locality 
> > and to avoid allocation stalls.
> >
> > As a user, I'd rather have a WARN in my logs than to be unable to start the 
> > database without changing cluster-wide configuration like schema / 
> > compaction parameters.
> >
> > – Scott
> >
> > On Dec 16, 2025, at 5:18 AM, Štefan Miklošovič <[email protected]> 
> > wrote:
> >
> >
> > I am open to adding some kind of metrics when it fallsbacks to track
> > if / how often it failed by hardware etc. Wondering what others think
> > about fallbacking just like that. I feel like something is not
> > transparent to a user who relies on hardware compression in the first
> > place.
> >
> > On Tue, Dec 16, 2025 at 1:52 PM Štefan Miklošovič
> > <[email protected]> wrote:
> >
> >
> > My personal preference is to not do any fallbacking. The reason for
> > that is that failures should be transparent and if it is meant to fail
> > so be it.
> >
> > If we wrap it in try-catch and fallback, then a user thinks that
> > everything is just fine, right? There is no visibility into whether
> > and how often it failed so a user can act on that. By fallbacking, a
> > user is kind of mislead, as they think that all is just fine while
> > they can not wrap they head around the fact that they bought hardware
> > which says that their compression will be accelerated while looking at
> > their dashboards and every now and then seeing the same performance as
> > if they were compressing by software.
> >
> > If they see that it is failing then they can reach out to the vendor
> > of such hardware, then raise complaints and issues with it so the
> > vendor's engineers can look into why it failed and how to fix it.
> > Instead of just wrapping it in one try-catch and acting like all is
> > actually fine. A user bought hardware to compress it, I do not think
> > they are interested in "best-effort" here. If that hardware fails, or
> > the software which is managing it is erroneous, then it should be
> > either fixed or replaced.
> >
> > On Tue, Dec 16, 2025 at 2:29 AM Kokoori, Shylaja
> > <[email protected]> wrote:
> > >
> > > Hi Stefan,
> > > Thank you very much for the feedback.
> > > You are correct, QAT is available on-die and not hot-plugged, and under 
> > > normal circumstances , we shouldn't encounter this exception. However, 
> > > wanted to add reverting to base compressor to make it fault-tolerant.
> > >
> > > While the QAT software stack includes built-in retries and software 
> > > fallbacks for scenarios when devices end up being busy etc., I didn't 
> > > want operations to fail due to transient hardware issues which otherwise 
> > > would have succeeded. An example would be, some unrecoverable error 
> > > occurring during a compress/decompress operation—whether due to a 
> > > hardware issue or caused by related software libraries—the system can 
> > > gracefully revert to the base compressor rather than failing the 
> > > operation entirely.
> > >
> > > I am open to other suggestions.
> > > Thanks,
> > > Shylaja
> > > ________________________________
> > > From: Štefan Miklošovič <[email protected]>
> > > Sent: Monday, December 15, 2025 2:50 PM
> > > To: [email protected] <[email protected]>
> > > Subject: Re: [VOTE] CEP-49: Hardware-accelerated compression
> > >
> > > Hi Shylaja,
> > >
> > > I am going through CEP so I can make the decision when voting and I
> > > want to clarify a few things.
> > >
> > > You say there:
> > >
> > > Both the default compressor instance and a plugin compressor instance
> > > (obtained from the provider), will be maintained by Cassandra. For
> > > subsequent read/write operations, the plugin compressor will be used.
> > > However, if the plugin version encounters an error, the default
> > > compressor will handle the operation.
> > >
> > > Why are we doing this kind of "fallback"? Under what circumstances
> > > "the plugin version encounters an error"? Why would it? It might be
> > > understandable to do it like that if that compression accelerator
> > > would be some "plug and play" or we could just remove it from a
> > > running machine. But this does not seem to be the case? QAT you are
> > > mentioning is baked into the CPU, right? It is not like we would
> > > decide to just turn it suddenly off in runtime so the database would
> > > need to deal with it.
> > >
> > > The reason I am asking is that I just briefly went over the PR and the
> > > way it works there is that if plugin de/compression is not possible
> > > (it throws IOException) then it will default to a software solution.
> > > This is done for every single de/compression of a chunk.
> > >
> > > Is this design the absolute must?
> > >
> > >
> > > On Mon, Dec 15, 2025 at 8:14 PM Josh McKenzie <[email protected]> 
> > > wrote:
> > > >
> > > > Yes but it's in reply to the discussion thread and so it threads that 
> > > > way in clients
> > > >
> > > > Apparently not in fastmail's client because it shows up as its own 
> > > > thread for me. /sigh
> > > >
> > > > Hence the confusion. Makes sense now.
> > > >
> > > > On Mon, Dec 15, 2025, at 1:18 PM, Kokoori, Shylaja wrote:
> > > >
> > > > Thank you for your feedback, Patrick & Brandon. I have created a new 
> > > > email thread like you suggested. Hopefully, this works.
> > > >
> > > > -Shylaja
> > > >
> > > > ________________________________
> > > > From: Patrick McFadin <[email protected]>
> > > > Sent: Monday, December 15, 2025 9:26 AM
> > > > To: [email protected] <[email protected]>
> > > > Subject: Re: [VOTE] CEP-49: Hardware-accelerated compression
> > > >
> > > > That was my point. It's a [DISCUSS] thread.
> > > >
> > > > On Mon, Dec 15, 2025 at 9:22 AM Brandon Williams <[email protected]> 
> > > > wrote:
> > > >
> > > > On Mon, Dec 15, 2025 at 11:13 AM Josh McKenzie <[email protected]> 
> > > > wrote:
> > > > >
> > > > > Can you put this into a [VOTE] thread?
> > > > >
> > > > > I'm confused - isn't the subject of this email [VOTE]?
> > > >
> > > > Yes but it's in reply to the discussion thread and so it threads that
> > > > way in clients, making it easy to overlook.
> > > >
> > > > Kind Regards,
> > > > Brandon
> > > >
> > > >
> >
> >
> >
>

Re: [VOTE] CEP-49: Hardware-accelerated compression

Reply via email to