Thank you for bringing this up. There has also been a re-prepare storm issue 
(15252) that we have fixed. I think in 17401 re-prepares are transient while in 
15252 they'd be permanent as you're describing.

On Mon, Dec 15, 2025, at 9:27 PM, Evan Jones via dev wrote:
> This isn't as helpful as I would like, but in case it helps: the description 
> of this problem sounds similar to an incident we had at Datadog at some point 
> in the past year. I can't remember the details and I can't find it quickly 
> right now, so it might not be identical. IIRC we observed a Cassandra cluster 
> using ~100% of its time preparing statements according to our Java continuous 
> profiling. We weren't sure if the bug was in Cassandra or in the gocql driver 
> we use, which auto-prepares statements. IIRC we ended up scaling the cluster 
> and/or turning off the source application and ramping it back up slowly 
> again, and we weren't able to reproduce the issue again.
> 
> Evan Jones
> 
> 
> 
> 
> On Mon, Dec 15, 2025 at 1:33 PM Jaydeep Chovatia <[email protected]> 
> wrote:
>> No problem, Alex. I'm also sorry for not pinging you a couple more times, as 
>> I assumed this was a corner case only I was seeing. It is now clear that a 
>> few other folks in the industry have faced it as well.
>> Please let Runtian or me know if you need any additional information on our 
>> end. Thank you!
>> 
>> Jaydeep
>> 
>> On Mon, Dec 15, 2025 at 9:47 AM Alex Petrov <[email protected]> wrote:
>>> __
>>> Thank you for explaining. I'll dig through the code to try to remember why 
>>> we introduced eviction, just to make sure we aren't going to introduce a 
>>> correctness issue in place of perf/operational issue (which I am not 
>>> claiming is the case btw, just not fully certain yet).
>>> 
>>> Also Jaydeep sorry for dropping the ball on this: I was under impression 
>>> this has lost importance, haven't realized it was pending all that time.
>>> 
>>> On Mon, Dec 15, 2025, at 6:41 PM, Runtian Liu wrote:
>>>> Alex, you're absolutely right that this isn’t a correctness issue—the 
>>>> system will eventually re-prepare the statement. The problem, however, 
>>>> shows up in real production environments under high QPS.
>>>> 
>>>> When a node is serving a heavy workload, the race condition described in 
>>>> the ticket causes repeated evictions followed by repeated re-prepare 
>>>> attempts. Instead of a single re-prepare, we see a *storm* of re-prepare 
>>>> requests hitting the coordinator. This quickly becomes expensive: it 
>>>> increases CPU usage, adds latency, and in our case escalated into a 
>>>> cluster-wide performance degradation. We actually experienced an outage 
>>>> triggered by this behavior.
>>>> 
>>>> So while correctness is preserved, the operational impact is severe. 
>>>> Preventing the unnecessary eviction avoids the re-prepare storm entirely, 
>>>> which is why we believe this patch is important for stability in real 
>>>> clusters.
>>>> 
>>>> 
>>>> On Mon, Dec 15, 2025 at 8:00 AM Paulo Motta <[email protected]> wrote:
>>>>> I wanted to note I recently faced the issue described in this ticket in a 
>>>>> real cluster. I'm not familiar with this area to understand if there any 
>>>>> negative implications of this patch.
>>>>> 
>>>>> So even if it's not a correctness issue per se, but fixes a practical 
>>>>> issue faced by users without negative consequences I don't see why this 
>>>>> should not be accepted, specially since it has been validated in 
>>>>> production.
>>>>> 
>>>>> On Mon, 15 Dec 2025 at 07:28 Alex Petrov <[email protected]> wrote:
>>>>>> __
>>>>>> iirc I reviewed it and mentioned this is not a correctness issue since 
>>>>>> we would simply re-prepare. I can't recall why we needed to evict, but I 
>>>>>> think this was for correctness reasons. 
>>>>>> 
>>>>>> Would you mind to elaborate why simply letting it to get re-prepared is 
>>>>>> harmful behavior? Or am I missing something and this has larger 
>>>>>> implications?
>>>>>> 
>>>>>> To be clear, I am not opposed to this patch, just want to understand 
>>>>>> implications better.
>>>>>> 
>>>>>> On Sun, Dec 14, 2025, at 9:03 PM, Jaydeep Chovatia wrote:
>>>>>>> Hi
>>>>>>> 
>>>>>>> I had reported this bug (CASSANDRA-17401 
>>>>>>> <https://issues.apache.org/jira/browse/CASSANDRA-17401>) in 2022 along 
>>>>>>> with the fix (PR#3059 <https://github.com/apache/cassandra/pull/3059>) 
>>>>>>> and a reproducible (PR#3058 
>>>>>>> <https://github.com/apache/cassandra/pull/3058>). I already applied 
>>>>>>> this fix internally, and it has been working fine for many years. Now 
>>>>>>> we can see one of the Cassandra users has been facing the exact same 
>>>>>>> problem. I have told them to go with the private fix for now.
>>>>>>> Paulo and Alex had reviewed it partially, could you (or someone) please 
>>>>>>> complete the review so I can land to the official repo.
>>>>>>> 
>>>>>>> Jaydeep
>>>>>> 
>>> 

Reply via email to