Re: [PR] CASSANALYTICS-126: Flush event consumer before persisting CDC state to prevent data loss on failure [cassandra-analytics]

via GitHub Mon, 09 Mar 2026 12:25:39 -0700


jmckenzie-dev commented on code in PR #178:
URL: 
https://github.com/apache/cassandra-analytics/pull/178#discussion_r2907478233



##########
cassandra-analytics-cdc/src/main/java/org/apache/cassandra/cdc/Cdc.java:
##########


Review Comment:
   > EX: if there is an out of memory exception during run method, do we stop 
consumers completely? what if the memory usage goes down, how does the sidecar 
start consumers again in that case?
   
   In my experience, once you've hit an OOM you can no longer trust the java 
process you're operating in to be in a good state. Any and all manner of 
problems come up once you have a non-deterministic "blast" of items that failed 
to allocate. Similar to my comment above about ordering and greedy resetting of 
the `this.timerId` to -1 before we've actually tried to stop things and see if 
they shut down gracefully or explode on execution, I think we face the same 
risk here. Catching all Throwables in this way is "fine" if you have something 
where you can reasonably recover from a failure state, but what do we do here 
if we're in a hybrid failure state (i.e. some persisting to disk failed, some 
didn't)? The logic around `persistToCassandra` will remove an activeState _even 
if it's in a catastrophically failing state_ and simply re-enqueue it for the 
next attempt.
   
   This whole thing seems engineered to try and always keep the 
`persistToCassandra` futures around until we know they're durably done, so we 
move to just catch everything and absorb them on failure. I guess my concern is 
the degenerate degradation case where we end up with a half-dead zombie process 
because we over-allocated and then end up strewing OOM's everywhere. For 
instance - if we're mid OOM, the following code's not going to re-enqueue the 
calls to persist:
   ```
           states.stream()
                 .map(this::persistToCassandra)
                 .filter(Objects::nonNull)
                 .forEach(activeFlush::add);
   ```
   The design here of the handoff between `activeFlush` and `states` plus our 
greedy "catch everything" exposes us to some nasty problems. =/



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] CASSANALYTICS-126: Flush event consumer before persisting CDC state to prevent data loss on failure [cassandra-analytics]

Reply via email to