bdeggleston commented on code in PR #4508:
URL: https://github.com/apache/cassandra/pull/4508#discussion_r2632056160
##########
src/java/org/apache/cassandra/db/AbstractMutationVerbHandler.java:
##########
@@ -51,7 +53,15 @@ public abstract class AbstractMutationVerbHandler<T extends
IMutation> implement
public void doVerb(Message<T> message) throws IOException
{
- processMessage(message, message.respondTo());
+ try
+ {
+ processMessage(message, message.respondTo());
+ }
+ catch (IllegalReplicationTypeException e)
+ {
+ // retry, write may be recoverable if TCM is behind locally
+ doVerb(message);
Review Comment:
Right, I'd added this to recover in cases where we're behind locally. Here's
a possible (though unlikely) scenario
You have a log with 3 changes:
Epoch 1 - tracked
Epoch 2 - untracked
Epoch 3 - tracked.
Coordinator sends tracked mutation with epoch 3. Replica is on epoch 1, but
migration router agrees, so the verb handler check passes. Between the check
and apply, the local TCM advances to epoch 2, and the check in keyspace#apply
fails. However we’re still behind, and the verb handler retries, notices we
should be on epoch 3, advances the local log to epoch 3 and the write succeeds.
I could see this happening if an operator updates the replication type and
quickly changes it back and TCM replication is having trouble somewhere. What's
less clear is if not dropping writes in this edge case is worth the complexity
and which is more likely to cause wasted effort doing production
investigations. WDYT?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]