dcapwell commented on code in PR #2660:
URL: https://github.com/apache/cassandra/pull/2660#discussion_r1327792552
##########
src/java/org/apache/cassandra/repair/RepairMessageVerbHandler.java:
##########
@@ -172,14 +194,23 @@ public void doVerb(final Message<RepairMessage> message)
ColumnFamilyStore store =
ColumnFamilyStore.getIfExists(desc.keyspace, desc.columnFamily);
if (store == null)
{
- logger.error("Table {}.{} was dropped during
validation phase of repair {}",
- desc.keyspace, desc.columnFamily,
desc.parentSessionId);
- vState.phase.fail(String.format("Table %s.%s was
dropped", desc.keyspace, desc.columnFamily));
-
MessagingService.instance().send(Message.out(VALIDATION_RSP, new
ValidationResponse(desc)), message.from());
+ String msg = String.format("Table %s.% was dropped
during validation phase of repair %s", desc.keyspace, desc.columnFamily,
desc.parentSessionId);
+ vState.phase.fail(msg);
+ logErrorAndSendFailureResponse(msg, message);
+ ctx.messaging().send(Message.out(VALIDATION_RSP,
new ValidationResponse(desc)), message.from());
return;
}
-
ActiveRepairService.instance.consistent.local.maybeSetRepairing(desc.parentSessionId);
+ try
+ {
+
ctx.repair().consistent.local.maybeSetRepairing(desc.parentSessionId);
+ }
+ catch (Throwable t)
+ {
+ JVMStabilityInspector.inspectThrowable(t);
+ logErrorAndSendFailureResponse(t.toString(),
message);
+ ctx.messaging().send(Message.out(VALIDATION_RSP,
new ValidationResponse(desc)), message.from());
Review Comment:
I thought about this more and I think the following is an edge case to worry
about
* We get a `VALIDATION_REQ` and `participate.register(vState)` returns `true`
* some check fails, so we send a fail ack back
* fail ack is dropped in the messaging layer (maybe a GC causes us to think
the coordinator is down, so we drop all messages)
* we timeout and have more retry attempts, so send another attempt
* `participate.register(vState)` is now `false`, so send ACK... coordinator
now waits for the trees, which will never come because we rejected it!
This isn't hard to solve as all we need to do is update
```
ValidationState vState = new ValidationState(desc, message.from());
if (!participate.register(vState))
{
sendAck(message);
logger.debug("Duplicate validation message found for
parent={}, validation={}", participate.id, vState.id);
return;
}
```
to check for this case (aka the state is failed)...
Ill add a TODO for this and make sure we hit this in testing (I did see CI
hang in the preview/ir case but yet to get CI to tell me the seed as the JVM is
halted by ant first... maybe related)...
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]