dcapwell commented on code in PR #2660:
URL: https://github.com/apache/cassandra/pull/2660#discussion_r1327792552


##########
src/java/org/apache/cassandra/repair/RepairMessageVerbHandler.java:
##########
@@ -172,14 +194,23 @@ public void doVerb(final Message<RepairMessage> message)
                         ColumnFamilyStore store = 
ColumnFamilyStore.getIfExists(desc.keyspace, desc.columnFamily);
                         if (store == null)
                         {
-                            logger.error("Table {}.{} was dropped during 
validation phase of repair {}",
-                                         desc.keyspace, desc.columnFamily, 
desc.parentSessionId);
-                            vState.phase.fail(String.format("Table %s.%s was 
dropped", desc.keyspace, desc.columnFamily));
-                            
MessagingService.instance().send(Message.out(VALIDATION_RSP, new 
ValidationResponse(desc)), message.from());
+                            String msg = String.format("Table %s.% was dropped 
during validation phase of repair %s", desc.keyspace, desc.columnFamily, 
desc.parentSessionId);
+                            vState.phase.fail(msg);
+                            logErrorAndSendFailureResponse(msg, message);
+                            ctx.messaging().send(Message.out(VALIDATION_RSP, 
new ValidationResponse(desc)), message.from());
                             return;
                         }
 
-                        
ActiveRepairService.instance.consistent.local.maybeSetRepairing(desc.parentSessionId);
+                        try
+                        {
+                            
ctx.repair().consistent.local.maybeSetRepairing(desc.parentSessionId);
+                        }
+                        catch (Throwable t)
+                        {
+                            JVMStabilityInspector.inspectThrowable(t);
+                            logErrorAndSendFailureResponse(t.toString(), 
message);
+                            ctx.messaging().send(Message.out(VALIDATION_RSP, 
new ValidationResponse(desc)), message.from());

Review Comment:
   I thought about this more and I think the following is an edge case to worry 
about
   
   * We get a `VALIDATION_REQ` and `participate.register(vState)` returns `true`
   * some check fails, so we send a fail ack back
   * fail ack is dropped in the messaging layer (maybe a GC causes us to think 
the coordinator is down, so we drop all messages)
   * we timeout and have more retry attempts, so send another attempt
   * `participate.register(vState)` is now `false`, so send ACK... coordinator 
now waits for the trees, which will never come because we rejected it!
   
   This isn't hard to solve as all we need to do is update 
   
   ```
   ValidationState vState = new ValidationState(desc, message.from());
                       if (!participate.register(vState))
                       {
                           sendAck(message);
                           logger.debug("Duplicate validation message found for 
parent={}, validation={}", participate.id, vState.id);
                           return;
                       }
   ```
   
   to check for this case (aka the state is failed)...
   
   Ill add a TODO for this and make sure we hit this in testing (I did see CI 
hang in the preview/ir case but yet to get CI to tell me the seed as the JVM is 
halted by ant first... maybe related)...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to