[ https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767742#comment-17767742 ]
Cameron Zemek edited comment on CASSANDRA-18866 at 9/21/23 10:04 PM: --------------------------------------------------------------------- Found some bugs: {code:java} if (inflightEcho.contains(addr)) { return; } inflightEcho.add(addr); {code} should be {noformat} if (!inflightEcho.add(addr)) { logger.info("Skip ECHO_REQ to {}", addr); return; }{noformat} Otherwise, data race allows multiple inflight echos. and {code:java} @Override public void onFailure(InetAddressAndPort from, RequestFailureReason failureReason) { MessagingService.instance().sendWithCallback(echoMessage, addr, this); } {code} should be {code:java} @Override public void onFailure(InetAddressAndPort from, RequestFailureReason failureReason) { logger.trace("Resending ECHO_REQ to {}", addr); Message<NoPayload> echoMessage = Message.out(ECHO_REQ, noPayload); MessagingService.instance().sendWithCallback(echoMessage, addr, this); } {code} That is need to construct a new message, not send the same message again. was (Author: cam1982): Found some bugs: {code:java} if (inflightEcho.contains(addr)) { return; } inflightEcho.add(addr); {code} should be {noformat} if (!inflightEcho.add(addr)) { logger.info("Skip ECHO_REQ to {}", addr); return; }{noformat} Otherwise, data race allows multiple inflight echos. and {code:java} @Override public void onFailure(InetAddressAndPort from, RequestFailureReason failureReason) { MessagingService.instance().sendWithCallback(echoMessage, addr, this); } {code} should be {code:java} @Override public void onFailure(InetAddressAndPort from, RequestFailureReason failureReason) { logger.trace("Resending ECHO_REQ to {}", addr); Message<NoPayload> echoMessage = Message.out(ECHO_REQ, noPayload); MessagingService.instance().sendWithCallback(echoMessage, addr, this); } {code} That is need to construct a new message, not send the same message again. > Node sends multiple inflight echos > ---------------------------------- > > Key: CASSANDRA-18866 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18866 > Project: Cassandra > Issue Type: Improvement > Reporter: Cameron Zemek > Priority: Normal > Attachments: 18866-regression.patch, duplicates.log, echo.log > > > CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, > 18845 had change to only allow 1 inflight ECHO request at a time. As per > 18854 some tests have an error rate due to this change. Creating this ticket > to discuss this further. As the current state also does not have retry logic, > it just allowing multiple ECHO requests inflight at the same time so less > likely that all ECHO will timeout or get lost. > With the change from 18845 adding in some extra logging to track what is > going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO > requests from a node and also see it retrying ECHOs when it doesn't get a > reply. > Therefore, I think the problem is more specific than the dropping of one ECHO > request. Yes there no retry logic for failed ECHO requests, but this is the > case even both before and after 18845. ECHO requests are only sent via gossip > verb handlers calling applyStateLocally. In these failed tests I therefore > assuming their cases where it won't call markAlive when other nodes consider > the node UP but its marked DOWN by a node. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org