[
https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767742#comment-17767742
]
Cameron Zemek edited comment on CASSANDRA-18866 at 9/22/23 3:00 AM:
--------------------------------------------------------------------
Found some bugs:
{code:java}
if (inflightEcho.contains(addr))
{
return;
}
inflightEcho.add(addr); {code}
should be
{code:java}
if (!inflightEcho.add(addr))
{
return;
} {code}
Otherwise, data race allows multiple inflight echos.
and
{code:java}
@Override
public void onFailure(InetAddressAndPort from, RequestFailureReason
failureReason)
{
MessagingService.instance().sendWithCallback(echoMessage, addr,
this);
} {code}
should be
{code:java}
@Override
public void onFailure(InetAddressAndPort from, RequestFailureReason
failureReason)
{
logger.trace("Resending ECHO_REQ to {}", addr);
Message<NoPayload> echoMessage = Message.out(ECHO_REQ,
noPayload);
MessagingService.instance().sendWithCallback(echoMessage, addr,
this);
}
{code}
That is need to construct a new message, not send the same message again.
was (Author: cam1982):
Found some bugs:
{code:java}
if (inflightEcho.contains(addr))
{
return;
}
inflightEcho.add(addr); {code}
should be
{noformat}
if (!inflightEcho.add(addr))
{
logger.info("Skip ECHO_REQ to {}", addr);
return;
}{noformat}
Otherwise, data race allows multiple inflight echos.
and
{code:java}
@Override
public void onFailure(InetAddressAndPort from, RequestFailureReason
failureReason)
{
MessagingService.instance().sendWithCallback(echoMessage, addr,
this);
} {code}
should be
{code:java}
@Override
public void onFailure(InetAddressAndPort from, RequestFailureReason
failureReason)
{
logger.trace("Resending ECHO_REQ to {}", addr);
Message<NoPayload> echoMessage = Message.out(ECHO_REQ,
noPayload);
MessagingService.instance().sendWithCallback(echoMessage, addr,
this);
}
{code}
That is need to construct a new message, not send the same message again.
> Node sends multiple inflight echos
> ----------------------------------
>
> Key: CASSANDRA-18866
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18866
> Project: Cassandra
> Issue Type: Improvement
> Reporter: Cameron Zemek
> Priority: Normal
> Attachments: 18866-regression.patch, duplicates.log, echo.log
>
>
> CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular,
> 18845 had change to only allow 1 inflight ECHO request at a time. As per
> 18854 some tests have an error rate due to this change. Creating this ticket
> to discuss this further. As the current state also does not have retry logic,
> it just allowing multiple ECHO requests inflight at the same time so less
> likely that all ECHO will timeout or get lost.
> With the change from 18845 adding in some extra logging to track what is
> going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO
> requests from a node and also see it retrying ECHOs when it doesn't get a
> reply.
> Therefore, I think the problem is more specific than the dropping of one ECHO
> request. Yes there no retry logic for failed ECHO requests, but this is the
> case even both before and after 18845. ECHO requests are only sent via gossip
> verb handlers calling applyStateLocally. In these failed tests I therefore
> assuming their cases where it won't call markAlive when other nodes consider
> the node UP but its marked DOWN by a node.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]