grom358 opened a new pull request, #4868:
URL: https://github.com/apache/cassandra/pull/4868

   In Gossiper, echoHandler only implements onResponse. 
RequestCallback.onFailure has a default no-op, so when the ECHO_REQ times out 
or the remote node returns an error, inflightEcho.remove(addr) is never called. 
The stale entry persists. Any subsequent markAlive(addr, localState) call — 
where localState is the same in-place-mutated object already in inflightEcho — 
sees localState.equals(prevState) = true (identity equality, same reference) 
and skips indefinitely. In a temporary-partition scenario (node briefly 
unreachable, echo times out, node recovers with the same generation), the node 
can get stuck permanently dead: the failure detector sees it as alive and keeps 
triggering markAlive, but every invocation is suppressed by the stale entry. 
The stale entry is only cleared by removeEndpoint() (explicit removal) or 
silentlyMarkDead() via markDead() (failure detector conviction) — neither of 
which fires if the failure detector is reporting the node as healthy.
   
   Fix: override onFailure in echoHandler to call inflightEcho.remove(addr).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to