Abhishek Pal created RATIS-2462:
-----------------------------------

             Summary: Leader transfer can fail when StartLeaderElection is 
resent with the same in-flight call id
                 Key: RATIS-2462
                 URL: https://issues.apache.org/jira/browse/RATIS-2462
             Project: Ratis
          Issue Type: Bug
          Components: election
            Reporter: Abhishek Pal
            Assignee: Abhishek Pal


*Problem*
- [TransferLeadership.sendStartLeaderElection() 
|https://github.com/apache/ratis/blob/a38a3d1d14c8b02452926281238cb3afbc2d9f61/ratis-server/src/main/java/org/apache/ratis/server/impl/TransferLeadership.java#L201]creates
 the 
[StartLeaderElectionRequestProto|https://github.com/apache/ratis/blob/a38a3d1d14c8b02452926281238cb3afbc2d9f61/ratis-server/src/main/java/org/apache/ratis/server/impl/TransferLeadership.java#L213]
 and can be called again later from 
[onFollowerAppendEntriesReply()|https://github.com/apache/ratis/blob/a38a3d1d14c8b02452926281238cb3afbc2d9f61/ratis-server/src/main/java/org/apache/ratis/server/impl/TransferLeadership.java#L249]
 for the same transferee.
- *ServerProtoUtils.toStartLeaderElectionRequestProto()* does not assign a 
unique call id to the server-generated request.
- *NettyRpcProxy.Connection.offer()* [stores in-flight requests by 
callId|https://github.com/apache/ratis/blob/a38a3d1d14c8b02452926281238cb3afbc2d9f61/ratis-netty/src/main/java/org/apache/ratis/netty/NettyRpcProxy.java#L223]
 and asserts that no request with the same key is already pending.

The *StartLeaderElectionRequestProto* is built without a unique call id, but 
Netty uses call id as the in-flight request key. As a result, the second send 
can collide with the first one, which surfaces as a 
*TransferLeadershipException* to the caller. This makes leader transfer 
intermittently fail under certain timing conditions.

*Proposal*
We should assign a unique call id to each server-generated 
*StartLeaderElectionRequestProto* - alternatively we can serialize the 
TransferLeadership requests, but this would cause decrease in throughput.


This was discovered as a part of "testTransferLeader" being flaky when run as 
part of NettyPRC tests.






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to