Abhishek Pal created RATIS-2462:
-----------------------------------
Summary: Leader transfer can fail when StartLeaderElection is
resent with the same in-flight call id
Key: RATIS-2462
URL: https://issues.apache.org/jira/browse/RATIS-2462
Project: Ratis
Issue Type: Bug
Components: election
Reporter: Abhishek Pal
Assignee: Abhishek Pal
*Problem*
- [TransferLeadership.sendStartLeaderElection()
|https://github.com/apache/ratis/blob/a38a3d1d14c8b02452926281238cb3afbc2d9f61/ratis-server/src/main/java/org/apache/ratis/server/impl/TransferLeadership.java#L201]creates
the
[StartLeaderElectionRequestProto|https://github.com/apache/ratis/blob/a38a3d1d14c8b02452926281238cb3afbc2d9f61/ratis-server/src/main/java/org/apache/ratis/server/impl/TransferLeadership.java#L213]
and can be called again later from
[onFollowerAppendEntriesReply()|https://github.com/apache/ratis/blob/a38a3d1d14c8b02452926281238cb3afbc2d9f61/ratis-server/src/main/java/org/apache/ratis/server/impl/TransferLeadership.java#L249]
for the same transferee.
- *ServerProtoUtils.toStartLeaderElectionRequestProto()* does not assign a
unique call id to the server-generated request.
- *NettyRpcProxy.Connection.offer()* [stores in-flight requests by
callId|https://github.com/apache/ratis/blob/a38a3d1d14c8b02452926281238cb3afbc2d9f61/ratis-netty/src/main/java/org/apache/ratis/netty/NettyRpcProxy.java#L223]
and asserts that no request with the same key is already pending.
The *StartLeaderElectionRequestProto* is built without a unique call id, but
Netty uses call id as the in-flight request key. As a result, the second send
can collide with the first one, which surfaces as a
*TransferLeadershipException* to the caller. This makes leader transfer
intermittently fail under certain timing conditions.
*Proposal*
We should assign a unique call id to each server-generated
*StartLeaderElectionRequestProto* - alternatively we can serialize the
TransferLeadership requests, but this would cause decrease in throughput.
This was discovered as a part of "testTransferLeader" being flaky when run as
part of NettyPRC tests.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)