[
https://issues.apache.org/jira/browse/RATIS-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tsz-wo Sze resolved RATIS-1490.
-------------------------------
Fix Version/s: 2.3.0
Resolution: Fixed
I have merged the pull request. Thanks, [~adoroszlai]!
> Follower should not change to candidate if already stopped
> ----------------------------------------------------------
>
> Key: RATIS-1490
> URL: https://issues.apache.org/jira/browse/RATIS-1490
> Project: Ratis
> Issue Type: Bug
> Components: server
> Reporter: Attila Doroszlai
> Assignee: Attila Doroszlai
> Priority: Major
> Fix For: 2.3.0
>
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> {{RaftServerImpl}} creates a new {{FollowerState}} for each vote request
> received. The old {{FollowerState}} may trigger change to candidate role
> after it was shutdown by the server. Both {{RaftServerImpl#requestVote}} and
> the change to candidate by {{FollowerState}} requires lock on {{server}}
> ({{RaftServerImpl}} object). The order of events which triggers the problem:
> # {{RaftServerImpl}} executing {{requestVote}} [with lock on
> itself|https://github.com/apache/ratis/blob/b57be797fe97c00abc0e3d7353d30e1b2961714d/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L1099]
> # {{FollowerState}} [checks conditions, then waits for lock on
> {{server}}|https://github.com/apache/ratis/blob/b57be797fe97c00abc0e3d7353d30e1b2961714d/ratis-server/src/main/java/org/apache/ratis/server/impl/FollowerState.java#L126-L130]
> # {{RaftServerImpl}} [shuts down old {{FollowerState}}, starts new
> one|https://github.com/apache/ratis/blob/b57be797fe97c00abc0e3d7353d30e1b2961714d/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L475-L477]
> # {{RaftServerImpl}} [sets last RPC time on new
> {{FollowerState}}|https://github.com/apache/ratis/blob/b57be797fe97c00abc0e3d7353d30e1b2961714d/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L1116-L1117]
> # {{RaftServerImpl}} [releases
> lock|https://github.com/apache/ratis/blob/b57be797fe97c00abc0e3d7353d30e1b2961714d/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L1127]
> # *old* {{FollowerState}} [enters block with lock on {{server}}, triggers
> change to
> candidate|https://github.com/apache/ratis/blob/b57be797fe97c00abc0e3d7353d30e1b2961714d/ratis-server/src/main/java/org/apache/ratis/server/impl/FollowerState.java#L130-L138]
> Log from Ozone DN where this was seen:
> {{c7514957}} votes for {{18b95dfb}}:
> {noformat}
> 2022-01-07 01:24:07,792 [grpc-default-executor-0] INFO
> server.RaftServer$Division (RaftServerImpl.java:requestVote(1046)) -
> c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075: receive
> requestVote(ELECTION, 18b95dfb-8ef7-4011-a842-0dc8d0be8521,
> group-9EECFC4A6075, 2, (t:0, i:0))
> 2022-01-07 01:24:07,792 [grpc-default-executor-0] INFO impl.VoteContext
> (VoteContext.java:log(48)) -
> c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-FOLLOWER: accept
> ELECTION from 18b95dfb-8ef7-4011-a842-0dc8d0be8521: our priority 0 <=
> candidate's priority 1
> 2022-01-07 01:24:07,793 [grpc-default-executor-0] INFO
> server.RaftServer$Division (RaftServerImpl.java:setRole(290)) -
> c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075: changes role from
> FOLLOWER to FOLLOWER at term 2 for
> candidate:18b95dfb-8ef7-4011-a842-0dc8d0be8521
> 2022-01-07 01:24:07,793 [grpc-default-executor-0] INFO impl.RoleInfo
> (RoleInfo.java:shutdownFollowerState(108)) -
> c7514957-7287-4cfe-8c11-55f780daee7f: shutdown
> c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-FollowerState
> 2022-01-07 01:24:07,793 [grpc-default-executor-0] INFO impl.RoleInfo
> (RoleInfo.java:updateAndGet(130)) - c7514957-7287-4cfe-8c11-55f780daee7f:
> start c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-FollowerState
> 2022-01-07 01:24:07,794 [grpc-default-executor-0] INFO
> server.RaftServer$Division (RaftServerImpl.java:requestVote(1078)) -
> c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075 replies to ELECTION
> vote request:
> 18b95dfb-8ef7-4011-a842-0dc8d0be8521<-c7514957-7287-4cfe-8c11-55f780daee7f#0:OK-t2.
> Peer's state: c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075:t2,
> leader=null, voted=18b95dfb-8ef7-4011-a842-0dc8d0be8521,
> raftlog=c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-SegmentedRaftLog:OPENED:c-1,
> conf=-1:
> [18b95dfb-8ef7-4011-a842-0dc8d0be8521|rpc:10.1.0.7:39307|priority:1,
> c7514957-7287-4cfe-8c11-55f780daee7f|rpc:10.1.0.7:42065|priority:0,
> baceac6d-bbfe-42b4-9bde-396690ef56d9|rpc:10.1.0.7:38573|priority:0], old=null
> {noformat}
> but just after 3ms starts new election due to RPC timeout:
> {noformat}
> 2022-01-07 01:24:07,795
> [c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-FollowerState] INFO
> impl.FollowerState (FollowerState.java:run(134)) -
> c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-FollowerState: change
> to CANDIDATE, lastRpcElapsedTime:5172276476ns, electionTimeout:5172ms
> 2022-01-07 01:24:07,796
> [c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-FollowerState] INFO
> impl.RoleInfo (RoleInfo.java:shutdownFollowerState(108)) -
> c7514957-7287-4cfe-8c11-55f780daee7f: shutdown
> c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-FollowerState
> 2022-01-07 01:24:07,796
> [c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-FollowerState] INFO
> server.RaftServer$Division (RaftServerImpl.java:setRole(290)) -
> c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075: changes role from
> FOLLOWER to CANDIDATE at term 2 for changeToCandidate
> 2022-01-07 01:24:07,796
> [c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-FollowerState] INFO
> impl.RoleInfo (RoleInfo.java:updateAndGet(130)) -
> c7514957-7287-4cfe-8c11-55f780daee7f: start
> c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-LeaderElection7
> {noformat}
> Full log:
> https://github.com/elek/ozone-build-results/blob/master/2022/01/07/12409/it-client/hadoop-ozone/integration-test/org.apache.hadoop.ozone.client.rpc.TestDeleteWithSlowFollower.txt
--
This message was sent by Atlassian Jira
(v8.20.1#820001)