[ 
https://issues.apache.org/jira/browse/RATIS-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze resolved RATIS-1490.
-------------------------------
    Fix Version/s: 2.3.0
       Resolution: Fixed

I have merged the pull request.  Thanks, [~adoroszlai]!

> Follower should not change to candidate if already stopped
> ----------------------------------------------------------
>
>                 Key: RATIS-1490
>                 URL: https://issues.apache.org/jira/browse/RATIS-1490
>             Project: Ratis
>          Issue Type: Bug
>          Components: server
>            Reporter: Attila Doroszlai
>            Assignee: Attila Doroszlai
>            Priority: Major
>             Fix For: 2.3.0
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> {{RaftServerImpl}} creates a new {{FollowerState}} for each vote request 
> received.  The old {{FollowerState}} may trigger change to candidate role 
> after it was shutdown by the server.  Both {{RaftServerImpl#requestVote}} and 
> the change to candidate by {{FollowerState}} requires lock on {{server}} 
> ({{RaftServerImpl}} object).  The order of events which triggers the problem:
> # {{RaftServerImpl}} executing {{requestVote}} [with lock on 
> itself|https://github.com/apache/ratis/blob/b57be797fe97c00abc0e3d7353d30e1b2961714d/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L1099]
> # {{FollowerState}} [checks conditions, then waits for lock on 
> {{server}}|https://github.com/apache/ratis/blob/b57be797fe97c00abc0e3d7353d30e1b2961714d/ratis-server/src/main/java/org/apache/ratis/server/impl/FollowerState.java#L126-L130]
> # {{RaftServerImpl}} [shuts down old {{FollowerState}}, starts new 
> one|https://github.com/apache/ratis/blob/b57be797fe97c00abc0e3d7353d30e1b2961714d/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L475-L477]
> # {{RaftServerImpl}} [sets last RPC time on new 
> {{FollowerState}}|https://github.com/apache/ratis/blob/b57be797fe97c00abc0e3d7353d30e1b2961714d/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L1116-L1117]
> # {{RaftServerImpl}} [releases 
> lock|https://github.com/apache/ratis/blob/b57be797fe97c00abc0e3d7353d30e1b2961714d/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L1127]
> # *old* {{FollowerState}} [enters block with lock on {{server}}, triggers 
> change to 
> candidate|https://github.com/apache/ratis/blob/b57be797fe97c00abc0e3d7353d30e1b2961714d/ratis-server/src/main/java/org/apache/ratis/server/impl/FollowerState.java#L130-L138]
> Log from Ozone DN where this was seen:
> {{c7514957}} votes for {{18b95dfb}}:
> {noformat}
> 2022-01-07 01:24:07,792 [grpc-default-executor-0] INFO  
> server.RaftServer$Division (RaftServerImpl.java:requestVote(1046)) - 
> c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075: receive 
> requestVote(ELECTION, 18b95dfb-8ef7-4011-a842-0dc8d0be8521, 
> group-9EECFC4A6075, 2, (t:0, i:0))
> 2022-01-07 01:24:07,792 [grpc-default-executor-0] INFO  impl.VoteContext 
> (VoteContext.java:log(48)) - 
> c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-FOLLOWER: accept 
> ELECTION from 18b95dfb-8ef7-4011-a842-0dc8d0be8521: our priority 0 <= 
> candidate's priority 1
> 2022-01-07 01:24:07,793 [grpc-default-executor-0] INFO  
> server.RaftServer$Division (RaftServerImpl.java:setRole(290)) - 
> c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075: changes role from  
> FOLLOWER to FOLLOWER at term 2 for 
> candidate:18b95dfb-8ef7-4011-a842-0dc8d0be8521
> 2022-01-07 01:24:07,793 [grpc-default-executor-0] INFO  impl.RoleInfo 
> (RoleInfo.java:shutdownFollowerState(108)) - 
> c7514957-7287-4cfe-8c11-55f780daee7f: shutdown 
> c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-FollowerState
> 2022-01-07 01:24:07,793 [grpc-default-executor-0] INFO  impl.RoleInfo 
> (RoleInfo.java:updateAndGet(130)) - c7514957-7287-4cfe-8c11-55f780daee7f: 
> start c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-FollowerState
> 2022-01-07 01:24:07,794 [grpc-default-executor-0] INFO  
> server.RaftServer$Division (RaftServerImpl.java:requestVote(1078)) - 
> c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075 replies to ELECTION 
> vote request: 
> 18b95dfb-8ef7-4011-a842-0dc8d0be8521<-c7514957-7287-4cfe-8c11-55f780daee7f#0:OK-t2.
>  Peer's state: c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075:t2, 
> leader=null, voted=18b95dfb-8ef7-4011-a842-0dc8d0be8521, 
> raftlog=c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-SegmentedRaftLog:OPENED:c-1,
>  conf=-1: 
> [18b95dfb-8ef7-4011-a842-0dc8d0be8521|rpc:10.1.0.7:39307|priority:1, 
> c7514957-7287-4cfe-8c11-55f780daee7f|rpc:10.1.0.7:42065|priority:0, 
> baceac6d-bbfe-42b4-9bde-396690ef56d9|rpc:10.1.0.7:38573|priority:0], old=null
> {noformat}
> but just after 3ms starts new election due to RPC timeout:
> {noformat}
> 2022-01-07 01:24:07,795 
> [c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-FollowerState] INFO  
> impl.FollowerState (FollowerState.java:run(134)) - 
> c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-FollowerState: change 
> to CANDIDATE, lastRpcElapsedTime:5172276476ns, electionTimeout:5172ms
> 2022-01-07 01:24:07,796 
> [c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-FollowerState] INFO  
> impl.RoleInfo (RoleInfo.java:shutdownFollowerState(108)) - 
> c7514957-7287-4cfe-8c11-55f780daee7f: shutdown 
> c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-FollowerState
> 2022-01-07 01:24:07,796 
> [c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-FollowerState] INFO  
> server.RaftServer$Division (RaftServerImpl.java:setRole(290)) - 
> c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075: changes role from  
> FOLLOWER to CANDIDATE at term 2 for changeToCandidate
> 2022-01-07 01:24:07,796 
> [c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-FollowerState] INFO  
> impl.RoleInfo (RoleInfo.java:updateAndGet(130)) - 
> c7514957-7287-4cfe-8c11-55f780daee7f: start 
> c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-LeaderElection7
> {noformat}
> Full log: 
> https://github.com/elek/ozone-build-results/blob/master/2022/01/07/12409/it-client/hadoop-ozone/integration-test/org.apache.hadoop.ozone.client.rpc.TestDeleteWithSlowFollower.txt



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to