[
https://issues.apache.org/jira/browse/RATIS-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Attila Doroszlai updated RATIS-1490:
------------------------------------
Description:
{{RaftServerImpl}} creates a new {{FollowerState}} for each vote request
received. The old {{FollowerState}} may trigger change to candidate role after
it was shutdown by the server. Both {{RaftServerImpl#requestVote}} and the
change to candidate by {{FollowerState}} requires lock on {{server}}
({{RaftServerImpl}} object). The order of events which triggers the problem:
# {{RaftServerImpl}} executing {{requestVote}} [with lock on
itself|https://github.com/apache/ratis/blob/b57be797fe97c00abc0e3d7353d30e1b2961714d/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L1099]
# {{FollowerState}} [checks conditions, then waits for lock on
{{server}}|https://github.com/apache/ratis/blob/b57be797fe97c00abc0e3d7353d30e1b2961714d/ratis-server/src/main/java/org/apache/ratis/server/impl/FollowerState.java#L126-L130]
# {{RaftServerImpl}} [shuts down old {{FollowerState}}, starts new
one|https://github.com/apache/ratis/blob/b57be797fe97c00abc0e3d7353d30e1b2961714d/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L475-L477]
# {{RaftServerImpl}} [sets last RPC time on new
{{FollowerState}}|https://github.com/apache/ratis/blob/b57be797fe97c00abc0e3d7353d30e1b2961714d/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L1116-L1117]
# {{RaftServerImpl}} [releases
lock|https://github.com/apache/ratis/blob/b57be797fe97c00abc0e3d7353d30e1b2961714d/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L1127]
# *old* {{FollowerState}} [enters block with lock on {{server}}, triggers
change to
candidate|https://github.com/apache/ratis/blob/b57be797fe97c00abc0e3d7353d30e1b2961714d/ratis-server/src/main/java/org/apache/ratis/server/impl/FollowerState.java#L130-L138]
Log from Ozone DN where this was seen:
{{c7514957}} votes for {{18b95dfb}}:
{noformat}
2022-01-07 01:24:07,792 [grpc-default-executor-0] INFO
server.RaftServer$Division (RaftServerImpl.java:requestVote(1046)) -
c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075: receive
requestVote(ELECTION, 18b95dfb-8ef7-4011-a842-0dc8d0be8521, group-9EECFC4A6075,
2, (t:0, i:0))
2022-01-07 01:24:07,792 [grpc-default-executor-0] INFO impl.VoteContext
(VoteContext.java:log(48)) -
c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-FOLLOWER: accept
ELECTION from 18b95dfb-8ef7-4011-a842-0dc8d0be8521: our priority 0 <=
candidate's priority 1
2022-01-07 01:24:07,793 [grpc-default-executor-0] INFO
server.RaftServer$Division (RaftServerImpl.java:setRole(290)) -
c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075: changes role from
FOLLOWER to FOLLOWER at term 2 for
candidate:18b95dfb-8ef7-4011-a842-0dc8d0be8521
2022-01-07 01:24:07,793 [grpc-default-executor-0] INFO impl.RoleInfo
(RoleInfo.java:shutdownFollowerState(108)) -
c7514957-7287-4cfe-8c11-55f780daee7f: shutdown
c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-FollowerState
2022-01-07 01:24:07,793 [grpc-default-executor-0] INFO impl.RoleInfo
(RoleInfo.java:updateAndGet(130)) - c7514957-7287-4cfe-8c11-55f780daee7f: start
c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-FollowerState
2022-01-07 01:24:07,794 [grpc-default-executor-0] INFO
server.RaftServer$Division (RaftServerImpl.java:requestVote(1078)) -
c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075 replies to ELECTION
vote request:
18b95dfb-8ef7-4011-a842-0dc8d0be8521<-c7514957-7287-4cfe-8c11-55f780daee7f#0:OK-t2.
Peer's state: c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075:t2,
leader=null, voted=18b95dfb-8ef7-4011-a842-0dc8d0be8521,
raftlog=c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-SegmentedRaftLog:OPENED:c-1,
conf=-1: [18b95dfb-8ef7-4011-a842-0dc8d0be8521|rpc:10.1.0.7:39307|priority:1,
c7514957-7287-4cfe-8c11-55f780daee7f|rpc:10.1.0.7:42065|priority:0,
baceac6d-bbfe-42b4-9bde-396690ef56d9|rpc:10.1.0.7:38573|priority:0], old=null
{noformat}
but just after 3ms starts new election due to RPC timeout:
{noformat}
2022-01-07 01:24:07,795
[c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-FollowerState] INFO
impl.FollowerState (FollowerState.java:run(134)) -
c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-FollowerState: change
to CANDIDATE, lastRpcElapsedTime:5172276476ns, electionTimeout:5172ms
2022-01-07 01:24:07,796
[c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-FollowerState] INFO
impl.RoleInfo (RoleInfo.java:shutdownFollowerState(108)) -
c7514957-7287-4cfe-8c11-55f780daee7f: shutdown
c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-FollowerState
2022-01-07 01:24:07,796
[c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-FollowerState] INFO
server.RaftServer$Division (RaftServerImpl.java:setRole(290)) -
c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075: changes role from
FOLLOWER to CANDIDATE at term 2 for changeToCandidate
2022-01-07 01:24:07,796
[c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-FollowerState] INFO
impl.RoleInfo (RoleInfo.java:updateAndGet(130)) -
c7514957-7287-4cfe-8c11-55f780daee7f: start
c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-LeaderElection7
{noformat}
Full log:
https://github.com/elek/ozone-build-results/blob/master/2022/01/07/12409/it-client/hadoop-ozone/integration-test/org.apache.hadoop.ozone.client.rpc.TestDeleteWithSlowFollower.txt
was:
{{RaftServerImpl}} creates a new {{FollowerState}} for each vote request
received. The old {{FollowerState}} may trigger change to candidate role after
it was shutdown by the server. Both {{RaftServerImpl#requestVote}} and the
change to candidate by {{FollowerState}} requires lock on {{server}}
({{RaftServerImpl}} object). The order of events which triggers the problem:
# {{RaftServerImpl}} executing {{requestVote}} [with lock on
itself|https://github.com/apache/ratis/blob/b57be797fe97c00abc0e3d7353d30e1b2961714d/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L1099]
# {{FollowerState}} [checks conditions, then waits for lock on
{{server}}|https://github.com/apache/ratis/blob/b57be797fe97c00abc0e3d7353d30e1b2961714d/ratis-server/src/main/java/org/apache/ratis/server/impl/FollowerState.java#L126-L130]
# {{RaftServerImpl}} [shuts down old {{FollowerState}}, starts new
one|https://github.com/apache/ratis/blob/b57be797fe97c00abc0e3d7353d30e1b2961714d/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L475-L477]
# {{RaftServerImpl}} [sets last RPC time on new
{{FollowerState}}|https://github.com/apache/ratis/blob/b57be797fe97c00abc0e3d7353d30e1b2961714d/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L1116-L1117]
# {{RaftServerImpl}} [releases
lock|https://github.com/apache/ratis/blob/b57be797fe97c00abc0e3d7353d30e1b2961714d/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L1127]
# *old* {{FollowerState}} [enters block with lock on {{server}}, triggers
change to
candidate|https://github.com/apache/ratis/blob/b57be797fe97c00abc0e3d7353d30e1b2961714d/ratis-server/src/main/java/org/apache/ratis/server/impl/FollowerState.java#L130-L138]
> Follower should not change to candidate if already stopped
> ----------------------------------------------------------
>
> Key: RATIS-1490
> URL: https://issues.apache.org/jira/browse/RATIS-1490
> Project: Ratis
> Issue Type: Bug
> Components: server
> Reporter: Attila Doroszlai
> Assignee: Attila Doroszlai
> Priority: Major
>
> {{RaftServerImpl}} creates a new {{FollowerState}} for each vote request
> received. The old {{FollowerState}} may trigger change to candidate role
> after it was shutdown by the server. Both {{RaftServerImpl#requestVote}} and
> the change to candidate by {{FollowerState}} requires lock on {{server}}
> ({{RaftServerImpl}} object). The order of events which triggers the problem:
> # {{RaftServerImpl}} executing {{requestVote}} [with lock on
> itself|https://github.com/apache/ratis/blob/b57be797fe97c00abc0e3d7353d30e1b2961714d/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L1099]
> # {{FollowerState}} [checks conditions, then waits for lock on
> {{server}}|https://github.com/apache/ratis/blob/b57be797fe97c00abc0e3d7353d30e1b2961714d/ratis-server/src/main/java/org/apache/ratis/server/impl/FollowerState.java#L126-L130]
> # {{RaftServerImpl}} [shuts down old {{FollowerState}}, starts new
> one|https://github.com/apache/ratis/blob/b57be797fe97c00abc0e3d7353d30e1b2961714d/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L475-L477]
> # {{RaftServerImpl}} [sets last RPC time on new
> {{FollowerState}}|https://github.com/apache/ratis/blob/b57be797fe97c00abc0e3d7353d30e1b2961714d/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L1116-L1117]
> # {{RaftServerImpl}} [releases
> lock|https://github.com/apache/ratis/blob/b57be797fe97c00abc0e3d7353d30e1b2961714d/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L1127]
> # *old* {{FollowerState}} [enters block with lock on {{server}}, triggers
> change to
> candidate|https://github.com/apache/ratis/blob/b57be797fe97c00abc0e3d7353d30e1b2961714d/ratis-server/src/main/java/org/apache/ratis/server/impl/FollowerState.java#L130-L138]
> Log from Ozone DN where this was seen:
> {{c7514957}} votes for {{18b95dfb}}:
> {noformat}
> 2022-01-07 01:24:07,792 [grpc-default-executor-0] INFO
> server.RaftServer$Division (RaftServerImpl.java:requestVote(1046)) -
> c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075: receive
> requestVote(ELECTION, 18b95dfb-8ef7-4011-a842-0dc8d0be8521,
> group-9EECFC4A6075, 2, (t:0, i:0))
> 2022-01-07 01:24:07,792 [grpc-default-executor-0] INFO impl.VoteContext
> (VoteContext.java:log(48)) -
> c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-FOLLOWER: accept
> ELECTION from 18b95dfb-8ef7-4011-a842-0dc8d0be8521: our priority 0 <=
> candidate's priority 1
> 2022-01-07 01:24:07,793 [grpc-default-executor-0] INFO
> server.RaftServer$Division (RaftServerImpl.java:setRole(290)) -
> c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075: changes role from
> FOLLOWER to FOLLOWER at term 2 for
> candidate:18b95dfb-8ef7-4011-a842-0dc8d0be8521
> 2022-01-07 01:24:07,793 [grpc-default-executor-0] INFO impl.RoleInfo
> (RoleInfo.java:shutdownFollowerState(108)) -
> c7514957-7287-4cfe-8c11-55f780daee7f: shutdown
> c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-FollowerState
> 2022-01-07 01:24:07,793 [grpc-default-executor-0] INFO impl.RoleInfo
> (RoleInfo.java:updateAndGet(130)) - c7514957-7287-4cfe-8c11-55f780daee7f:
> start c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-FollowerState
> 2022-01-07 01:24:07,794 [grpc-default-executor-0] INFO
> server.RaftServer$Division (RaftServerImpl.java:requestVote(1078)) -
> c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075 replies to ELECTION
> vote request:
> 18b95dfb-8ef7-4011-a842-0dc8d0be8521<-c7514957-7287-4cfe-8c11-55f780daee7f#0:OK-t2.
> Peer's state: c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075:t2,
> leader=null, voted=18b95dfb-8ef7-4011-a842-0dc8d0be8521,
> raftlog=c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-SegmentedRaftLog:OPENED:c-1,
> conf=-1:
> [18b95dfb-8ef7-4011-a842-0dc8d0be8521|rpc:10.1.0.7:39307|priority:1,
> c7514957-7287-4cfe-8c11-55f780daee7f|rpc:10.1.0.7:42065|priority:0,
> baceac6d-bbfe-42b4-9bde-396690ef56d9|rpc:10.1.0.7:38573|priority:0], old=null
> {noformat}
> but just after 3ms starts new election due to RPC timeout:
> {noformat}
> 2022-01-07 01:24:07,795
> [c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-FollowerState] INFO
> impl.FollowerState (FollowerState.java:run(134)) -
> c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-FollowerState: change
> to CANDIDATE, lastRpcElapsedTime:5172276476ns, electionTimeout:5172ms
> 2022-01-07 01:24:07,796
> [c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-FollowerState] INFO
> impl.RoleInfo (RoleInfo.java:shutdownFollowerState(108)) -
> c7514957-7287-4cfe-8c11-55f780daee7f: shutdown
> c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-FollowerState
> 2022-01-07 01:24:07,796
> [c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-FollowerState] INFO
> server.RaftServer$Division (RaftServerImpl.java:setRole(290)) -
> c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075: changes role from
> FOLLOWER to CANDIDATE at term 2 for changeToCandidate
> 2022-01-07 01:24:07,796
> [c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-FollowerState] INFO
> impl.RoleInfo (RoleInfo.java:updateAndGet(130)) -
> c7514957-7287-4cfe-8c11-55f780daee7f: start
> c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-LeaderElection7
> {noformat}
> Full log:
> https://github.com/elek/ozone-build-results/blob/master/2022/01/07/12409/it-client/hadoop-ozone/integration-test/org.apache.hadoop.ozone.client.rpc.TestDeleteWithSlowFollower.txt
--
This message was sent by Atlassian Jira
(v8.20.1#820001)