[ 
https://issues.apache.org/jira/browse/RATIS-788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Doroszlai updated RATIS-788:
-----------------------------------
    Attachment: 
org.apache.hadoop.ozone.scm.TestSCMContainerPlacementPolicyMetrics.txt
                
org.apache.hadoop.ozone.scm.TestSCMContainerPlacementPolicyMetrics-output.txt

> Server stuck due to exception while becoming leader
> ---------------------------------------------------
>
>                 Key: RATIS-788
>                 URL: https://issues.apache.org/jira/browse/RATIS-788
>             Project: Ratis
>          Issue Type: Bug
>          Components: server
>            Reporter: Attila Doroszlai
>            Priority: Major
>         Attachments: 
> org.apache.hadoop.ozone.scm.TestSCMContainerPlacementPolicyMetrics-output.txt,
>  org.apache.hadoop.ozone.scm.TestSCMContainerPlacementPolicyMetrics.txt
>
>
> It seems single-node Ratis can get stuck if exception happens while becoming 
> leader.  {{LeaderElection}} ignores the exception because it is already shut 
> down after successful vote.  I guess 3-node Ratis might be able to recover.
> {code}
> 2020-01-09 23:31:35,160 [Thread-95] INFO  impl.FollowerState 
> (FollowerState.java:run(108)) - 
> 6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-FollowerState: change 
> to CANDIDATE, lastRpcTime:1117ms, electionTimeout:1103ms
> 2020-01-09 23:31:35,161 [Thread-95] INFO  impl.RoleInfo 
> (RoleInfo.java:shutdownFollowerState(121)) - 
> 6b60526e-eae6-4f33-854d-fa396187085c: shutdown FollowerState
> 2020-01-09 23:31:35,161 [Thread-95] INFO  impl.RaftServerImpl 
> (RaftServerImpl.java:setRole(173)) - 
> 6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E: changes role from  
> FOLLOWER to CANDIDATE at term 0 for changeToCandidate
> 2020-01-09 23:31:35,165 [Thread-95] INFO  impl.RoleInfo 
> (RoleInfo.java:updateAndGet(143)) - 6b60526e-eae6-4f33-854d-fa396187085c: 
> start LeaderElection
> 2020-01-09 23:31:35,176 
> [6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1] 
> INFO  impl.LeaderElection (LeaderElection.java:askForVotes(206)) - 
> 6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1: 
> begin an election at term 1 for -1: 
> [6b60526e-eae6-4f33-854d-fa396187085c:localhost:9872], old=null
> 2020-01-09 23:31:35,177 
> [6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1] 
> INFO  impl.RoleInfo (RoleInfo.java:shutdownLeaderElection(134)) - 
> 6b60526e-eae6-4f33-854d-fa396187085c: shutdown LeaderElection
> 2020-01-09 23:31:35,178 
> [6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1] 
> INFO  impl.RaftServerImpl (RaftServerImpl.java:setRole(173)) - 
> 6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E: changes role from 
> CANDIDATE to LEADER at term 1 for changeToLeader
> 2020-01-09 23:31:35,178 
> [6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1] 
> INFO  impl.RaftServerImpl (ServerState.java:setLeader(255)) - 
> 6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E: change Leader from 
> null to 6b60526e-eae6-4f33-854d-fa396187085c at term 1 for becomeLeader, 
> leader elected after 1269ms
> 2020-01-09 23:31:35,183 
> [6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1] 
> INFO  server.RaftServerConfigKeys (ConfUtils.java:logGet(43)) - 
> raft.server.staging.catchup.gap = 1000 (default)
> 2020-01-09 23:31:35,185 
> [6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1] 
> INFO  server.RaftServerConfigKeys (ConfUtils.java:logGet(43)) - 
> raft.server.rpc.sleep.time = 25ms (default)
> 2020-01-09 23:31:35,217 
> [6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1] 
> INFO  impl.LeaderElection (LeaderElection.java:run(165)) - 
> 6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1: 
> ConcurrentModificationException is safely ignored since this is already 
> CLOSING
> java.util.ConcurrentModificationException
>       at java.util.ArrayList.forEach(ArrayList.java:1260)
>       at 
> org.apache.ratis.metrics.impl.MetricRegistriesImpl.lambda$create$1(MetricRegistriesImpl.java:66)
>       at 
> org.apache.ratis.metrics.impl.RefCountingMap.lambda$put$0(RefCountingMap.java:51)
>       at 
> java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1853)
>       at 
> org.apache.ratis.metrics.impl.RefCountingMap.put(RefCountingMap.java:46)
>       at 
> org.apache.ratis.metrics.impl.MetricRegistriesImpl.create(MetricRegistriesImpl.java:59)
>       at 
> org.apache.ratis.server.metrics.RatisMetrics.create(RatisMetrics.java:45)
>       at 
> org.apache.ratis.server.metrics.RatisMetrics.getMetricRegistryForLogAppender(RatisMetrics.java:82)
>       at 
> org.apache.ratis.server.metrics.LogAppenderMetrics.<init>(LogAppenderMetrics.java:32)
>       at org.apache.ratis.server.impl.LeaderState.<init>(LeaderState.java:221)
>       at 
> org.apache.ratis.server.impl.RoleInfo.startLeaderState(RoleInfo.java:94)
>       at 
> org.apache.ratis.server.impl.RaftServerImpl.changeToLeader(RaftServerImpl.java:348)
>       at 
> org.apache.ratis.server.impl.LeaderElection.askForVotes(LeaderElection.java:238)
>       at 
> org.apache.ratis.server.impl.LeaderElection.run(LeaderElection.java:161)
>       at java.lang.Thread.run(Thread.java:748)
> ...
> 2020-01-09 23:31:48,567 ... 
> 6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E is in LEADER state 
> but not ready yet.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to