[
https://issues.apache.org/jira/browse/RATIS-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16406433#comment-16406433
]
Lokesh Jain commented on RATIS-209:
-----------------------------------
I think the issue is in Ratis. The appendEntries are async, therefore the
follower can append the entries in future and return the next index. In such a
case the leader will see that the entry has been added to the log for majority
of the followers and will commit it. But the follower may not have yet written
the state machine data, which leads to the null pointer exception.
I think we need to make sure that the state machine data has been written
before moving on to applyTransaction in state machine.
> StateMachine updater may miss writeLog Request after a new Leader is chosen.
> ----------------------------------------------------------------------------
>
> Key: RATIS-209
> URL: https://issues.apache.org/jira/browse/RATIS-209
> Project: Ratis
> Issue Type: Bug
> Affects Versions: 0.2.0-alpha
> Reporter: Mukul Kumar Singh
> Assignee: Mukul Kumar Singh
> Priority: Major
>
> This issue is happening on ozone for write chunk request.
> 1) currently write chunk request is processed in two phases, in the first
> phase the user data is written to the follower as part of
> {{writeStateMachineData}} and then the entry is committed to the follower as
> part of {{commit}}.
> 2) The issue which is hit right now is the case where a)
> {{writeStateMachineData}} didn't happen for a particular chunk however b) the
> commit entry is still processed. this leads to a case where a corresponding
> stateMachineFuture is not present in the hashmap.
> {code}
> 2018-02-12 00:26:30,097 INFO org.apache.ratis.server.impl.FollowerState:
> 172.26.32.228_9858 changes to CANDIDATE, lastRpcTime:1773,
> electionTimeout:873ms
> 2018-02-12 00:26:30,098 INFO org.apache.ratis.server.impl.RaftServerImpl:
> 172.26.32.228_9858 changes role from FOLLOWER to CANDIDATE at term 3 for
> changeToCandidate
> 2018-02-12 00:26:30,100 INFO org.apache.ratis.server.impl.RaftServerImpl:
> 172.26.32.228_9858: change Leader from 172.26.32.232_9858 to null at term 3
> for initElection
> 2018-02-12 00:26:32,869 INFO org.apache.ratis.server.impl.LeaderElection:
> 172.26.32.228_9858: begin an election in Term 4
> 2018-02-12 00:26:32,901 INFO
> org.apache.ratis.grpc.server.RaftServerProtocolService: 172.26.32.228_9858:
> appendEntries completed
> 2018-02-12 00:26:33,217 INFO org.apache.ratis.server.impl.LeaderElection:
> 172.26.32.228_9858: Election REJECTED; received 2 response(s)
> [172.26.32.228_9858<-172.26.32.230_9858#0:FAIL-t4,
> 172.26.32.228_9858<-172.26.32.232_9858#0:FAIL-t4] and 0 exception(s);
> 172.26.32.228_9858:t4, leader=null, voted=172.26.32.228_9858, raftlog=[(t:3,
> i:10711)], conf=[172.26.32.228_9858:172.26.32.228:9858,
> 172.26.32.230_9858:172.26.32.230:9858,
> 172.26.32.232_9858:172.26.32.232:9858], old=null
> 2018-02-12 00:26:33,217 INFO org.apache.ratis.server.impl.RaftServerImpl:
> 172.26.32.228_9858 changes role from CANDIDATE to FOLLOWER at term 4 for
> changeToFollower
> 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.FollowerState:
> 172.26.32.228_9858 changes to CANDIDATE, lastRpcTime:5624,
> electionTimeout:975ms
> 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.RaftServerImpl:
> 172.26.32.228_9858 changes role from FOLLOWER to CANDIDATE at term 5 for
> changeToCandidate
> 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.RaftServerImpl:
> 172.26.32.228_9858 changes role from CANDIDATE to FOLLOWER at term 5 for
> changeToFollower
> 2018-02-12 00:26:39,520 INFO org.apache.ratis.server.impl.RaftServerImpl:
> 172.26.32.228_9858: change Leader from null to 172.26.32.232_9858 at term 5
> for appendEntries
> {code}
> {code}
> 2018-02-12 00:31:12,400 ERROR
> org.apache.ratis.server.impl.StateMachineUpdater: Terminating with exit
> status 2: StateMachineUpdater-172.26.32.228_9858: the StateMachineUpdater
> hits Throwable
> java.lang.NullPointerException
> at
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.applyTransaction(ContainerStateMachine.java:254)
> at
> org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1001)
> at
> org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:151)
> at java.lang.Thread.run(Thread.java:745)
> 2018-02-12 00:31:12,406 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
> SHUTDOWN_MSG:
> /************************************************************
> SHUTDOWN_MSG: Shutting down DataNode at
> y128.l42scl.hortonworks.com/172.26.32.228
> ************************************************************/
> *** shutting down gRPC server since JVM is shutting down
> *** server shut down
> ***
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)