[ 
https://issues.apache.org/jira/browse/RATIS-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411840#comment-16411840
 ] 

Mukul Kumar Singh commented on RATIS-209:
-----------------------------------------

The reason for this bug is following, Assume a 3 node cluster as in the case of 
Ozone.

1) Leader accepts a request and forwards it to the followers.
2) one of the follower, is not able to add the request completely. Please note 
that the requests are sent through appendEntriesAsync, each individual entry is 
now inserted through SegmentedRaftLog#appendEntry. 
3) SegmentedRaftLog#appendEntry first inserts entry to the cache and then to 
the statemachine. However because of any error, if the write to stateMachine 
data fails then the there will be a spurious entry in the cache.
4) Follower now receives a commit info from the leader about the entry it 
missed in step(3). As the cache entry is present, the statemachine updater 
process the request even when the statemachine write was not successful.

Adding logs from a failed instance,
Leader adds the request to its log
{code}
2018-03-23 12:51:18,583 INFO  storage.RaftLog 
(SegmentedRaftLog.java:appendEntry(275)) - added log entry for index 158 
12:51:18,598 INFO  ratis.ContainerStateMachine 
(ContainerStateMachine.java:handleWriteChunk(207)) - adding entry:158 
{code}

Follower receive the request
{code}
2018-03-23 12:51:19,294 INFO  storage.RaftLog 
(SegmentedRaftLog.java:appendEntry(275)) - added log entry for index 158
2018-03-23 12:51:19,498 INFO  storage.RaftLog 
(SegmentedRaftLog.java:appendEntry(275)) - added log entry for index 158
{code}

One follower succeeds in adding entry to state machine while the other fails.
{code}
2018-03-23 12:51:19,895 INFO  ratis.ContainerStateMachine (ContainerSt
ateMachine.java:handleWriteChunk(207)) - adding entry:158
2018-03-23 12:51:20,219 WARN  impl.LogAppender 
(GRpcLogAppender.java:onError(226)) - 127.0.0.1_63413 got error when appending 
entries to 127.0.0.1_63403, exception: {}.
java.io.IOException: org.apache.ratis.shaded.io.grpc.StatusRuntimeException: 
INTERNAL: Java heap space
        at 
org.apache.ratis.grpc.RaftGrpcUtil.unwrapException(RaftGrpcUtil.java:65)
        at 
org.apache.ratis.grpc.RaftGrpcUtil.unwrapThrowable(RaftGrpcUtil.java:48)
        at 
org.apache.ratis.grpc.server.GRpcLogAppender$AppendLogResponseHandler.onError(GRpcLogAppender.java:227)
        at 
org.apache.ratis.shaded.io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:395)
        at 
org.apache.ratis.shaded.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:481)
        at 
org.apache.ratis.shaded.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$600(ClientCallImpl.java:398)
        at 
org.apache.ratis.shaded.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:513)
        at 
org.apache.ratis.shaded.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:52)
        at 
org.apache.ratis.shaded.io.grpc.internal.SerializingExecutor$TaskRunner.run(SerializingExecutor.java:154)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.ratis.shaded.io.grpc.StatusRuntimeException: INTERNAL: 
Java heap space
        at 
org.apache.ratis.shaded.io.grpc.Status.asRuntimeException(Status.java:545)
        ... 9 more
{code}

Leader now receives failure info.
{code}
2018-03-23 12:51:20,220 INFO  impl.LogAppender 
(GRpcLogAppender.java:onError(247)) - 127.0.0.1_63413 failed 
peer:127.0.0.1_63403 next index:158
2018-03-23 12:51:20,264 INFO  storage.RaftLog 
(RaftLog.java:updateLastCommitted(92)) - 127.0.0.1_63413: Updating 
lastCommitted to 158
{code}

Failed follower not trying to commit the transaction on receiving the commit 
info from leader. This fails with NullPointerException
{code}
2018-03-23 12:51:20,966 INFO  ratis.ContainerStateMachine 
(ContainerStateMachine.java:applyTransaction(263)) - removing entry:158
2018-03-23 12:51:20,972 ERROR impl.StateMachineUpdater 
(ExitUtils.java:terminate(86)) - Terminating with exit status 2: 
StateMachineUpdater-127.0.0.1_63403: the StateMachineUpdater hits Throwable

java.lang.NullPointerException
        at 
org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.applyTransaction(ContainerStateMachine.java:267)
        at 
org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1051)
        at 
org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:154)
        at java.lang.Thread.run(Thread.java:745)
*** shutting down gRPC server since JVM is shutting down
*** shutting down gRPC server since JVM is shutting down
{code}

> StateMachine updater may miss writeLog Request after a new Leader is chosen.
> ----------------------------------------------------------------------------
>
>                 Key: RATIS-209
>                 URL: https://issues.apache.org/jira/browse/RATIS-209
>             Project: Ratis
>          Issue Type: Bug
>    Affects Versions: 0.2.0-alpha
>            Reporter: Mukul Kumar Singh
>            Assignee: Mukul Kumar Singh
>            Priority: Major
>
> This issue is happening on ozone for write chunk request.
> 1) currently write chunk request is processed in two phases, in the first 
> phase the user data is written to the follower as part of 
> {{writeStateMachineData}} and then the entry is committed to the follower as 
> part of {{commit}}.
> 2) The issue which is hit right now is the case where a) 
> {{writeStateMachineData}} didn't happen for a particular chunk however b) the 
> commit entry is still processed. this leads to a case where a corresponding 
> stateMachineFuture is not present in the hashmap.
> {code}
> 2018-02-12 00:26:30,097 INFO org.apache.ratis.server.impl.FollowerState: 
> 172.26.32.228_9858 changes to CANDIDATE, lastRpcTime:1773, 
> electionTimeout:873ms
> 2018-02-12 00:26:30,098 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858 changes role from FOLLOWER to CANDIDATE at term 3 for 
> changeToCandidate
> 2018-02-12 00:26:30,100 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858: change Leader from 172.26.32.232_9858 to null at term 3 
> for initElection
> 2018-02-12 00:26:32,869 INFO org.apache.ratis.server.impl.LeaderElection: 
> 172.26.32.228_9858: begin an election in Term 4
> 2018-02-12 00:26:32,901 INFO 
> org.apache.ratis.grpc.server.RaftServerProtocolService: 172.26.32.228_9858: 
> appendEntries completed
> 2018-02-12 00:26:33,217 INFO org.apache.ratis.server.impl.LeaderElection: 
> 172.26.32.228_9858: Election REJECTED; received 2 response(s) 
> [172.26.32.228_9858<-172.26.32.230_9858#0:FAIL-t4, 
> 172.26.32.228_9858<-172.26.32.232_9858#0:FAIL-t4] and 0 exception(s); 
> 172.26.32.228_9858:t4, leader=null, voted=172.26.32.228_9858, raftlog=[(t:3, 
> i:10711)], conf=[172.26.32.228_9858:172.26.32.228:9858, 
> 172.26.32.230_9858:172.26.32.230:9858, 
> 172.26.32.232_9858:172.26.32.232:9858], old=null
> 2018-02-12 00:26:33,217 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858 changes role from CANDIDATE to FOLLOWER at term 4 for 
> changeToFollower
> 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.FollowerState: 
> 172.26.32.228_9858 changes to CANDIDATE, lastRpcTime:5624, 
> electionTimeout:975ms
> 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858 changes role from FOLLOWER to CANDIDATE at term 5 for 
> changeToCandidate
> 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858 changes role from CANDIDATE to FOLLOWER at term 5 for 
> changeToFollower
> 2018-02-12 00:26:39,520 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858: change Leader from null to 172.26.32.232_9858 at term 5 
> for appendEntries
> {code}
> {code}
> 2018-02-12 00:31:12,400 ERROR 
> org.apache.ratis.server.impl.StateMachineUpdater: Terminating with exit 
> status 2: StateMachineUpdater-172.26.32.228_9858: the StateMachineUpdater 
> hits Throwable
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.applyTransaction(ContainerStateMachine.java:254)
>         at 
> org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1001)
>         at 
> org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:151)
>         at java.lang.Thread.run(Thread.java:745)
> 2018-02-12 00:31:12,406 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> SHUTDOWN_MSG: 
> /************************************************************
> SHUTDOWN_MSG: Shutting down DataNode at 
> y128.l42scl.hortonworks.com/172.26.32.228
> ************************************************************/
> *** shutting down gRPC server since JVM is shutting down
> *** server shut down
> ***
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to