[ 
https://issues.apache.org/jira/browse/RATIS-216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mukul Kumar Singh moved HDFS-13327 to RATIS-216:
------------------------------------------------

        Key: RATIS-216  (was: HDFS-13327)
    Project: Ratis  (was: Hadoop HDFS)

> Ratis should handle failure conditions where append request to followers can 
> fail
> ---------------------------------------------------------------------------------
>
>                 Key: RATIS-216
>                 URL: https://issues.apache.org/jira/browse/RATIS-216
>             Project: Ratis
>          Issue Type: Bug
>            Reporter: Mukul Kumar Singh
>            Assignee: Mukul Kumar Singh
>            Priority: Major
>
> This was observed on a 3 node ratis config with ozone, this was simulated 
> using TestFreon unit test.
> The problem is append to {{127.0.0.1_63765}} failed because the node ran out 
> of memory.
> A new leader is elected which tries to send the append requests to the same 
> follower, but the request is out of order.
> {code}
> 2018-03-21 22:05:01,495 INFO  ratis.ContainerStateMachine 
> (ContainerStateMachine.java:handleWriteChunk(207)) - adding  index 185
> 2018-03-21 22:05:02,054 WARN  impl.LogAppender 
> (GRpcLogAppender.java:onError(226)) - 127.0.0.1_63790 got error when 
> appending entries to 127.0.0.1_63765, exception: {}.
> java.io.IOException: org.apache.ratis.shaded.io.grpc.StatusRuntimeException: 
> INTERNAL: Java heap space
>         at 
> org.apache.ratis.grpc.RaftGrpcUtil.unwrapException(RaftGrpcUtil.java:65)
>         at 
> org.apache.ratis.grpc.RaftGrpcUtil.unwrapThrowable(RaftGrpcUtil.java:48)
>         at 
> org.apache.ratis.grpc.server.GRpcLogAppender$AppendLogResponseHandler.onError(GRpcLogAppender.java:227)
>         at 
> org.apache.ratis.shaded.io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:395)
>         at 
> org.apache.ratis.shaded.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:481)
>         at 
> org.apache.ratis.shaded.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$600(ClientCallImpl.java:398)
>         at 
> org.apache.ratis.shaded.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:513)
>         at 
> org.apache.ratis.shaded.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:52)
>         at 
> org.apache.ratis.shaded.io.grpc.internal.SerializingExecutor$TaskRunner.run(SerializingExecutor.java:154)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.ratis.shaded.io.grpc.StatusRuntimeException: INTERNAL: 
> Java heap space
>         at 
> org.apache.ratis.shaded.io.grpc.Status.asRuntimeException(Status.java:545)
>         ... 9 more
> 2018-03-21 22:05:02,088 INFO  ratis.ContainerStateMachine 
> (ContainerStateMachine.java:applyTransaction(263)) - removing index 185
> ^M 90.00% 
> |███████████████████████████████████████████████████████████████████████████████████████████
>           |  18/20 Time: 0:01:282018-03-21 22:05:02,622 INFO  
> impl.FollowerState (FollowerState.java:r
> un(72)) - 127.0.0.1_63776 changes to CANDIDATE, lastRpcTime:534, 
> electionTimeout:414ms
> 2018-03-21 22:05:02,622 INFO  impl.RaftServerImpl 
> (RaftServerImpl.java:setRole(159)) - 127.0.0.1_63776 changes role from 
> FOLLOWER to CANDIDATE at term 31 for changeToCandidate
> 2018-03-21 22:05:02,622 INFO  impl.FollowerState (FollowerState.java:run(72)) 
> - 127.0.0.1_63765 changes to CANDIDATE, lastRpcTime:576, electionTimeout:411ms
> 2018-03-21 22:05:02,623 INFO  impl.RaftServerImpl 
> (RaftServerImpl.java:setRole(159)) - 127.0.0.1_63765 changes role from 
> FOLLOWER to CANDIDATE at term 31 for changeToCandidate
> 2018-03-21 22:05:02,623 INFO  impl.RaftServerImpl 
> (ServerState.java:setLeader(215)) - 127.0.0.1_63776: change Leader from 
> 127.0.0.1_63790 to null at term 31 for initElection
> 2018-03-21 22:05:02,628 INFO  impl.RaftServerImpl 
> (ServerState.java:setLeader(215)) - 127.0.0.1_63765: change Leader from 
> 127.0.0.1_63790 to null at term 31 for initElection
> 2018-03-21 22:05:02,634 INFO  impl.LeaderElection 
> (LeaderElection.java:askForVotes(127)) - 127.0.0.1_63765: begin an election 
> in Term 32
> 2018-03-21 22:05:02,634 INFO  impl.LeaderElection 
> (LeaderElection.java:askForVotes(127)) - 127.0.0.1_63776: begin an election 
> in Term 32
> 2018-03-21 22:05:02,635 INFO  impl.RaftServerImpl 
> (ServerState.java:setLeader(215)) - 127.0.0.1_63790: change Leader from 
> 127.0.0.1_63790 to null at term 32 for updateCurrentTerm
> 2018-03-21 22:05:02,635 INFO  impl.RaftServerImpl 
> (ServerState.java:setLeader(215)) - 127.0.0.1_63790: change Leader from 
> 127.0.0.1_63790 to null at term 32 for updateCurrentTerm
> 2018-03-21 22:05:02,636 INFO  impl.RaftServerImpl 
> (RaftServerImpl.java:setRole(159)) - 127.0.0.1_63790 changes role from LEADER 
> to FOLLOWER at term 32 for changeToFollower
> 2018-03-21 22:05:02,636 INFO  impl.PendingRequests 
> (PendingRequests.java:sendNotLeaderResponses(147)) - 127.0.0.1_63790 sends 
> responses before shutting down PendingRequestsHandler
> 2018-03-21 22:05:02,637 INFO  server.RaftServerProtocolService 
> (RaftServerProtocolService.java:onCompleted(104)) - 127.0.0.1_63776: 
> appendEntries completed
> 2018-03-21 22:05:02,640 INFO  impl.LogAppender 
> (GRpcLogAppender.java:onCompleted(254)) - 127.0.0.1_63790 stops appending log 
> entries to follower 127.0.0.1_63776(next=186, match=185, attendVote=true, 
> lastR
> pcSendTime=86068ms, lastRpcResponseTime=86080ms)
> 2018-03-21 22:05:02,646 INFO  impl.LeaderElection 
> (LeaderElection.java:logAndReturn(46)) - 127.0.0.1_63776: Election PASSED; 
> received 2 response(s) [127.0.0.1_63776<-127.0.0.1_63765#0:FAIL-t32, 
> 127.0.0.1_
> 63776<-127.0.0.1_63790#0:OK-t32] and 0 exception(s); 127.0.0.1_63776:t32, 
> leader=null, voted=127.0.0.1_63776, raftlog=[(t:31, i:185)], 
> conf=[127.0.0.1_63765:127.0.0.1:63765, 127.0.0.1_63776:127.0.0.1:6377
> 6, 127.0.0.1_63790:127.0.0.1:63790], old=null
> 2018-03-21 22:05:02,647 INFO  impl.RaftServerImpl 
> (RaftServerImpl.java:setRole(159)) - 127.0.0.1_63776 changes role from 
> CANDIDATE to LEADER at term 32 for changeToLeader
> 2018-03-21 22:05:02,647 INFO  impl.RaftServerImpl 
> (ServerState.java:setLeader(215)) - 127.0.0.1_63776: change Leader from null 
> to 127.0.0.1_63776 at term 32 for becomeLeader
> 2018-03-21 22:05:02,647 INFO  conf.ConfUtils (ConfUtils.java:logGet(41)) - 
> raft.server.staging.catchup.gap = 1000 (default)
> 2018-03-21 22:05:02,647 INFO  impl.LeaderElection 
> (LeaderElection.java:logAndReturn(46)) - 127.0.0.1_63765: Election REJECTED; 
> received 2 response(s) [127.0.0.1_63765<-127.0.0.1_63776#0:FAIL-t32, 
> 127.0.0.1_63765<-127.0.0.1_63790#0:FAIL-t32] and 0 exception(s); 
> 127.0.0.1_63765:t32, leader=null, voted=127.0.0.1_63765, raftlog=[(t:31, 
> i:185)], conf=[127.0.0.1_63765:127.0.0.1:63765, 
> 127.0.0.1_63776:127.0.0.1:63776, 127.0.0.1_63790:127.0.0.1:63790], old=null
> 2018-03-21 22:05:02,647 INFO  conf.ConfUtils (ConfUtils.java:logGet(41)) - 
> raft.server.rpc.sleep.time = 25 ms (default)
> 2018-03-21 22:05:02,650 INFO  impl.RaftServerImpl 
> (RaftServerImpl.java:setRole(159)) - 127.0.0.1_63765 changes role from 
> CANDIDATE to FOLLOWER at term 32 for changeToFollower
> 2018-03-21 22:05:02,650 INFO  conf.ConfUtils (ConfUtils.java:logGet(41)) - 
> raft.server.log.appender.buffer.capacity = 134217728 (custom)
> 2018-03-21 22:05:02,650 INFO  conf.ConfUtils (ConfUtils.java:logGet(41)) - 
> raft.server.log.appender.batch.enabled = true (custom)
> 2018-03-21 22:05:02,652 INFO  conf.ConfUtils (ConfUtils.java:logGet(41)) - 
> raft.server.log.appender.snapshot.chunk.size.max = 16MB (=16777216) (default)
> 2018-03-21 22:05:02,653 INFO  conf.ConfUtils (ConfUtils.java:logGet(41)) - 
> raft.grpc.server.leader.outstanding.appends.max = 128 (default)
> 2018-03-21 22:05:02,653 INFO  conf.ConfUtils (ConfUtils.java:logGet(41)) - 
> raft.server.log.appender.buffer.capacity = 134217728 (custom)
> 2018-03-21 22:05:02,653 INFO  conf.ConfUtils (ConfUtils.java:logGet(41)) - 
> raft.server.log.appender.batch.enabled = true (custom)
> 2018-03-21 22:05:02,653 INFO  conf.ConfUtils (ConfUtils.java:logGet(41)) - 
> raft.server.log.appender.snapshot.chunk.size.max = 16MB (=16777216) (default)
> 2018-03-21 22:05:02,653 INFO  conf.ConfUtils (ConfUtils.java:logGet(41)) - 
> raft.grpc.server.leader.outstanding.appends.max = 128 (default)
> 2018-03-21 22:05:02,654 INFO  storage.RaftLogWorker 
> (RaftLogWorker.java:rollLogSegment(256)) - Rolling 
> segment:127.0.0.1_63776-RaftLogWorker index to:186
> 2018-03-21 22:05:02,673 INFO  impl.RaftServerImpl 
> (ServerState.java:setLeader(215)) - 127.0.0.1_63765: change Leader from null 
> to 127.0.0.1_63776 at term 32 for appendEntries
> 2018-03-21 22:05:02,673 INFO  impl.RaftServerImpl 
> (ServerState.java:setLeader(215)) - 127.0.0.1_63790: change Leader from null 
> to 127.0.0.1_63776 at term 32 for appendEntries
> 2018-03-21 22:05:02,674 INFO  storage.RaftLogWorker 
> (RaftLogWorker.java:rollLogSegment(256)) - Rolling 
> segment:127.0.0.1_63790-RaftLogWorker index to:186
> 2018-03-21 22:05:02,674 INFO  storage.RaftLogWorker 
> (RaftLogWorker.java:rollLogSegment(256)) - Rolling 
> segment:127.0.0.1_63765-RaftLogWorker index to:186
> 2018-03-21 22:05:02,712 INFO  server.RaftServerProtocolService 
> (RaftServerProtocolService.java:onCompleted(104)) - 127.0.0.1_63765: 
> appendEntries completed
> 2018-03-21 22:05:02,713 INFO  impl.LogAppender 
> (GRpcLogAppender.java:onCompleted(254)) - 127.0.0.1_63790 stops appending log 
> entries to follower 127.0.0.1_63765(next=186, match=184, attendVote=true, 
> lastRpcSendTime=86081ms, lastRpcResponseTime=86157ms)
> Mar 21, 2018 10:05:02 PM 
> org.apache.ratis.shaded.io.grpc.internal.ManagedChannelImpl 
> maybeTerminateChannel
> INFO: [ManagedChannelImpl@643d117e] Terminated
> 2018-03-21 22:05:03,016 INFO  conf.ConfUtils (ConfUtils.java:logGet(41)) - 
> raft.grpc.flow.control.window = 1MB (=1048576) (default)
> 2018-03-21 22:05:03,017 INFO  conf.ConfUtils (ConfUtils.java:logGet(41)) - 
> raft.grpc.message.size.max = 33554432 (custom)
> Mar 21, 2018 10:05:03 PM 
> org.apache.ratis.shaded.io.grpc.internal.ManagedChannelImpl <init>
> INFO: [ManagedChannelImpl@4a23944d] Created with target 127.0.0.1:63776
> 2018-03-21 22:05:03,479 INFO  ratis.ContainerStateMachine 
> (ContainerStateMachine.java:applyTransaction(263)) - removing index 185
> 2018-03-21 22:05:03,562 ERROR storage.RaftLogWorker 
> (ExitUtils.java:terminate(86)) - Terminating with exit status 1: 
> 127.0.0.1_63765-RaftLogWorker failed.
> java.lang.IllegalStateException: lastWrittenIndex == 184, entry == term: 32
> index: 186
> noOp {
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to