[
https://issues.apache.org/jira/browse/RATIS-216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mukul Kumar Singh moved HDFS-13327 to RATIS-216:
------------------------------------------------
Key: RATIS-216 (was: HDFS-13327)
Project: Ratis (was: Hadoop HDFS)
> Ratis should handle failure conditions where append request to followers can
> fail
> ---------------------------------------------------------------------------------
>
> Key: RATIS-216
> URL: https://issues.apache.org/jira/browse/RATIS-216
> Project: Ratis
> Issue Type: Bug
> Reporter: Mukul Kumar Singh
> Assignee: Mukul Kumar Singh
> Priority: Major
>
> This was observed on a 3 node ratis config with ozone, this was simulated
> using TestFreon unit test.
> The problem is append to {{127.0.0.1_63765}} failed because the node ran out
> of memory.
> A new leader is elected which tries to send the append requests to the same
> follower, but the request is out of order.
> {code}
> 2018-03-21 22:05:01,495 INFO ratis.ContainerStateMachine
> (ContainerStateMachine.java:handleWriteChunk(207)) - adding index 185
> 2018-03-21 22:05:02,054 WARN impl.LogAppender
> (GRpcLogAppender.java:onError(226)) - 127.0.0.1_63790 got error when
> appending entries to 127.0.0.1_63765, exception: {}.
> java.io.IOException: org.apache.ratis.shaded.io.grpc.StatusRuntimeException:
> INTERNAL: Java heap space
> at
> org.apache.ratis.grpc.RaftGrpcUtil.unwrapException(RaftGrpcUtil.java:65)
> at
> org.apache.ratis.grpc.RaftGrpcUtil.unwrapThrowable(RaftGrpcUtil.java:48)
> at
> org.apache.ratis.grpc.server.GRpcLogAppender$AppendLogResponseHandler.onError(GRpcLogAppender.java:227)
> at
> org.apache.ratis.shaded.io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:395)
> at
> org.apache.ratis.shaded.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:481)
> at
> org.apache.ratis.shaded.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$600(ClientCallImpl.java:398)
> at
> org.apache.ratis.shaded.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:513)
> at
> org.apache.ratis.shaded.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:52)
> at
> org.apache.ratis.shaded.io.grpc.internal.SerializingExecutor$TaskRunner.run(SerializingExecutor.java:154)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.ratis.shaded.io.grpc.StatusRuntimeException: INTERNAL:
> Java heap space
> at
> org.apache.ratis.shaded.io.grpc.Status.asRuntimeException(Status.java:545)
> ... 9 more
> 2018-03-21 22:05:02,088 INFO ratis.ContainerStateMachine
> (ContainerStateMachine.java:applyTransaction(263)) - removing index 185
> ^M 90.00%
> |███████████████████████████████████████████████████████████████████████████████████████████
> | 18/20 Time: 0:01:282018-03-21 22:05:02,622 INFO
> impl.FollowerState (FollowerState.java:r
> un(72)) - 127.0.0.1_63776 changes to CANDIDATE, lastRpcTime:534,
> electionTimeout:414ms
> 2018-03-21 22:05:02,622 INFO impl.RaftServerImpl
> (RaftServerImpl.java:setRole(159)) - 127.0.0.1_63776 changes role from
> FOLLOWER to CANDIDATE at term 31 for changeToCandidate
> 2018-03-21 22:05:02,622 INFO impl.FollowerState (FollowerState.java:run(72))
> - 127.0.0.1_63765 changes to CANDIDATE, lastRpcTime:576, electionTimeout:411ms
> 2018-03-21 22:05:02,623 INFO impl.RaftServerImpl
> (RaftServerImpl.java:setRole(159)) - 127.0.0.1_63765 changes role from
> FOLLOWER to CANDIDATE at term 31 for changeToCandidate
> 2018-03-21 22:05:02,623 INFO impl.RaftServerImpl
> (ServerState.java:setLeader(215)) - 127.0.0.1_63776: change Leader from
> 127.0.0.1_63790 to null at term 31 for initElection
> 2018-03-21 22:05:02,628 INFO impl.RaftServerImpl
> (ServerState.java:setLeader(215)) - 127.0.0.1_63765: change Leader from
> 127.0.0.1_63790 to null at term 31 for initElection
> 2018-03-21 22:05:02,634 INFO impl.LeaderElection
> (LeaderElection.java:askForVotes(127)) - 127.0.0.1_63765: begin an election
> in Term 32
> 2018-03-21 22:05:02,634 INFO impl.LeaderElection
> (LeaderElection.java:askForVotes(127)) - 127.0.0.1_63776: begin an election
> in Term 32
> 2018-03-21 22:05:02,635 INFO impl.RaftServerImpl
> (ServerState.java:setLeader(215)) - 127.0.0.1_63790: change Leader from
> 127.0.0.1_63790 to null at term 32 for updateCurrentTerm
> 2018-03-21 22:05:02,635 INFO impl.RaftServerImpl
> (ServerState.java:setLeader(215)) - 127.0.0.1_63790: change Leader from
> 127.0.0.1_63790 to null at term 32 for updateCurrentTerm
> 2018-03-21 22:05:02,636 INFO impl.RaftServerImpl
> (RaftServerImpl.java:setRole(159)) - 127.0.0.1_63790 changes role from LEADER
> to FOLLOWER at term 32 for changeToFollower
> 2018-03-21 22:05:02,636 INFO impl.PendingRequests
> (PendingRequests.java:sendNotLeaderResponses(147)) - 127.0.0.1_63790 sends
> responses before shutting down PendingRequestsHandler
> 2018-03-21 22:05:02,637 INFO server.RaftServerProtocolService
> (RaftServerProtocolService.java:onCompleted(104)) - 127.0.0.1_63776:
> appendEntries completed
> 2018-03-21 22:05:02,640 INFO impl.LogAppender
> (GRpcLogAppender.java:onCompleted(254)) - 127.0.0.1_63790 stops appending log
> entries to follower 127.0.0.1_63776(next=186, match=185, attendVote=true,
> lastR
> pcSendTime=86068ms, lastRpcResponseTime=86080ms)
> 2018-03-21 22:05:02,646 INFO impl.LeaderElection
> (LeaderElection.java:logAndReturn(46)) - 127.0.0.1_63776: Election PASSED;
> received 2 response(s) [127.0.0.1_63776<-127.0.0.1_63765#0:FAIL-t32,
> 127.0.0.1_
> 63776<-127.0.0.1_63790#0:OK-t32] and 0 exception(s); 127.0.0.1_63776:t32,
> leader=null, voted=127.0.0.1_63776, raftlog=[(t:31, i:185)],
> conf=[127.0.0.1_63765:127.0.0.1:63765, 127.0.0.1_63776:127.0.0.1:6377
> 6, 127.0.0.1_63790:127.0.0.1:63790], old=null
> 2018-03-21 22:05:02,647 INFO impl.RaftServerImpl
> (RaftServerImpl.java:setRole(159)) - 127.0.0.1_63776 changes role from
> CANDIDATE to LEADER at term 32 for changeToLeader
> 2018-03-21 22:05:02,647 INFO impl.RaftServerImpl
> (ServerState.java:setLeader(215)) - 127.0.0.1_63776: change Leader from null
> to 127.0.0.1_63776 at term 32 for becomeLeader
> 2018-03-21 22:05:02,647 INFO conf.ConfUtils (ConfUtils.java:logGet(41)) -
> raft.server.staging.catchup.gap = 1000 (default)
> 2018-03-21 22:05:02,647 INFO impl.LeaderElection
> (LeaderElection.java:logAndReturn(46)) - 127.0.0.1_63765: Election REJECTED;
> received 2 response(s) [127.0.0.1_63765<-127.0.0.1_63776#0:FAIL-t32,
> 127.0.0.1_63765<-127.0.0.1_63790#0:FAIL-t32] and 0 exception(s);
> 127.0.0.1_63765:t32, leader=null, voted=127.0.0.1_63765, raftlog=[(t:31,
> i:185)], conf=[127.0.0.1_63765:127.0.0.1:63765,
> 127.0.0.1_63776:127.0.0.1:63776, 127.0.0.1_63790:127.0.0.1:63790], old=null
> 2018-03-21 22:05:02,647 INFO conf.ConfUtils (ConfUtils.java:logGet(41)) -
> raft.server.rpc.sleep.time = 25 ms (default)
> 2018-03-21 22:05:02,650 INFO impl.RaftServerImpl
> (RaftServerImpl.java:setRole(159)) - 127.0.0.1_63765 changes role from
> CANDIDATE to FOLLOWER at term 32 for changeToFollower
> 2018-03-21 22:05:02,650 INFO conf.ConfUtils (ConfUtils.java:logGet(41)) -
> raft.server.log.appender.buffer.capacity = 134217728 (custom)
> 2018-03-21 22:05:02,650 INFO conf.ConfUtils (ConfUtils.java:logGet(41)) -
> raft.server.log.appender.batch.enabled = true (custom)
> 2018-03-21 22:05:02,652 INFO conf.ConfUtils (ConfUtils.java:logGet(41)) -
> raft.server.log.appender.snapshot.chunk.size.max = 16MB (=16777216) (default)
> 2018-03-21 22:05:02,653 INFO conf.ConfUtils (ConfUtils.java:logGet(41)) -
> raft.grpc.server.leader.outstanding.appends.max = 128 (default)
> 2018-03-21 22:05:02,653 INFO conf.ConfUtils (ConfUtils.java:logGet(41)) -
> raft.server.log.appender.buffer.capacity = 134217728 (custom)
> 2018-03-21 22:05:02,653 INFO conf.ConfUtils (ConfUtils.java:logGet(41)) -
> raft.server.log.appender.batch.enabled = true (custom)
> 2018-03-21 22:05:02,653 INFO conf.ConfUtils (ConfUtils.java:logGet(41)) -
> raft.server.log.appender.snapshot.chunk.size.max = 16MB (=16777216) (default)
> 2018-03-21 22:05:02,653 INFO conf.ConfUtils (ConfUtils.java:logGet(41)) -
> raft.grpc.server.leader.outstanding.appends.max = 128 (default)
> 2018-03-21 22:05:02,654 INFO storage.RaftLogWorker
> (RaftLogWorker.java:rollLogSegment(256)) - Rolling
> segment:127.0.0.1_63776-RaftLogWorker index to:186
> 2018-03-21 22:05:02,673 INFO impl.RaftServerImpl
> (ServerState.java:setLeader(215)) - 127.0.0.1_63765: change Leader from null
> to 127.0.0.1_63776 at term 32 for appendEntries
> 2018-03-21 22:05:02,673 INFO impl.RaftServerImpl
> (ServerState.java:setLeader(215)) - 127.0.0.1_63790: change Leader from null
> to 127.0.0.1_63776 at term 32 for appendEntries
> 2018-03-21 22:05:02,674 INFO storage.RaftLogWorker
> (RaftLogWorker.java:rollLogSegment(256)) - Rolling
> segment:127.0.0.1_63790-RaftLogWorker index to:186
> 2018-03-21 22:05:02,674 INFO storage.RaftLogWorker
> (RaftLogWorker.java:rollLogSegment(256)) - Rolling
> segment:127.0.0.1_63765-RaftLogWorker index to:186
> 2018-03-21 22:05:02,712 INFO server.RaftServerProtocolService
> (RaftServerProtocolService.java:onCompleted(104)) - 127.0.0.1_63765:
> appendEntries completed
> 2018-03-21 22:05:02,713 INFO impl.LogAppender
> (GRpcLogAppender.java:onCompleted(254)) - 127.0.0.1_63790 stops appending log
> entries to follower 127.0.0.1_63765(next=186, match=184, attendVote=true,
> lastRpcSendTime=86081ms, lastRpcResponseTime=86157ms)
> Mar 21, 2018 10:05:02 PM
> org.apache.ratis.shaded.io.grpc.internal.ManagedChannelImpl
> maybeTerminateChannel
> INFO: [ManagedChannelImpl@643d117e] Terminated
> 2018-03-21 22:05:03,016 INFO conf.ConfUtils (ConfUtils.java:logGet(41)) -
> raft.grpc.flow.control.window = 1MB (=1048576) (default)
> 2018-03-21 22:05:03,017 INFO conf.ConfUtils (ConfUtils.java:logGet(41)) -
> raft.grpc.message.size.max = 33554432 (custom)
> Mar 21, 2018 10:05:03 PM
> org.apache.ratis.shaded.io.grpc.internal.ManagedChannelImpl <init>
> INFO: [ManagedChannelImpl@4a23944d] Created with target 127.0.0.1:63776
> 2018-03-21 22:05:03,479 INFO ratis.ContainerStateMachine
> (ContainerStateMachine.java:applyTransaction(263)) - removing index 185
> 2018-03-21 22:05:03,562 ERROR storage.RaftLogWorker
> (ExitUtils.java:terminate(86)) - Terminating with exit status 1:
> 127.0.0.1_63765-RaftLogWorker failed.
> java.lang.IllegalStateException: lastWrittenIndex == 184, entry == term: 32
> index: 186
> noOp {
> }
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)