Nilotpal Nandi created HDDS-6900:
------------------------------------
Summary: SCM node went down with UndeclaredThrowableException
while running container balancer
Key: HDDS-6900
URL: https://issues.apache.org/jira/browse/HDDS-6900
Project: Apache Ozone
Issue Type: Bug
Components: SCM HA
Reporter: Nilotpal Nandi
SCM nodeĀ went down with UndeclaredThrowableException when container balancer
is running and 2 other SCM nodes were shutdown.
{noformat}
2022-06-15 20:00:15,634 WARN org.apache.ratis.grpc.server.GrpcLogAppender:
99c85376-060f-4b3c-8973-a2d2b1dd23e6@group-0B75F4A309CF->cbdea5d3-682d-43e6-a17a-bdce757b7764-GrpcLogAppender:
Leader has not got in touch with Follower
99c85376-060f-4b3c-8973-a2d2b1dd23e6@group-0B75F4A309CF->cbdea5d3-682d-43e6-a17a-bdce757b7764(c-1,m0,n310,
attendVote=true, lastRpcSendTime=1, lastRpcResponseTime=32843) yet, just keep
nextIndex unchanged and retry. 2022-06-15 20:00:16,887 WARN
org.apache.ratis.grpc.server.GrpcLogAppender:
99c85376-060f-4b3c-8973-a2d2b1dd23e6@group-0B75F4A309CF->cbdea5d3-682d-43e6-a17a-bdce757b7764-AppendLogResponseHandler:
Failed appendEntries:
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io
exception 2022-06-15 20:00:16,888 WARN
org.apache.ratis.grpc.server.GrpcLogAppender:
99c85376-060f-4b3c-8973-a2d2b1dd23e6@group-0B75F4A309CF->cbdea5d3-682d-43e6-a17a-bdce757b7764-GrpcLogAppender:
Leader has not got in touch with Follower
99c85376-060f-4b3c-8973-a2d2b1dd23e6@group-0B75F4A309CF->cbdea5d3-682d-43e6-a17a-bdce757b7764(c-1,m0,n310,
attendVote=true, lastRpcSendTime=4, lastRpcResponseTime=34097) yet, just keep
nextIndex unchanged and retry. 2022-06-15 20:00:18,121 ERROR
org.apache.ratis.server.impl.StateMachineUpdater:
99c85376-060f-4b3c-8973-a2d2b1dd23e6@group-0B75F4A309CF-StateMachineUpdater
caught a Throwable. java.lang.reflect.UndeclaredThrowableException at
com.sun.proxy.$Proxy19.completeMove(Unknown Source) at
org.apache.hadoop.hdds.scm.container.replication.LegacyReplicationManager.deleteSrcDnForMove(LegacyReplicationManager.java:1249)
at
org.apache.hadoop.hdds.scm.container.replication.LegacyReplicationManager.lambda$onLeaderReadyAndOutOfSafeMode$40(LegacyReplicationManager.java:1871)
at
java.base/java.util.concurrent.ConcurrentHashMap.forEach(ConcurrentHashMap.java:1603)
at
org.apache.hadoop.hdds.scm.container.replication.LegacyReplicationManager.onLeaderReadyAndOutOfSafeMode(LegacyReplicationManager.java:1850)
at
org.apache.hadoop.hdds.scm.container.replication.LegacyReplicationManager.notifyStatusChanged(LegacyReplicationManager.java:1649)
at
org.apache.hadoop.hdds.scm.container.replication.ReplicationManager.notifyStatusChanged(ReplicationManager.java:375)
at
org.apache.hadoop.hdds.scm.ha.SCMServiceManager.notifyStatusChanged(SCMServiceManager.java:52)
at
org.apache.hadoop.hdds.scm.ha.SCMStateMachine.notifyTermIndexUpdated(SCMStateMachine.java:330)
at
org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1566)
at
org.apache.ratis.server.impl.StateMachineUpdater.applyLog(StateMachineUpdater.java:239)
at
org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:182)
at java.base/java.lang.Thread.run(Thread.java:834) Caused by:
java.util.concurrent.TimeoutException at
java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886)
at
java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021)
at
org.apache.hadoop.hdds.scm.ha.SCMRatisServerImpl.submitRequest(SCMRatisServerImpl.java:225)
at
org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invokeRatis(SCMHAInvocationHandler.java:111)
at
org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invoke(SCMHAInvocationHandler.java:67)
... 13 more 2022-06-15 20:00:18,122 INFO
org.apache.ratis.server.RaftServer$Division:
99c85376-060f-4b3c-8973-a2d2b1dd23e6@group-0B75F4A309CF: shutdown 2022-06-15
20:00:18,122 INFO org.apache.ratis.util.JmxRegister: Successfully un-registered
JMX Bean with object name
Ratis:service=RaftServer,group=group-0B75F4A309CF,id=99c85376-060f-4b3c-8973-a2d2b1dd23e6
2022-06-15 20:00:18,122 INFO org.apache.ratis.server.impl.RoleInfo:
99c85376-060f-4b3c-8973-a2d2b1dd23e6: shutdown
99c85376-060f-4b3c-8973-a2d2b1dd23e6@group-0B75F4A309CF-LeaderStateImpl
2022-06-15 20:00:18,124 INFO org.apache.ratis.server.impl.PendingRequests:
99c85376-060f-4b3c-8973-a2d2b1dd23e6@group-0B75F4A309CF-PendingRequests:
sendNotLeaderResponses 2022-06-15 20:00:18,125 WARN
org.apache.ratis.grpc.server.GrpcLogAppender:
99c85376-060f-4b3c-8973-a2d2b1dd23e6@group-0B75F4A309CF->b6382f07-de2e-4986-8275-9146e73360a6-GrpcLogAppender:
Wait interrupted by java.lang.InterruptedException 2022-06-15 20:00:18,128
INFO org.apache.hadoop.hdds.scm.ha.SCMStateMachine: current leader SCM steps
down.{noformat}
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]