[
https://issues.apache.org/jira/browse/HDDS-13551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18012833#comment-18012833
]
Bablu Raul commented on HDDS-13551:
-----------------------------------
*Expected Result:*
After the new SCM leader is elected, reconciliation should automatically resume
and function correctly without any issues.
*Actual Result:*
Even after a new SCM leader is successfully elected, when I run the reconcile
command, the system throws a {{{}ServerNotLeaderException{}}}, and
reconciliation fails to proceed.
> ServerNotLeaderException after SCM leader is stopped and new leader is
> elected
> -------------------------------------------------------------------------------
>
> Key: HDDS-13551
> URL: https://issues.apache.org/jira/browse/HDDS-13551
> Project: Apache Ozone
> Issue Type: Bug
> Reporter: Bablu Raul
> Priority: Major
>
> {code:java}
> When I stop the SCM leader(i.e data-10:LEADER), the system correctly triggers
> a leader election and successfully elects a new SCM leader. This can be
> verified through the CLI, where the new leader is visible{code}
> {code:java}
> data-17:FOLLOWER
> data-1:FOLLOWER
> data-10:LEADER{code}
> {code:java}
> data-17:FOLLOWER
> data-1:LEADER
> data-10:FOLLOWER {code}
> {code:java}
> 2025-08-07 05:15:11,853|INFO|MainThread|machine.py:205 -
> run()||GUID=51debc2b-956b-4e05-b036-ced4aa0547f4|25/08/07 05:15:11 INFO
> retry.RetryInvocationHandler: com.google.protobuf.ServiceException:
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdds.scm.exceptions.SCMException):
> Cannot reconcile container #5001 in state CLOSED with replica states:
> CLOSED, CLOSED, CLOSING
> 2025-08-07 05:15:11,854|INFO|MainThread|machine.py:205 -
> run()||GUID=51debc2b-956b-4e05-b036-ced4aa0547f4|at
> org.apache.hadoop.hdds.scm.server.SCMClientProtocolServer.reconcileContainer(SCMClientProtocolServer.java:1542)
> 2025-08-07 05:15:11,854|INFO|MainThread|machine.py:205 -
> run()||GUID=51debc2b-956b-4e05-b036-ced4aa0547f4|at
> org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.reconcileContainer(StorageContainerLocationProtocolServerSideTranslatorPB.java:1360)
> 2025-08-07 05:15:11,854|INFO|MainThread|machine.py:205 -
> run()||GUID=51debc2b-956b-4e05-b036-ced4aa0547f4|at
> org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.processRequest(StorageContainerLocationProtocolServerSideTranslatorPB.java:739)
> 2025-08-07 05:15:11,854|INFO|MainThread|machine.py:205 -
> run()||GUID=51debc2b-956b-4e05-b036-ced4aa0547f4|at
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:89)
> 2025-08-07 05:15:11,855|INFO|MainThread|machine.py:205 -
> run()||GUID=51debc2b-956b-4e05-b036-ced4aa0547f4|at
> org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.submitRequest(StorageContainerLocationProtocolServerSideTranslatorPB.java:235)
> 2025-08-07 05:15:11,855|INFO|MainThread|machine.py:205 -
> run()||GUID=51debc2b-956b-4e05-b036-ced4aa0547f4|at
> org.apache.hadoop.hdds.protocol.proto.StorageContainerLocationProtocolProtos$StorageContainerLocationProtocolService$2.callBlockingMethod(StorageContainerLocationProtocolProtos.java)
> 2025-08-07 05:15:11,855|INFO|MainThread|machine.py:205 -
> run()||GUID=51debc2b-956b-4e05-b036-ced4aa0547f4|at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
> 2025-08-07 05:15:11,855|INFO|MainThread|machine.py:205 -
> run()||GUID=51debc2b-956b-4e05-b036-ced4aa0547f4|at
> org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> 2025-08-07 05:15:11,855|INFO|MainThread|machine.py:205 -
> run()||GUID=51debc2b-956b-4e05-b036-ced4aa0547f4|at
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:994)
> 2025-08-07 05:15:11,856|INFO|MainThread|machine.py:205 -
> run()||GUID=51debc2b-956b-4e05-b036-ced4aa0547f4|at
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:922)
> 2025-08-07 05:15:11,856|INFO|MainThread|machine.py:205 -
> run()||GUID=51debc2b-956b-4e05-b036-ced4aa0547f4|at
> java.security.AccessController.doPrivileged(Native Method)
> 2025-08-07 05:15:11,856|INFO|MainThread|machine.py:205 -
> run()||GUID=51debc2b-956b-4e05-b036-ced4aa0547f4|at
> javax.security.auth.Subject.doAs(Subject.java:422)
> 2025-08-07 05:15:11,856|INFO|MainThread|machine.py:205 -
> run()||GUID=51debc2b-956b-4e05-b036-ced4aa0547f4|at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1910)
> 2025-08-07 05:15:11,856|INFO|MainThread|machine.py:205 -
> run()||GUID=51debc2b-956b-4e05-b036-ced4aa0547f4|at
> org.apache.hadoop.ipc.Server$Handler.run(Server.java:2899)
> 2025-08-07 05:15:11,856|INFO|MainThread|machine.py:205 -
> run()||GUID=51debc2b-956b-4e05-b036-ced4aa0547f4|, while invoking
> $Proxy18.submitRequest over
> nodeId=node2,nodeAddress=ccycloud-10.newom.root.comops.site/10.140.137.141:9860
> after 3 failover attempts. Trying to failover after sleeping for 2000ms.
> 2025-08-07 05:15:13,856|INFO|MainThread|machine.py:205 -
> run()||GUID=51debc2b-956b-4e05-b036-ced4aa0547f4|25/08/07 05:15:13 INFO
> retry.RetryInvocationHandler: com.google.protobuf.ServiceException:
> java.net.ConnectException: Call From st-ozone-qkgzv8-dbnlt/10.104.11.242 to
> ccycloud-1.newom.root.comops.site:9860 failed on connection exception:
> java.net.ConnectException: Connection refused; For more details see:
> http://wiki.apache.org/hadoop/ConnectionRefused, while invoking
> $Proxy18.submitRequest over
> nodeId=node1,nodeAddress=ccycloud-1.newom.root.comops.site/10.140.132.65:9860
> after 4 failover attempts. Trying to failover after sleeping for 2000ms.
> 2025-08-07 05:15:15,863|INFO|MainThread|machine.py:205 -
> run()||GUID=51debc2b-956b-4e05-b036-ced4aa0547f4|25/08/07 05:15:15 INFO
> retry.RetryInvocationHandler: com.google.protobuf.ServiceException:
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdds.scm.exceptions.SCMException):
> Cannot reconcile container #5001 in state CLOSED with replica states:
> CLOSED, CLOSED, CLOSING
> 2025-08-07 05:15:15,864|INFO|MainThread|machine.py:205 -
> run()||GUID=51debc2b-956b-4e05-b036-ced4aa0547f4|at
> org.apache.hadoop.hdds.scm.server.SCMClientProtocolServer.reconcileContainer(SCMClientProtocolServer.java:1542)
> 2025-08-07 05:15:15,864|INFO|MainThread|machine.py:205 -
> run()||GUID=51debc2b-956b-4e05-b036-ced4aa0547f4|at
> org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.reconcileContainer(StorageContainerLocationProtocolServerSideTranslatorPB.java:1360)
> 2025-08-07 05:15:15,864|INFO|MainThread|machine.py:205 -
> run()||GUID=51debc2b-956b-4e05-b036-ced4aa0547f4|at
> org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.processRequest(StorageContainerLocationProtocolServerSideTranslatorPB.java:739)
> 2025-08-07 05:15:15,865|INFO|MainThread|machine.py:205 -
> run()||GUID=51debc2b-956b-4e05-b036-ced4aa0547f4|at
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:89)
> 2025-08-07 05:15:15,865|INFO|MainThread|machine.py:205 -
> run()||GUID=51debc2b-956b-4e05-b036-ced4aa0547f4|at
> org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.submitRequest(StorageContainerLocationProtocolServerSideTranslatorPB.java:235)
> 2025-08-07 05:15:15,865|INFO|MainThread|machine.py:205 -
> run()||GUID=51debc2b-956b-4e05-b036-ced4aa0547f4|at
> org.apache.hadoop.hdds.protocol.proto.StorageContainerLocationProtocolProtos$StorageContainerLocationProtocolService$2.callBlockingMethod(StorageContainerLocationProtocolProtos.java)
> 2025-08-07 05:15:15,865|INFO|MainThread|machine.py:205 -
> run()||GUID=51debc2b-956b-4e05-b036-ced4aa0547f4|at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
> 2025-08-07 05:15:15,865|INFO|MainThread|machine.py:205 -
> run()||GUID=51debc2b-956b-4e05-b036-ced4aa0547f4|at
> org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> 2025-08-07 05:15:15,865|INFO|MainThread|machine.py:205 -
> run()||GUID=51debc2b-956b-4e05-b036-ced4aa0547f4|at
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:994)
> 2025-08-07 05:15:15,866|INFO|MainThread|machine.py:205 -
> run()||GUID=51debc2b-956b-4e05-b036-ced4aa0547f4|at
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:922)
> 2025-08-07 05:15:15,866|INFO|MainThread|machine.py:205 -
> run()||GUID=51debc2b-956b-4e05-b036-ced4aa0547f4|at
> java.security.AccessController.doPrivileged(Native Method)
> 2025-08-07 05:15:15,866|INFO|MainThread|machine.py:205 -
> run()||GUID=51debc2b-956b-4e05-b036-ced4aa0547f4|at
> javax.security.auth.Subject.doAs(Subject.java:422)
> 2025-08-07 05:15:15,866|INFO|MainThread|machine.py:205 -
> run()||GUID=51debc2b-956b-4e05-b036-ced4aa0547f4|at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1910)
> 2025-08-07 05:15:15,866|INFO|MainThread|machine.py:205 -
> run()||GUID=51debc2b-956b-4e05-b036-ced4aa0547f4|at
> org.apache.hadoop.ipc.Server$Handler.run(Server.java:2899)
> 2025-08-07 05:15:15,866|INFO|MainThread|machine.py:205 -
> run()||GUID=51debc2b-956b-4e05-b036-ced4aa0547f4|, while invoking
> $Proxy18.submitRequest over
> nodeId=node2,nodeAddress=ccycloud-10.newom.root.comops.site/10.140.137.141:9860
> after 5 failover attempts. Trying to failover after sleeping for 2000ms.
> 2025-08-07 05:15:17,875|INFO|MainThread|machine.py:205 -
> run()||GUID=51debc2b-956b-4e05-b036-ced4aa0547f4|25/08/07 05:15:17 INFO
> retry.RetryInvocationHandler: com.google.protobuf.ServiceException:
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdds.ratis.ServerNotLeaderException):
> Server:333ad543-2b65-4a7c-b5c1-b8faadc1d7f4 is not the leader. Suggested
> leader is Server:ccycloud-10.newom.root.comops.site:9860.
> 2025-08-07 05:15:17,876|INFO|MainThread|machine.py:205 -
> run()||GUID=51debc2b-956b-4e05-b036-ced4aa0547f4|at
> org.apache.hadoop.hdds.ratis.ServerNotLeaderException.convertToNotLeaderException(ServerNotLeaderException.java:102)
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]