[
https://issues.apache.org/jira/browse/HDDS-11350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877615#comment-17877615
]
Siddhant Sangwan commented on HDDS-11350:
-----------------------------------------
[~jyosin] thanks for reporting this.
>From the SCM logs, it's clear that the method '{{{}public
>List<ContainerBalancerTaskIterationStatusInfo>
>getCurrentIterationsStatistic(){}}}' is called before the balancing thread has
>initialised the data structures and classes used by balancer. One of these is
>the variable '{{{}private FindTargetStrategy findTargetStrategy{}}}', which is
>null at this point. This results in the null pointer exception.
SCM logs:
{code:java}
2024-08-21 04:30:02,935 INFO
[node1-ContainerBalancerTask-1]-org.apache.hadoop.hdds.scm.container.balancer.ContainerBalancerTask:
ContainerBalancer will sleep for 180000 ms while waiting for updated usage
information from Datanodes.
2024-08-21 04:30:06,906 INFO [Socket Reader #1 for port
9860]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for
[email protected] (auth:KERBEROS)
2024-08-21 04:30:06,918 INFO [Socket Reader #1 for port
9860]-SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
Authorization successful for [email protected](auth:KERBEROS) for protocol=interface
org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocol
2024-08-21 04:30:06,924 WARN [IPC Server handler 1 on
9860]-org.apache.hadoop.ipc.Server: IPC Server handler 1 on 9860, call Call#0
Retry#0
org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocol.submitRequest
from 10.64.62.70:46350
java.lang.NullPointerException
at
org.apache.hadoop.hdds.scm.container.balancer.ContainerBalancerTask.getCurrentIterationsStatistic(ContainerBalancerTask.java:353)
at
org.apache.hadoop.hdds.scm.container.balancer.ContainerBalancer.getBalancerStatusInfo(ContainerBalancer.java:191)
at
org.apache.hadoop.hdds.scm.server.SCMClientProtocolServer.getContainerBalancerStatusInfo(SCMClientProtocolServer.java:1213)
at
org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.getContainerBalancerStatusInfo(StorageContainerLocationProtocolServerSideTranslatorPB.java:1210)
at
org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.processRequest(StorageContainerLocationProtocolServerSideTranslatorPB.java:608)
at
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:89)
at
org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.submitRequest(StorageContainerLocationProtocolServerSideTranslatorPB.java:233)
at
org.apache.hadoop.hdds.protocol.proto.StorageContainerLocationProtocolProtos$StorageContainerLocationProtocolService$2.callBlockingMethod(StorageContainerLocationProtocolProtos.java)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:994)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:922)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
{code}
As seen in these logs, the balancing thread goes to sleep because
"trigger.du.before.move.enable" is set to true. This is before it has
initialised findTargetStrategy. At this point the call to check the status
comes in, resulting in the exception.
> NullPointerException thrown on checking container balancer status
> -----------------------------------------------------------------
>
> Key: HDDS-11350
> URL: https://issues.apache.org/jira/browse/HDDS-11350
> Project: Apache Ozone
> Issue Type: Bug
> Components: Balancer
> Reporter: Jyotirmoy Sinha
> Assignee: Siddhant Sangwan
> Priority: Major
>
> Scenario - Run container balancer when there is no data in the cluster.
> Configs -
> {code:java}
> "hdds.container.balancer.trigger.du.before.move.enable": "true",
> "ozone.scm.container.size": "1GB",
> "hdds.container.balancer.balancing.iteration.interval": "5m",
> "hdds.container.balancer.size.moved.max.per.iteration": "2GB" {code}
> Error stacktrace -
> {code:java}
> # /opt/cloudera/parcels/CDH/bin/ozone admin containerbalancer start -t 1 -d
> 100
> Container Balancer started successfully.
> # /opt/cloudera/parcels/CDH/bin/ozone admin containerbalancer status
> 24/08/21 14:14:30 INFO retry.RetryInvocationHandler:
> com.google.protobuf.ServiceException:
> org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException):
> java.lang.NullPointerException
> at
> org.apache.hadoop.hdds.scm.container.balancer.ContainerBalancerTask.getCurrentIterationsStatistic(ContainerBalancerTask.java:353)
> at
> org.apache.hadoop.hdds.scm.container.balancer.ContainerBalancer.getBalancerStatusInfo(ContainerBalancer.java:191)
> at
> org.apache.hadoop.hdds.scm.server.SCMClientProtocolServer.getContainerBalancerStatusInfo(SCMClientProtocolServer.java:1213)
> at
> org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.getContainerBalancerStatusInfo(StorageContainerLocationProtocolServerSideTranslatorPB.java:1210)
> at
> org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.processRequest(StorageContainerLocationProtocolServerSideTranslatorPB.java:608)
> at
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:89)
> at
> org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.submitRequest(StorageContainerLocationProtocolServerSideTranslatorPB.java:233)
> at
> org.apache.hadoop.hdds.protocol.proto.StorageContainerLocationProtocolProtos$StorageContainerLocationProtocolService$2.callBlockingMethod(StorageContainerLocationProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:994)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:922)
> at java.base/java.security.AccessController.doPrivileged(Native Method)
> at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2899)
> , while invoking $Proxy20.submitRequest over
> nodeId=node1,nodeAddress=ccycloud-1.quasar-ypdsqw.root.comops.site/10.140.49.132:9860
> after 2 failover attempts. Trying to failover after sleeping for 2000ms.
> 24/08/21 14:14:34 INFO retry.RetryInvocationHandler:
> com.google.protobuf.ServiceException:
> org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException):
> java.lang.NullPointerException
> at
> org.apache.hadoop.hdds.scm.container.balancer.ContainerBalancerTask.getCurrentIterationsStatistic(ContainerBalancerTask.java:353)
> at
> org.apache.hadoop.hdds.scm.container.balancer.ContainerBalancer.getBalancerStatusInfo(ContainerBalancer.java:191)
> at
> org.apache.hadoop.hdds.scm.server.SCMClientProtocolServer.getContainerBalancerStatusInfo(SCMClientProtocolServer.java:1213)
> at
> org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.getContainerBalancerStatusInfo(StorageContainerLocationProtocolServerSideTranslatorPB.java:1210)
> at
> org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.processRequest(StorageContainerLocationProtocolServerSideTranslatorPB.java:608)
> at
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:89)
> at
> org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.submitRequest(StorageContainerLocationProtocolServerSideTranslatorPB.java:233)
> at
> org.apache.hadoop.hdds.protocol.proto.StorageContainerLocationProtocolProtos$StorageContainerLocationProtocolService$2.callBlockingMethod(StorageContainerLocationProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:994)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:922)
> at java.base/java.security.AccessController.doPrivileged(Native Method)
> at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2899)
> , while invoking $Proxy20.submitRequest over
> nodeId=node1,nodeAddress=ccycloud-1.quasar-ypdsqw.root.comops.site/10.140.49.132:9860
> after 4 failover attempts. Trying to failover after sleeping for 2000ms.
> 24/08/21 14:14:36 INFO retry.RetryInvocationHandler:
> com.google.protobuf.ServiceException:
> org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException):
> java.lang.NullPointerException {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]