[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

huhaiyang (Jira) Thu, 03 Sep 2020 02:13:58 -0700


     [ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


huhaiyang updated HDFS-15556:
-----------------------------
    Description: 
In our cluster, the NameNode appears NPE when processing lifeline messages sent 
by the DataNode, which will cause an maxLoad exception calculated by NN.
because DataNode is identified as busy and unable to allocate available nodes 
in choose  DataNode, program loop execution results in high CPU and reduces the 
processing performance of the cluster.

*NameNode the exception stack*:
{code:java}
2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 
on 8022, call Call#20535 Retry#0 org.ap
ache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from 
xxxxx:34766
java.lang.NullPointerException
        at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
        at 
org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
        at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
{code}


{code:java}
// DatanodeDescriptor#updateStorageStats
...
for (StorageReport report : reports) {

      DatanodeStorageInfo storage = null;
      synchronized (storageMap) {
        storage =
            storageMap.get(report.getStorage().getStorageID());
      }
      if (checkFailedStorages) {
        failedStorageInfos.remove(storage);
      }

      storage.receivedHeartbeat(report);  //  NPE exception occurred here 
      // skip accounting for capacity of PROVIDED storages!
      if (StorageType.PROVIDED.equals(storage.getStorageType())) {
        continue;
      }
...
{code}


  was:
In our cluster, the NameNode appears NPE when processing lifeline messages sent 
by the DataNode, which will cause an maxLoad exception calculated by NN.
In choose  DataNode because DataNode is identified as busy and unable to 
allocate available nodes, program loop execution results in high CPU and 
reduces the processing performance of the cluster.

*NameNode the exception stack*:
{code:java}
2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 
on 8022, call Call#20535 Retry#0 org.ap
ache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from 
xxxxx:34766
java.lang.NullPointerException
        at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
        at 
org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
        at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
{code}


{code:java}
// DatanodeDescriptor#updateStorageStats
...
for (StorageReport report : reports) {

      DatanodeStorageInfo storage = null;
      synchronized (storageMap) {
        storage =
            storageMap.get(report.getStorage().getStorageID());
      }
      if (checkFailedStorages) {
        failedStorageInfos.remove(storage);
      }

      storage.receivedHeartbeat(report);  //  NPE exception occurred here 
      // skip accounting for capacity of PROVIDED storages!
      if (StorageType.PROVIDED.equals(storage.getStorageType())) {
        continue;
      }
...
{code}



> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> ------------------------------------------------------------------------
>
>                 Key: HDFS-15556
>                 URL: https://issues.apache.org/jira/browse/HDFS-15556
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 3.2.0
>            Reporter: huhaiyang
>            Priority: Critical
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> because DataNode is identified as busy and unable to allocate available nodes 
> in choose  DataNode, program loop execution results in high CPU and reduces 
> the processing performance of the cluster.
> *NameNode the exception stack*:
> {code:java}
> 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 5 on 8022, call Call#20535 Retry#0 org.ap
> ache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from 
> xxxxx:34766
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
>         at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
> {code}
> {code:java}
> // DatanodeDescriptor#updateStorageStats
> ...
> for (StorageReport report : reports) {
>       DatanodeStorageInfo storage = null;
>       synchronized (storageMap) {
>         storage =
>             storageMap.get(report.getStorage().getStorageID());
>       }
>       if (checkFailedStorages) {
>         failedStorageInfos.remove(storage);
>       }
>       storage.receivedHeartbeat(report);  //  NPE exception occurred here 
>       // skip accounting for capacity of PROVIDED storages!
>       if (StorageType.PROVIDED.equals(storage.getStorageType())) {
>         continue;
>       }
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

Reply via email to