[
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
huhaiyang updated HDFS-15556:
-----------------------------
Description:
In our cluster, the NameNode appears NPE when processing lifeline messages sent
by the DataNode, which will cause an maxLoad exception calculated by NN.
In choose DataNode because DataNode is identified as busy and unable to
allocate available nodes, program loop execution results in high CPU and
reduces the processing performance of the cluster.
*NameNode the exception stack:
{code:java}
2020-09-02 11:01:57,043 DEBUG org.apache.hadoop.ipc.Server: Served:
sendLifeline, queueTime= 2 procesingTime= 0 exception= NullPointerException
2020-09-02 11:01:57,044 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0
on 8022, call Call#68269 Retry#0
org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline
from xxx:47138
java.lang.NullPointerException
at
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:475)
at
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:391)
at
org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
at
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1825)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
at
org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
at
org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
{code}
was:
In our cluster, the NameNode appears NPE when processing lifeline messages sent
by the DataNode, which will cause an maxLoad exception calculated by NN.
In choose DataNode because DataNode is identified as busy and unable to
allocate available nodes, program loop execution results in high CPU and
reduces the processing performance of the cluster.
{code:java}
2020-09-02 11:01:57,043 DEBUG org.apache.hadoop.ipc.Server: Served:
sendLifeline, queueTime= 2 procesingTime= 0 exception= NullPointerException
2020-09-02 11:01:57,044 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0
on 8022, call Call#68269 Retry#0
org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline
from xxx:47138
java.lang.NullPointerException
at
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:475)
at
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:391)
at
org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
at
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1825)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
at
org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
at
org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
{code}
> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> ------------------------------------------------------------------------
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: namenode
> Affects Versions: 3.2.0
> Reporter: huhaiyang
> Priority: Critical
>
> In our cluster, the NameNode appears NPE when processing lifeline messages
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> In choose DataNode because DataNode is identified as busy and unable to
> allocate available nodes, program loop execution results in high CPU and
> reduces the processing performance of the cluster.
> *NameNode the exception stack:
> {code:java}
> 2020-09-02 11:01:57,043 DEBUG org.apache.hadoop.ipc.Server: Served:
> sendLifeline, queueTime= 2 procesingTime= 0 exception= NullPointerException
> 2020-09-02 11:01:57,044 WARN org.apache.hadoop.ipc.Server: IPC Server handler
> 0 on 8022, call Call#68269 Retry#0
> org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline
> from xxx:47138
> java.lang.NullPointerException
> at
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:475)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:391)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1825)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
> at
> org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
> at
> org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]