Li Cheng created HDDS-3559:
------------------------------
Summary: Datanode doesn't handle java heap OutOfMemory exception
Key: HDDS-3559
URL: https://issues.apache.org/jira/browse/HDDS-3559
Project: Hadoop Distributed Data Store
Issue Type: Bug
Components: Ozone Datanode
Affects Versions: 0.5.0
Reporter: Li Cheng
2020-05-05 15:47:41,568 [Datanode State Machine Thread - 167] WARN
org.apache.hadoop.ozone.container.common.statemachine.Endpoi
ntStateMachine: Unable to communicate to SCM server at host-10-51-87-181:9861
for past 0 seconds.
java.io.IOException: com.google.protobuf.ServiceException:
java.lang.OutOfMemoryError: Java heap space
at
org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:47)
at
org.apache.hadoop.ozone.protocolPB.StorageContainerDatanodeProtocolClientSideTranslatorPB.submitRequest(StorageContainerDatanodeProtocolClientSideTranslatorPB.java:118)
at
org.apache.hadoop.ozone.protocolPB.StorageContainerDatanodeProtocolClientSideTranslatorPB.sendHeartbeat(StorageContainerDatanodeProtocolClientSideTranslatorPB.java:148)
at
org.apache.hadoop.ozone.container.common.states.endpoint.HeartbeatEndpointTask.call(HeartbeatEndpointTask.java:145)
at
org.apache.hadoop.ozone.container.common.states.endpoint.HeartbeatEndpointTask.call(HeartbeatEndpointTask.java:76)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.google.protobuf.ServiceException: java.lang.OutOfMemoryError:
Java heap space
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.getReturnMessage(ProtobufRpcEngine.java:293)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:270)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy38.submitRequest(Unknown Source)
at
org.apache.hadoop.ozone.protocolPB.StorageContainerDatanodeProtocolClientSideTranslatorPB.submitRequest(StorageContainerDatanodeProtocolClientSideTranslatorPB.java:116)
On a cluster, one datanode stops reporting to SCM while being kept unknown. The
datanode process is still working. Log shows Java heap OOM when it's
serializing protobuf for rpc message. However, datanode silently stops reports
to SCM and the process becomes stale.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]