[
https://issues.apache.org/jira/browse/HDFS-9684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15122615#comment-15122615
]
Chris Nauroth commented on HDFS-9684:
-------------------------------------
This is likely to be a duplicate of HDFS-9046, though I have not reviewed
either of the patches in enough detail to say for sure.
bq. ...I think conclusion is to kill the DN in case of OOM.
In general, I am opposed to attempting recovery from {{OutOfMemoryError}},
especially if it's a true memory allocation problem and not thread exhaustion
like shown here. The trouble with trying to recover is that it's extremely
difficult to predict what state we were in right before the memory allocation
failed, and therefore we can't tell what kind of repair work might be required.
It would be easy to end up with an internal data structure half-modified with
no way to either roll back or roll forward to complete the modification later.
Then, the process keeps running in an indeterminate state that we never
anticipated, so its behavior will be unpredictable.
Alas, we already have established code that tries to recover from
{{OutOfMemoryError}}, most notably in the RPC {{Server}}. Some of us prefer to
launch the JVM with the {{-XX:OnOutOfMemoryError}} argument set so that the
process kills itself.
> DataNode stopped sending heartbeat after getting OutOfMemoryError form
> DataTransfer thread.
> -------------------------------------------------------------------------------------------
>
> Key: HDFS-9684
> URL: https://issues.apache.org/jira/browse/HDFS-9684
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode
> Affects Versions: 2.7.1
> Reporter: Surendra Singh Lilhore
> Assignee: Surendra Singh Lilhore
> Priority: Blocker
> Attachments: HDFS-9684.01.patch
>
>
> {noformat}
> java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:714)
> at
> org.apache.hadoop.hdfs.server.datanode.DataNode.transferBlock(DataNode.java:1999)
> at
> org.apache.hadoop.hdfs.server.datanode.DataNode.transferBlocks(DataNode.java:2008)
> at
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:657)
> at
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:615)
> at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:857)
> at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:671)
> at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:823)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)