[ 
https://issues.apache.org/jira/browse/HDFS-9684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15122615#comment-15122615
 ] 

Chris Nauroth commented on HDFS-9684:
-------------------------------------

This is likely to be a duplicate of HDFS-9046, though I have not reviewed 
either of the patches in enough detail to say for sure.

bq. ...I think conclusion is to kill the DN in case of OOM.

In general, I am opposed to attempting recovery from {{OutOfMemoryError}}, 
especially if it's a true memory allocation problem and not thread exhaustion 
like shown here.  The trouble with trying to recover is that it's extremely 
difficult to predict what state we were in right before the memory allocation 
failed, and therefore we can't tell what kind of repair work might be required. 
 It would be easy to end up with an internal data structure half-modified with 
no way to either roll back or roll forward to complete the modification later.  
Then, the process keeps running in an indeterminate state that we never 
anticipated, so its behavior will be unpredictable.

Alas, we already have established code that tries to recover from 
{{OutOfMemoryError}}, most notably in the RPC {{Server}}.  Some of us prefer to 
launch the JVM with the {{-XX:OnOutOfMemoryError}} argument set so that the 
process kills itself.

> DataNode stopped sending heartbeat after getting OutOfMemoryError form 
> DataTransfer thread.
> -------------------------------------------------------------------------------------------
>
>                 Key: HDFS-9684
>                 URL: https://issues.apache.org/jira/browse/HDFS-9684
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.7.1
>            Reporter: Surendra Singh Lilhore
>            Assignee: Surendra Singh Lilhore
>            Priority: Blocker
>         Attachments: HDFS-9684.01.patch
>
>
> {noformat}
> java.lang.OutOfMemoryError: unable to create new native thread
>       at java.lang.Thread.start0(Native Method)
>       at java.lang.Thread.start(Thread.java:714)
>       at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.transferBlock(DataNode.java:1999)
>       at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.transferBlocks(DataNode.java:2008)
>       at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:657)
>       at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:615)
>       at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:857)
>       at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:671)
>       at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:823)
>       at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to