[ 
https://issues.apache.org/jira/browse/HDFS-9684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15117387#comment-15117387
 ] 

Kihwal Lee commented on HDFS-9684:
----------------------------------

If we are to make datanode recoverable from such conditions, we need to take 
care of other essential services running in datanode. E.g. I've seen DU threads 
silently terminating, causing storage report to be stale. Sometimes crippled 
datanodes keep heartbeating so clients are sent there and cause more failures.  
It feels like we need a self healthcheck in datanode along with recovery 
mechanism. 

> DataNode stopped sending heartbeat after getting OutOfMemoryError form 
> DataTransfer thread.
> -------------------------------------------------------------------------------------------
>
>                 Key: HDFS-9684
>                 URL: https://issues.apache.org/jira/browse/HDFS-9684
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.7.1
>            Reporter: Surendra Singh Lilhore
>            Assignee: Surendra Singh Lilhore
>            Priority: Blocker
>         Attachments: HDFS-9684.01.patch
>
>
> {noformat}
> java.lang.OutOfMemoryError: unable to create new native thread
>       at java.lang.Thread.start0(Native Method)
>       at java.lang.Thread.start(Thread.java:714)
>       at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.transferBlock(DataNode.java:1999)
>       at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.transferBlocks(DataNode.java:2008)
>       at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:657)
>       at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:615)
>       at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:857)
>       at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:671)
>       at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:823)
>       at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to