[ 
https://issues.apache.org/jira/browse/HADOOP-4672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12668220#action_12668220
 ] 

Raghu Angadi commented on HADOOP-4672:
--------------------------------------

I suspect it is going to be very hard to reproduce this problem. Even after 
finding the problem, mostly, we don't be able to do much about it.

Fortunately, all the stuck threads are reading or writing from sockets that 
don't have a timeout. So one work around is to have a timeout (something like 
10 minutes). 

Currently following need timeout :
    - upstream socket in datanode write pipeline
    - IPC client writes to the server (reads already have a timeout and 
controlled by pings)


> RPC on Datanode blocked forever.
> --------------------------------
>
>                 Key: HADOOP-4672
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4672
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs, io
>    Affects Versions: 0.17.0
>         Environment: Java SE 1.6.0-b105 on Linux 2.6.x
>            Reporter: Raghu Angadi
>
> We recently noticed a number of datanodes got stuck. The main thread that 
> sends heartbeats and block reports is blocked in select() in side 
> blockReport() RPC.  I will add a stack trace in the next comment.
> I am not sure why select was blocked forever since there is no connection 
> open to NameNode. In fact, NN was restarted in between. It could be some JDK 
> bug or a Hadoop bug.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to