[ 
https://issues.apache.org/jira/browse/HDFS-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13197601#comment-13197601
 ] 

Harsh J commented on HDFS-1603:
-------------------------------

bq. I've noticed a failure during an unlock call that occurs AFTER a SD has 
been detected as a failed point. The unlock call went ahead and blocked via a 
native call to the NFS lock daemon - and since the NFS server was down, it just 
hung (odd that the timeout did not apply, probably an nfs lockd issue, but I do 
not feel its OK to unlock after a directory has caused a processIOError call).

Disregard the above. It was cause of a lockd bug in an earlier release of 
CentOS as I'd suspected.

For:
bq. ATM and I just brainstormed about this a little bit over some iced coffee. 
Though on the surface it doesn't look too hard to implement timeouts on namedir 
operations, it would actually have to be done in a lot of places (eg 
mkdirs/move calls on storage directories, writing edits, saving images, etc). 
Timing out some of these things isn't entirely straightforward, since the 
underlying calls aren't interruptible.

Since the hang is in processIOError or a call like that that handles the 
dir-errors, lets have a timeout here instead? Should solve the same issue? 
Though if it was the case like above, a thread may hang forever.
                
> Namenode gets sticky if one of namenode storage volumes disappears (removed, 
> unmounted, etc.)
> ---------------------------------------------------------------------------------------------
>
>                 Key: HDFS-1603
>                 URL: https://issues.apache.org/jira/browse/HDFS-1603
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.21.0
>            Reporter: Konstantin Boudnik
>
> While investigating failures on HDFS-1602 it became apparent that once a 
> namenode storage volume is pulled out NN becomes completely "sticky" until 
> {{FSImage:processIOError: removing storage}} move the storage from the active 
> set. During this time none of normal NN operations are possible (e.g. 
> creating a directory on HDFS timeouts eventually).
> In case of NFS this can be workaround'd with soft,intr,timeo,retrans 
> settings. However, a better handling of the situation is apparently possible 
> and needs to be implemented.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to