[ 
https://issues.apache.org/jira/browse/HDFS-2422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13124547#comment-13124547
 ] 

Aaron T. Myers commented on HDFS-2422:
--------------------------------------

Thanks a lot for the comments, Milind. Answers inline.

bq. I think it is a "good thing" (tm) that NN makes HDFS readonly when nfs is 
not accessible.

I can see arguments for both. In fact, I originally argued in favor of the 
behavior you're describing. Upon further reflection, I think I've changed my 
opinion, however. At least, whatever policy is being used for the number of 
failed volumes that can be tolerated when syncing edit logs should also be used 
when checking for available resources in the {{NameNodeResourceChecker}}, for 
the purpose of consistency.

bq. HDFS is getting public criticism about "losing" data, and if hdfs 
modifications are allowed by modifying a single destination, then it open up a 
window for losing data.

The purpose of configuring multiple {{dfs.name.dir}} directories is exactly so 
that the NN can tolerate multiple failures and keep on humming. It's not going 
to lose any data just because one goes offline - it will just write to the 
other directories.

bq. The right thing to do is to return from safemode when the NFS volume 
becomes available again.

Please see [this 
comment|https://issues.apache.org/jira/browse/HDFS-1594?focusedCommentId=13020373&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13020373]
 for the reasoning as to why the {{NameNodeResourceChecker}} doesn't 
automatically take the NN out of SM when it detects a volume being low on space.
                
> temporary loss of NFS mount causes NN safe mode
> -----------------------------------------------
>
>                 Key: HDFS-2422
>                 URL: https://issues.apache.org/jira/browse/HDFS-2422
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.24.0
>            Reporter: Jeff Bean
>            Assignee: Aaron T. Myers
>
> We encountered a situation where the namenode dropped into safe mode after a 
> temporary outage of an NFS mount.
> At 12:10 the NFS server goes offline
> Oct  8 12:10:05 <namenode> kernel: nfs: server <nfs host> not responding, 
> timed out
> This caused the namenode to conclude resource issues:
> 2011-10-08 12:10:34,848 WARN 
> org.apache.hadoop.hdfs.server.namenode.NameNodeResourceChecker: Space 
> available on volume '<nfs host>' is 0, which is below the configured reserved 
> amount 104857600
> Temporary loss of NFS mount shouldn't cause safemode.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to