[
https://issues.apache.org/jira/browse/HDFS-2422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13124547#comment-13124547
]
Aaron T. Myers commented on HDFS-2422:
--------------------------------------
Thanks a lot for the comments, Milind. Answers inline.
bq. I think it is a "good thing" (tm) that NN makes HDFS readonly when nfs is
not accessible.
I can see arguments for both. In fact, I originally argued in favor of the
behavior you're describing. Upon further reflection, I think I've changed my
opinion, however. At least, whatever policy is being used for the number of
failed volumes that can be tolerated when syncing edit logs should also be used
when checking for available resources in the {{NameNodeResourceChecker}}, for
the purpose of consistency.
bq. HDFS is getting public criticism about "losing" data, and if hdfs
modifications are allowed by modifying a single destination, then it open up a
window for losing data.
The purpose of configuring multiple {{dfs.name.dir}} directories is exactly so
that the NN can tolerate multiple failures and keep on humming. It's not going
to lose any data just because one goes offline - it will just write to the
other directories.
bq. The right thing to do is to return from safemode when the NFS volume
becomes available again.
Please see [this
comment|https://issues.apache.org/jira/browse/HDFS-1594?focusedCommentId=13020373&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13020373]
for the reasoning as to why the {{NameNodeResourceChecker}} doesn't
automatically take the NN out of SM when it detects a volume being low on space.
> temporary loss of NFS mount causes NN safe mode
> -----------------------------------------------
>
> Key: HDFS-2422
> URL: https://issues.apache.org/jira/browse/HDFS-2422
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: name-node
> Affects Versions: 0.24.0
> Reporter: Jeff Bean
> Assignee: Aaron T. Myers
>
> We encountered a situation where the namenode dropped into safe mode after a
> temporary outage of an NFS mount.
> At 12:10 the NFS server goes offline
> Oct 8 12:10:05 <namenode> kernel: nfs: server <nfs host> not responding,
> timed out
> This caused the namenode to conclude resource issues:
> 2011-10-08 12:10:34,848 WARN
> org.apache.hadoop.hdfs.server.namenode.NameNodeResourceChecker: Space
> available on volume '<nfs host>' is 0, which is below the configured reserved
> amount 104857600
> Temporary loss of NFS mount shouldn't cause safemode.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira