[jira] [Commented] (HDFS-2422) temporary loss of NFS mount causes NN safe mode

Aaron T. Myers (Commented) (JIRA) Mon, 10 Oct 2011 13:48:52 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-2422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13124466#comment-13124466
 ]


Aaron T. Myers commented on HDFS-2422:
--------------------------------------

Looks like this is happening because {{o.a.h.fs.DF}} will return 0 for "space 
available" on a directory which doesn't exist:

{noformat}
[01:29:11] atm@simon:~$ hadoop org.apache.hadoop.fs.DF /
df -k null
null    72718632        49480712        19543996        73%     null
[01:29:23] atm@simon:~$ hadoop org.apache.hadoop.fs.DF /foo/bar/baz
df -k null
null    0       0       0       0%      null
{noformat}

I'm guessing the particular {{dfs.name.dir}} the NN was writing to was in fact 
a subdirectory of the mount directory, so when the NFS mount went away so did 
the subdirectory, causing DF to return 0.

I think this is indicative of a more basic issue with the {{NNResourceChecker}} 
policy, though. When syncing edit logs, the NN is designed to tolerate failure 
of up to N-1 {{dfs.name.dirs}}, but the {{NNResourceChecker}} will put the NN 
into safemode if only a single {{dfs.name.dir}} is low on space. The 
appropriate solution, then, seems to me to be to change the 
{{NNResourceChecker}} to also tolerate up to N-1 directories being low on space.

I'll create a patch to do this and upload it shortly.
                
> temporary loss of NFS mount causes NN safe mode
> -----------------------------------------------
>
>                 Key: HDFS-2422
>                 URL: https://issues.apache.org/jira/browse/HDFS-2422
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.24.0
>            Reporter: Jeff Bean
>            Assignee: Aaron T. Myers
>
> We encountered a situation where the namenode dropped into safe mode after a 
> temporary outage of an NFS mount.
> At 12:10 the NFS server goes offline
> Oct  8 12:10:05 <namenode> kernel: nfs: server <nfs host> not responding, 
> timed out
> This caused the namenode to conclude resource issues:
> 2011-10-08 12:10:34,848 WARN 
> org.apache.hadoop.hdfs.server.namenode.NameNodeResourceChecker: Space 
> available on volume '<nfs host>' is 0, which is below the configured reserved 
> amount 104857600
> Temporary loss of NFS mount shouldn't cause safemode.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-2422) temporary loss of NFS mount causes NN safe mode

Reply via email to