[jira] [Commented] (HDFS-2422) The NN should tolerate the same number of low-resource volumes as failed volumes

Aaron T. Myers (Commented) (JIRA) Mon, 10 Oct 2011 16:36:56 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-2422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13124574#comment-13124574
 ]


Aaron T. Myers commented on HDFS-2422:
--------------------------------------

bq. The transient loss of connectivity to an NFS mount currently reflects as if 
the NFS mount is low on space (in fact, has 0 space left). This is unfortunate. 
If there were a way to distinguish between the two, (I cannot think of any, but 
others may have an answer), it would be ideal to have namenode come out of safe 
mode automatically when the transient error goes away.

I'm afraid I also can't think of a way to reliably distinguish between the two. 
We could, for example, check that the directory actually exists (which it would 
not, in the case the NFS mount disappears and the configured {{dfs.name.dir}} 
were a subdirectory of the mount), but that could obviously be conflated with 
other issues besides NFS mount failure.

Even if there were a way to distinguish between the two, I would probably argue 
for not entering SM in the first place, but that's a separate issue.
                
> The NN should tolerate the same number of low-resource volumes as failed 
> volumes
> --------------------------------------------------------------------------------
>
>                 Key: HDFS-2422
>                 URL: https://issues.apache.org/jira/browse/HDFS-2422
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.24.0
>            Reporter: Jeff Bean
>            Assignee: Aaron T. Myers
>         Attachments: HDFS-2422.patch
>
>
> We encountered a situation where the namenode dropped into safe mode after a 
> temporary outage of an NFS mount.
> At 12:10 the NFS server goes offline
> Oct  8 12:10:05 <namenode> kernel: nfs: server <nfs host> not responding, 
> timed out
> This caused the namenode to conclude resource issues:
> 2011-10-08 12:10:34,848 WARN 
> org.apache.hadoop.hdfs.server.namenode.NameNodeResourceChecker: Space 
> available on volume '<nfs host>' is 0, which is below the configured reserved 
> amount 104857600
> Temporary loss of NFS mount shouldn't cause safemode.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-2422) The NN should tolerate the same number of low-resource volumes as failed volumes

Reply via email to