[ 
https://issues.apache.org/jira/browse/HDFS-2422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13124558#comment-13124558
 ] 

Aaron T. Myers commented on HDFS-2422:
--------------------------------------

Thanks for the comments Milind.

bq. Aaron, the failed volume policy should ensure that at least two volumes are 
up when writing edit logs. If it were only writing to one volume, and staying 
writable, then there is a time period when there is a single up-to-date replica 
of edit logs that can fail and lose modifications (that is why I said the 
window opens for losing data, anot that it will definitely lose data.).

Ah, I misunderstood your earlier comment. That seems reasonable to me. I've 
filed HDFS-2430 to address this issue.

bq. re: automatically coming out of safemode, I think transient unavailability 
of a volume, and a volume being low on disk space should be treated 
differently. While the second case requires admin intervention, the first case 
does not.

Do you disagree with the reasoning Eli posted in the comment I linked to 
earlier? I found his argument quite compelling. If so, we should probably file 
a separate JIRA for that, along the lines of "The NN should automatically leave 
SM if sufficient resources become available again after they were previously 
low" and continue the discussion there.
                
> The NN should tolerate the same number of low-resource volumes as failed 
> volumes
> --------------------------------------------------------------------------------
>
>                 Key: HDFS-2422
>                 URL: https://issues.apache.org/jira/browse/HDFS-2422
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.24.0
>            Reporter: Jeff Bean
>            Assignee: Aaron T. Myers
>         Attachments: HDFS-2422.patch
>
>
> We encountered a situation where the namenode dropped into safe mode after a 
> temporary outage of an NFS mount.
> At 12:10 the NFS server goes offline
> Oct  8 12:10:05 <namenode> kernel: nfs: server <nfs host> not responding, 
> timed out
> This caused the namenode to conclude resource issues:
> 2011-10-08 12:10:34,848 WARN 
> org.apache.hadoop.hdfs.server.namenode.NameNodeResourceChecker: Space 
> available on volume '<nfs host>' is 0, which is below the configured reserved 
> amount 104857600
> Temporary loss of NFS mount shouldn't cause safemode.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to