[jira] [Resolved] (HDFS-3049) During the normal loading NN startup process, fall back on a different EditLog if we see one that is corrupt

Todd Lipcon (JIRA) Wed, 05 Dec 2012 11:14:01 -0800

     [ 
https://issues.apache.org/jira/browse/HDFS-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Todd Lipcon resolved HDFS-3049.
-------------------------------

       Resolution: Fixed
    Fix Version/s: 2.0.3-alpha

Fixed the extra imports and committed to branch-2, thanks for the reviews.
                
> During the normal loading NN startup process, fall back on a different 
> EditLog if we see one that is corrupt
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-3049
>                 URL: https://issues.apache.org/jira/browse/HDFS-3049
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: namenode
>    Affects Versions: 0.23.0
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>            Priority: Minor
>             Fix For: 3.0.0, 2.0.3-alpha
>
>         Attachments: HDFS-3049.001.patch, HDFS-3049.002.patch, 
> HDFS-3049.003.patch, HDFS-3049.005.against3335.patch, 
> HDFS-3049.006.against3335.patch, HDFS-3049.007.against3335.patch, 
> HDFS-3049.010.patch, HDFS-3049.011.patch, HDFS-3049.012.patch, 
> HDFS-3049.013.patch, HDFS-3049.015.patch, HDFS-3049.017.patch, 
> HDFS-3049.018.patch, HDFS-3049.021.patch, HDFS-3049.023.patch, 
> HDFS-3049.025.patch, HDFS-3049.026.patch, HDFS-3049.027.patch, 
> HDFS-3049.028.patch, HDFS-3049.028.patch, HDFS-3049.028.patch, 
> hdfs-3049-branch-2.txt
>
>
> During the NameNode startup process, we load an image, and then apply edit 
> logs to it until we believe that we have all the latest changes.  
> Unfortunately, if there is an I/O error while reading any of these files, in 
> most cases, we simply abort the startup process.  We should try harder to 
> locate a readable edit log and/or image file.
> *There are three main use cases for this feature:*
> 1. If the operating system does not honor fsync (usually due to a 
> misconfiguration), a file may end up in an inconsistent state.
> 2. In certain older releases where we did not use fallocate() or similar to 
> pre-reserve blocks, a disk full condition may cause a truncated log in one 
> edit directory.
> 3. There may be a bug in HDFS which results in some of the data directories 
> receiving corrupt data, but not all.  This is the least likely use case.
> *Proposed changes to normal NN startup*
> * We should try a different FSImage if we can't load the first one we try.
> * We should examine other FSEditLogs if we can't load the first one(s) we try.
> * We should fail if we can't find EditLogs that would bring us up to what we 
> believe is the latest transaction ID.
> Proposed changes to recovery mode NN startup:
> we should list out all the available storage directories and allow the 
> operator to select which one he wants to use.
> Something like this:
> {code}
> Multiple storage directories found.
> 1. /foo/bar
>     edits__curent__XYZ          size:213421345       md5:2345345
>     image                                  size:213421345       md5:2345345
> 2. /foo/baz
>     edits__curent__XYZ          size:213421345       md5:2345345345
>     image                                  size:213421345       md5:2345345
> Which one would you like to use? (1/2)
> {code}
> As usual in recovery mode, we want to be flexible about error handling.  In 
> this case, this means that we should NOT fail if we can't find EditLogs that 
> would bring us up to what we believe is the latest transaction ID.
> *Not addressed by this feature*
> This feature will not address the case where an attempt to access the 
> NameNode name directory or directories hangs because of an I/O error.  This 
> may happen, for example, when trying to load an image from a hard-mounted NFS 
> directory, when the NFS server has gone away.  Just as now, the operator will 
> have to notice this problem and take steps to correct it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (HDFS-3049) During the normal loading NN startup process, fall back on a different EditLog if we see one that is corrupt

Reply via email to