[
https://issues.apache.org/jira/browse/HDFS-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Colin Patrick McCabe resolved HDFS-3049.
----------------------------------------
Resolution: Fixed
The build failures in un-mavenized MR tests were handled by Arun in HDFS-3614
> During the normal loading NN startup process, fall back on a different
> EditLog if we see one that is corrupt
> ------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-3049
> URL: https://issues.apache.org/jira/browse/HDFS-3049
> Project: Hadoop HDFS
> Issue Type: New Feature
> Components: name-node
> Affects Versions: 0.23.0
> Reporter: Colin Patrick McCabe
> Assignee: Colin Patrick McCabe
> Priority: Minor
> Attachments: HDFS-3049.001.patch, HDFS-3049.002.patch,
> HDFS-3049.003.patch, HDFS-3049.005.against3335.patch,
> HDFS-3049.006.against3335.patch, HDFS-3049.007.against3335.patch,
> HDFS-3049.010.patch, HDFS-3049.011.patch, HDFS-3049.012.patch,
> HDFS-3049.013.patch, HDFS-3049.015.patch, HDFS-3049.017.patch,
> HDFS-3049.018.patch, HDFS-3049.021.patch, HDFS-3049.023.patch,
> HDFS-3049.025.patch, HDFS-3049.026.patch, HDFS-3049.027.patch,
> HDFS-3049.028.patch, HDFS-3049.028.patch, HDFS-3049.028.patch
>
>
> During the NameNode startup process, we load an image, and then apply edit
> logs to it until we believe that we have all the latest changes.
> Unfortunately, if there is an I/O error while reading any of these files, in
> most cases, we simply abort the startup process. We should try harder to
> locate a readable edit log and/or image file.
> *There are three main use cases for this feature:*
> 1. If the operating system does not honor fsync (usually due to a
> misconfiguration), a file may end up in an inconsistent state.
> 2. In certain older releases where we did not use fallocate() or similar to
> pre-reserve blocks, a disk full condition may cause a truncated log in one
> edit directory.
> 3. There may be a bug in HDFS which results in some of the data directories
> receiving corrupt data, but not all. This is the least likely use case.
> *Proposed changes to normal NN startup*
> * We should try a different FSImage if we can't load the first one we try.
> * We should examine other FSEditLogs if we can't load the first one(s) we try.
> * We should fail if we can't find EditLogs that would bring us up to what we
> believe is the latest transaction ID.
> Proposed changes to recovery mode NN startup:
> we should list out all the available storage directories and allow the
> operator to select which one he wants to use.
> Something like this:
> {code}
> Multiple storage directories found.
> 1. /foo/bar
> edits__curent__XYZ size:213421345 md5:2345345
> image size:213421345 md5:2345345
> 2. /foo/baz
> edits__curent__XYZ size:213421345 md5:2345345345
> image size:213421345 md5:2345345
> Which one would you like to use? (1/2)
> {code}
> As usual in recovery mode, we want to be flexible about error handling. In
> this case, this means that we should NOT fail if we can't find EditLogs that
> would bring us up to what we believe is the latest transaction ID.
> *Not addressed by this feature*
> This feature will not address the case where an attempt to access the
> NameNode name directory or directories hangs because of an I/O error. This
> may happen, for example, when trying to load an image from a hard-mounted NFS
> directory, when the NFS server has gone away. Just as now, the operator will
> have to notice this problem and take steps to correct it.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira