[ https://issues.apache.org/jira/browse/HDFS-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Todd Lipcon resolved HDFS-3049. ------------------------------- Resolution: Fixed Fix Version/s: 2.0.3-alpha Fixed the extra imports and committed to branch-2, thanks for the reviews. > During the normal loading NN startup process, fall back on a different > EditLog if we see one that is corrupt > ------------------------------------------------------------------------------------------------------------ > > Key: HDFS-3049 > URL: https://issues.apache.org/jira/browse/HDFS-3049 > Project: Hadoop HDFS > Issue Type: New Feature > Components: namenode > Affects Versions: 0.23.0 > Reporter: Colin Patrick McCabe > Assignee: Colin Patrick McCabe > Priority: Minor > Fix For: 3.0.0, 2.0.3-alpha > > Attachments: HDFS-3049.001.patch, HDFS-3049.002.patch, > HDFS-3049.003.patch, HDFS-3049.005.against3335.patch, > HDFS-3049.006.against3335.patch, HDFS-3049.007.against3335.patch, > HDFS-3049.010.patch, HDFS-3049.011.patch, HDFS-3049.012.patch, > HDFS-3049.013.patch, HDFS-3049.015.patch, HDFS-3049.017.patch, > HDFS-3049.018.patch, HDFS-3049.021.patch, HDFS-3049.023.patch, > HDFS-3049.025.patch, HDFS-3049.026.patch, HDFS-3049.027.patch, > HDFS-3049.028.patch, HDFS-3049.028.patch, HDFS-3049.028.patch, > hdfs-3049-branch-2.txt > > > During the NameNode startup process, we load an image, and then apply edit > logs to it until we believe that we have all the latest changes. > Unfortunately, if there is an I/O error while reading any of these files, in > most cases, we simply abort the startup process. We should try harder to > locate a readable edit log and/or image file. > *There are three main use cases for this feature:* > 1. If the operating system does not honor fsync (usually due to a > misconfiguration), a file may end up in an inconsistent state. > 2. In certain older releases where we did not use fallocate() or similar to > pre-reserve blocks, a disk full condition may cause a truncated log in one > edit directory. > 3. There may be a bug in HDFS which results in some of the data directories > receiving corrupt data, but not all. This is the least likely use case. > *Proposed changes to normal NN startup* > * We should try a different FSImage if we can't load the first one we try. > * We should examine other FSEditLogs if we can't load the first one(s) we try. > * We should fail if we can't find EditLogs that would bring us up to what we > believe is the latest transaction ID. > Proposed changes to recovery mode NN startup: > we should list out all the available storage directories and allow the > operator to select which one he wants to use. > Something like this: > {code} > Multiple storage directories found. > 1. /foo/bar > edits__curent__XYZ size:213421345 md5:2345345 > image size:213421345 md5:2345345 > 2. /foo/baz > edits__curent__XYZ size:213421345 md5:2345345345 > image size:213421345 md5:2345345 > Which one would you like to use? (1/2) > {code} > As usual in recovery mode, we want to be flexible about error handling. In > this case, this means that we should NOT fail if we can't find EditLogs that > would bring us up to what we believe is the latest transaction ID. > *Not addressed by this feature* > This feature will not address the case where an attempt to access the > NameNode name directory or directories hangs because of an I/O error. This > may happen, for example, when trying to load an image from a hard-mounted NFS > directory, when the NFS server has gone away. Just as now, the operator will > have to notice this problem and take steps to correct it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira