[
https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13224748#comment-13224748
]
Colin Patrick McCabe commented on HDFS-3004:
--------------------------------------------
> Wr edit logs in the namenode directory that "seem" to have a higher txid than
> the current txid, isn't the idea that we have an option to actually truncate
> the last edit from the log? Ie in this patch you're asking if the user would
> like to truncate but not actually truncating
It's a logical truncation-- removing the following content from the state of
the system.
I don't want to actually modify the old edit logs. If I did that, it would be
a big hassle for everyone concerned. For example, what happens if I get an I/O
error while writing ot the old edit logs? Considering we're handling bad
on-disk data, there's a higher-than-usual chance of that happening. Then
suddenly all the edits directories are no longer the same-- my changes got
applied to some, but not all. Etc.
Also, the way I intend this being used is that the admin might start up the
system, decide that he didn't like the way he resolved the corruption (maybe a
crucial file is missing?) and try it again. With this patch, he can easily do
this by deleting the new image, changing the last seen txid, and simply having
another go. If I start messing with or truncating the old logs, this becomes
much harder.
> Is the move of the re-check of maxSeenTxid cleanup or actually necessary now?
> I
> agree the re-check doesn't look necessary though now we bail before adding
> found images if we can't find the maxSeenTxId in the SD images, not sure
> that's
> OK.
Yeah, the re-check was never necessary. It seems to have been a typo.
Also, previously, we didn't catch the exception that might be thrown by
readTransactionIdFile, which could lead to aborting the whole NN startup
process because one directory was bad. The current solution is to ignore
directories with missing txId files.
I don't know how we would even go about handling an image whose last seen
transaction ID was unknown. If we do decide to handle that case, I would argue
we should probably file a separate JIRA, rather than trying to cram it into
this patch.
thanks,
C.
> Implement Recovery Mode
> -----------------------
>
> Key: HDFS-3004
> URL: https://issues.apache.org/jira/browse/HDFS-3004
> Project: Hadoop HDFS
> Issue Type: New Feature
> Components: tools
> Reporter: Colin Patrick McCabe
> Assignee: Colin Patrick McCabe
> Attachments: HDFS-3004.006.patch,
> HDFS-3004__namenode_recovery_tool.txt
>
>
> When the NameNode metadata is corrupt for some reason, we want to be able to
> fix it. Obviously, we would prefer never to get in this case. In a perfect
> world, we never would. However, bad data on disk can happen from time to
> time, because of hardware errors or misconfigurations. In the past we have
> had to correct it manually, which is time-consuming and which can result in
> downtime.
> Recovery mode is initialized by the system administrator. When the NameNode
> starts up in Recovery Mode, it will try to load the FSImage file, apply all
> the edits from the edits log, and then write out a new image. Then it will
> shut down.
> Unlike in the normal startup process, the recovery mode startup process will
> be interactive. When the NameNode finds something that is inconsistent, it
> will prompt the operator as to what it should do. The operator can also
> choose to take the first option for all prompts by starting up with the '-f'
> flag, or typing 'a' at one of the prompts.
> I have reused as much code as possible from the NameNode in this tool.
> Hopefully, the effort that was spent developing this will also make the
> NameNode editLog and image processing even more robust than it already is.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira