[
https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13224750#comment-13224750
]
Colin Patrick McCabe commented on HDFS-3004:
--------------------------------------------
Oops, forgot to post the second half of my commentary. That's what I get for
using an external editor. Here goes:
> logTruncateMessage should probably be WARN instead of ERROR since we're doing
> it intentionally (ie this code path isn't an error case), but we want it to
> have a high log level so we always see it.
Yeah.
> In the arg checking loop can just test for one additional argument rather
> than looping since we only support 1 argument
The rationale for this is that if we don't do it, someone might pass a command
line like: namenode -recover -f -backup
Since the last StartupOption option wins, this would lead to us NOT starting up
in recovery mode at all. I felt that this was confusing and would rather this
just be a parse error.
> Looks like loadEditRecords used to throw EditLogInputException in cases it
> now throws IOE. Also, let's pull the recovery code out to a separate method
> vs implementing inline in the catch block. It may even make sense to have a
> separate loadEditRecordsWithRecovery method
EditLogInputException was added to the code very recently, by the HA merge. I
guess I should consult with the original author maybe. However, by my reading,
EditLogInputException doesn't seem to be caught anywhere, but always just
treated as an IOE. I removed it because it didn't seem to be helpful to me.
We're decoding the edit log, so logically everything is an
EditLogInputException, right? Not helpful.
The distinction that I was trying to make is between errors that lead to you
trying to skip a few edit log entries (example: bad checksum) and errors that
can't be skipped over (example: premature end of file.) I would rather not
worry too much about the exception hierarchy now, but address that in a future
patch which adds full support for skipping transactions... if that makes sense.
> Needs some more test cases
Yeah... definitely. I'll try to expand the test coverage.
C.
> Implement Recovery Mode
> -----------------------
>
> Key: HDFS-3004
> URL: https://issues.apache.org/jira/browse/HDFS-3004
> Project: Hadoop HDFS
> Issue Type: New Feature
> Components: tools
> Reporter: Colin Patrick McCabe
> Assignee: Colin Patrick McCabe
> Attachments: HDFS-3004.006.patch,
> HDFS-3004__namenode_recovery_tool.txt
>
>
> When the NameNode metadata is corrupt for some reason, we want to be able to
> fix it. Obviously, we would prefer never to get in this case. In a perfect
> world, we never would. However, bad data on disk can happen from time to
> time, because of hardware errors or misconfigurations. In the past we have
> had to correct it manually, which is time-consuming and which can result in
> downtime.
> Recovery mode is initialized by the system administrator. When the NameNode
> starts up in Recovery Mode, it will try to load the FSImage file, apply all
> the edits from the edits log, and then write out a new image. Then it will
> shut down.
> Unlike in the normal startup process, the recovery mode startup process will
> be interactive. When the NameNode finds something that is inconsistent, it
> will prompt the operator as to what it should do. The operator can also
> choose to take the first option for all prompts by starting up with the '-f'
> flag, or typing 'a' at one of the prompts.
> I have reused as much code as possible from the NameNode in this tool.
> Hopefully, the effort that was spent developing this will also make the
> NameNode editLog and image processing even more robust than it already is.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira