[jira] [Commented] (HDFS-3004) Implement Recovery Mode

Colin Patrick McCabe (Commented) (JIRA) Wed, 07 Mar 2012 13:53:20 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13224750#comment-13224750
 ]


Colin Patrick McCabe commented on HDFS-3004:
--------------------------------------------

Oops, forgot to post the second half of my commentary.  That's what I get for 
using an external editor.  Here goes:

> logTruncateMessage should probably be WARN instead of ERROR since we're doing
> it intentionally (ie this code path isn't an error case), but we want it to
> have a high log level so we always see it.

Yeah.

> In the arg checking loop can just test for one additional argument rather
> than looping since we only support 1 argument

The rationale for this is that if we don't do it, someone might pass a command 
line like: namenode -recover -f -backup

Since the last StartupOption option wins, this would lead to us NOT starting up 
in recovery mode at all.  I felt that this was confusing and would rather this 
just be a parse error.

> Looks like loadEditRecords used to throw EditLogInputException in cases it
> now throws IOE. Also, let's pull the recovery code out to a separate method
> vs implementing inline in the catch block. It may even make sense to have a
> separate loadEditRecordsWithRecovery method

EditLogInputException was added to the code very recently, by the HA merge.  I 
guess I should consult with the original author maybe.  However, by my reading, 
EditLogInputException doesn't seem to be caught anywhere, but always just 
treated as an IOE.  I removed it because it didn't seem to be helpful to me.  
We're decoding the edit log, so logically everything is an 
EditLogInputException, right?  Not helpful.

The distinction that I was trying to make is between errors that lead to you 
trying to skip a few edit log entries (example: bad checksum) and errors that 
can't be skipped over (example: premature end of file.)  I would rather not 
worry too much about the exception hierarchy now, but address that in a future 
patch which adds full support for skipping transactions... if that makes sense.

> Needs some more test cases

Yeah... definitely.  I'll try to expand the test coverage.

C.
                
> Implement Recovery Mode
> -----------------------
>
>                 Key: HDFS-3004
>                 URL: https://issues.apache.org/jira/browse/HDFS-3004
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: tools
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>         Attachments: HDFS-3004.006.patch, 
> HDFS-3004__namenode_recovery_tool.txt
>
>
> When the NameNode metadata is corrupt for some reason, we want to be able to 
> fix it.  Obviously, we would prefer never to get in this case.  In a perfect 
> world, we never would.  However, bad data on disk can happen from time to 
> time, because of hardware errors or misconfigurations.  In the past we have 
> had to correct it manually, which is time-consuming and which can result in 
> downtime.
> Recovery mode is initialized by the system administrator.  When the NameNode 
> starts up in Recovery Mode, it will try to load the FSImage file, apply all 
> the edits from the edits log, and then write out a new image.  Then it will 
> shut down.
> Unlike in the normal startup process, the recovery mode startup process will 
> be interactive.  When the NameNode finds something that is inconsistent, it 
> will prompt the operator as to what it should do.   The operator can also 
> choose to take the first option for all prompts by starting up with the '-f' 
> flag, or typing 'a' at one of the prompts.
> I have reused as much code as possible from the NameNode in this tool.  
> Hopefully, the effort that was spent developing this will also make the 
> NameNode editLog and image processing even more robust than it already is.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3004) Implement Recovery Mode

Reply via email to