[jira] [Commented] (HDFS-3004) Implement Recovery Mode

Todd Lipcon (Commented) (JIRA) Mon, 19 Mar 2012 17:44:05 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13233065#comment-13233065
 ]


Todd Lipcon commented on HDFS-3004:
-----------------------------------

- The docs reference a '-f' flag, but the code seems to be using 
{{-chooseFirst}}.
- Regarding above, I think {{alwaysChooseFirst}} in the code is actually hard 
to understand. I'd prefer {{autoChooseDefault}} - the fact that the default is 
listed first is sort of ancillary. Similarly, rename {{firstChoice}} to 
{{defaultChoice}} in the {{ask}} function.
- I think "q/quit" should probably be an option for all {{ask()}} calls.
- Rename {{StartupOption.getRecoveryContext}} to 
{{StartupOption.createRecoveryContext}} since it creates a new one rather than 
returning a singleton
- There are a few unrelated changes in FSEditLogLoader that change formatting 
of code that wasn't touched in the patch
- You can't remove EditLogInputException. It's important for EditLogTailer to 
be able to distinguish a transient issue with reading input from the shared 
directory from an actual invalid operation -- we want to fail fast in the case 
of the invalid operation, whereas we want to keep retrying in the case of a 
transient IO error. See HDFS-2709 for details. The fact that you had to change 
TestFailureToReadEdits is a red flag here.
- Does RecoveryContext need to be a public class? If so, you need to add an 
InterfaceAudience annotation to mark it as Private audience
- Same with OpInstanceCache
- It seems strange to be passing OpInstanceCache to all of the 
{{getInstance()}} methods. Instead, can we kill off all the {{getInstance}} 
methods, and add something like the following to OpInstanceCache:

{code}
  @SuppressWarnings("unchecked")
  public <T extends FSEditLogOp> T getInstance(FSEditLogOpCodes code, Class<T> 
clazz) {
    return (T)inst.get(code);
  }
{code}

this could happen in a separate follow-up JIRA though.
- Move {{logTruncateMessage}} down lower in the file near the other logging 
stuff, since it's just a utility method. Also, no need to use StringBuilder 
here since it's not performance-critical. Easier to read simple string 
concatenation.
- In FileJournalManager, please retain the log message when reading from the 
middle of a stream.


- I think the {{skipUntil}} API should actually throw an exception if it could 
not skip to the requested txid -- we always expect to be able to do that, with 
the current usage. Then, {{putOp()}} could be made private, which would keep 
the EditLogInputStream API cleaner (i.e not leak the fact that there's a 
one-element buffer hiding inside). I'm actually not clear on why this change is 
part of this patch at all -- how is this change necessary in recovery mode?
- Rather than adding the new {{skipBrokenEdits}} to {{nextOp()}} and 
{{readOp()}} I think it's better to add a new API, like {{void 
skipUntilNextValidEdit()}} or {{void resyncToValidEdit()}}. Then the next call 
to {{readOp()}} would return that edit. I think this is easier to understand, 
and parallels the SequenceFile.Reader API we have elsewhere in Hadoop.

- It seems like, in the case that an error is a mismatched transaction ID, we 
want one of the recovery choices to be "apply the edit anyway" -- eg imagine 
that for some reason our log just skips from txid 1 to txid 3. When we hit txid 
3, we'll see an error that we expected a different one. But we'd like to 
continue applying from txid 3, ignoring the gap, right?

----
{code}
-            if (txId != expectedTxId) {
-              throw new IOException("Expected transaction ID " +
-                  expectedTxId + " but got " + txId);
+            if (!skipBrokenEdits) {
+              if (op.txid != expectedTxId) {
{code}
Rather than nested ifs, just {{&&}} the two conditions

----
Formatting nit:
{code}
+  /** Display a prompt to the user and get his or her choice.
+   *  
{code}
The first line of text in the javadoc should be on the line after the {{/**}}

                
> Implement Recovery Mode
> -----------------------
>
>                 Key: HDFS-3004
>                 URL: https://issues.apache.org/jira/browse/HDFS-3004
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: tools
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>         Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, 
> HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, 
> HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, 
> HDFS-3004.019.patch, HDFS-3004__namenode_recovery_tool.txt
>
>
> When the NameNode metadata is corrupt for some reason, we want to be able to 
> fix it.  Obviously, we would prefer never to get in this case.  In a perfect 
> world, we never would.  However, bad data on disk can happen from time to 
> time, because of hardware errors or misconfigurations.  In the past we have 
> had to correct it manually, which is time-consuming and which can result in 
> downtime.
> Recovery mode is initialized by the system administrator.  When the NameNode 
> starts up in Recovery Mode, it will try to load the FSImage file, apply all 
> the edits from the edits log, and then write out a new image.  Then it will 
> shut down.
> Unlike in the normal startup process, the recovery mode startup process will 
> be interactive.  When the NameNode finds something that is inconsistent, it 
> will prompt the operator as to what it should do.   The operator can also 
> choose to take the first option for all prompts by starting up with the '-f' 
> flag, or typing 'a' at one of the prompts.
> I have reused as much code as possible from the NameNode in this tool.  
> Hopefully, the effort that was spent developing this will also make the 
> NameNode editLog and image processing even more robust than it already is.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3004) Implement Recovery Mode

Reply via email to