[jira] [Commented] (HDFS-3004) Implement Recovery Mode

Todd Lipcon (Commented) (JIRA) Sat, 31 Mar 2012 17:56:54 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13243623#comment-13243623
 ]


Todd Lipcon commented on HDFS-3004:
-----------------------------------

Looking pretty good, just style nits:

{code}
+    /* This is a trivial implementation which just assumes that any errors mean
+     * that there is nothing more of value in the log.  You can override this.
+     */
{code}
Style: use // comments for inline comments. I think it's clearer to say 
"subclasses will likely want to override this" instead of "You can override 
this". The other option, which I think is somewhat reasonable, is to just do: 
{{throw new UnsupportedOperationException(this.getClass() + " does not support 
resyncing to next edit");}} since it seems like the implementation that's there 
now is worse than just failing.

----
{code}
+   * After this function returns, the next call to reaOp will return either
{code}
typo: readOp

----

bq. Basically, we NEVER want to apply a transaction that has a lower or equal 
ID to the previous one. That's why the ''continue'' is there in the else 
clause. We will try to recover from an edit log with gaps in it, though. (That 
is sort of the point of recovery).

I can see cases where we might want to "apply the out-of-order edit anyway". 
But let's leave that for a follow-up JIRA.

----

{code}
+            LOG.error("Encountered exception on operation " +
+              op.toString() + ": \n" + StringUtils.stringifyException(e));
{code}
Use the second argument of LOG.error to log the exception, rather than using 
stringifyException.

----
{code}
+    }
+    finally {
{code}
style: combine to one line. Also an example of this inside 
EltsTestGarbageInEditLog later in the patch.

----
{code}
+            /* If we encountered an exception or an end-of-file condition,
+             * do not advance the input stream. */
{code}
// comments

----
{code}
+          if (!skipBrokenEdits)
+            throw e;
+          if (in.skip(1) < 1)
+            return null;
...
+      if (response.equalsIgnoreCase(firstChoice))
+        return firstChoice;
...
+    if (operation == StartupOption.RECOVER)
+      return;
{code}
need {} braces (also a few other places).

For simple "return" or "break", you can put it on the same line as the if, 
without braces, if it fits, but otherwise, we always use {}s

----
{code}
+        } else {
+        assertEquals(prevTxId, elts.getLastValidTxId());
+        }
{code}
indentation

----
{code}
+  public static Set<Long> setFromArray(long[] arr) {
...
{code}

Instead of this, you can just use Sets.newHashSet(1,2,3...) from guava

----

{code}
+      Set <Long> validTxIds = elts.getValidTxIds();
{code}
no space between {{Set}} and {{<Long>}}

----
{code}
+      if ((elfos != null) && (elfos.isOpen()))
+        elfos.close();
+      if (elfis != null)
+        elfis.close();
{code}
use IOUtils.cleanup

                
> Implement Recovery Mode
> -----------------------
>
>                 Key: HDFS-3004
>                 URL: https://issues.apache.org/jira/browse/HDFS-3004
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: tools
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>         Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, 
> HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, 
> HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, 
> HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, 
> HDFS-3004.023.patch, HDFS-3004.024.patch, HDFS-3004.026.patch, 
> HDFS-3004.027.patch, HDFS-3004.029.patch, HDFS-3004.030.patch, 
> HDFS-3004.031.patch, HDFS-3004.032.patch, HDFS-3004.033.patch, 
> HDFS-3004.034.patch, HDFS-3004.035.patch, HDFS-3004.036.patch, 
> HDFS-3004__namenode_recovery_tool.txt
>
>
> When the NameNode metadata is corrupt for some reason, we want to be able to 
> fix it.  Obviously, we would prefer never to get in this case.  In a perfect 
> world, we never would.  However, bad data on disk can happen from time to 
> time, because of hardware errors or misconfigurations.  In the past we have 
> had to correct it manually, which is time-consuming and which can result in 
> downtime.
> Recovery mode is initialized by the system administrator.  When the NameNode 
> starts up in Recovery Mode, it will try to load the FSImage file, apply all 
> the edits from the edits log, and then write out a new image.  Then it will 
> shut down.
> Unlike in the normal startup process, the recovery mode startup process will 
> be interactive.  When the NameNode finds something that is inconsistent, it 
> will prompt the operator as to what it should do.   The operator can also 
> choose to take the first option for all prompts by starting up with the '-f' 
> flag, or typing 'a' at one of the prompts.
> I have reused as much code as possible from the NameNode in this tool.  
> Hopefully, the effort that was spent developing this will also make the 
> NameNode editLog and image processing even more robust than it already is.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3004) Implement Recovery Mode

Reply via email to