[jira] [Commented] (HDFS-1981) When namenode goes down while checkpointing and if is started again subsequent Checkpointing is always failing

Konstantin Shvachko (JIRA) Fri, 24 Jun 2011 17:53:13 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054781#comment-13054781
 ]


Konstantin Shvachko commented on HDFS-1981:
-------------------------------------------

Not sure what introduced it, but 
The problem is that NN does not saveNamespace() when editsNew is present.
This only happens in Ramakrishna's scenario, when editsNew is empty. That is 
when you start the checkpoint, and fail NN before modifying anything in the 
namespace.

Deleting editsNew, is probably valid, but not consistent, since at this stage 
NN is in read-only mode. That is if something goes wrong we should leave the 
storage directory in exactly the same state as it was before the startup.

I propose to increment numEdits if editsNew exists. This will trigger saving 
namespace after loading. So just one line change:
{code}
. if (editsNew.exists() && editsNew.length() > 0) {
+   numEdits ++;
    edits = new EditLogFileInputStream(editsNew);
    numEdits += loader.loadFSEdits(edits);
    edits.close();
  }
{code}
Well, may be not one line as you need to increment even if {{editsNew.length() 
== 0}}.

Your test should work in this case as well. Could you please convert it to 
JUnit4 and use {{MiniDFSCluster.Builder}} instead of a direct constructor.

> When namenode goes down while checkpointing and if is started again 
> subsequent Checkpointing is always failing
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-1981
>                 URL: https://issues.apache.org/jira/browse/HDFS-1981
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.22.0
>         Environment: Linux
>            Reporter: ramkrishna.s.vasudevan
>            Priority: Blocker
>             Fix For: 0.22.0
>
>         Attachments: HDFS-1981.patch
>
>
> This scenario is applicable in NN and BNN case.
> When the namenode goes down after creating the edits.new, on subsequent 
> restart the divertFileStreams will not happen to edits.new as the edits.new 
> file is already present and the size is zero.
> so on trying to saveCheckPoint an exception occurs 
> 2011-05-23 16:38:57,476 WARN org.mortbay.log: /getimage: java.io.IOException: 
> GetImage failed. java.io.IOException: Namenode has an edit log with timestamp 
> of 2011-05-23 16:38:56 but new checkpoint was created using editlog  with 
> timestamp 2011-05-23 16:37:30. Checkpoint Aborted.
> This is a bug or is that the behaviour.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-1981) When namenode goes down while checkpointing and if is started again subsequent Checkpointing is always failing

Reply via email to