[jira] Commented: (HADOOP-1076) Periodic checkpointing cannot resume if the secondary name-node fails.

Hadoop QA (JIRA) Mon, 24 Sep 2007 13:31:12 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12529964
 ]


Hadoop QA commented on HADOOP-1076:
-----------------------------------

+1 overall.  Here are the results of testing the latest attachment 
http://issues.apache.org/jira/secure/attachment/12366319/secondaryRestart4.patch
against trunk revision r578879.

    @author +1.  The patch does not contain any @author tags.

    javadoc +1.  The javadoc tool did not generate any warning messages.

    javac +1.  The applied patch does not generate any new compiler warnings.

    findbugs +1.  The patch does not introduce any new Findbugs warnings.

    core tests +1.  The patch passed core unit tests.

    contrib tests +1.  The patch passed contrib unit tests.

Test results: 
http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/815/testReport/
Findbugs warnings: 
http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/815/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: 
http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/815/artifact/trunk/build/test/checkstyle-errors.html
Console output: 
http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/815/console

This message is automatically generated.

> Periodic checkpointing cannot resume if the secondary name-node fails.
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-1076
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1076
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Konstantin Shvachko
>            Assignee: dhruba borthakur
>             Fix For: 0.15.0
>
>         Attachments: secondaryRestart4.patch
>
>
> If secondary name-node fails during checkpointing then the primary node will 
> have 2 edits file.
> "edits" - is the one which current checkpoint is to be based upon.
> "edits.new" - is where new name space edits are currently logged.
> The problem is that the primary node cannot do checkpointing until 
> "edits.new" file is in place.
> That is, even if the secondary name-node is restarted periodic checkpointing 
> is not going to be resumed.
> In fact the primary node will be throwing an exception complaining about the 
> existing "edits.new"
> There is only one way to get rid of the edits.new file - to restart the 
> primary name-node.
> So in a way if secondary name-node fails then you should restart the whole 
> cluster.
> Here is a rather simple modification to the current approach, which we 
> discussed with Dhruba.
> When secondary node requests to rollEditLog() the primary node should roll 
> the edit log only if
> it has not been already rolled. Otherwise the existing "edits" file will be 
> used for checkpointing
> and the primary node will keep accumulating new edits in the "edits.new".
> In order to make it work the primary node should also ignore any 
> rollFSImage() requests when it
> already started to perform one. Otherwise the new image can become corrupted 
> if two secondary
> nodes request to rollFSImage() at the same time.
> 2. Also, after the periodic checkpointing patch HADOOP-227 I see pieces of 
> unusable code.
> I noticed one data member SecondaryNameNode.localName and at least 4 methods 
> in FSEditLog
> that are not used anywhere. We should remove them and others alike if found.
> Supporting unusable code is such a waist of time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1076) Periodic checkpointing cannot resume if the secondary name-node fails.

Reply via email to