[ https://issues.apache.org/jira/browse/HADOOP-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12529964 ]
Hadoop QA commented on HADOOP-1076: ----------------------------------- +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12366319/secondaryRestart4.patch against trunk revision r578879. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/815/testReport/ Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/815/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/815/artifact/trunk/build/test/checkstyle-errors.html Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/815/console This message is automatically generated. > Periodic checkpointing cannot resume if the secondary name-node fails. > ---------------------------------------------------------------------- > > Key: HADOOP-1076 > URL: https://issues.apache.org/jira/browse/HADOOP-1076 > Project: Hadoop > Issue Type: Bug > Components: dfs > Reporter: Konstantin Shvachko > Assignee: dhruba borthakur > Fix For: 0.15.0 > > Attachments: secondaryRestart4.patch > > > If secondary name-node fails during checkpointing then the primary node will > have 2 edits file. > "edits" - is the one which current checkpoint is to be based upon. > "edits.new" - is where new name space edits are currently logged. > The problem is that the primary node cannot do checkpointing until > "edits.new" file is in place. > That is, even if the secondary name-node is restarted periodic checkpointing > is not going to be resumed. > In fact the primary node will be throwing an exception complaining about the > existing "edits.new" > There is only one way to get rid of the edits.new file - to restart the > primary name-node. > So in a way if secondary name-node fails then you should restart the whole > cluster. > Here is a rather simple modification to the current approach, which we > discussed with Dhruba. > When secondary node requests to rollEditLog() the primary node should roll > the edit log only if > it has not been already rolled. Otherwise the existing "edits" file will be > used for checkpointing > and the primary node will keep accumulating new edits in the "edits.new". > In order to make it work the primary node should also ignore any > rollFSImage() requests when it > already started to perform one. Otherwise the new image can become corrupted > if two secondary > nodes request to rollFSImage() at the same time. > 2. Also, after the periodic checkpointing patch HADOOP-227 I see pieces of > unusable code. > I noticed one data member SecondaryNameNode.localName and at least 4 methods > in FSEditLog > that are not used anywhere. We should remove them and others alike if found. > Supporting unusable code is such a waist of time. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.