[ https://issues.apache.org/jira/browse/HADOOP-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12530243 ]
Hudson commented on HADOOP-1076: -------------------------------- Integrated in Hadoop-Nightly #250 (See [http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/250/]) > Periodic checkpointing cannot resume if the secondary name-node fails. > ---------------------------------------------------------------------- > > Key: HADOOP-1076 > URL: https://issues.apache.org/jira/browse/HADOOP-1076 > Project: Hadoop > Issue Type: Bug > Components: dfs > Reporter: Konstantin Shvachko > Assignee: dhruba borthakur > Fix For: 0.15.0 > > Attachments: secondaryRestart4.patch > > > If secondary name-node fails during checkpointing then the primary node will > have 2 edits file. > "edits" - is the one which current checkpoint is to be based upon. > "edits.new" - is where new name space edits are currently logged. > The problem is that the primary node cannot do checkpointing until > "edits.new" file is in place. > That is, even if the secondary name-node is restarted periodic checkpointing > is not going to be resumed. > In fact the primary node will be throwing an exception complaining about the > existing "edits.new" > There is only one way to get rid of the edits.new file - to restart the > primary name-node. > So in a way if secondary name-node fails then you should restart the whole > cluster. > Here is a rather simple modification to the current approach, which we > discussed with Dhruba. > When secondary node requests to rollEditLog() the primary node should roll > the edit log only if > it has not been already rolled. Otherwise the existing "edits" file will be > used for checkpointing > and the primary node will keep accumulating new edits in the "edits.new". > In order to make it work the primary node should also ignore any > rollFSImage() requests when it > already started to perform one. Otherwise the new image can become corrupted > if two secondary > nodes request to rollFSImage() at the same time. > 2. Also, after the periodic checkpointing patch HADOOP-227 I see pieces of > unusable code. > I noticed one data member SecondaryNameNode.localName and at least 4 methods > in FSEditLog > that are not used anywhere. We should remove them and others alike if found. > Supporting unusable code is such a waist of time. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.