[ https://issues.apache.org/jira/browse/HDFS-955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831258#action_12831258 ]
Todd Lipcon commented on HDFS-955: ---------------------------------- loadFSImage: - find a pair of EDITS and IMAGE that have the same checkpoint time and are from the latest checkpointTime (this ignores *_NEW) - recoverInterruptedCheckpoint: if there is an IMAGE_NEW: if there is EDITS_NEW: delete IMAGE_NEW (since we assume we can replay from IMAGE + EDITS + EDITS_NEW? else: replace IMAGE with IMAGE_NEW, delete IMAGE_NEW I took some pseudocode notes on what's currently going on in the load/save code: {noformat} - load IMAGE - load EDITS - load EDITS_NEW - if need to save: saveFSImage: save IMAGE_NEW truncate EDITS if EDITS_NEW exists: truncate EDITS_NEW rollFSImage: purgeEditLog: replace EDITS with EDITS_NEW renameCheckpoint: replace IMAGE with IMAGE_NEW {noformat} Next I'll look at a failure at each point and see if recovery works. Longer term we should also figure out how to get either use the FI test framework or some clever mockito spies to inject these failures for unit tests. > FSImage.saveFSImage can lose edits > ---------------------------------- > > Key: HDFS-955 > URL: https://issues.apache.org/jira/browse/HDFS-955 > Project: Hadoop HDFS > Issue Type: Bug > Affects Versions: 0.21.0, 0.22.0 > Reporter: Todd Lipcon > Assignee: Todd Lipcon > Priority: Blocker > > This is a continuation of a discussion from HDFS-909. The FSImage.saveFSImage > function (implementing dfsadmin -saveNamespace) can corrupt the NN storage > such that all current edits are lost. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.