[ https://issues.apache.org/jira/browse/HDFS-955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841983#action_12841983 ]
Konstantin Shvachko commented on HDFS-955: ------------------------------------------ Unfortunately, this solution does not work either. The problem is that it assumes that all files are in the same directory, while in our model edits and image directories may be independent of each other. It means that we cannot rely on the presence or absence of EDITS_NEW (and EDITS) in order to decide whether to remove or promote IMAGE_NEW, because the system can dye when EDITS_NEW is renamed to EDITS in one directory but not in an other. We are trying here to restore the stage of the NN storage transformation sequence, when it crashed, by examining the remaining files. This is error-prone, and introduces unnecessary complexity. We should rather apply the technique used in BackupNode and for the upgrade. h3. A Better Solution The idea is to create a temporary directory and accumulate all necessary changes to the persistent data in it, and then rename it to {{current}} once the new data is ready. The rename is two-step, not atomic, but it minimizes the recovery effort. Here is how saveFSImage() should work. # Create prospective_current.tmp, and write necessary files in it. #- Save new image into prospective_current.tmp/IMAGE #- Create empty prospective_current.tmp/EDITS #- Create VERSION and fstime files in prospective_current.tmp and write new checkpointTime. # Rename current to removed_current.tmp # Rename prospective_current.tmp to current # Remove removed_current.tmp And the recovery procedure is very simple: - if current.exists && prospective_current.tmp.exists then remove prospective_current.tmp - if ! current.exists && prospective_current.tmp.exists then rename prospective_current.tmp to current and remove removed_current.tmp It is important that image and edits directories are operated (created and recovered) independently of each other, but maintain the same meta-data state. I plan to implement this algorithm, and will try to reuse some code from BN. I will not change the checkpoint procedure for SNN, since it is deprecated, and it should not cause problems, as - Checkpoint cannot start when saveFSImage is in progress. - If checkpoint image upload started before saveFSImage, then the uploading will continue to current, and further rollFSImage will fail either because the NN is in safe mode (saveFSImage is still in progress) or because EDITS_NEW does not exist anymore (saveFSImage already completed). > FSImage.saveFSImage can lose edits > ---------------------------------- > > Key: HDFS-955 > URL: https://issues.apache.org/jira/browse/HDFS-955 > Project: Hadoop HDFS > Issue Type: Bug > Affects Versions: 0.20.1, 0.21.0, 0.22.0 > Reporter: Todd Lipcon > Assignee: Todd Lipcon > Priority: Blocker > Attachments: hdfs-955-moretests.txt, hdfs-955-unittest.txt, > PurgeEditsBeforeImageSave.patch > > > This is a continuation of a discussion from HDFS-909. The FSImage.saveFSImage > function (implementing dfsadmin -saveNamespace) can corrupt the NN storage > such that all current edits are lost. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.