[ https://issues.apache.org/jira/browse/HDFS-955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840412#action_12840412 ]
Konstantin Shvachko commented on HDFS-955: ------------------------------------------ h3. The Problem Our recovery logic for IMAGE_NEW file was originally intended for the checkpoint recovery, and it works in this case. But it does not work for recovery from a saveFSImage() failure. The storage directory may contain four files: IMAGE, EDITS, EDITS_NEW, and IMAGE_NEW. Here are the steps we perform during checkpoint: 0. Initially storage directory has IMAGE and EDITS files only. 1. Start checkpoint. NN creates EDITS_NEW, and starts streaming edits into it. 2. Upload IMAGE_NEW from SNN to NN storage directory. 3. When upload is done, rename EDITS_NEW -> EDITS. 4. Rename IMAGE_NEW -> IMAGE. Back to the initial state. Here is the time-line of which combination of files represent the _current_ state of the file system relative to the events above. IMAGE + EDITS --- (1) --- IMAGE + EDITS + EDITS_NEW --- (2) --- IMAGE + EDITS + EDITS_NEW --- (3) --- IMAGE_NEW + EDITS --- (4) --- IMAGE + EDITS The recovery procedure: - If EDITS_NEW.exists, then we know NN failed after 1 or 2, but before 3, and our recovery strategy is to discard IMAGE_NEW. - If ! EDITS_NEW.exists && IMAGE_NEW.exists, then NN failed after 3, but before 4, and we recover by upgrading IMAGE_NEW to IMAGE. Now lets see what happens when we save image during startup or saveNamespace. Here are the steps we perform when we call saveFSImage(): 0. Initially the storage directory has IMAGE, EDITS, and potentially EDITS_NEW, which have been all loaded and digested in NN RAM. 1. Create EDITS_NEW if missing. 2. Save IMAGE_NEW. 3. Empty EDITS and EDITS_NEW. 4. Rename EDITS_NEW -> EDITS. 5. Rename IMAGE_NEW -> IMAGE. We use the same recovery procedure here as in checkpointing, which leads to a data loss in the following failure scenario. If we fail after 3 but before 4, then we will discard IMAGE_NEW, because EDITS_NEW.exists. But the latest updates in EDITS and/or EDITS_NEW has already been wiped out and we loose these edits forever. The main reason the checkpointing logic does not work for saving is that IMAGE_NEW has different semantics in these two cases. - In checkpoint IMAGE_NEW = IMAGE + EDITS - In saveFSImage IMAGE_NEW = IMAGE + EDITS + EDITS_NEW h3. The Solution Different images should be represented by separate files and treated differently. I'll denote them - IMAGE_CKPT = IMAGE + EDITS the checkpoint image - IMAGE_LAST = IMAGE + EDITS + EDITS_NEW the last saved image So the checkpoint process will create IMAGE_CKPT and will work with is exactly as before, no changes here. saveFSImage will save NN's memory state into IMAGE_LAST, and should consist of the following steps: 0. Initially the storage directory has IMAGE, EDITS, and potentially EDITS_NEW and IMAGE_CKPT. 1. Save image into IMAGE_LAST. 2. Remove EDITS, IMAGE_CKPT, and EDITS_NEW - in the order listed. 3. Rename IMAGE_LAST -> IMAGE. 4. Create empty EDITS. It is important to note that checkpoint cannot start once saveFSImage started, because NN is in safe mode, and because it holds the NN lock. If the upload of IMAGE_CKPT has started (stage c-2) it will proceed concurrently with the save. But rollEdits() (stage c-3) will fail if called during saveFSImage. Here is the time-line of which combination of files represent the _current_ state of the file system relative to the events above. IMAGE + EDITS + EDITS_NEW --- (1) --- IMAGE + EDITS + EDITS_NEW --- (2) --- IMAGE_LAST --- (3) --- IMAGE --- (4) --- IMAGE + EDITS The recovery procedure for saving image is: - If EDITS.exists && IMAGE_LAST.exists, then we know NN failed after 1 but before 2, and we recover by discarding IMAGE_LAST. - If ! EDITS.exists && IMAGE_LAST.exists, then NN failed during or after 2, and we recover by applying 2, 3, and 4. - If ! EDITS.exists && ! IMAGE_LAST.exists, then NN failed after 3, and we recover by applying 4. There is a slight complication for WinFS. It will not let us remove IMAGE_CKPT on stage 3 if the checkpointer is still writing into it. In this case we will ignore the failure, and quit the procedure, delaing the rest of the steps for the future. The correct state (rename IMAGE_LAST to IMAGE) will be restored either when checkpoint finishes or if the NN restarts due to a failure. > FSImage.saveFSImage can lose edits > ---------------------------------- > > Key: HDFS-955 > URL: https://issues.apache.org/jira/browse/HDFS-955 > Project: Hadoop HDFS > Issue Type: Bug > Affects Versions: 0.20.1, 0.21.0, 0.22.0 > Reporter: Todd Lipcon > Assignee: Todd Lipcon > Priority: Blocker > Attachments: hdfs-955-moretests.txt, hdfs-955-unittest.txt, > PurgeEditsBeforeImageSave.patch > > > This is a continuation of a discussion from HDFS-909. The FSImage.saveFSImage > function (implementing dfsadmin -saveNamespace) can corrupt the NN storage > such that all current edits are lost. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.