[ 
https://issues.apache.org/jira/browse/HDFS-955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840412#action_12840412
 ] 

Konstantin Shvachko commented on HDFS-955:
------------------------------------------

h3. The Problem

Our recovery logic for IMAGE_NEW file was originally intended for the 
checkpoint recovery, and it works in this case. But it does not work for 
recovery from a saveFSImage() failure.

The storage directory may contain four files: IMAGE, EDITS, EDITS_NEW, and 
IMAGE_NEW.
Here are the steps we perform during checkpoint:
0. Initially storage directory has IMAGE and EDITS files only.
1. Start checkpoint. NN creates EDITS_NEW, and starts streaming edits into it.
2. Upload IMAGE_NEW from SNN to NN storage directory.
3. When upload is done, rename EDITS_NEW -> EDITS.
4. Rename IMAGE_NEW -> IMAGE. Back to the initial state.

Here is the time-line of which combination of files represent the _current_ 
state of the file system relative to the events above.

IMAGE + EDITS --- (1) --- IMAGE + EDITS + EDITS_NEW --- (2) --- IMAGE + EDITS + 
EDITS_NEW --- (3) --- IMAGE_NEW + EDITS --- (4) --- IMAGE + EDITS 

The recovery procedure:
- If EDITS_NEW.exists, then we know NN failed after 1 or 2, but before 3, and 
our recovery strategy is to discard IMAGE_NEW.
- If ! EDITS_NEW.exists && IMAGE_NEW.exists, then NN failed after 3, but before 
4, and we recover by upgrading IMAGE_NEW to IMAGE.

Now lets see what happens when we save image during startup or saveNamespace.
Here are the steps we perform when we call saveFSImage():
0. Initially the storage directory has IMAGE, EDITS, and potentially EDITS_NEW, 
which have been all loaded and digested in NN RAM.
1. Create EDITS_NEW if missing.
2. Save IMAGE_NEW.
3. Empty EDITS and EDITS_NEW.
4. Rename EDITS_NEW -> EDITS.
5. Rename IMAGE_NEW -> IMAGE.

We use the same recovery procedure here as in checkpointing, which leads to a 
data loss in the following failure scenario.
If we fail after 3 but before 4, then we will discard IMAGE_NEW, because 
EDITS_NEW.exists. 
But the latest updates in EDITS and/or EDITS_NEW has already been wiped out and 
we loose these edits forever.

The main reason the checkpointing logic does not work for saving is that 
IMAGE_NEW has different semantics in these two cases.
- In checkpoint  IMAGE_NEW = IMAGE + EDITS
- In saveFSImage IMAGE_NEW = IMAGE + EDITS + EDITS_NEW

h3. The Solution

Different images should be represented by separate files and treated 
differently. I'll denote them
- IMAGE_CKPT = IMAGE + EDITS the checkpoint image
- IMAGE_LAST = IMAGE + EDITS + EDITS_NEW the last saved image

So the checkpoint process will create IMAGE_CKPT and will work with is exactly 
as before, no changes here.

saveFSImage will save NN's memory state into IMAGE_LAST, and should consist of 
the following steps:
0. Initially the storage directory has IMAGE, EDITS, and potentially EDITS_NEW 
and IMAGE_CKPT.
1. Save image into IMAGE_LAST.
2. Remove EDITS, IMAGE_CKPT, and EDITS_NEW - in the order listed.
3. Rename IMAGE_LAST -> IMAGE.
4. Create empty EDITS.

It is important to note that checkpoint cannot start once saveFSImage started, 
because NN is in safe mode, and because it holds the NN lock. If the upload of 
IMAGE_CKPT has started (stage c-2) it will proceed concurrently with the save. 
But rollEdits() (stage c-3) will fail if called during saveFSImage.

Here is the time-line of which combination of files represent the _current_ 
state of the file system relative to the events above. 

IMAGE + EDITS + EDITS_NEW --- (1) --- IMAGE + EDITS + EDITS_NEW --- (2) --- 
IMAGE_LAST --- (3) --- IMAGE --- (4) --- IMAGE + EDITS 

The recovery procedure for saving image is:
- If EDITS.exists && IMAGE_LAST.exists, then we know NN failed after 1 but 
before 2, and we recover by discarding IMAGE_LAST.
- If ! EDITS.exists && IMAGE_LAST.exists, then NN failed during or after 2, and 
we recover by applying 2, 3, and 4.
- If ! EDITS.exists && ! IMAGE_LAST.exists, then NN failed after 3, and we 
recover by applying 4.

There is a slight complication for WinFS. It will not let us remove IMAGE_CKPT 
on stage 3 if the checkpointer is still writing into it. In this case we will 
ignore the failure, and quit the procedure, delaing the rest of the steps for 
the future. The correct state (rename IMAGE_LAST to IMAGE) will be restored 
either when checkpoint finishes or if the NN restarts due to a failure.


> FSImage.saveFSImage can lose edits
> ----------------------------------
>
>                 Key: HDFS-955
>                 URL: https://issues.apache.org/jira/browse/HDFS-955
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 0.20.1, 0.21.0, 0.22.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Blocker
>         Attachments: hdfs-955-moretests.txt, hdfs-955-unittest.txt, 
> PurgeEditsBeforeImageSave.patch
>
>
> This is a continuation of a discussion from HDFS-909. The FSImage.saveFSImage 
> function (implementing dfsadmin -saveNamespace) can corrupt the NN storage 
> such that all current edits are lost.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to