[ 
https://issues.apache.org/jira/browse/HDFS-955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841983#action_12841983
 ] 

Konstantin Shvachko commented on HDFS-955:
------------------------------------------

Unfortunately, this solution does not work either. The problem is that it 
assumes that all files are in the same directory, while in our model edits and 
image directories may be independent of each other. It means that we cannot 
rely on the presence or absence of EDITS_NEW (and EDITS) in order to decide 
whether to remove or promote IMAGE_NEW, because the system can dye when 
EDITS_NEW is renamed to EDITS in one directory but not in an other. We are 
trying here to restore the stage of the NN storage transformation sequence, 
when it crashed, by examining the remaining files. This is error-prone, and 
introduces unnecessary complexity. We should rather apply the technique used in 
BackupNode and for the upgrade.

h3. A Better Solution

The idea is to create a temporary directory and accumulate all necessary 
changes to the persistent data in it, and then rename it to {{current}} once 
the new data is ready. The rename is two-step, not atomic, but it minimizes the 
recovery effort. Here is how saveFSImage() should work.

# Create prospective_current.tmp, and write necessary files in it.
#- Save new image into prospective_current.tmp/IMAGE
#- Create empty prospective_current.tmp/EDITS
#- Create VERSION and fstime files in prospective_current.tmp and write new 
checkpointTime.
# Rename current to removed_current.tmp
# Rename prospective_current.tmp to current
# Remove removed_current.tmp

And the recovery procedure is very simple:
- if current.exists && prospective_current.tmp.exists then remove 
prospective_current.tmp
- if ! current.exists && prospective_current.tmp.exists then rename  
prospective_current.tmp to current and remove removed_current.tmp

It is important that image and edits directories are operated (created and 
recovered) independently of each other, but maintain the same meta-data state.
I plan to implement this algorithm, and will try to reuse some code from BN. 
I will not change the checkpoint procedure for SNN, since it is deprecated, and 
it should not cause problems, as
- Checkpoint cannot start when saveFSImage is in progress.
- If checkpoint image upload started before saveFSImage, then the uploading 
will continue to current, and further rollFSImage will fail either because the 
NN is in safe mode (saveFSImage is still in progress) or because EDITS_NEW does 
not exist anymore (saveFSImage already completed).


> FSImage.saveFSImage can lose edits
> ----------------------------------
>
>                 Key: HDFS-955
>                 URL: https://issues.apache.org/jira/browse/HDFS-955
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 0.20.1, 0.21.0, 0.22.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Blocker
>         Attachments: hdfs-955-moretests.txt, hdfs-955-unittest.txt, 
> PurgeEditsBeforeImageSave.patch
>
>
> This is a continuation of a discussion from HDFS-909. The FSImage.saveFSImage 
> function (implementing dfsadmin -saveNamespace) can corrupt the NN storage 
> such that all current edits are lost.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to