[
https://issues.apache.org/jira/browse/HDFS-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14289108#comment-14289108
]
Hudson commented on HDFS-3519:
------------------------------
FAILURE: Integrated in Hadoop-Yarn-trunk #816 (See
[https://builds.apache.org/job/Hadoop-Yarn-trunk/816/])
HDFS-3519. Checkpoint upload may interfere with a concurrent saveNamespace.
Contributed by Ming Ma. (cnauroth: rev d3268c4b10a0f728b554ddb6d69b666a9ca13f12)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
*
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ImageServlet.java
*
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestStandbyCheckpoints.java
*
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImage.java
> Checkpoint upload may interfere with a concurrent saveNamespace
> ---------------------------------------------------------------
>
> Key: HDFS-3519
> URL: https://issues.apache.org/jira/browse/HDFS-3519
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: namenode
> Reporter: Todd Lipcon
> Assignee: Ming Ma
> Priority: Critical
> Fix For: 2.7.0
>
> Attachments: HDFS-3519-2.patch, HDFS-3519-3.patch,
> HDFS-3519-branch-2.patch, HDFS-3519.patch, test-output.txt
>
>
> TestStandbyCheckpoints failed in [precommit build
> 2620|https://builds.apache.org/job/PreCommit-HDFS-Build/2620//testReport/]
> due to the following issue:
> - both nodes were in Standby state, and configured to checkpoint "as fast as
> possible"
> - NN1 starts to save its own namespace
> - NN2 starts to upload a checkpoint for the same txid. So, both threads are
> writing to the same file fsimage.ckpt_12, but the actual file contents
> correspond to the uploading thread's data.
> - NN1 finished its saveNamespace operation while NN2 was still uploading. So,
> it renamed the ckpt file. However, the contents of the file are still empty
> since NN2 hasn't sent any bytes
> - NN2 finishes the upload, and the rename() call fails, which causes the
> directory to be marked failed, etc.
> The result is that there is a file fsimage_12 which appears to be a finalized
> image but in fact is incompletely transferred. When the transfer completes,
> the problem "heals itself" so there wouldn't be persistent corruption unless
> the machine crashes at the same time. And even then, we'd still have the
> earlier checkpoint to restore from.
> This same race could occur in a non-HA setup if a user puts the NN in safe
> mode and issues saveNamespace operations concurrent with a 2NN checkpointing,
> I believe.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)