[jira] [Commented] (HDFS-3519) Checkpoint upload may interfere with a concurrent saveNamespace

Hadoop QA (JIRA) Mon, 12 Jan 2015 16:35:37 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274486#comment-14274486
 ]


Hadoop QA commented on HDFS-3519:
---------------------------------

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12691715/HDFS-3519-2.patch
  against trunk revision b78b4a1.

    {color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

    {color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

    {color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

    {color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

    {color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

    {color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

    {color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

    {color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

                  
org.apache.hadoop.hdfs.qjournal.client.TestQuorumJournalManager
                  org.apache.hadoop.hdfs.server.balancer.TestBalancer
                  org.apache.hadoop.hdfs.TestReplaceDatanodeOnFailure

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/9187//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9187//console

This message is automatically generated.

> Checkpoint upload may interfere with a concurrent saveNamespace
> ---------------------------------------------------------------
>
>                 Key: HDFS-3519
>                 URL: https://issues.apache.org/jira/browse/HDFS-3519
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>            Reporter: Todd Lipcon
>            Assignee: Ming Ma
>            Priority: Critical
>         Attachments: HDFS-3519-2.patch, HDFS-3519.patch, test-output.txt
>
>
> TestStandbyCheckpoints failed in [precommit build 
> 2620|https://builds.apache.org/job/PreCommit-HDFS-Build/2620//testReport/] 
> due to the following issue:
> - both nodes were in Standby state, and configured to checkpoint "as fast as 
> possible"
> - NN1 starts to save its own namespace
> - NN2 starts to upload a checkpoint for the same txid. So, both threads are 
> writing to the same file fsimage.ckpt_12, but the actual file contents 
> correspond to the uploading thread's data.
> - NN1 finished its saveNamespace operation while NN2 was still uploading. So, 
> it renamed the ckpt file. However, the contents of the file are still empty 
> since NN2 hasn't sent any bytes
> - NN2 finishes the upload, and the rename() call fails, which causes the 
> directory to be marked failed, etc.
> The result is that there is a file fsimage_12 which appears to be a finalized 
> image but in fact is incompletely transferred. When the transfer completes, 
> the problem "heals itself" so there wouldn't be persistent corruption unless 
> the machine crashes at the same time. And even then, we'd still have the 
> earlier checkpoint to restore from.
> This same race could occur in a non-HA setup if a user puts the NN in safe 
> mode and issues saveNamespace operations concurrent with a 2NN checkpointing, 
> I believe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-3519) Checkpoint upload may interfere with a concurrent saveNamespace

Reply via email to