[jira] [Commented] (HDFS-7203) Concurrent appending to the same file can cause data corruption

Hadoop QA (JIRA) Tue, 07 Oct 2014 15:07:07 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-7203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162651#comment-14162651
 ]


Hadoop QA commented on HDFS-7203:
---------------------------------

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12673416/HDFS-7203.patch
  against trunk revision 9196db9.

    {color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

    {color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

    {color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

    {color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

    {color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

    {color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

    {color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

    {color:red}-1 core tests{color}.  The test build failed in 
hadoop-hdfs-project/hadoop-hdfs 

    {color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8344//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8344//console

This message is automatically generated.

> Concurrent appending to the same file can cause data corruption
> ---------------------------------------------------------------
>
>                 Key: HDFS-7203
>                 URL: https://issues.apache.org/jira/browse/HDFS-7203
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Kihwal Lee
>            Assignee: Kihwal Lee
>            Priority: Blocker
>         Attachments: HDFS-7203.patch
>
>
> When multiple threads are calling append against the same file, the file can 
> get corrupt. The root of the problem is that a stale file stat may be used 
> for append in {{DFSClient}}. If the file size changes between 
> {{getFileStatus()}} and {{namenode.append()}}, {{DataStreamer}} will get 
> confused about how to align data to the checksum boundary and break the 
> assumption made by data nodes.  
> When it happens, datanode may not write the last checksum. On the next append 
> attempt, datanode won't be able to reposition for the partial chunk, since 
> the last checksum is missing. The append will fail after running out of data 
> nodes to copy the partial block to.
> However, if there are more threads that try to append, this leads to a more 
> serious situation.  In a few minutes, a lease recovery and block recovery 
> will happen.  The block recovery truncates the block to the ack'ed size in 
> order to make sure to keep only the portion of data that is 
> checksum-verified.  The problem is, during the last successful append, the 
> last data node verified the checksum and ack'ed before writing data and wrong 
> metadata to the disk and all data nodes in the pipeline wrote the same wrong 
> metadata.  So the ack'ed size contains the corrupt portion of the data.
> Since block recovery does not perform any checksum verification, the file 
> sizes are adjusted and after {{commitBlockSynchronization()}}, another thread 
> will be allowed to append to the corrupt file.  This latent corruption may 
> not be detected for a very long time.
> The first failing {{append()}} would have created a partial copy of the block 
> in the temporary directory of every data node in the cluster. After this 
> failure, it is likely under replicated, so the file will be scheduled for 
> replication after being closed. Before HDFS-6948, replication didn't work 
> until a node is added or restarted because of the temporary file being on all 
> data nodes. As a result, the corruption could not be detected by replication. 
> After HDFS-6948, the corruption will be detected after the file is closed by 
> lease recovery or subsequent append-close.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7203) Concurrent appending to the same file can cause data corruption

Reply via email to