[
https://issues.apache.org/jira/browse/HDFS-7203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162651#comment-14162651
]
Hadoop QA commented on HDFS-7203:
---------------------------------
{color:red}-1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12673416/HDFS-7203.patch
against trunk revision 9196db9.
{color:green}+1 @author{color}. The patch does not contain any @author
tags.
{color:green}+1 tests included{color}. The patch appears to include 1 new
or modified test files.
{color:green}+1 javac{color}. The applied patch does not increase the
total number of javac compiler warnings.
{color:green}+1 javadoc{color}. There were no new javadoc warning messages.
{color:green}+1 eclipse:eclipse{color}. The patch built with
eclipse:eclipse.
{color:green}+1 findbugs{color}. The patch does not introduce any new
Findbugs (version 2.0.3) warnings.
{color:green}+1 release audit{color}. The applied patch does not increase
the total number of release audit warnings.
{color:red}-1 core tests{color}. The test build failed in
hadoop-hdfs-project/hadoop-hdfs
{color:green}+1 contrib tests{color}. The patch passed contrib unit tests.
Test results:
https://builds.apache.org/job/PreCommit-HDFS-Build/8344//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8344//console
This message is automatically generated.
> Concurrent appending to the same file can cause data corruption
> ---------------------------------------------------------------
>
> Key: HDFS-7203
> URL: https://issues.apache.org/jira/browse/HDFS-7203
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: Kihwal Lee
> Assignee: Kihwal Lee
> Priority: Blocker
> Attachments: HDFS-7203.patch
>
>
> When multiple threads are calling append against the same file, the file can
> get corrupt. The root of the problem is that a stale file stat may be used
> for append in {{DFSClient}}. If the file size changes between
> {{getFileStatus()}} and {{namenode.append()}}, {{DataStreamer}} will get
> confused about how to align data to the checksum boundary and break the
> assumption made by data nodes.
> When it happens, datanode may not write the last checksum. On the next append
> attempt, datanode won't be able to reposition for the partial chunk, since
> the last checksum is missing. The append will fail after running out of data
> nodes to copy the partial block to.
> However, if there are more threads that try to append, this leads to a more
> serious situation. In a few minutes, a lease recovery and block recovery
> will happen. The block recovery truncates the block to the ack'ed size in
> order to make sure to keep only the portion of data that is
> checksum-verified. The problem is, during the last successful append, the
> last data node verified the checksum and ack'ed before writing data and wrong
> metadata to the disk and all data nodes in the pipeline wrote the same wrong
> metadata. So the ack'ed size contains the corrupt portion of the data.
> Since block recovery does not perform any checksum verification, the file
> sizes are adjusted and after {{commitBlockSynchronization()}}, another thread
> will be allowed to append to the corrupt file. This latent corruption may
> not be detected for a very long time.
> The first failing {{append()}} would have created a partial copy of the block
> in the temporary directory of every data node in the cluster. After this
> failure, it is likely under replicated, so the file will be scheduled for
> replication after being closed. Before HDFS-6948, replication didn't work
> until a node is added or restarted because of the temporary file being on all
> data nodes. As a result, the corruption could not be detected by replication.
> After HDFS-6948, the corruption will be detected after the file is closed by
> lease recovery or subsequent append-close.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)