[jira] [Commented] (HDFS-7203) Concurrent appending to the same file can cause data corruption

Vinayakumar B (JIRA) Wed, 08 Oct 2014 02:39:58 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-7203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14163268#comment-14163268
 ]


Vinayakumar B commented on HDFS-7203:
-------------------------------------

Good finding [~kihwal].
Your test reproduces the issue sometime as its a race between concurrent writes.

+1, Patch looks good to me.

I feel we can avoid two RPCs to namenode for append if we can combine LastBlock 
and HdfsFileStatus
I will file a separate Jira for this improvement.

> Concurrent appending to the same file can cause data corruption
> ---------------------------------------------------------------
>
>                 Key: HDFS-7203
>                 URL: https://issues.apache.org/jira/browse/HDFS-7203
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Kihwal Lee
>            Assignee: Kihwal Lee
>            Priority: Blocker
>         Attachments: HDFS-7203.patch
>
>
> When multiple threads are calling append against the same file, the file can 
> get corrupt. The root of the problem is that a stale file stat may be used 
> for append in {{DFSClient}}. If the file size changes between 
> {{getFileStatus()}} and {{namenode.append()}}, {{DataStreamer}} will get 
> confused about how to align data to the checksum boundary and break the 
> assumption made by data nodes.  
> When it happens, datanode may not write the last checksum. On the next append 
> attempt, datanode won't be able to reposition for the partial chunk, since 
> the last checksum is missing. The append will fail after running out of data 
> nodes to copy the partial block to.
> However, if there are more threads that try to append, this leads to a more 
> serious situation.  In a few minutes, a lease recovery and block recovery 
> will happen.  The block recovery truncates the block to the ack'ed size in 
> order to make sure to keep only the portion of data that is 
> checksum-verified.  The problem is, during the last successful append, the 
> last data node verified the checksum and ack'ed before writing data and wrong 
> metadata to the disk and all data nodes in the pipeline wrote the same wrong 
> metadata.  So the ack'ed size contains the corrupt portion of the data.
> Since block recovery does not perform any checksum verification, the file 
> sizes are adjusted and after {{commitBlockSynchronization()}}, another thread 
> will be allowed to append to the corrupt file.  This latent corruption may 
> not be detected for a very long time.
> The first failing {{append()}} would have created a partial copy of the block 
> in the temporary directory of every data node in the cluster. After this 
> failure, it is likely under replicated, so the file will be scheduled for 
> replication after being closed. Before HDFS-6948, replication didn't work 
> until a node is added or restarted because of the temporary file being on all 
> data nodes. As a result, the corruption could not be detected by replication. 
> After HDFS-6948, the corruption will be detected after the file is closed by 
> lease recovery or subsequent append-close.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7203) Concurrent appending to the same file can cause data corruption

Reply via email to