[
https://issues.apache.org/jira/browse/HDFS-7203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14163268#comment-14163268
]
Vinayakumar B commented on HDFS-7203:
-------------------------------------
Good finding [~kihwal].
Your test reproduces the issue sometime as its a race between concurrent writes.
+1, Patch looks good to me.
I feel we can avoid two RPCs to namenode for append if we can combine LastBlock
and HdfsFileStatus
I will file a separate Jira for this improvement.
> Concurrent appending to the same file can cause data corruption
> ---------------------------------------------------------------
>
> Key: HDFS-7203
> URL: https://issues.apache.org/jira/browse/HDFS-7203
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: Kihwal Lee
> Assignee: Kihwal Lee
> Priority: Blocker
> Attachments: HDFS-7203.patch
>
>
> When multiple threads are calling append against the same file, the file can
> get corrupt. The root of the problem is that a stale file stat may be used
> for append in {{DFSClient}}. If the file size changes between
> {{getFileStatus()}} and {{namenode.append()}}, {{DataStreamer}} will get
> confused about how to align data to the checksum boundary and break the
> assumption made by data nodes.
> When it happens, datanode may not write the last checksum. On the next append
> attempt, datanode won't be able to reposition for the partial chunk, since
> the last checksum is missing. The append will fail after running out of data
> nodes to copy the partial block to.
> However, if there are more threads that try to append, this leads to a more
> serious situation. In a few minutes, a lease recovery and block recovery
> will happen. The block recovery truncates the block to the ack'ed size in
> order to make sure to keep only the portion of data that is
> checksum-verified. The problem is, during the last successful append, the
> last data node verified the checksum and ack'ed before writing data and wrong
> metadata to the disk and all data nodes in the pipeline wrote the same wrong
> metadata. So the ack'ed size contains the corrupt portion of the data.
> Since block recovery does not perform any checksum verification, the file
> sizes are adjusted and after {{commitBlockSynchronization()}}, another thread
> will be allowed to append to the corrupt file. This latent corruption may
> not be detected for a very long time.
> The first failing {{append()}} would have created a partial copy of the block
> in the temporary directory of every data node in the cluster. After this
> failure, it is likely under replicated, so the file will be scheduled for
> replication after being closed. Before HDFS-6948, replication didn't work
> until a node is added or restarted because of the temporary file being on all
> data nodes. As a result, the corruption could not be detected by replication.
> After HDFS-6948, the corruption will be detected after the file is closed by
> lease recovery or subsequent append-close.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)