Kihwal Lee created HDFS-7203:
--------------------------------
Summary: Concurrent appending to the same file can cause data
corruption
Key: HDFS-7203
URL: https://issues.apache.org/jira/browse/HDFS-7203
Project: Hadoop HDFS
Issue Type: Bug
Reporter: Kihwal Lee
Priority: Blocker
When multiple threads are calling append against the same file, the file can
get corrupt. The root of the problem is that a stale file stat may be used for
append in {{DFSClient}}. If the file size changes between {{getFileStatus()}}
and {{namenode.append()}}, {{DataStreamer}} will get confused about how to
align data to the checksum boundary and break the assumption made by data
nodes.
When it happens, datanode may not write the last checksum. On the next append
attempt, datanode won't be able to reposition for the partial chunk, since the
last checksum is missing. The append will fail after running out of data nodes
to copy the partial block to.
However, if there are more threads that try to append, this leads to a more
serious situation. In a few minutes, a lease recovery and block recovery will
happen. The block recovery truncates the block to the ack'ed size in order to
make sure to keep only the portion of data that is checksum-verified. The
problem is, during the last successful append, the last data node verified the
checksum and ack'ed before writing data and wrong metadata to the disk and all
data nodes in the pipeline wrote the same wrong metadata. So the ack'ed size
contains the corrupt portion of the data.
Since block recovery does not perform any checksum verification, the file sizes
are adjusted and after {{commitBlockSynchronization()}}, another thread will be
allowed to append to the corrupt file. This latent corruption may not be
detected for a very long time.
The first failing {{append()}} would have created a partial copy of the block
in the temporary directory of every data node in the cluster. After this
failure, it is likely under replicated, so the file will be scheduled for
replication after being closed. Before HDFS-6948, replication didn't work until
a node is added or restarted because of the temporary file being on all data
nodes. As a result, the corruption could not be detected by replication. After
HDFS-6948, the corruption will be detected after the file is closed by lease
recovery or subsequent append-close.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)