Wei-Chiu Chuang created HDFS-11160:
--------------------------------------
Summary: VolumeScanner incorrectly reports good replicas as
corrupt due to race condition
Key: HDFS-11160
URL: https://issues.apache.org/jira/browse/HDFS-11160
Project: Hadoop HDFS
Issue Type: Bug
Components: datanode
Environment: CDH5.7.4
Reporter: Wei-Chiu Chuang
Assignee: Wei-Chiu Chuang
Due to a race condition initially reported in HDFS-6804, VolumeScanner may
erroneously detect good replicas as corrupt. This is serious because in some
cases it results in data loss if all replicas are declared corrupt.
We are investigating an incidence that caused very high block corruption rate
in a relatively small cluster. Initially, we thought HDFS-11056 is to blame.
However, after applying HDFS-11056, we are still seeing VolumeScanner reporting
corrupt replicas.
It turns out that if a replica is being appended while VolumeScanner is
scanning it, VolumeScanner may use the new checksum to compare against old
data, causing checksum mismatch.
I have a unit test to reproduce the error. Will attach later.
To fix it, I propose a FinalizedReplica object should also have a lastChecksum
field like ReplicaBeingWritten, and BlockSender should use the in-memory
lastChecksum to verify the partial data in the last chunk on disk. File this
jira to discuss a good fix for this issue.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]