Wei-Chiu Chuang created HDFS-11160:
--------------------------------------

             Summary: VolumeScanner incorrectly reports good replicas as 
corrupt due to race condition
                 Key: HDFS-11160
                 URL: https://issues.apache.org/jira/browse/HDFS-11160
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: datanode
         Environment: CDH5.7.4
            Reporter: Wei-Chiu Chuang
            Assignee: Wei-Chiu Chuang


Due to a race condition initially reported in HDFS-6804, VolumeScanner may 
erroneously detect good replicas as corrupt. This is serious because in some 
cases it results in data loss if all replicas are declared corrupt.

We are investigating an incidence that caused very high block corruption rate 
in a relatively small cluster. Initially, we thought HDFS-11056 is to blame. 
However, after applying HDFS-11056, we are still seeing VolumeScanner reporting 
corrupt replicas.

It turns out that if a replica is being appended while VolumeScanner is 
scanning it, VolumeScanner may use the new checksum to compare against old 
data, causing checksum mismatch.

I have a unit test to reproduce the error. Will attach later.

To fix it, I propose a FinalizedReplica object should also have a lastChecksum 
field like ReplicaBeingWritten, and BlockSender should use the in-memory 
lastChecksum to verify the partial data in the last chunk on disk. File this 
jira to discuss a good fix for this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to