DataNode restarts may introduce corrupt/duplicated/lost replicas when handling
detached replicas
------------------------------------------------------------------------------------------------
Key: HDFS-550
URL: https://issues.apache.org/jira/browse/HDFS-550
Project: Hadoop HDFS
Issue Type: Bug
Components: data-node
Affects Versions: 0.21.0
Reporter: Hairong Kuang
Assignee: Hairong Kuang
Priority: Blocker
Fix For: Append Branch
Current trunks calls first unlinks a finalized replica before appending to this
block. Unlink is done by temporally copying the block file in the "current"
subtree to a directory called "detach" under the volume's daa directory and
then copies it back when unlink succeeds. On datanode restarts, datanodes
recover faied unlink by copying replicas under "detach" to "current".
There are two bugs with this implementation:
1. The "detach" directory does not include in a snapshot. so rollback will
cause the "detaching" replicas to be lost.
2. After a replica is copied to the "detach" directory, the information of its
original location is lost. The current implementation erroneously assumes that
the replica to be unlinked is under "current". This will make two instances of
replicas with the same block id coexist in a datanode. Also if the replica
under "detach" is corrupt, the corrupt replica is moved to "current" without
being detected, polluting datanode data.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.