[
https://issues.apache.org/jira/browse/HDFS-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Allen Wittenauer resolved HDFS-1225.
------------------------------------
Resolution: Incomplete
append got overhauled in 2.x. closing.
> Block lost when primary crashes in recoverBlock
> -----------------------------------------------
>
> Key: HDFS-1225
> URL: https://issues.apache.org/jira/browse/HDFS-1225
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode
> Affects Versions: 0.20-append
> Reporter: Thanh Do
>
> - Summary: Block is lost if primary datanode crashes in the middle
> tryUpdateBlock.
>
> - Setup:
> # available datanode = 2
> # replica = 2
> # disks / datanode = 1
> # failures = 1
> # failure type = crash
> When/where failure happens = (see below)
>
> - Details:
> Suppose we have 2 datanodes: dn1 and dn2 and dn1 is primary.
> Client appends to blk_X_1001 and crash happens during dn1.recoverBlock,
> at the point after blk_X_1001.meta is renamed to blk_X_1001.meta_tmp1002
> **Interesting**, this case, the block X is lost eventually. Why?
> After dn1.recoverBlock crashes at rename, what left at dn1 current directory
> is:
> 1) blk_X
>
>
> 2) blk_X_1001.meta_tmp1002
> ==> this is an invalid block, because it has no meta file associated with it.
> dn2 (after dn1 crash) now contains:
> 1) blk_X
>
>
> 2) blk_X_1002.meta
> (note that the rename at dn2 is completed, because dn1 called
> dn2.updateBlock() before
> calling its own updateBlock())
> But the command namenode.commitBlockSynchronization is not reported to
> namenode,
> because dn1 is crashed. Therefore, from namenode point of view, the block X
> has GS 1001.
> Hence, the block is lost.
> This bug was found by our Failure Testing Service framework:
> http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
> For questions, please email us: Thanh Do ([email protected]) and
> Haryadi Gunawi ([email protected])
--
This message was sent by Atlassian JIRA
(v6.2#6252)