[jira] [Resolved] (HDFS-1225) Block lost when primary crashes in recoverBlock

Allen Wittenauer (JIRA) Wed, 30 Jul 2014 13:25:42 -0700

     [ 
https://issues.apache.org/jira/browse/HDFS-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Allen Wittenauer resolved HDFS-1225.
------------------------------------

    Resolution: Incomplete

append got overhauled in 2.x. closing.

> Block lost when primary crashes in recoverBlock
> -----------------------------------------------
>
>                 Key: HDFS-1225
>                 URL: https://issues.apache.org/jira/browse/HDFS-1225
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 0.20-append
>            Reporter: Thanh Do
>
> - Summary: Block is lost if primary datanode crashes in the middle 
> tryUpdateBlock.
>  
> - Setup:
> # available datanode = 2
> # replica = 2
> # disks / datanode = 1
> # failures = 1
> # failure type = crash
> When/where failure happens = (see below)
>  
> - Details:
>  Suppose we have 2 datanodes: dn1 and dn2 and dn1 is primary.
> Client appends to blk_X_1001 and crash happens during dn1.recoverBlock,
> at the point after blk_X_1001.meta is renamed to blk_X_1001.meta_tmp1002
> **Interesting**, this case, the block X is lost eventually. Why?
> After dn1.recoverBlock crashes at rename, what left at dn1 current directory 
> is:
> 1) blk_X                                                                      
>                                                                               
>                                                      
> 2) blk_X_1001.meta_tmp1002
> ==> this is an invalid block, because it has no meta file associated with it.
> dn2 (after dn1 crash) now contains:
> 1) blk_X                                                                      
>                                                                               
>                                                      
> 2) blk_X_1002.meta
> (note that the rename at dn2 is completed, because dn1 called 
> dn2.updateBlock() before
> calling its own updateBlock())
> But the command namenode.commitBlockSynchronization is not reported to 
> namenode,
> because dn1 is crashed. Therefore, from namenode point of view, the block X 
> has GS 1001.
> Hence, the block is lost.
> This bug was found by our Failure Testing Service framework:
> http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
> For questions, please email us: Thanh Do ([email protected]) and 
> Haryadi Gunawi ([email protected])



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (HDFS-1225) Block lost when primary crashes in recoverBlock

Reply via email to