[jira] Commented: (HADOOP-4742) Mistake delete replica in hadoop 0.18.1

Hairong Kuang (JIRA) Fri, 05 Dec 2008 15:18:43 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-4742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653974#action_12653974
 ]


Hairong Kuang commented on HADOOP-4742:
---------------------------------------

ant test-core succeded:
BUILD SUCCESSFUL
Total time: 115 minutes 14 seconds

ant test-patch result:
     [exec] -1 overall.

     [exec]     +1 @author.  The patch does not contain any @author tags.

     [exec]     -1 tests included.  The patch doesn't appear to include any new 
or modified tes
ts.
     [exec]                         Please justify why no tests are needed for 
this patch.

     [exec]     +1 javadoc.  The javadoc tool did not generate any warning 
messages.

     [exec]     +1 javac.  The applied patch does not increase the total number 
of javac compil
er warnings.

     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.

     [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath 
integrity.


> Mistake delete replica in hadoop 0.18.1
> ---------------------------------------
>
>                 Key: HADOOP-4742
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4742
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.1
>         Environment: CentOS 5.2, JDK 1.6, 
> 16 Datanodes and 1 Namenodes, each has 8GB Memory and a 4-core CPU, connected 
> by GigabyteEthernet
>            Reporter: Wang Xu
>            Assignee: Wang Xu
>            Priority: Blocker
>             Fix For: 0.18.3
>
>         Attachments: blockReceived-br18.patch, blockReceived.patch, 
> HADOOP-4742.diff
>
>
> We recently deployed a 0.18.1 cluster and did some test. And we found
> if we corrupt a block, the namenode will find it and replicate it as soon as
> a client read that block. However, the namenode will delete a health block
> (the source of the above replication operation) at the same time, (I think 
> this
> issue may affect all 0.18 tree.)
> Having did some trace, I find in FSNamesystem.addStoredBlock(), it will
> check the number of replications after add the block to blocksMap:
>  |   NumberReplicas num = countNodes(storedBlock);
>  |    int numLiveReplicas = num.liveReplicas();
>  |    int numCurrentReplica = numLiveReplicas
>  |      + pendingReplications.getNumReplicas(block);
> which means all the live replicas and pending replications will be
> counted. But in the end of FSNamesystem.blockReceived(), which
> calls the addStoredBlock(), it will call addStoredBlock() first, then
> reduce the pendingReplications count.
>  |    //
>  |    // Modify the blocks->datanode map and node's map.
>  |    //
>  |    addStoredBlock(block, node, delHintNode );
>  |    pendingReplications.remove(block);
> Hence, the newly replicated replica will be counted twice, and then
> will be marked as excess and lead to a mistake deletion.
> I think change the counting lines in   blockReceived(), may solve this
> issue:
> --- FSNamesystem.java-orig      2008-11-28 13:34:40.000000000 +0800
> +++ FSNamesystem.java   2008-11-28 13:54:12.000000000 +0800
> @@ -3152,8 +3152,8 @@
>     //
>     // Modify the blocks->datanode map and node's map.
>     //
> -    addStoredBlock(block, node, delHintNode );
>     pendingReplications.remove(block);
> +    addStoredBlock(block, node, delHintNode );
>   }
>   long[] getStats() throws IOException {
> The following is the logs for the mistake deletion, with additional
> logging info inserted by me.
> 2008-11-28 11:22:08,866 INFO org.apache.hadoop.dfs.StateChange: *DIR*
> NameNode.reportBadBlocks
> 2008-11-28 11:22:08,866 INFO org.apache.hadoop.dfs.StateChange: BLOCK
> NameSystem.addToCorruptReplicasMap: blk_3828935579548953768 added as
> corrupt on 192.168.33.51:50010 by /192.168.33.51
> 2008-11-28 11:22:10,179 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> ask 192.168.33.50:50010 to replicate blk_3828935579548953768_1184 to
> datanode(s) 192.168.33.45:50010
> 2008-11-28 11:22:12,629 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated: 192.168.33.45:50010 is
> added to blk_3828935579548953768_1184 size 67108864
> 2008-11-28 11:22:12,629 INFO org.apache.hadoop.dfs.StateChange: Wang
> Xu* NameSystem.addStoredBlock: current replicas 4 in which has 1
> pendings
> 2008-11-28 11:22:12,630 INFO org.apache.hadoop.dfs.StateChange: DIR*
> NameSystem.invalidateBlock: blk_3828935579548953768_1184 on
> 192.168.33.51:50010
> 2008-11-28 11:22:12,630 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> NameSystem.delete: blk_3828935579548953768 is added to invalidSet of
> 192.168.33.51:50010
> 2008-11-28 11:22:13,180 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> ask 192.168.33.44:50010 to delete  blk_3828935579548953768_1184
> 2008-11-28 11:22:13,181 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> ask 192.168.33.51:50010 to delete  blk_3828935579548953768_1184

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4742) Mistake delete replica in hadoop 0.18.1

Reply via email to