[
https://issues.apache.org/jira/browse/HADOOP-4742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653974#action_12653974
]
Hairong Kuang commented on HADOOP-4742:
---------------------------------------
ant test-core succeded:
BUILD SUCCESSFUL
Total time: 115 minutes 14 seconds
ant test-patch result:
[exec] -1 overall.
[exec] +1 @author. The patch does not contain any @author tags.
[exec] -1 tests included. The patch doesn't appear to include any new
or modified tes
ts.
[exec] Please justify why no tests are needed for
this patch.
[exec] +1 javadoc. The javadoc tool did not generate any warning
messages.
[exec] +1 javac. The applied patch does not increase the total number
of javac compil
er warnings.
[exec] +1 findbugs. The patch does not introduce any new Findbugs
warnings.
[exec] +1 Eclipse classpath. The patch retains Eclipse classpath
integrity.
> Mistake delete replica in hadoop 0.18.1
> ---------------------------------------
>
> Key: HADOOP-4742
> URL: https://issues.apache.org/jira/browse/HADOOP-4742
> Project: Hadoop Core
> Issue Type: Bug
> Components: dfs
> Affects Versions: 0.18.1
> Environment: CentOS 5.2, JDK 1.6,
> 16 Datanodes and 1 Namenodes, each has 8GB Memory and a 4-core CPU, connected
> by GigabyteEthernet
> Reporter: Wang Xu
> Assignee: Wang Xu
> Priority: Blocker
> Fix For: 0.18.3
>
> Attachments: blockReceived-br18.patch, blockReceived.patch,
> HADOOP-4742.diff
>
>
> We recently deployed a 0.18.1 cluster and did some test. And we found
> if we corrupt a block, the namenode will find it and replicate it as soon as
> a client read that block. However, the namenode will delete a health block
> (the source of the above replication operation) at the same time, (I think
> this
> issue may affect all 0.18 tree.)
> Having did some trace, I find in FSNamesystem.addStoredBlock(), it will
> check the number of replications after add the block to blocksMap:
> | NumberReplicas num = countNodes(storedBlock);
> | int numLiveReplicas = num.liveReplicas();
> | int numCurrentReplica = numLiveReplicas
> | + pendingReplications.getNumReplicas(block);
> which means all the live replicas and pending replications will be
> counted. But in the end of FSNamesystem.blockReceived(), which
> calls the addStoredBlock(), it will call addStoredBlock() first, then
> reduce the pendingReplications count.
> | //
> | // Modify the blocks->datanode map and node's map.
> | //
> | addStoredBlock(block, node, delHintNode );
> | pendingReplications.remove(block);
> Hence, the newly replicated replica will be counted twice, and then
> will be marked as excess and lead to a mistake deletion.
> I think change the counting lines in blockReceived(), may solve this
> issue:
> --- FSNamesystem.java-orig 2008-11-28 13:34:40.000000000 +0800
> +++ FSNamesystem.java 2008-11-28 13:54:12.000000000 +0800
> @@ -3152,8 +3152,8 @@
> //
> // Modify the blocks->datanode map and node's map.
> //
> - addStoredBlock(block, node, delHintNode );
> pendingReplications.remove(block);
> + addStoredBlock(block, node, delHintNode );
> }
> long[] getStats() throws IOException {
> The following is the logs for the mistake deletion, with additional
> logging info inserted by me.
> 2008-11-28 11:22:08,866 INFO org.apache.hadoop.dfs.StateChange: *DIR*
> NameNode.reportBadBlocks
> 2008-11-28 11:22:08,866 INFO org.apache.hadoop.dfs.StateChange: BLOCK
> NameSystem.addToCorruptReplicasMap: blk_3828935579548953768 added as
> corrupt on 192.168.33.51:50010 by /192.168.33.51
> 2008-11-28 11:22:10,179 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> ask 192.168.33.50:50010 to replicate blk_3828935579548953768_1184 to
> datanode(s) 192.168.33.45:50010
> 2008-11-28 11:22:12,629 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated: 192.168.33.45:50010 is
> added to blk_3828935579548953768_1184 size 67108864
> 2008-11-28 11:22:12,629 INFO org.apache.hadoop.dfs.StateChange: Wang
> Xu* NameSystem.addStoredBlock: current replicas 4 in which has 1
> pendings
> 2008-11-28 11:22:12,630 INFO org.apache.hadoop.dfs.StateChange: DIR*
> NameSystem.invalidateBlock: blk_3828935579548953768_1184 on
> 192.168.33.51:50010
> 2008-11-28 11:22:12,630 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> NameSystem.delete: blk_3828935579548953768 is added to invalidSet of
> 192.168.33.51:50010
> 2008-11-28 11:22:13,180 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> ask 192.168.33.44:50010 to delete blk_3828935579548953768_1184
> 2008-11-28 11:22:13,181 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> ask 192.168.33.51:50010 to delete blk_3828935579548953768_1184
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.