[ 
https://issues.apache.org/jira/browse/HDFS-9243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14970318#comment-14970318
 ] 

Walter Su commented on HDFS-9243:
---------------------------------

The reason is ReplicationMonitor have chosen one DataNode as target twice for 
the same block.

totalNodes=\{DN0, DN1, DN2\}
one block with replFactor=3, liveNodes=\{DN0\}, so block is under replicated.
For some reason, DN2 is not chosen by ReplicationMonitor.
ReplicationMonitor chose DN1 as target, schedule 1st recovery. pendingNum=1.
Later, before DN1 reported, ReplicationMonitor found liveNodes + pendingNum < 
replFactor, it chose DN1 as target again, schedule 2nd recovery. pendingNum=2.

DN1 can't have 2 identical replicas. So one creplica is recovered at DN1. But 
liveNodes + pendingNum = replFactor, so it wait until timeout at 
{{PendingReplicationBlocks}} map.

Some testCase wait liveNodes to reach replFactor, but the testCase cound't wait 
5min so it failed. Due to randomness of BlockPlacmentPolicy and relatively 
large number of DNs, it may not be a problem in production.

I checked 
[testReport|https://builds.apache.org/job/PreCommit-HDFS-Build/13124/testReport/org.apache.hadoop.hdfs.server.blockmanagement/TestUnderReplicatedBlocks/testSetrepIncWithUnderReplicatedBlocks/]
 . I saw
{noformat}
2015-10-22 15:10:18,400 [DataXceiver for client  at /127.0.0.1:47728 [Receiving 
block BP-57081724-67.195.81.153-1445526607531:blk_1073741825_1001]] INFO  
datanode.DataNode (DataXceiver.java:run(270)) - 127.0.0.1:47391:DataXceiver 
error processing WRITE_BLOCK operation  src: /127.0.0.1:47728 dst: 
/127.0.0.1:47391; 
org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block 
BP-57081724-67.195.81.153-1445526607531:blk_1073741825_1001 already exists in 
state FINALIZED and thus cannot be created.
{noformat}
I've saw this before at HDFS-9275.

> TestUnderReplicatedBlocks#testSetrepIncWithUnderReplicatedBlocks test timeout
> -----------------------------------------------------------------------------
>
>                 Key: HDFS-9243
>                 URL: https://issues.apache.org/jira/browse/HDFS-9243
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: HDFS
>            Reporter: Wei-Chiu Chuang
>            Priority: Minor
>
> org.apache.hadoop.hdfs.server.blockmanagement.TestUnderReplicatedBlocks 
> sometimes time out.
> This is happening on trunk as can be observed in several recent jenkins job. 
> (e.g. https://builds.apache.org/job/Hadoop-Hdfs-trunk/2423/  
> https://builds.apache.org/job/Hadoop-Hdfs-trunk/2386/ 
> https://builds.apache.org/job/Hadoop-Hdfs-trunk/2351/ 
> https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/472/
> On my local Linux machine, this test case times out 6 out of 10 times. When 
> it does not time out, this test takes about 20 seconds, otherwise it takes 
> more than 60 seconds and then time out.
> I suspect it's a deadlock issue, as dead lock had occurred at this test case 
> in HDFS-5527 before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to