Ratandeep Ratti created HDFS-6681:
-------------------------------------
Summary:
TestRBWBlockInvalidation#testBlockInvalidationWhenRBWReplicaMissedInDN is flaky
and sometimes gets stuck in infinite loops
Key: HDFS-6681
URL: https://issues.apache.org/jira/browse/HDFS-6681
Project: Hadoop HDFS
Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Ratandeep Ratti
This testcase has 3 infinite loops which break only on certain conditions being
satisfied.
1st loop checks if there should be a single live replica. It assumes this to be
true since it has just corrupted a block on one of the datanodes (testcase has
replication factor as 2). One scenario in which this loop will never break is
if the Namenode invalidates the corrupt replica, schedules a replication
command, and the new copied replica is added all before this testcase has the
chance to check the live-replica count.
2nd loop checks there should be 2 live replicas. It assumes this to be true (in
some time) since the first loop has broken implying there is a single replica
and now it is only a matter of time when the Namenode schedules a replication
command to copy a replica to another datanode. One scenario in which this loop
will never break is when the Namenode tries to schedule a new replica on the
same node on which we actually corrupted the block. That dst. datanode will not
copy the block, complaining that it already has the (corrupted) replica in the
create state. The situation that results is that Namenode has scheduled a copy
to a datanode, the block is now in the namenode's pending replication queue,
this block will never be removed from the pending replication queue because the
namenode will never receive a report from the datanodes that the block is
'added'.
Note: The block can be transferred from the 'pending replication' to "needed
replication" queue once the pending timeout (5 minutes) expires. The Namenode
then actively tries to schedule a replication for blocks in 'needed
replication' queue. This can cause the 2nd loop to break but the time in which
this process gets kicked in is more than 5 minutes.
3rd loop: This loops checks if there are no corrupt replicas. I don't see a
scenario in which this loop can go on for ever, since once the live replica
count goes back to normal (2), the corrupted block will be removed
I guess increasing the heart beat interval time, so that the testcase has
enough time to check condition in loop 1 before a datanode reports a successful
copy should help avoid race condition in loop1. Regarding loop2 I guess we can
reduce the timeout after which the block is transferred from the pending
replication to the needed replication queue.
--
This message was sent by Atlassian JIRA
(v6.2#6252)