[jira] [Commented] (HDFS-1562) Add rack policy tests

Matt Foley (JIRA) Tue, 12 Apr 2011 14:13:48 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13019049#comment-13019049
 ]


Matt Foley commented on HDFS-1562:
----------------------------------

Hi Eli, thanks for pointing out the relationship between these two bugs.  If 
you wish, since you are effectively rewriting the whole unit test file, I'll 
withdraw my patch except for a one-line fix which should not interfere with 
auto-merge when you submit.

That said, I think your patch may be subject to the same problem as I fix in 
HDFS-1828:

The primary problem in HDFS-1828 was that 
testSufficientlyReplicatedBlocksWithNotEnoughRacks() waited "while ((numRacks < 
2) || (curReplicas != REPLICATION_FACTOR) || (neededReplicationSize > 0))" 
[line 79], and then asserted "(curReplicas == REPLICATION_FACTOR)" [line 95]; 
when in fact it was appropriate under the circumstances of the test, to expect 
curReplicas == REPLICATION_FACTOR+1, transiently.

It looks like the same issue remains in your patch, because in method 
waitForReplication(), it waits "while ((curRacks < racks || curReplicas < 
replicas || curNeededReplicas > neededReplicas) && count < 10)", and then does 
"assertEquals(replicas, curReplicas)".  So it will have the same problem, 
unless you never use this in a context where curReplicas > replicas might occur.

A couple additional suggestions:

1. You added a waitForReplication() method.  Can you instead use 
DFSTestUtil.waitReplication()?  (and BTW, this method correctly checks for 
replication being != rather than < expected value.)  Or if you need the 
block-oriented signature of your version, can you consider adding it to 
DFSTestUtil instead of leaving it just in the one unit test module?

2. I'm concerned about waitForCorruptReplicas(), because it is polling for a 
problematic condition that is supposed to be self-healing, and uses a fairly 
coarse poll frequency (a whole second).  It is possible for such a test to 
"miss" the condition it is trying to catch.  See HDFS-1806, where I just fixed 
such a problem by changing a polling frequency from 100ms to 5ms.  

Now, I haven't had time to fully understand the tests in your new version.  It 
may be that you are controlling for other parameters, such as the values of 
DFS_HEARTBEAT_INTERVAL, DFS_BLOCKREPORT_INTERVAL, and 
DFS_NAMENODE_REPLICATION_INTERVAL, that would prevent the condition from 
self-healing in the time period over which you are waiting for it.  But I have 
seen corrupt replicas be recognized and eliminated in less than a second, on a 
tiny cluster under proper intersection of events.  Since such issues become 
long-lived intermittent false positives for lots of people on Hudson :-) I hope 
you don't mind my asking you to reason through an explanation that this 
construct can't miss its condition.  Thanks.

> Add rack policy tests
> ---------------------
>
>                 Key: HDFS-1562
>                 URL: https://issues.apache.org/jira/browse/HDFS-1562
>             Project: Hadoop HDFS
>          Issue Type: Test
>          Components: name-node, test
>    Affects Versions: 0.23.0
>            Reporter: Eli Collins
>            Assignee: Eli Collins
>         Attachments: hdfs-1562-1.patch, hdfs-1562-2.patch
>
>
> The existing replication tests (TestBlocksWithNotEnoughRacks, 
> TestPendingReplication, TestOverReplicatedBlocks, TestReplicationPolicy, 
> TestUnderReplicatedBlocks, and TestReplication) are missing tests for rack 
> policy violations.  This jira adds the following tests which I created when 
> generating a new patch for HDFS-15.
> * Test that blocks that have a sufficient number of total replicas, but are 
> not replicated cross rack, get replicated cross rack when a rack becomes 
> available.
> * Test that new blocks for an underreplicated file will get replicated cross 
> rack. 
> * Mark a block as corrupt, test that when it is re-replicated that it is 
> still replicated across racks.
> * Reduce the replication factor of a file, making sure that the only block 
> that is across racks is not removed when deleting replicas.
> * Test that when a block is replicated because a replica is lost due to host 
> failure the the rack policy is preserved.
> * Test that when the execss replicas of a block are reduced due to a node 
> re-joining the cluster the rack policy is not violated.
> * Test that rack policy is still respected when blocks are replicated due to 
> node decommissioning.
> * Test that rack policy is still respected when blocks are replicated due to 
> node decommissioning, even when the blocks are over-replicated.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-1562) Add rack policy tests

Reply via email to