[
https://issues.apache.org/jira/browse/HDFS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13019049#comment-13019049
]
Matt Foley commented on HDFS-1562:
----------------------------------
Hi Eli, thanks for pointing out the relationship between these two bugs. If
you wish, since you are effectively rewriting the whole unit test file, I'll
withdraw my patch except for a one-line fix which should not interfere with
auto-merge when you submit.
That said, I think your patch may be subject to the same problem as I fix in
HDFS-1828:
The primary problem in HDFS-1828 was that
testSufficientlyReplicatedBlocksWithNotEnoughRacks() waited "while ((numRacks <
2) || (curReplicas != REPLICATION_FACTOR) || (neededReplicationSize > 0))"
[line 79], and then asserted "(curReplicas == REPLICATION_FACTOR)" [line 95];
when in fact it was appropriate under the circumstances of the test, to expect
curReplicas == REPLICATION_FACTOR+1, transiently.
It looks like the same issue remains in your patch, because in method
waitForReplication(), it waits "while ((curRacks < racks || curReplicas <
replicas || curNeededReplicas > neededReplicas) && count < 10)", and then does
"assertEquals(replicas, curReplicas)". So it will have the same problem,
unless you never use this in a context where curReplicas > replicas might occur.
A couple additional suggestions:
1. You added a waitForReplication() method. Can you instead use
DFSTestUtil.waitReplication()? (and BTW, this method correctly checks for
replication being != rather than < expected value.) Or if you need the
block-oriented signature of your version, can you consider adding it to
DFSTestUtil instead of leaving it just in the one unit test module?
2. I'm concerned about waitForCorruptReplicas(), because it is polling for a
problematic condition that is supposed to be self-healing, and uses a fairly
coarse poll frequency (a whole second). It is possible for such a test to
"miss" the condition it is trying to catch. See HDFS-1806, where I just fixed
such a problem by changing a polling frequency from 100ms to 5ms.
Now, I haven't had time to fully understand the tests in your new version. It
may be that you are controlling for other parameters, such as the values of
DFS_HEARTBEAT_INTERVAL, DFS_BLOCKREPORT_INTERVAL, and
DFS_NAMENODE_REPLICATION_INTERVAL, that would prevent the condition from
self-healing in the time period over which you are waiting for it. But I have
seen corrupt replicas be recognized and eliminated in less than a second, on a
tiny cluster under proper intersection of events. Since such issues become
long-lived intermittent false positives for lots of people on Hudson :-) I hope
you don't mind my asking you to reason through an explanation that this
construct can't miss its condition. Thanks.
> Add rack policy tests
> ---------------------
>
> Key: HDFS-1562
> URL: https://issues.apache.org/jira/browse/HDFS-1562
> Project: Hadoop HDFS
> Issue Type: Test
> Components: name-node, test
> Affects Versions: 0.23.0
> Reporter: Eli Collins
> Assignee: Eli Collins
> Attachments: hdfs-1562-1.patch, hdfs-1562-2.patch
>
>
> The existing replication tests (TestBlocksWithNotEnoughRacks,
> TestPendingReplication, TestOverReplicatedBlocks, TestReplicationPolicy,
> TestUnderReplicatedBlocks, and TestReplication) are missing tests for rack
> policy violations. This jira adds the following tests which I created when
> generating a new patch for HDFS-15.
> * Test that blocks that have a sufficient number of total replicas, but are
> not replicated cross rack, get replicated cross rack when a rack becomes
> available.
> * Test that new blocks for an underreplicated file will get replicated cross
> rack.
> * Mark a block as corrupt, test that when it is re-replicated that it is
> still replicated across racks.
> * Reduce the replication factor of a file, making sure that the only block
> that is across racks is not removed when deleting replicas.
> * Test that when a block is replicated because a replica is lost due to host
> failure the the rack policy is preserved.
> * Test that when the execss replicas of a block are reduced due to a node
> re-joining the cluster the rack policy is not violated.
> * Test that rack policy is still respected when blocks are replicated due to
> node decommissioning.
> * Test that rack policy is still respected when blocks are replicated due to
> node decommissioning, even when the blocks are over-replicated.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira