[ 
https://issues.apache.org/jira/browse/HDFS-14960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17107684#comment-17107684
 ] 

Jim Brennan commented on HDFS-14960:
------------------------------------

On further investigation of this, I realized that the balancer does not pay any 
attention to {{DFSConfigKeys.DFS_USE_DFS_NETWORK_TOPOLOGY_KEY}}.

Here are the config settings for TestBalancerWithNodeGroup:
{code:java}
  static Configuration createConf() {
    Configuration conf = new HdfsConfiguration();
    TestBalancer.initConf(conf);
    conf.setLong(DFSConfigKeys.DFS_BLOCK_SIZE_KEY, DEFAULT_BLOCK_SIZE);
    conf.setBoolean(DFSConfigKeys.DFS_USE_DFS_NETWORK_TOPOLOGY_KEY, false);
    conf.set(CommonConfigurationKeysPublic.NET_TOPOLOGY_IMPL_KEY,
        NetworkTopologyWithNodeGroup.class.getName());
    conf.set(DFSConfigKeys.DFS_BLOCK_REPLICATOR_CLASSNAME_KEY, 
        BlockPlacementPolicyWithNodeGroup.class.getName());
    return conf;
  }
{code}
Prior to HDFS-14958, we were not setting 
{{DFSConfigKeys.DFS_USE_DFS_NETWORK_TOPOLOGY_KEY = false}}, so 
BlockPlacementPolicyWithNodeGroup was being initialized with a clusterMap of 
type DFSNetworkTopology. This did not affect this test though, because the 
balancer ignores that flag.  The Balancer only pays attention to 
{{CommonConfigurationKeysPublic.NET_TOPOLOGY_IMPL_KEY}} which was already set 
to NetworkTopologyWithNodeGroup.

This is why the test never failed - it is specifically testing the results of 
the balancer. The only reason I found the issue in HDFS-14958 was because we 
had some internal changes that caused it to fail. But the apache version never 
actually failed because of HDFS-14958.

Given this, I thought I should double-check that the test does fail if the 
Balancer doesn't use NetworkTopologyWithNodeGroup. So I set it to use 
NetworkTopology and the test passed!

Looking at it more closely, I was surprised in particular that 
testBalancerEndInNoMoveProgress() was succeeding in this case. I would expect 
that with NetworkTopology there would be some block moves. But the code to 
verify that it finishes with no moves seems to allow moves:
{code:java}
    final int r = Balancer.run(namenodes, BalancerParameters.DEFAULT, conf);
    Assert.assertTrue(r == ExitStatus.SUCCESS.getExitCode() ||
        (r == ExitStatus.NO_MOVE_PROGRESS.getExitCode()));
{code}
I don't understand why SUCCESS is a valid return for this case. Isn't the point 
of this test case to verify that no block moves were done?

Sure enough, if I change that assert to be more restrictive:
{code:java}
    Assert.assertTrue(r == ExitStatus.NO_MOVE_PROGRESS.getExitCode());
{code}
then testBalancerEndInNoMoveProgress() fails when the topology is not 
{{NetworkTopologyWithNodeGroup}}.

With this change in place, however, when I went back to using 
{{NetworkTopologyWithNodeGroup}} I ran into a new failure. 
testBalancerWithRackLocality() was failing on the modified assert. I don't see 
why this test case was using the runBalanceCanFinish() in the first place 
though. I changed it to just use runBalancer(), and it passes.  This seems more 
correct to me, although I am definitely not an expert in this area of the code.

As suggested by [~hemanthboyina] and others, I also added a precondition check 
to BlockPlacementPolicyWithNodeGroup.initialize() to verify that clusterMap is 
an instance of NetworkTopologyWithNodeGroup.   With this change, all of the 
test cases in this test fail immediately if you misconfigure it to use 
DFSNetworkTopology with BlockPlacementPolicyWithNodeGroup.

> TestBalancerWithNodeGroup should not succeed with DFSNetworkTopology
> --------------------------------------------------------------------
>
>                 Key: HDFS-14960
>                 URL: https://issues.apache.org/jira/browse/HDFS-14960
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs
>    Affects Versions: 3.1.3
>            Reporter: Jim Brennan
>            Assignee: Jim Brennan
>            Priority: Minor
>
> As reported in HDFS-14958, TestBalancerWithNodeGroup was succeeding even 
> though it was using DFSNetworkTopology instead of 
> NetworkTopologyWithNodeGroup.
> [~inigoiri] rightly suggested that this indicates the test is not very good - 
> it should fail when run without NetworkTopologyWithNodeGroup.
> We should improve this test.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to