[
https://issues.apache.org/jira/browse/HDFS-14960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17107684#comment-17107684
]
Jim Brennan commented on HDFS-14960:
------------------------------------
On further investigation of this, I realized that the balancer does not pay any
attention to {{DFSConfigKeys.DFS_USE_DFS_NETWORK_TOPOLOGY_KEY}}.
Here are the config settings for TestBalancerWithNodeGroup:
{code:java}
static Configuration createConf() {
Configuration conf = new HdfsConfiguration();
TestBalancer.initConf(conf);
conf.setLong(DFSConfigKeys.DFS_BLOCK_SIZE_KEY, DEFAULT_BLOCK_SIZE);
conf.setBoolean(DFSConfigKeys.DFS_USE_DFS_NETWORK_TOPOLOGY_KEY, false);
conf.set(CommonConfigurationKeysPublic.NET_TOPOLOGY_IMPL_KEY,
NetworkTopologyWithNodeGroup.class.getName());
conf.set(DFSConfigKeys.DFS_BLOCK_REPLICATOR_CLASSNAME_KEY,
BlockPlacementPolicyWithNodeGroup.class.getName());
return conf;
}
{code}
Prior to HDFS-14958, we were not setting
{{DFSConfigKeys.DFS_USE_DFS_NETWORK_TOPOLOGY_KEY = false}}, so
BlockPlacementPolicyWithNodeGroup was being initialized with a clusterMap of
type DFSNetworkTopology. This did not affect this test though, because the
balancer ignores that flag. The Balancer only pays attention to
{{CommonConfigurationKeysPublic.NET_TOPOLOGY_IMPL_KEY}} which was already set
to NetworkTopologyWithNodeGroup.
This is why the test never failed - it is specifically testing the results of
the balancer. The only reason I found the issue in HDFS-14958 was because we
had some internal changes that caused it to fail. But the apache version never
actually failed because of HDFS-14958.
Given this, I thought I should double-check that the test does fail if the
Balancer doesn't use NetworkTopologyWithNodeGroup. So I set it to use
NetworkTopology and the test passed!
Looking at it more closely, I was surprised in particular that
testBalancerEndInNoMoveProgress() was succeeding in this case. I would expect
that with NetworkTopology there would be some block moves. But the code to
verify that it finishes with no moves seems to allow moves:
{code:java}
final int r = Balancer.run(namenodes, BalancerParameters.DEFAULT, conf);
Assert.assertTrue(r == ExitStatus.SUCCESS.getExitCode() ||
(r == ExitStatus.NO_MOVE_PROGRESS.getExitCode()));
{code}
I don't understand why SUCCESS is a valid return for this case. Isn't the point
of this test case to verify that no block moves were done?
Sure enough, if I change that assert to be more restrictive:
{code:java}
Assert.assertTrue(r == ExitStatus.NO_MOVE_PROGRESS.getExitCode());
{code}
then testBalancerEndInNoMoveProgress() fails when the topology is not
{{NetworkTopologyWithNodeGroup}}.
With this change in place, however, when I went back to using
{{NetworkTopologyWithNodeGroup}} I ran into a new failure.
testBalancerWithRackLocality() was failing on the modified assert. I don't see
why this test case was using the runBalanceCanFinish() in the first place
though. I changed it to just use runBalancer(), and it passes. This seems more
correct to me, although I am definitely not an expert in this area of the code.
As suggested by [~hemanthboyina] and others, I also added a precondition check
to BlockPlacementPolicyWithNodeGroup.initialize() to verify that clusterMap is
an instance of NetworkTopologyWithNodeGroup. With this change, all of the
test cases in this test fail immediately if you misconfigure it to use
DFSNetworkTopology with BlockPlacementPolicyWithNodeGroup.
> TestBalancerWithNodeGroup should not succeed with DFSNetworkTopology
> --------------------------------------------------------------------
>
> Key: HDFS-14960
> URL: https://issues.apache.org/jira/browse/HDFS-14960
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: hdfs
> Affects Versions: 3.1.3
> Reporter: Jim Brennan
> Assignee: Jim Brennan
> Priority: Minor
>
> As reported in HDFS-14958, TestBalancerWithNodeGroup was succeeding even
> though it was using DFSNetworkTopology instead of
> NetworkTopologyWithNodeGroup.
> [~inigoiri] rightly suggested that this indicates the test is not very good -
> it should fail when run without NetworkTopologyWithNodeGroup.
> We should improve this test.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]