[ 
https://issues.apache.org/jira/browse/HDFS-11682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16045163#comment-16045163
 ] 

Manoj Govindassamy commented on HDFS-11682:
-------------------------------------------

[~eddyxu],

>From your explanation it looks like it is inevitable to run {{runBalancer}}  a 
>couple of times with updated HB to trigger additional balancing if needed. +1 
>(non-binding) with one comment below.

{noformat}
942         while (retry > 0) {
943           // start rebalancing
944           Collection<URI> namenodes = DFSUtil.getInternalNsRpcUris(conf);
945           final int run = runBalancer(namenodes, p, conf);
.. ..
955           waitForHeartBeat(totalUsedSpace, totalCapacity, client, cluster);
956           LOG.info("  .");
957           try {
958             waitForBalancer(totalUsedSpace, totalCapacity, client, cluster, 
p,
959                 excludedNodes);
960           } catch (TimeoutException e) {
961             // See HDFS-11682. NN may not get heartbeat to reflect the 
newest
962             // block changes.
963             retry--;
964             if (retry == 0) {
965               throw e;
966             }
967             LOG.warn("The cluster has not balanced yet, retry...");
968             continue;
969           }
970           break;
971         }

{{waitForHeartBeat}} in the above loop can also timeout and throw 
{{TimeoutException}} which is not caught like in {{waitForBalancer}}. So, the 
caller could fail because of this.

> TestBalancer#testBalancerWithStripedFile is flaky
> -------------------------------------------------
>
>                 Key: HDFS-11682
>                 URL: https://issues.apache.org/jira/browse/HDFS-11682
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 3.0.0-alpha4
>            Reporter: Andrew Wang
>            Assignee: Lei (Eddy) Xu
>         Attachments: HDFS-11682.00.patch, HDFS-11682.01.patch, 
> IndexOutOfBoundsException.log, timeout.log
>
>
> Saw this fail in two different ways on a precommit run, but pass locally.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to