[
https://issues.apache.org/jira/browse/HDFS-11742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15998534#comment-15998534
]
Kihwal Lee commented on HDFS-11742:
-----------------------------------
Instead of reverting, I am making a simple change to make it more usable. This
will prevent users from hitting the same issues we had. The changes from
HDFS-8188 does allow running balancer at a higher throughput, but it needs
turning multiple knobs to get there. And when it is running slower than the
previous release, users will have no clue why it is so. The default config
values may result in degraded performance for users running a cluster with more
than 20 nodes.
The main problem of HDFS-8188 is the way thread pool is created per target. If
it reaches the limit (max mover threads), the remaining pending moves are
simply dropped (Or even worse, it hangs without HDFS-11377), leading to
degraded performance as demonstrated above with graphs. The suggested
workaround of "set the mover thread limit to 10,000 or 30,000" simply means
removing the limit. i.e. it cannot work with the limit.
The suggested improvement calculates the size of each mover thread pool,
instead of using the configured fixed value. The total thread count limit is
honored without causing the degradation seen with the original design.
> Improve balancer usability after HDFS-8188
> ------------------------------------------
>
> Key: HDFS-11742
> URL: https://issues.apache.org/jira/browse/HDFS-11742
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: Kihwal Lee
> Assignee: Kihwal Lee
> Priority: Blocker
> Attachments: balancer2.8.png, HDFS-11742.branch-2.8.patch,
> HDFS-11742.branch-2.patch, HDFS-11742.trunk.patch
>
>
> We ran 2.8 balancer with HDFS-8818 on a 280-node and a 2,400-node cluster. In
> both cases, it would hang forever after two iterations. The two iterations
> were also moving things at a significantly lower rate. The hang itself is
> fixed by HDFS-11377, but the design limitation remains, so the balancer
> throughput ends up actually lower.
> Instead of reverting HDFS-8188 as originally suggested, I am making a small
> change to make it less error prone and more usable.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]