[
https://issues.apache.org/jira/browse/HDFS-5580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Binglin Chang updated HDFS-5580:
--------------------------------
Attachment: HDFS-5580.v1.patch
Bug analysis:
In Balancer.PendingBlockMove.chooseProxySource()
{code}
boolean find = false;
for (BalancerDatanode loc : block.getLocations()) {
// check if there is replica which is on the same rack with the target
if (cluster.isOnSameRack(loc.getDatanode(), targetDN) && addTo(loc)) {
find = true;
// if cluster is not nodegroup aware or the proxy is on the same
// nodegroup with target, then we already find the nearest proxy
if (!cluster.isNodeGroupAware()
|| cluster.isOnSameNodeGroup(loc.getDatanode(), targetDN)) {
return true;
}
}
if (!find) {
// find out a non-busy replica out of rack of target
find = addTo(loc);
}
}
{code}
PendingBlockMove may be added to mulitple locations instead of one, but
consumer thread pool only remove a pair of PendingBlockMove at a time, left
some wild PendingBlockMove in the queue, Balancer.waitForMoveCompletion wait
the queue become empty, which will never happen, causing dead lock.
> Infinite loop in Balancer.waitForMoveCompletion
> -----------------------------------------------
>
> Key: HDFS-5580
> URL: https://issues.apache.org/jira/browse/HDFS-5580
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: Binglin Chang
> Assignee: Binglin Chang
> Attachments: HDFS-5580.v1.patch, TestBalancerWithNodeGroupTimeout.log
>
>
> In recent
> [build|https://builds.apache.org/job/PreCommit-HDFS-Build/5592//testReport/org.apache.hadoop.hdfs.server.balancer/TestBalancerWithNodeGroup/testBalancerWithNodeGroup/]
> in HDFS-5574, TestBalancerWithNodeGroup timeout, this is also mentioned in
> HDFS-4376
> [here|https://issues.apache.org/jira/browse/HDFS-4376?focusedCommentId=13799402&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13799402].
>
> Looks like the bug is introduced by HDFS-4376.
--
This message was sent by Atlassian JIRA
(v6.1#6144)