[
https://issues.apache.org/jira/browse/HDFS-742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14026612#comment-14026612
]
Mit Desai commented on HDFS-742:
--------------------------------
Attaching the patch. Unfortunately I do not have a way to reproduce the issue
so I'm unable to have a test to verify the change.
Here is the explanation of the part of the Balancer code makes it hang forever.
In the following while loop in Balancer.java, when the Balancer figures out
that it should fetch more blocks, it gets the BlockList and decrements the
blockToReceive by that many blocks. It again starts from the top of the loop
after that.
{code}
while(!isTimeUp && getScheduledSize()>0 &&
(!srcBlockList.isEmpty() || blocksToReceive>0)) {
## SOME LINES OMITTED ##
filterMovedBlocks(); // filter already moved blocks
if (shouldFetchMoreBlocks()) {
// fetch new blocks
try {
blocksToReceive -= getBlockList();
continue;
} catch (IOException e) {
## SOME LINES OMITTED ##
// check if time is up or not
if (Time.now()-startTime > MAX_ITERATION_TIME) {
isTimeUp = true;
continue;
}
## SOME LINES OMITTED ##
}
{code}
The problem here is, if the datanode is decommissioned, the {{getBlockList()}}
method will not return anything and the {{blocksToReceive}} will not be
changed. It will keep on doing this indefinitely as the {{blocksToReceive}}
will always be greater than 0. The {{isTimeUp}} will never be set to true as it
will never reach that part of the code. In the patch that is submitted, the
Time up condition is moved to the top of the loop. So it will check if
{{isTimeUp}} is set and proceed ahead only if time up is not encountered.
> A down DataNode makes Balancer to hang on repeatingly asking NameNode its
> partial block list
> --------------------------------------------------------------------------------------------
>
> Key: HDFS-742
> URL: https://issues.apache.org/jira/browse/HDFS-742
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: balancer
> Reporter: Hairong Kuang
> Assignee: Mit Desai
> Attachments: HDFS-742.patch
>
>
> We had a balancer that had not made any progress for a long time. It turned
> out it was repeatingly asking Namenode for a partial block list of one
> datanode, which was done while the balancer was running.
> NameNode should notify Balancer that the datanode is not available and
> Balancer should stop asking for the datanode's block list.
--
This message was sent by Atlassian JIRA
(v6.2#6252)