[ 
https://issues.apache.org/jira/browse/HDFS-742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14026612#comment-14026612
 ] 

Mit Desai commented on HDFS-742:
--------------------------------

Attaching the patch. Unfortunately I do not have a way to reproduce the issue 
so I'm unable to have a test to verify the change.
Here is the explanation of the part of the Balancer code makes it hang forever.

In the following while loop in Balancer.java, when the Balancer figures out 
that it should fetch more blocks, it gets the BlockList and decrements the 
blockToReceive by that many blocks. It again starts from the top of the loop 
after that.

{code}
 while(!isTimeUp && getScheduledSize()>0 &&
          (!srcBlockList.isEmpty() || blocksToReceive>0)) {
       
## SOME LINES OMITTED ##

        filterMovedBlocks(); // filter already moved blocks
        if (shouldFetchMoreBlocks()) {
          // fetch new blocks
          try {
            blocksToReceive -= getBlockList();
            continue;
          } catch (IOException e) {
            
## SOME LINES OMITTED ##
        
        // check if time is up or not
        if (Time.now()-startTime > MAX_ITERATION_TIME) {
          isTimeUp = true;
          continue;
        }
## SOME LINES OMITTED ##

 }
{code}

The problem here is, if the datanode is decommissioned, the {{getBlockList()}} 
method will not return anything and the {{blocksToReceive}} will not be 
changed. It will keep on doing this indefinitely as the {{blocksToReceive}} 
will always be greater than 0. The {{isTimeUp}} will never be set to true as it 
will never reach that part of the code. In the patch that is submitted, the 
Time up condition is moved to the top of the loop. So it will check if 
{{isTimeUp}} is set and proceed ahead only if time up is not encountered.

> A down DataNode makes Balancer to hang on repeatingly asking NameNode its 
> partial block list
> --------------------------------------------------------------------------------------------
>
>                 Key: HDFS-742
>                 URL: https://issues.apache.org/jira/browse/HDFS-742
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: balancer
>            Reporter: Hairong Kuang
>            Assignee: Mit Desai
>         Attachments: HDFS-742.patch
>
>
> We had a balancer that had not made any progress for a long time. It turned 
> out it was repeatingly asking Namenode for a partial block list of one 
> datanode, which was done while the balancer was running.
> NameNode should notify Balancer that the datanode is not available and 
> Balancer should stop asking for the datanode's block list.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to