[ 
https://issues.apache.org/jira/browse/HDFS-10966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15678192#comment-15678192
 ] 

Zhe Zhang edited comment on HDFS-10966 at 11/19/16 12:22 AM:
-------------------------------------------------------------

Thanks Kihwal for the review! Uploading a new patch:
# This patch does change the Balancer behavior introduced by HDFS-4261, around 
the timeout logic. But I don't think there's a negative effect. By staying in 
the {{dispatchBlocks}} while loop longer, the only overhead is to check 
{{chooseNextMove}}, which only checks local states, without issuing NameNode 
workload. Even if we jump out of the while loop, the thread for that Source 
cannot be reused at another Source anyway. In {{TestBalancer}} I reset the 
config value to 5s, and the run time is normal.
# Added to {{hdfs-default.xml}}, thx for the catch.
# I think it is a good idea, added.


was (Author: zhz):
Thanks Kihwal for the review! Uploading a new patch:
# This patch does change the Balancer behavior introduced by HDFS-4261, around 
the timeout logic. But I don't think there's a negative effect. By staying in 
the {{dispatchBlocks}} while loop longer, the only overhead is to check 
{{chooseNextMove}}, which only checks local states, without issuing NameNode 
workload. Even if we jump out of the while loop, the thread for that Source 
cannot be reused at another Source anyway.
# Added to {{hdfs-default.xml}}, thx for the catch.
# I think it is a good idea, added.

> Enhance Dispatcher logic on deciding when to give up a source DataNode
> ----------------------------------------------------------------------
>
>                 Key: HDFS-10966
>                 URL: https://issues.apache.org/jira/browse/HDFS-10966
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: balancer & mover
>            Reporter: Zhe Zhang
>            Assignee: Mark Wagner
>         Attachments: HDFS-10966.00.patch, HDFS-10966.01.patch
>
>
> When a {{Dispatcher}} thread works on a source DataNode, in each iteration it 
> tries to execute a {{PendingMove}}. If no block is moved after 5 iterations, 
> this source (over-utlized) DataNode is given up for this Balancer iteration 
> (20 mins). This is problematic if the source DataNode was heavily loaded in 
> the beginning of the iteration. It will quickly encounter 5 unsuccessful 
> moves and be abandoned.
> We should enhance this logic by e.g. using elapsed time instead of number of 
> iterations.
> {code}
> // Check if the previous move was successful
>         } else {
>           // source node cannot find a pending block to move, iteration +1
>           noPendingMoveIteration++;
>           // in case no blocks can be moved for source node's task,
>           // jump out of while-loop after 5 iterations.
>           if (noPendingMoveIteration >= MAX_NO_PENDING_MOVE_ITERATIONS) {
>             LOG.info("Failed to find a pending move "  + 
> noPendingMoveIteration
>                 + " times.  Skipping " + this);
>             resetScheduledSize();
>           }
>         }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to