[ https://issues.apache.org/jira/browse/HADOOP-4116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630713#action_12630713 ]
Hairong Kuang commented on HADOOP-4116: --------------------------------------- Proposed changes to the Balancer: 1. Remove the use of Semaphor at DataNodes. Instead a DataNode uses a counter to manages the number of concurrent block moves. On receiving a block move request while maximum block moves are in progress, reject the request immediately. 2. Let the receiver initiate the block move; The sender rejects the request when the maximum number has already reached. As a result when either the sender or the receiver does not have resource to handle block move, the block content will not get transfered across network. 3. The balancer does not set a timeout on a socket. Instead, it sets the option KeepAlive on the socket. So a block move does not timeout no matter how slow it goes and next phrase of scheduling does not get started when there is a pending block move. > Balancer should provide better resource management > -------------------------------------------------- > > Key: HADOOP-4116 > URL: https://issues.apache.org/jira/browse/HADOOP-4116 > Project: Hadoop Core > Issue Type: Improvement > Components: dfs > Affects Versions: 0.17.0 > Reporter: Raghu Angadi > Assignee: Hairong Kuang > > The number of threads are currently limited on datanodes. Once these threads > are occupied, DataNode does not accept any more requests (DOS). Recently we > saw a case where most of the 256 threads were waiting in > {{DataXceiver.replaceBlock()}} trying to acquire {{balancingSem}}. Since > rebalancing is (heavily) throttled, I would think this would be the common > case. > These operations waiting for active rebalancing threads to finish need not > take up a thread. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.