[ 
https://issues.apache.org/jira/browse/HADOOP-4116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12634441#action_12634441
 ] 

Steve Loughran commented on HADOOP-4116:
----------------------------------------

The ping operation we are proposing for the lifecycle in HADOOP-3628 and 
HADOOP-3969 does a better health check, as it asks the far end if it thinks it 
is happy, and can detect a dead end (with suitable timeouts) and a machine that 
thinks it is unwell. But the most reliable way to check system health is to 
give that node real work and see if it completes it within time. That's 
something that could be done as a low priority job across a cluster: queue work 
and check the results, though you'd need to direct the work to specific nodes 
somehow. 

> Balancer should provide better resource management
> --------------------------------------------------
>
>                 Key: HADOOP-4116
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4116
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.0
>            Reporter: Raghu Angadi
>            Assignee: Hairong Kuang
>            Priority: Blocker
>             Fix For: 0.18.2, 0.19.0
>
>         Attachments: balancerRM.patch, balancerRM1.patch, 
> balancerRM2-b18.patch, balancerRM2.patch
>
>
> The number of threads are currently limited on datanodes. Once these threads 
> are occupied, DataNode does not accept any more requests (DOS). Recently we 
> saw a case where most of the 256 threads were waiting in 
> {{DataXceiver.replaceBlock()}} trying to acquire  {{balancingSem}}.  Since 
> rebalancing  is (heavily) throttled, I would think this would be the common 
> case. 
> These operations waiting  for active rebalancing threads to finish need not 
> take up a thread. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to