[ https://issues.apache.org/jira/browse/HADOOP-4116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12634441#action_12634441 ]
Steve Loughran commented on HADOOP-4116: ---------------------------------------- The ping operation we are proposing for the lifecycle in HADOOP-3628 and HADOOP-3969 does a better health check, as it asks the far end if it thinks it is happy, and can detect a dead end (with suitable timeouts) and a machine that thinks it is unwell. But the most reliable way to check system health is to give that node real work and see if it completes it within time. That's something that could be done as a low priority job across a cluster: queue work and check the results, though you'd need to direct the work to specific nodes somehow. > Balancer should provide better resource management > -------------------------------------------------- > > Key: HADOOP-4116 > URL: https://issues.apache.org/jira/browse/HADOOP-4116 > Project: Hadoop Core > Issue Type: Bug > Components: dfs > Affects Versions: 0.17.0 > Reporter: Raghu Angadi > Assignee: Hairong Kuang > Priority: Blocker > Fix For: 0.18.2, 0.19.0 > > Attachments: balancerRM.patch, balancerRM1.patch, > balancerRM2-b18.patch, balancerRM2.patch > > > The number of threads are currently limited on datanodes. Once these threads > are occupied, DataNode does not accept any more requests (DOS). Recently we > saw a case where most of the 256 threads were waiting in > {{DataXceiver.replaceBlock()}} trying to acquire {{balancingSem}}. Since > rebalancing is (heavily) throttled, I would think this would be the common > case. > These operations waiting for active rebalancing threads to finish need not > take up a thread. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.