[ https://issues.apache.org/jira/browse/HADOOP-4116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12634074#action_12634074 ]
Steve Loughran commented on HADOOP-4116: ---------------------------------------- keepAlives are a fairly weak way of assessing "liveness" because -it works at the network stack level, so your app may still be dead but the KA packets are happy -if there are a lot of (idle) connections between two hosts, a lot of KA traffic can be generated, rather than one packet per host, which is how a lot of protocols (CORBA and DCOM, for example) communicate "we are still alive". I think this proposal is better than nothing, but we need to be aware of limitations. It will detect a network partition, but not a hung far end if the network stack is still up > Balancer should provide better resource management > -------------------------------------------------- > > Key: HADOOP-4116 > URL: https://issues.apache.org/jira/browse/HADOOP-4116 > Project: Hadoop Core > Issue Type: Bug > Components: dfs > Affects Versions: 0.17.0 > Reporter: Raghu Angadi > Assignee: Hairong Kuang > Priority: Blocker > Fix For: 0.18.2, 0.19.0 > > Attachments: balancerRM.patch, balancerRM1.patch > > > The number of threads are currently limited on datanodes. Once these threads > are occupied, DataNode does not accept any more requests (DOS). Recently we > saw a case where most of the 256 threads were waiting in > {{DataXceiver.replaceBlock()}} trying to acquire {{balancingSem}}. Since > rebalancing is (heavily) throttled, I would think this would be the common > case. > These operations waiting for active rebalancing threads to finish need not > take up a thread. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.