Xiaobo Peng created HADOOP-9640:
-----------------------------------
Summary: RPC Congestion Control
Key: HADOOP-9640
URL: https://issues.apache.org/jira/browse/HADOOP-9640
Project: Hadoop Common
Issue Type: Improvement
Reporter: Xiaobo Peng
Several production Hadoop cluster incidents occurred where the Namenode was
overloaded and failed to be responsive. This task is to improve the system to
detect RPC congestion early, and to provide good diagnostic information for
alerts that identify suspicious jobs/users so as to restore services quickly.
Excerpted from the communication of one incident, “The map task of a user was
creating huge number of small files in the user directory. Due to the heavy
load on NN, the JT also was unable to communicate with NN...The cluster became
responsive only once the job was killed.”
Excerpted from the communication of another incident, “Namenode was overloaded
by GetBlockLocation requests. All other requests to namenode were also affected
by this and hence all jobs slowed down. Cluster almost came to a grinding
halt…Eventually killed jobtracker to kill all jobs that are running.”
Excerpted from HDFS-945, “We've seen defective applications cause havoc on the
NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories (60k
files) etc.”
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira