[
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chris Li updated HADOOP-9640:
-----------------------------
Attachment: faircallqueue.patch
Combats congestion via RPC-level priority
> RPC Congestion Control
> ----------------------
>
> Key: HADOOP-9640
> URL: https://issues.apache.org/jira/browse/HADOOP-9640
> Project: Hadoop Common
> Issue Type: Improvement
> Affects Versions: 3.0.0, 2.2.0
> Reporter: Xiaobo Peng
> Labels: hdfs, qos, rpc
> Attachments: faircallqueue.patch,
> rpc-congestion-control-draft-plan.pdf
>
>
> Several production Hadoop cluster incidents occurred where the Namenode was
> overloaded and failed to be responsive. This task is to improve the system
> to detect RPC congestion early, and to provide good diagnostic information
> for alerts that identify suspicious jobs/users so as to restore services
> quickly.
> Excerpted from the communication of one incident, “The map task of a user was
> creating huge number of small files in the user directory. Due to the heavy
> load on NN, the JT also was unable to communicate with NN...The cluster
> became responsive only once the job was killed.”
> Excerpted from the communication of another incident, “Namenode was
> overloaded by GetBlockLocation requests (Correction: should be getFileInfo
> requests. the job had a bug that called getFileInfo for a nonexistent file in
> an endless loop). All other requests to namenode were also affected by this
> and hence all jobs slowed down. Cluster almost came to a grinding
> halt…Eventually killed jobtracker to kill all jobs that are running.”
> Excerpted from HDFS-945, “We've seen defective applications cause havoc on
> the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories
> (60k files) etc.”
--
This message was sent by Atlassian JIRA
(v6.1#6144)