[ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Li updated HADOOP-9640:
-----------------------------

    Description: 
Several production Hadoop cluster incidents occurred where the Namenode was 
overloaded and failed to respond. 

We can improve quality of service for users during namenode peak loads by 
replacing the FIFO call queue with a [Fair Call 
Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
 (this plan supersedes rpc-congestion-control-draft-plan).

Excerpted from the communication of one incident, “The map task of a user was 
creating huge number of small files in the user directory. Due to the heavy 
load on NN, the JT also was unable to communicate with NN...The cluster became 
responsive only once the job was killed.”

Excerpted from the communication of another incident, “Namenode was overloaded 
by GetBlockLocation requests (Correction: should be getFileInfo requests. the 
job had a bug that called getFileInfo for a nonexistent file in an endless 
loop). All other requests to namenode were also affected by this and hence all 
jobs slowed down. Cluster almost came to a grinding halt…Eventually killed 
jobtracker to kill all jobs that are running.”

Excerpted from HDFS-945, “We've seen defective applications cause havoc on the 
NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories (60k 
files) etc.”


  was:
Several production Hadoop cluster incidents occurred where the Namenode was 
overloaded and failed to be responsive.  This task is to improve the system to 
detect RPC congestion early, and to provide good diagnostic information for 
alerts that identify suspicious jobs/users so as to restore services quickly.

Excerpted from the communication of one incident, “The map task of a user was 
creating huge number of small files in the user directory. Due to the heavy 
load on NN, the JT also was unable to communicate with NN...The cluster became 
responsive only once the job was killed.”

Excerpted from the communication of another incident, “Namenode was overloaded 
by GetBlockLocation requests (Correction: should be getFileInfo requests. the 
job had a bug that called getFileInfo for a nonexistent file in an endless 
loop). All other requests to namenode were also affected by this and hence all 
jobs slowed down. Cluster almost came to a grinding halt…Eventually killed 
jobtracker to kill all jobs that are running.”

Excerpted from HDFS-945, “We've seen defective applications cause havoc on the 
NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories (60k 
files) etc.”



> RPC Congestion Control
> ----------------------
>
>                 Key: HADOOP-9640
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9640
>             Project: Hadoop Common
>          Issue Type: Improvement
>    Affects Versions: 3.0.0, 2.2.0
>            Reporter: Xiaobo Peng
>              Labels: hdfs, qos, rpc
>         Attachments: NN-denial-of-service-updated-plan.pdf, 
> faircallqueue.patch, rpc-congestion-control-draft-plan.pdf
>
>
> Several production Hadoop cluster incidents occurred where the Namenode was 
> overloaded and failed to respond. 
> We can improve quality of service for users during namenode peak loads by 
> replacing the FIFO call queue with a [Fair Call 
> Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
>  (this plan supersedes rpc-congestion-control-draft-plan).
> Excerpted from the communication of one incident, “The map task of a user was 
> creating huge number of small files in the user directory. Due to the heavy 
> load on NN, the JT also was unable to communicate with NN...The cluster 
> became responsive only once the job was killed.”
> Excerpted from the communication of another incident, “Namenode was 
> overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
> requests. the job had a bug that called getFileInfo for a nonexistent file in 
> an endless loop). All other requests to namenode were also affected by this 
> and hence all jobs slowed down. Cluster almost came to a grinding 
> halt…Eventually killed jobtracker to kill all jobs that are running.”
> Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
> the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
> (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to