[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue

2019-02-08 Thread Erik Krogen (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16763778#comment-16763778
 ] 

Erik Krogen commented on HADOOP-9640:
-

Moved HADOOP-10286, HADOOP-13029, HADOOP-10599, HADOOP-10598 to issues instead 
of subtasks. Now with all subtasks complete, I will close out this umbrella.

> RPC Congestion Control with FairCallQueue
> -
>
> Key: HADOOP-9640
> URL: https://issues.apache.org/jira/browse/HADOOP-9640
> Project: Hadoop Common
>  Issue Type: Improvement
>Affects Versions: 2.2.0, 3.0.0-alpha1
>Reporter: Xiaobo Peng
>Assignee: Chris Li
>Priority: Major
>  Labels: hdfs, qos, rpc
> Attachments: FairCallQueue-PerformanceOnCluster.pdf, 
> MinorityMajorityPerformance.pdf, NN-denial-of-service-updated-plan.pdf, 
> faircallqueue.patch, faircallqueue2.patch, faircallqueue3.patch, 
> faircallqueue4.patch, faircallqueue5.patch, faircallqueue6.patch, 
> faircallqueue7_with_runtime_swapping.patch, 
> rpc-congestion-control-draft-plan.pdf
>
>
> For an easy-to-read summary see: 
> http://www.ebaytechblog.com/2014/08/21/quality-of-service-in-hadoop/
> Several production Hadoop cluster incidents occurred where the Namenode was 
> overloaded and failed to respond. 
> We can improve quality of service for users during namenode peak loads by 
> replacing the FIFO call queue with a [Fair Call 
> Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
>  (this plan supersedes rpc-congestion-control-draft-plan).
> Excerpted from the communication of one incident, “The map task of a user was 
> creating huge number of small files in the user directory. Due to the heavy 
> load on NN, the JT also was unable to communicate with NN...The cluster 
> became responsive only once the job was killed.”
> Excerpted from the communication of another incident, “Namenode was 
> overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
> requests. the job had a bug that called getFileInfo for a nonexistent file in 
> an endless loop). All other requests to namenode were also affected by this 
> and hence all jobs slowed down. Cluster almost came to a grinding 
> halt…Eventually killed jobtracker to kill all jobs that are running.”
> Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
> the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
> (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue

2019-02-08 Thread Wei-Chiu Chuang (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16763774#comment-16763774
 ] 

Wei-Chiu Chuang commented on HADOOP-9640:
-

Works for me. Thanks for the help here! We didn't enable FairCallQueue and the 
related features in CDH5 and we just started looking into this feature. I am 
reviewing HADOOP-10286 (as well as HDFS-12345)

> RPC Congestion Control with FairCallQueue
> -
>
> Key: HADOOP-9640
> URL: https://issues.apache.org/jira/browse/HADOOP-9640
> Project: Hadoop Common
>  Issue Type: Improvement
>Affects Versions: 2.2.0, 3.0.0-alpha1
>Reporter: Xiaobo Peng
>Assignee: Chris Li
>Priority: Major
>  Labels: hdfs, qos, rpc
> Attachments: FairCallQueue-PerformanceOnCluster.pdf, 
> MinorityMajorityPerformance.pdf, NN-denial-of-service-updated-plan.pdf, 
> faircallqueue.patch, faircallqueue2.patch, faircallqueue3.patch, 
> faircallqueue4.patch, faircallqueue5.patch, faircallqueue6.patch, 
> faircallqueue7_with_runtime_swapping.patch, 
> rpc-congestion-control-draft-plan.pdf
>
>
> For an easy-to-read summary see: 
> http://www.ebaytechblog.com/2014/08/21/quality-of-service-in-hadoop/
> Several production Hadoop cluster incidents occurred where the Namenode was 
> overloaded and failed to respond. 
> We can improve quality of service for users during namenode peak loads by 
> replacing the FIFO call queue with a [Fair Call 
> Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
>  (this plan supersedes rpc-congestion-control-draft-plan).
> Excerpted from the communication of one incident, “The map task of a user was 
> creating huge number of small files in the user directory. Due to the heavy 
> load on NN, the JT also was unable to communicate with NN...The cluster 
> became responsive only once the job was killed.”
> Excerpted from the communication of another incident, “Namenode was 
> overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
> requests. the job had a bug that called getFileInfo for a nonexistent file in 
> an endless loop). All other requests to namenode were also affected by this 
> and hence all jobs slowed down. Cluster almost came to a grinding 
> halt…Eventually killed jobtracker to kill all jobs that are running.”
> Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
> the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
> (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue

2019-02-08 Thread Erik Krogen (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16763772#comment-16763772
 ] 

Erik Krogen commented on HADOOP-9640:
-

Hi [~linyiqun], I just uploaded some documentation at HADOOP-16097. Please take 
a look!

[~jojochuang], regarding your comment about cleaning up this JIRA, I am 
thinking to move all unresolved subtasks out (I would consider all of the 
as-yet unresolved tasks as follow-ons) and close this umbrella. Let me know 
your thoughts. Also if you have some time to help review HADOOP-10286 it would 
be appreciated :)

> RPC Congestion Control with FairCallQueue
> -
>
> Key: HADOOP-9640
> URL: https://issues.apache.org/jira/browse/HADOOP-9640
> Project: Hadoop Common
>  Issue Type: Improvement
>Affects Versions: 2.2.0, 3.0.0-alpha1
>Reporter: Xiaobo Peng
>Assignee: Chris Li
>Priority: Major
>  Labels: hdfs, qos, rpc
> Attachments: FairCallQueue-PerformanceOnCluster.pdf, 
> MinorityMajorityPerformance.pdf, NN-denial-of-service-updated-plan.pdf, 
> faircallqueue.patch, faircallqueue2.patch, faircallqueue3.patch, 
> faircallqueue4.patch, faircallqueue5.patch, faircallqueue6.patch, 
> faircallqueue7_with_runtime_swapping.patch, 
> rpc-congestion-control-draft-plan.pdf
>
>
> For an easy-to-read summary see: 
> http://www.ebaytechblog.com/2014/08/21/quality-of-service-in-hadoop/
> Several production Hadoop cluster incidents occurred where the Namenode was 
> overloaded and failed to respond. 
> We can improve quality of service for users during namenode peak loads by 
> replacing the FIFO call queue with a [Fair Call 
> Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
>  (this plan supersedes rpc-congestion-control-draft-plan).
> Excerpted from the communication of one incident, “The map task of a user was 
> creating huge number of small files in the user directory. Due to the heavy 
> load on NN, the JT also was unable to communicate with NN...The cluster 
> became responsive only once the job was killed.”
> Excerpted from the communication of another incident, “Namenode was 
> overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
> requests. the job had a bug that called getFileInfo for a nonexistent file in 
> an endless loop). All other requests to namenode were also affected by this 
> and hence all jobs slowed down. Cluster almost came to a grinding 
> halt…Eventually killed jobtracker to kill all jobs that are running.”
> Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
> the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
> (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue

2018-06-28 Thread Yiqun Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526157#comment-16526157
 ] 

Yiqun Lin commented on HADOOP-9640:
---

Hi all,
Any progress on this issue? This looks like a good feature and can be used in 
production env, any documentation where we can learn from that how to configure 
using this? Appropriate for this, :).

> RPC Congestion Control with FairCallQueue
> -
>
> Key: HADOOP-9640
> URL: https://issues.apache.org/jira/browse/HADOOP-9640
> Project: Hadoop Common
>  Issue Type: Improvement
>Affects Versions: 2.2.0, 3.0.0-alpha1
>Reporter: Xiaobo Peng
>Assignee: Chris Li
>Priority: Major
>  Labels: hdfs, qos, rpc
> Attachments: FairCallQueue-PerformanceOnCluster.pdf, 
> MinorityMajorityPerformance.pdf, NN-denial-of-service-updated-plan.pdf, 
> faircallqueue.patch, faircallqueue2.patch, faircallqueue3.patch, 
> faircallqueue4.patch, faircallqueue5.patch, faircallqueue6.patch, 
> faircallqueue7_with_runtime_swapping.patch, 
> rpc-congestion-control-draft-plan.pdf
>
>
> For an easy-to-read summary see: 
> http://www.ebaytechblog.com/2014/08/21/quality-of-service-in-hadoop/
> Several production Hadoop cluster incidents occurred where the Namenode was 
> overloaded and failed to respond. 
> We can improve quality of service for users during namenode peak loads by 
> replacing the FIFO call queue with a [Fair Call 
> Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
>  (this plan supersedes rpc-congestion-control-draft-plan).
> Excerpted from the communication of one incident, “The map task of a user was 
> creating huge number of small files in the user directory. Due to the heavy 
> load on NN, the JT also was unable to communicate with NN...The cluster 
> became responsive only once the job was killed.”
> Excerpted from the communication of another incident, “Namenode was 
> overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
> requests. the job had a bug that called getFileInfo for a nonexistent file in 
> an endless loop). All other requests to namenode were also affected by this 
> and hence all jobs slowed down. Cluster almost came to a grinding 
> halt…Eventually killed jobtracker to kill all jobs that are running.”
> Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
> the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
> (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue

2015-09-07 Thread Ajith S (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14733724#comment-14733724
 ] 

Ajith S commented on HADOOP-9640:
-

I missed checking the sub-tasks. Will like to update documentation for 
FairCallQueue as it has lot of configuration introduced. It will be better if 
we can explain it briefly similar to 
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html.
 Added a sub-task for it.

> RPC Congestion Control with FairCallQueue
> -
>
> Key: HADOOP-9640
> URL: https://issues.apache.org/jira/browse/HADOOP-9640
> Project: Hadoop Common
>  Issue Type: Improvement
>Affects Versions: 2.2.0, 3.0.0
>Reporter: Xiaobo Peng
>Assignee: Chris Li
>  Labels: hdfs, qos, rpc
> Attachments: FairCallQueue-PerformanceOnCluster.pdf, 
> MinorityMajorityPerformance.pdf, NN-denial-of-service-updated-plan.pdf, 
> faircallqueue.patch, faircallqueue2.patch, faircallqueue3.patch, 
> faircallqueue4.patch, faircallqueue5.patch, faircallqueue6.patch, 
> faircallqueue7_with_runtime_swapping.patch, 
> rpc-congestion-control-draft-plan.pdf
>
>
> For an easy-to-read summary see: 
> http://www.ebaytechblog.com/2014/08/21/quality-of-service-in-hadoop/
> Several production Hadoop cluster incidents occurred where the Namenode was 
> overloaded and failed to respond. 
> We can improve quality of service for users during namenode peak loads by 
> replacing the FIFO call queue with a [Fair Call 
> Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
>  (this plan supersedes rpc-congestion-control-draft-plan).
> Excerpted from the communication of one incident, “The map task of a user was 
> creating huge number of small files in the user directory. Due to the heavy 
> load on NN, the JT also was unable to communicate with NN...The cluster 
> became responsive only once the job was killed.”
> Excerpted from the communication of another incident, “Namenode was 
> overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
> requests. the job had a bug that called getFileInfo for a nonexistent file in 
> an endless loop). All other requests to namenode were also affected by this 
> and hence all jobs slowed down. Cluster almost came to a grinding 
> halt…Eventually killed jobtracker to kill all jobs that are running.”
> Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
> the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
> (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue

2015-09-04 Thread Chris Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731185#comment-14731185
 ] 

Chris Li commented on HADOOP-9640:
--

[~ajithshetty] Sure, what did you have in mind?

> RPC Congestion Control with FairCallQueue
> -
>
> Key: HADOOP-9640
> URL: https://issues.apache.org/jira/browse/HADOOP-9640
> Project: Hadoop Common
>  Issue Type: Improvement
>Affects Versions: 2.2.0, 3.0.0
>Reporter: Xiaobo Peng
>Assignee: Chris Li
>  Labels: hdfs, qos, rpc
> Attachments: FairCallQueue-PerformanceOnCluster.pdf, 
> MinorityMajorityPerformance.pdf, NN-denial-of-service-updated-plan.pdf, 
> faircallqueue.patch, faircallqueue2.patch, faircallqueue3.patch, 
> faircallqueue4.patch, faircallqueue5.patch, faircallqueue6.patch, 
> faircallqueue7_with_runtime_swapping.patch, 
> rpc-congestion-control-draft-plan.pdf
>
>
> For an easy-to-read summary see: 
> http://www.ebaytechblog.com/2014/08/21/quality-of-service-in-hadoop/
> Several production Hadoop cluster incidents occurred where the Namenode was 
> overloaded and failed to respond. 
> We can improve quality of service for users during namenode peak loads by 
> replacing the FIFO call queue with a [Fair Call 
> Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
>  (this plan supersedes rpc-congestion-control-draft-plan).
> Excerpted from the communication of one incident, “The map task of a user was 
> creating huge number of small files in the user directory. Due to the heavy 
> load on NN, the JT also was unable to communicate with NN...The cluster 
> became responsive only once the job was killed.”
> Excerpted from the communication of another incident, “Namenode was 
> overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
> requests. the job had a bug that called getFileInfo for a nonexistent file in 
> an endless loop). All other requests to namenode were also affected by this 
> and hence all jobs slowed down. Cluster almost came to a grinding 
> halt…Eventually killed jobtracker to kill all jobs that are running.”
> Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
> the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
> (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue

2015-09-03 Thread Ajith S (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14730272#comment-14730272
 ] 

Ajith S commented on HADOOP-9640:
-

Hi [~chrilisf]

Any progress on the issue.? If you are not looking into this, i would like to 
continue work :)

> RPC Congestion Control with FairCallQueue
> -
>
> Key: HADOOP-9640
> URL: https://issues.apache.org/jira/browse/HADOOP-9640
> Project: Hadoop Common
>  Issue Type: Improvement
>Affects Versions: 2.2.0, 3.0.0
>Reporter: Xiaobo Peng
>Assignee: Chris Li
>  Labels: hdfs, qos, rpc
> Attachments: FairCallQueue-PerformanceOnCluster.pdf, 
> MinorityMajorityPerformance.pdf, NN-denial-of-service-updated-plan.pdf, 
> faircallqueue.patch, faircallqueue2.patch, faircallqueue3.patch, 
> faircallqueue4.patch, faircallqueue5.patch, faircallqueue6.patch, 
> faircallqueue7_with_runtime_swapping.patch, 
> rpc-congestion-control-draft-plan.pdf
>
>
> For an easy-to-read summary see: 
> http://www.ebaytechblog.com/2014/08/21/quality-of-service-in-hadoop/
> Several production Hadoop cluster incidents occurred where the Namenode was 
> overloaded and failed to respond. 
> We can improve quality of service for users during namenode peak loads by 
> replacing the FIFO call queue with a [Fair Call 
> Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
>  (this plan supersedes rpc-congestion-control-draft-plan).
> Excerpted from the communication of one incident, “The map task of a user was 
> creating huge number of small files in the user directory. Due to the heavy 
> load on NN, the JT also was unable to communicate with NN...The cluster 
> became responsive only once the job was killed.”
> Excerpted from the communication of another incident, “Namenode was 
> overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
> requests. the job had a bug that called getFileInfo for a nonexistent file in 
> an endless loop). All other requests to namenode were also affected by this 
> and hence all jobs slowed down. Cluster almost came to a grinding 
> halt…Eventually killed jobtracker to kill all jobs that are running.”
> Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
> the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
> (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue

2014-10-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177698#comment-14177698
 ] 

Hadoop QA commented on HADOOP-9640:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12641612/FairCallQueue-PerformanceOnCluster.pdf
  against trunk revision e90718f.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/4932//console

This message is automatically generated.

 RPC Congestion Control with FairCallQueue
 -

 Key: HADOOP-9640
 URL: https://issues.apache.org/jira/browse/HADOOP-9640
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 3.0.0, 2.2.0
Reporter: Xiaobo Peng
Assignee: Chris Li
  Labels: hdfs, qos, rpc
 Attachments: FairCallQueue-PerformanceOnCluster.pdf, 
 MinorityMajorityPerformance.pdf, NN-denial-of-service-updated-plan.pdf, 
 faircallqueue.patch, faircallqueue2.patch, faircallqueue3.patch, 
 faircallqueue4.patch, faircallqueue5.patch, faircallqueue6.patch, 
 faircallqueue7_with_runtime_swapping.patch, 
 rpc-congestion-control-draft-plan.pdf


 For an easy-to-read summary see: 
 http://www.ebaytechblog.com/2014/08/21/quality-of-service-in-hadoop/
 Several production Hadoop cluster incidents occurred where the Namenode was 
 overloaded and failed to respond. 
 We can improve quality of service for users during namenode peak loads by 
 replacing the FIFO call queue with a [Fair Call 
 Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
  (this plan supersedes rpc-congestion-control-draft-plan).
 Excerpted from the communication of one incident, “The map task of a user was 
 creating huge number of small files in the user directory. Due to the heavy 
 load on NN, the JT also was unable to communicate with NN...The cluster 
 became responsive only once the job was killed.”
 Excerpted from the communication of another incident, “Namenode was 
 overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
 requests. the job had a bug that called getFileInfo for a nonexistent file in 
 an endless loop). All other requests to namenode were also affected by this 
 and hence all jobs slowed down. Cluster almost came to a grinding 
 halt…Eventually killed jobtracker to kill all jobs that are running.”
 Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
 the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
 (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue

2014-05-19 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002801#comment-14002801
 ] 

Ming Ma commented on HADOOP-9640:
-

Sorry, there was a typo about, I meant thousands of RPC requests in RPC queue.

 RPC Congestion Control with FairCallQueue
 -

 Key: HADOOP-9640
 URL: https://issues.apache.org/jira/browse/HADOOP-9640
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 3.0.0, 2.2.0
Reporter: Xiaobo Peng
Assignee: Chris Li
  Labels: hdfs, qos, rpc
 Attachments: FairCallQueue-PerformanceOnCluster.pdf, 
 MinorityMajorityPerformance.pdf, NN-denial-of-service-updated-plan.pdf, 
 faircallqueue.patch, faircallqueue2.patch, faircallqueue3.patch, 
 faircallqueue4.patch, faircallqueue5.patch, faircallqueue6.patch, 
 faircallqueue7_with_runtime_swapping.patch, 
 rpc-congestion-control-draft-plan.pdf


 Several production Hadoop cluster incidents occurred where the Namenode was 
 overloaded and failed to respond. 
 We can improve quality of service for users during namenode peak loads by 
 replacing the FIFO call queue with a [Fair Call 
 Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
  (this plan supersedes rpc-congestion-control-draft-plan).
 Excerpted from the communication of one incident, “The map task of a user was 
 creating huge number of small files in the user directory. Due to the heavy 
 load on NN, the JT also was unable to communicate with NN...The cluster 
 became responsive only once the job was killed.”
 Excerpted from the communication of another incident, “Namenode was 
 overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
 requests. the job had a bug that called getFileInfo for a nonexistent file in 
 an endless loop). All other requests to namenode were also affected by this 
 and hence all jobs slowed down. Cluster almost came to a grinding 
 halt…Eventually killed jobtracker to kill all jobs that are running.”
 Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
 the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
 (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue

2014-05-05 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13990308#comment-13990308
 ] 

Ming Ma commented on HADOOP-9640:
-

Thanks, Chris.

1. The current approach drops call when RPC queue is full and the client relies 
on RPC timeout. It will be interesting to confirm if it is useful to have RPC 
server throw some exception back to client and have client do exponential back 
off; or maybe just block the RPC reader thread instead.

2. RPC-based approach didn't account for http request such as webHDFS. Based on 
some test results, it seems Jetty uses around 250 threads, small compared to 
the thousands of RPC handler threads. a) The bad application traffic from 
webHDFS still has impact on RPC latency, not as severe compared to the RPC 
case. b), if there are SLA jobs based on webHDFS, then the RPC throttling won't 
help much.

 RPC Congestion Control with FairCallQueue
 -

 Key: HADOOP-9640
 URL: https://issues.apache.org/jira/browse/HADOOP-9640
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 3.0.0, 2.2.0
Reporter: Xiaobo Peng
Assignee: Chris Li
  Labels: hdfs, qos, rpc
 Attachments: FairCallQueue-PerformanceOnCluster.pdf, 
 MinorityMajorityPerformance.pdf, NN-denial-of-service-updated-plan.pdf, 
 faircallqueue.patch, faircallqueue2.patch, faircallqueue3.patch, 
 faircallqueue4.patch, faircallqueue5.patch, faircallqueue6.patch, 
 faircallqueue7_with_runtime_swapping.patch, 
 rpc-congestion-control-draft-plan.pdf


 Several production Hadoop cluster incidents occurred where the Namenode was 
 overloaded and failed to respond. 
 We can improve quality of service for users during namenode peak loads by 
 replacing the FIFO call queue with a [Fair Call 
 Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
  (this plan supersedes rpc-congestion-control-draft-plan).
 Excerpted from the communication of one incident, “The map task of a user was 
 creating huge number of small files in the user directory. Due to the heavy 
 load on NN, the JT also was unable to communicate with NN...The cluster 
 became responsive only once the job was killed.”
 Excerpted from the communication of another incident, “Namenode was 
 overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
 requests. the job had a bug that called getFileInfo for a nonexistent file in 
 an endless loop). All other requests to namenode were also affected by this 
 and hence all jobs slowed down. Cluster almost came to a grinding 
 halt…Eventually killed jobtracker to kill all jobs that are running.”
 Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
 the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
 (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue

2014-05-02 Thread Chris Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988113#comment-13988113
 ] 

Chris Li commented on HADOOP-9640:
--

Uploaded patches to HADOOP-10279, HADOOP-10281, and HADOOP-10282 for feedback. 
The new scheduler fixes the performance issues identified in the earlier PDF 
too.

 RPC Congestion Control with FairCallQueue
 -

 Key: HADOOP-9640
 URL: https://issues.apache.org/jira/browse/HADOOP-9640
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 3.0.0, 2.2.0
Reporter: Xiaobo Peng
Assignee: Chris Li
  Labels: hdfs, qos, rpc
 Attachments: FairCallQueue-PerformanceOnCluster.pdf, 
 MinorityMajorityPerformance.pdf, NN-denial-of-service-updated-plan.pdf, 
 faircallqueue.patch, faircallqueue2.patch, faircallqueue3.patch, 
 faircallqueue4.patch, faircallqueue5.patch, faircallqueue6.patch, 
 faircallqueue7_with_runtime_swapping.patch, 
 rpc-congestion-control-draft-plan.pdf


 Several production Hadoop cluster incidents occurred where the Namenode was 
 overloaded and failed to respond. 
 We can improve quality of service for users during namenode peak loads by 
 replacing the FIFO call queue with a [Fair Call 
 Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
  (this plan supersedes rpc-congestion-control-draft-plan).
 Excerpted from the communication of one incident, “The map task of a user was 
 creating huge number of small files in the user directory. Due to the heavy 
 load on NN, the JT also was unable to communicate with NN...The cluster 
 became responsive only once the job was killed.”
 Excerpted from the communication of another incident, “Namenode was 
 overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
 requests. the job had a bug that called getFileInfo for a nonexistent file in 
 an endless loop). All other requests to namenode were also affected by this 
 and hence all jobs slowed down. Cluster almost came to a grinding 
 halt…Eventually killed jobtracker to kill all jobs that are running.”
 Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
 the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
 (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue

2014-04-23 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13977949#comment-13977949
 ] 

Ming Ma commented on HADOOP-9640:
-

Nice work. Some high-level comments,

1. At some point, we might need to prioritize DN RPC over client RPC so that no 
matter what application do to NN RPC and FSNamesystem's global lock, DN's 
requests will be processed timely. We can do it in two ways. a) config a global 
RPC server and have the pluggable CallQueue handle that. b) have one RPC server 
for client and one RPC server of service request, for that we will need some 
abstraction like  https://issues.apache.org/jira/browse/HDFS-5639.

2. CallQueue priority policy. Perhaps this could leave it to the plugin 
implementation. It can be somewhat soft policy like FaiCallQueue, or with some 
sort of allocation quota like other schedulers, .e.g., if we know the cluster 
has allocated 50% to some group at YARN layer, perhaps it is ok to assume that 
NN RPC request for that group can be around 50%.

 RPC Congestion Control with FairCallQueue
 -

 Key: HADOOP-9640
 URL: https://issues.apache.org/jira/browse/HADOOP-9640
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 3.0.0, 2.2.0
Reporter: Xiaobo Peng
Assignee: Chris Li
  Labels: hdfs, qos, rpc
 Attachments: MinorityMajorityPerformance.pdf, 
 NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, 
 faircallqueue2.patch, faircallqueue3.patch, faircallqueue4.patch, 
 faircallqueue5.patch, faircallqueue6.patch, 
 faircallqueue7_with_runtime_swapping.patch, 
 rpc-congestion-control-draft-plan.pdf


 Several production Hadoop cluster incidents occurred where the Namenode was 
 overloaded and failed to respond. 
 We can improve quality of service for users during namenode peak loads by 
 replacing the FIFO call queue with a [Fair Call 
 Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
  (this plan supersedes rpc-congestion-control-draft-plan).
 Excerpted from the communication of one incident, “The map task of a user was 
 creating huge number of small files in the user directory. Due to the heavy 
 load on NN, the JT also was unable to communicate with NN...The cluster 
 became responsive only once the job was killed.”
 Excerpted from the communication of another incident, “Namenode was 
 overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
 requests. the job had a bug that called getFileInfo for a nonexistent file in 
 an endless loop). All other requests to namenode were also affected by this 
 and hence all jobs slowed down. Cluster almost came to a grinding 
 halt…Eventually killed jobtracker to kill all jobs that are running.”
 Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
 the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
 (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue

2014-04-23 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13979194#comment-13979194
 ] 

Hadoop QA commented on HADOOP-9640:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12641612/FairCallQueue-PerformanceOnCluster.pdf
  against trunk revision .

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/3841//console

This message is automatically generated.

 RPC Congestion Control with FairCallQueue
 -

 Key: HADOOP-9640
 URL: https://issues.apache.org/jira/browse/HADOOP-9640
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 3.0.0, 2.2.0
Reporter: Xiaobo Peng
Assignee: Chris Li
  Labels: hdfs, qos, rpc
 Attachments: FairCallQueue-PerformanceOnCluster.pdf, 
 MinorityMajorityPerformance.pdf, NN-denial-of-service-updated-plan.pdf, 
 faircallqueue.patch, faircallqueue2.patch, faircallqueue3.patch, 
 faircallqueue4.patch, faircallqueue5.patch, faircallqueue6.patch, 
 faircallqueue7_with_runtime_swapping.patch, 
 rpc-congestion-control-draft-plan.pdf


 Several production Hadoop cluster incidents occurred where the Namenode was 
 overloaded and failed to respond. 
 We can improve quality of service for users during namenode peak loads by 
 replacing the FIFO call queue with a [Fair Call 
 Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
  (this plan supersedes rpc-congestion-control-draft-plan).
 Excerpted from the communication of one incident, “The map task of a user was 
 creating huge number of small files in the user directory. Due to the heavy 
 load on NN, the JT also was unable to communicate with NN...The cluster 
 became responsive only once the job was killed.”
 Excerpted from the communication of another incident, “Namenode was 
 overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
 requests. the job had a bug that called getFileInfo for a nonexistent file in 
 an endless loop). All other requests to namenode were also affected by this 
 and hence all jobs slowed down. Cluster almost came to a grinding 
 halt…Eventually killed jobtracker to kill all jobs that are running.”
 Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
 the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
 (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue

2014-02-27 Thread Chris Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13915068#comment-13915068
 ] 

Chris Li commented on HADOOP-9640:
--

Hey all, 

Can anyone check out https://issues.apache.org/jira/browse/HADOOP-10280 and 
give feedback on the next stage?

 RPC Congestion Control with FairCallQueue
 -

 Key: HADOOP-9640
 URL: https://issues.apache.org/jira/browse/HADOOP-9640
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 3.0.0, 2.2.0
Reporter: Xiaobo Peng
Assignee: Chris Li
  Labels: hdfs, qos, rpc
 Attachments: MinorityMajorityPerformance.pdf, 
 NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, 
 faircallqueue2.patch, faircallqueue3.patch, faircallqueue4.patch, 
 faircallqueue5.patch, faircallqueue6.patch, 
 faircallqueue7_with_runtime_swapping.patch, 
 rpc-congestion-control-draft-plan.pdf


 Several production Hadoop cluster incidents occurred where the Namenode was 
 overloaded and failed to respond. 
 We can improve quality of service for users during namenode peak loads by 
 replacing the FIFO call queue with a [Fair Call 
 Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
  (this plan supersedes rpc-congestion-control-draft-plan).
 Excerpted from the communication of one incident, “The map task of a user was 
 creating huge number of small files in the user directory. Due to the heavy 
 load on NN, the JT also was unable to communicate with NN...The cluster 
 became responsive only once the job was killed.”
 Excerpted from the communication of another incident, “Namenode was 
 overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
 requests. the job had a bug that called getFileInfo for a nonexistent file in 
 an endless loop). All other requests to namenode were also affected by this 
 and hence all jobs slowed down. Cluster almost came to a grinding 
 halt…Eventually killed jobtracker to kill all jobs that are running.”
 Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
 the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
 (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue

2014-02-18 Thread Chris Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13904731#comment-13904731
 ] 

Chris Li commented on HADOOP-9640:
--

Ignore above comment, it was meant for HADOOP-10278

 RPC Congestion Control with FairCallQueue
 -

 Key: HADOOP-9640
 URL: https://issues.apache.org/jira/browse/HADOOP-9640
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 3.0.0, 2.2.0
Reporter: Xiaobo Peng
  Labels: hdfs, qos, rpc
 Attachments: MinorityMajorityPerformance.pdf, 
 NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, 
 faircallqueue2.patch, faircallqueue3.patch, faircallqueue4.patch, 
 faircallqueue5.patch, faircallqueue6.patch, 
 faircallqueue7_with_runtime_swapping.patch, 
 rpc-congestion-control-draft-plan.pdf


 Several production Hadoop cluster incidents occurred where the Namenode was 
 overloaded and failed to respond. 
 We can improve quality of service for users during namenode peak loads by 
 replacing the FIFO call queue with a [Fair Call 
 Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
  (this plan supersedes rpc-congestion-control-draft-plan).
 Excerpted from the communication of one incident, “The map task of a user was 
 creating huge number of small files in the user directory. Due to the heavy 
 load on NN, the JT also was unable to communicate with NN...The cluster 
 became responsive only once the job was killed.”
 Excerpted from the communication of another incident, “Namenode was 
 overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
 requests. the job had a bug that called getFileInfo for a nonexistent file in 
 an endless loop). All other requests to namenode were also affected by this 
 and hence all jobs slowed down. Cluster almost came to a grinding 
 halt…Eventually killed jobtracker to kill all jobs that are running.”
 Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
 the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
 (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue

2014-01-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13881747#comment-13881747
 ] 

Hadoop QA commented on HADOOP-9640:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12624909/faircallqueue7_with_runtime_swapping.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 7 new 
or modified test files.

  {color:red}-1 javac{color}.  The applied patch generated 1547 javac 
compiler warnings (more than the trunk's current 1546 warnings).

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 5 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.ipc.TestQueueRuntimeReconfigure
  org.apache.hadoop.hdfs.server.namenode.TestNameNodeHttpServer

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/3467//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/3467//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-common.html
Javac warnings: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/3467//artifact/trunk/patchprocess/diffJavacWarnings.txt
Console output: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/3467//console

This message is automatically generated.

 RPC Congestion Control with FairCallQueue
 -

 Key: HADOOP-9640
 URL: https://issues.apache.org/jira/browse/HADOOP-9640
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 3.0.0, 2.2.0
Reporter: Xiaobo Peng
  Labels: hdfs, qos, rpc
 Attachments: MinorityMajorityPerformance.pdf, 
 NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, 
 faircallqueue2.patch, faircallqueue3.patch, faircallqueue4.patch, 
 faircallqueue5.patch, faircallqueue6.patch, 
 faircallqueue7_with_runtime_swapping.patch, 
 rpc-congestion-control-draft-plan.pdf


 Several production Hadoop cluster incidents occurred where the Namenode was 
 overloaded and failed to respond. 
 We can improve quality of service for users during namenode peak loads by 
 replacing the FIFO call queue with a [Fair Call 
 Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
  (this plan supersedes rpc-congestion-control-draft-plan).
 Excerpted from the communication of one incident, “The map task of a user was 
 creating huge number of small files in the user directory. Due to the heavy 
 load on NN, the JT also was unable to communicate with NN...The cluster 
 became responsive only once the job was killed.”
 Excerpted from the communication of another incident, “Namenode was 
 overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
 requests. the job had a bug that called getFileInfo for a nonexistent file in 
 an endless loop). All other requests to namenode were also affected by this 
 and hence all jobs slowed down. Cluster almost came to a grinding 
 halt…Eventually killed jobtracker to kill all jobs that are running.”
 Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
 the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
 (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue

2014-01-24 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13881428#comment-13881428
 ] 

Kihwal Lee commented on HADOOP-9640:


Can low priority requests starve higher priority requests? If a low priority 
call queue is full and all reader threads are blocked on put() for adding calls 
belonging to that queue, newly arriving higher priority requests won't get 
processed even if their corresponding queue is not full.  If the request rate 
stays greater than service rate for some time in this state, the listen queue 
will likely overflow and all types of requests will suffer regardless of 
priority. 

 RPC Congestion Control with FairCallQueue
 -

 Key: HADOOP-9640
 URL: https://issues.apache.org/jira/browse/HADOOP-9640
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 3.0.0, 2.2.0
Reporter: Xiaobo Peng
  Labels: hdfs, qos, rpc
 Attachments: MinorityMajorityPerformance.pdf, 
 NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, 
 faircallqueue2.patch, faircallqueue3.patch, faircallqueue4.patch, 
 faircallqueue5.patch, faircallqueue6.patch, 
 faircallqueue7_with_runtime_swapping.patch, 
 rpc-congestion-control-draft-plan.pdf


 Several production Hadoop cluster incidents occurred where the Namenode was 
 overloaded and failed to respond. 
 We can improve quality of service for users during namenode peak loads by 
 replacing the FIFO call queue with a [Fair Call 
 Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
  (this plan supersedes rpc-congestion-control-draft-plan).
 Excerpted from the communication of one incident, “The map task of a user was 
 creating huge number of small files in the user directory. Due to the heavy 
 load on NN, the JT also was unable to communicate with NN...The cluster 
 became responsive only once the job was killed.”
 Excerpted from the communication of another incident, “Namenode was 
 overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
 requests. the job had a bug that called getFileInfo for a nonexistent file in 
 an endless loop). All other requests to namenode were also affected by this 
 and hence all jobs slowed down. Cluster almost came to a grinding 
 halt…Eventually killed jobtracker to kill all jobs that are running.”
 Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
 the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
 (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue

2014-01-24 Thread Chris Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13881465#comment-13881465
 ] 

Chris Li commented on HADOOP-9640:
--

[~kihwal] In the first 6 versions of this patch, this does indeed happen. It's 
partially alleviated due to the round-robin withdrawal from the queues.

In the latest iteration of the patch (7), the reader threads would lock on the 
queue's putLock like they do in trunk. I think this behavior is more intuitive.

Today I will be breaking this JIRA to make it easier to review.



 RPC Congestion Control with FairCallQueue
 -

 Key: HADOOP-9640
 URL: https://issues.apache.org/jira/browse/HADOOP-9640
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 3.0.0, 2.2.0
Reporter: Xiaobo Peng
  Labels: hdfs, qos, rpc
 Attachments: MinorityMajorityPerformance.pdf, 
 NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, 
 faircallqueue2.patch, faircallqueue3.patch, faircallqueue4.patch, 
 faircallqueue5.patch, faircallqueue6.patch, 
 faircallqueue7_with_runtime_swapping.patch, 
 rpc-congestion-control-draft-plan.pdf


 Several production Hadoop cluster incidents occurred where the Namenode was 
 overloaded and failed to respond. 
 We can improve quality of service for users during namenode peak loads by 
 replacing the FIFO call queue with a [Fair Call 
 Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
  (this plan supersedes rpc-congestion-control-draft-plan).
 Excerpted from the communication of one incident, “The map task of a user was 
 creating huge number of small files in the user directory. Due to the heavy 
 load on NN, the JT also was unable to communicate with NN...The cluster 
 became responsive only once the job was killed.”
 Excerpted from the communication of another incident, “Namenode was 
 overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
 requests. the job had a bug that called getFileInfo for a nonexistent file in 
 an endless loop). All other requests to namenode were also affected by this 
 and hence all jobs slowed down. Cluster almost came to a grinding 
 halt…Eventually killed jobtracker to kill all jobs that are running.”
 Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
 the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
 (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue

2014-01-24 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13881556#comment-13881556
 ] 

Daryn Sharp commented on HADOOP-9640:
-

Agreed, this needs subtasks.  General comments/requests:
# Please make the default callq a {{BlockingQueue}} again, and have your custom 
implementations conform to the interface.
# The default callq should remain a {{LinkedBlockingQueue}}, not a 
{{FIFOCallQueue}}.  You're doing some pretty tricky locking and I'd rather 
trust the JDK.
# Call.getRemoteUser() would be much cleaner to get the UGI than an interface + 
enum to get user and group.
# Using the literal string unknown! for a user or group is not a good idea.

The more I think about it, multiple queues will exasperate congestion problem 
as Kihwal points out.  For that reason, I'd like to see minimal invasiveness in 
the Server class - I'll feel safe and you are free to experiment with alternate 
implementations.

 RPC Congestion Control with FairCallQueue
 -

 Key: HADOOP-9640
 URL: https://issues.apache.org/jira/browse/HADOOP-9640
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 3.0.0, 2.2.0
Reporter: Xiaobo Peng
  Labels: hdfs, qos, rpc
 Attachments: MinorityMajorityPerformance.pdf, 
 NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, 
 faircallqueue2.patch, faircallqueue3.patch, faircallqueue4.patch, 
 faircallqueue5.patch, faircallqueue6.patch, 
 faircallqueue7_with_runtime_swapping.patch, 
 rpc-congestion-control-draft-plan.pdf


 Several production Hadoop cluster incidents occurred where the Namenode was 
 overloaded and failed to respond. 
 We can improve quality of service for users during namenode peak loads by 
 replacing the FIFO call queue with a [Fair Call 
 Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
  (this plan supersedes rpc-congestion-control-draft-plan).
 Excerpted from the communication of one incident, “The map task of a user was 
 creating huge number of small files in the user directory. Due to the heavy 
 load on NN, the JT also was unable to communicate with NN...The cluster 
 became responsive only once the job was killed.”
 Excerpted from the communication of another incident, “Namenode was 
 overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
 requests. the job had a bug that called getFileInfo for a nonexistent file in 
 an endless loop). All other requests to namenode were also affected by this 
 and hence all jobs slowed down. Cluster almost came to a grinding 
 halt…Eventually killed jobtracker to kill all jobs that are running.”
 Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
 the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
 (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue

2014-01-24 Thread Chris Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13881571#comment-13881571
 ] 

Chris Li commented on HADOOP-9640:
--

[~daryn] 

Thanks for your feedback.

Some points of clarification:

3. The identity is meant to be configurable, so you can schedule by user, by 
group, and in the future by job.
4. Any suggestions?

 RPC Congestion Control with FairCallQueue
 -

 Key: HADOOP-9640
 URL: https://issues.apache.org/jira/browse/HADOOP-9640
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 3.0.0, 2.2.0
Reporter: Xiaobo Peng
  Labels: hdfs, qos, rpc
 Attachments: MinorityMajorityPerformance.pdf, 
 NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, 
 faircallqueue2.patch, faircallqueue3.patch, faircallqueue4.patch, 
 faircallqueue5.patch, faircallqueue6.patch, 
 faircallqueue7_with_runtime_swapping.patch, 
 rpc-congestion-control-draft-plan.pdf


 Several production Hadoop cluster incidents occurred where the Namenode was 
 overloaded and failed to respond. 
 We can improve quality of service for users during namenode peak loads by 
 replacing the FIFO call queue with a [Fair Call 
 Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
  (this plan supersedes rpc-congestion-control-draft-plan).
 Excerpted from the communication of one incident, “The map task of a user was 
 creating huge number of small files in the user directory. Due to the heavy 
 load on NN, the JT also was unable to communicate with NN...The cluster 
 became responsive only once the job was killed.”
 Excerpted from the communication of another incident, “Namenode was 
 overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
 requests. the job had a bug that called getFileInfo for a nonexistent file in 
 an endless loop). All other requests to namenode were also affected by this 
 and hence all jobs slowed down. Cluster almost came to a grinding 
 halt…Eventually killed jobtracker to kill all jobs that are running.”
 Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
 the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
 (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue

2014-01-24 Thread Chris Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13881614#comment-13881614
 ] 

Chris Li commented on HADOOP-9640:
--

I've uploaded the first of the patches to 
https://issues.apache.org/jira/browse/HADOOP-10278

It allows the user to use a custom call queue specified via configuration, but 
falls back on a LinkedBlockingQueue otherwise.

I'd like to take any further discussions about this aspect to the subtask, and 
get some feedback.

Thanks

 RPC Congestion Control with FairCallQueue
 -

 Key: HADOOP-9640
 URL: https://issues.apache.org/jira/browse/HADOOP-9640
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 3.0.0, 2.2.0
Reporter: Xiaobo Peng
  Labels: hdfs, qos, rpc
 Attachments: MinorityMajorityPerformance.pdf, 
 NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, 
 faircallqueue2.patch, faircallqueue3.patch, faircallqueue4.patch, 
 faircallqueue5.patch, faircallqueue6.patch, 
 faircallqueue7_with_runtime_swapping.patch, 
 rpc-congestion-control-draft-plan.pdf


 Several production Hadoop cluster incidents occurred where the Namenode was 
 overloaded and failed to respond. 
 We can improve quality of service for users during namenode peak loads by 
 replacing the FIFO call queue with a [Fair Call 
 Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
  (this plan supersedes rpc-congestion-control-draft-plan).
 Excerpted from the communication of one incident, “The map task of a user was 
 creating huge number of small files in the user directory. Due to the heavy 
 load on NN, the JT also was unable to communicate with NN...The cluster 
 became responsive only once the job was killed.”
 Excerpted from the communication of another incident, “Namenode was 
 overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
 requests. the job had a bug that called getFileInfo for a nonexistent file in 
 an endless loop). All other requests to namenode were also affected by this 
 and hence all jobs slowed down. Cluster almost came to a grinding 
 halt…Eventually killed jobtracker to kill all jobs that are running.”
 Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
 the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
 (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue

2014-01-23 Thread Mayank Bansal (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13880231#comment-13880231
 ] 

Mayank Bansal commented on HADOOP-9640:
---

Hi [~sureshms] 

Can you please take a look at this jira?

Thanks,
Mayank

 RPC Congestion Control with FairCallQueue
 -

 Key: HADOOP-9640
 URL: https://issues.apache.org/jira/browse/HADOOP-9640
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 3.0.0, 2.2.0
Reporter: Xiaobo Peng
  Labels: hdfs, qos, rpc
 Attachments: MinorityMajorityPerformance.pdf, 
 NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, 
 faircallqueue2.patch, faircallqueue3.patch, faircallqueue4.patch, 
 faircallqueue5.patch, faircallqueue6.patch, 
 rpc-congestion-control-draft-plan.pdf


 Several production Hadoop cluster incidents occurred where the Namenode was 
 overloaded and failed to respond. 
 We can improve quality of service for users during namenode peak loads by 
 replacing the FIFO call queue with a [Fair Call 
 Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
  (this plan supersedes rpc-congestion-control-draft-plan).
 Excerpted from the communication of one incident, “The map task of a user was 
 creating huge number of small files in the user directory. Due to the heavy 
 load on NN, the JT also was unable to communicate with NN...The cluster 
 became responsive only once the job was killed.”
 Excerpted from the communication of another incident, “Namenode was 
 overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
 requests. the job had a bug that called getFileInfo for a nonexistent file in 
 an endless loop). All other requests to namenode were also affected by this 
 and hence all jobs slowed down. Cluster almost came to a grinding 
 halt…Eventually killed jobtracker to kill all jobs that are running.”
 Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
 the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
 (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue

2014-01-10 Thread Chris Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13868553#comment-13868553
 ] 

Chris Li commented on HADOOP-9640:
--

I'd like to get people's thoughts on a new feature: Dynamic reconfiguration

h5. Motivation
1. The tuning of parameters will be important for optimal performance
2. We can recover faster from bad parameters
3. The cost of doing a NN failover to change parameters is too high, while this 
would be much faster (seconds)

h5. User Interface

Much like `hadoop mradmin -refreshQueueAcls` the user would run the command to 
reload the CallQueue based on config.


 RPC Congestion Control with FairCallQueue
 -

 Key: HADOOP-9640
 URL: https://issues.apache.org/jira/browse/HADOOP-9640
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 3.0.0, 2.2.0
Reporter: Xiaobo Peng
  Labels: hdfs, qos, rpc
 Attachments: MinorityMajorityPerformance.pdf, 
 NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, 
 faircallqueue2.patch, faircallqueue3.patch, faircallqueue4.patch, 
 faircallqueue5.patch, faircallqueue6.patch, 
 rpc-congestion-control-draft-plan.pdf


 Several production Hadoop cluster incidents occurred where the Namenode was 
 overloaded and failed to respond. 
 We can improve quality of service for users during namenode peak loads by 
 replacing the FIFO call queue with a [Fair Call 
 Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
  (this plan supersedes rpc-congestion-control-draft-plan).
 Excerpted from the communication of one incident, “The map task of a user was 
 creating huge number of small files in the user directory. Due to the heavy 
 load on NN, the JT also was unable to communicate with NN...The cluster 
 became responsive only once the job was killed.”
 Excerpted from the communication of another incident, “Namenode was 
 overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
 requests. the job had a bug that called getFileInfo for a nonexistent file in 
 an endless loop). All other requests to namenode were also affected by this 
 and hence all jobs slowed down. Cluster almost came to a grinding 
 halt…Eventually killed jobtracker to kill all jobs that are running.”
 Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
 the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
 (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue

2013-12-16 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13849674#comment-13849674
 ] 

Hadoop QA commented on HADOOP-9640:
---

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12618954/faircallqueue6.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/3361//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/3361//console

This message is automatically generated.

 RPC Congestion Control with FairCallQueue
 -

 Key: HADOOP-9640
 URL: https://issues.apache.org/jira/browse/HADOOP-9640
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 3.0.0, 2.2.0
Reporter: Xiaobo Peng
  Labels: hdfs, qos, rpc
 Attachments: MinorityMajorityPerformance.pdf, 
 NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, 
 faircallqueue2.patch, faircallqueue3.patch, faircallqueue4.patch, 
 faircallqueue5.patch, faircallqueue6.patch, 
 rpc-congestion-control-draft-plan.pdf


 Several production Hadoop cluster incidents occurred where the Namenode was 
 overloaded and failed to respond. 
 We can improve quality of service for users during namenode peak loads by 
 replacing the FIFO call queue with a [Fair Call 
 Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
  (this plan supersedes rpc-congestion-control-draft-plan).
 Excerpted from the communication of one incident, “The map task of a user was 
 creating huge number of small files in the user directory. Due to the heavy 
 load on NN, the JT also was unable to communicate with NN...The cluster 
 became responsive only once the job was killed.”
 Excerpted from the communication of another incident, “Namenode was 
 overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
 requests. the job had a bug that called getFileInfo for a nonexistent file in 
 an endless loop). All other requests to namenode were also affected by this 
 and hence all jobs slowed down. Cluster almost came to a grinding 
 halt…Eventually killed jobtracker to kill all jobs that are running.”
 Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
 the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
 (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue

2013-12-10 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13844506#comment-13844506
 ] 

Suresh Srinivas commented on HADOOP-9640:
-

I had in person meeting with [~chrili] on this. This is excellent work!

bq. Parsing the MapReduce job name out of the DFSClient name is kind of an ugly 
hack. The client name also isn't that reliable since it's formed from the 
client's Configuration
I had suggested this to [~chrili]. I realize that the configuration passed from 
MapReduce is actually a task ID. So the client name based on that will not be 
useful, unless we parse it to get the job ID.

I agree that this is not the way the final solution should work. I propose 
adding some kind of configuration that can be passed to establish context in 
which access to services is happening. Currently this is done by mapreduce 
framework. It sets the configuration  which gets used in forming DFSClient 
name.

We could do the following to satisfy the various user requirements:
# Add a new configuration in common called hadoop.application.context to 
HDFS. Other services that want to do the same thing can either use this same 
configuration and find another way to configure it. This information should be 
marshalled from the client to the server. The congestion control can be built 
based on that.
# Lets also make identities used for accounting configurable. They can be 
either based on context, user, token, or default. That way people who 
do not like the default configuration can make changes.

 RPC Congestion Control with FairCallQueue
 -

 Key: HADOOP-9640
 URL: https://issues.apache.org/jira/browse/HADOOP-9640
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 3.0.0, 2.2.0
Reporter: Xiaobo Peng
  Labels: hdfs, qos, rpc
 Attachments: MinorityMajorityPerformance.pdf, 
 NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, 
 faircallqueue2.patch, faircallqueue3.patch, faircallqueue4.patch, 
 faircallqueue5.patch, rpc-congestion-control-draft-plan.pdf


 Several production Hadoop cluster incidents occurred where the Namenode was 
 overloaded and failed to respond. 
 We can improve quality of service for users during namenode peak loads by 
 replacing the FIFO call queue with a [Fair Call 
 Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
  (this plan supersedes rpc-congestion-control-draft-plan).
 Excerpted from the communication of one incident, “The map task of a user was 
 creating huge number of small files in the user directory. Due to the heavy 
 load on NN, the JT also was unable to communicate with NN...The cluster 
 became responsive only once the job was killed.”
 Excerpted from the communication of another incident, “Namenode was 
 overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
 requests. the job had a bug that called getFileInfo for a nonexistent file in 
 an endless loop). All other requests to namenode were also affected by this 
 and hence all jobs slowed down. Cluster almost came to a grinding 
 halt…Eventually killed jobtracker to kill all jobs that are running.”
 Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
 the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
 (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue

2013-12-10 Thread Chris Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13844743#comment-13844743
 ] 

Chris Li commented on HADOOP-9640:
--

bq. Add a new configuration in common called hadoop.application.context to 
HDFS. Other services that want to do the same thing can either use this same 
configuration and find another way to configure it. This information should be 
marshalled from the client to the server. The congestion control can be built 
based on that.

Just to be clear, would an example be,
1. Cluster operator specifies ipc.8020.application.context = hadoop.yarn
2. Namenode sees this, knows to load the class that generates job IDs from the 
Connection/Call?

Or were you thinking of physically adding the id into the RPC call itself, 
which would make the rpc call size larger, but is a cleaner solution (albeit 
one that the client could spoof).

bq. Lets also make identities used for accounting configurable. They can be 
either based on context, user, token, or default. That way people who 
do not like the default configuration can make changes.

Sounds like a good idea.

 RPC Congestion Control with FairCallQueue
 -

 Key: HADOOP-9640
 URL: https://issues.apache.org/jira/browse/HADOOP-9640
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 3.0.0, 2.2.0
Reporter: Xiaobo Peng
  Labels: hdfs, qos, rpc
 Attachments: MinorityMajorityPerformance.pdf, 
 NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, 
 faircallqueue2.patch, faircallqueue3.patch, faircallqueue4.patch, 
 faircallqueue5.patch, rpc-congestion-control-draft-plan.pdf


 Several production Hadoop cluster incidents occurred where the Namenode was 
 overloaded and failed to respond. 
 We can improve quality of service for users during namenode peak loads by 
 replacing the FIFO call queue with a [Fair Call 
 Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
  (this plan supersedes rpc-congestion-control-draft-plan).
 Excerpted from the communication of one incident, “The map task of a user was 
 creating huge number of small files in the user directory. Due to the heavy 
 load on NN, the JT also was unable to communicate with NN...The cluster 
 became responsive only once the job was killed.”
 Excerpted from the communication of another incident, “Namenode was 
 overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
 requests. the job had a bug that called getFileInfo for a nonexistent file in 
 an endless loop). All other requests to namenode were also affected by this 
 and hence all jobs slowed down. Cluster almost came to a grinding 
 halt…Eventually killed jobtracker to kill all jobs that are running.”
 Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
 the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
 (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue

2013-12-09 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13843551#comment-13843551
 ] 

Hadoop QA commented on HADOOP-9640:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12617895/faircallqueue4.patch
  against trunk revision .

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/3347//console

This message is automatically generated.

 RPC Congestion Control with FairCallQueue
 -

 Key: HADOOP-9640
 URL: https://issues.apache.org/jira/browse/HADOOP-9640
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 3.0.0, 2.2.0
Reporter: Xiaobo Peng
  Labels: hdfs, qos, rpc
 Attachments: MinorityMajorityPerformance.pdf, 
 NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, 
 faircallqueue2.patch, faircallqueue3.patch, faircallqueue4.patch, 
 rpc-congestion-control-draft-plan.pdf


 Several production Hadoop cluster incidents occurred where the Namenode was 
 overloaded and failed to respond. 
 We can improve quality of service for users during namenode peak loads by 
 replacing the FIFO call queue with a [Fair Call 
 Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
  (this plan supersedes rpc-congestion-control-draft-plan).
 Excerpted from the communication of one incident, “The map task of a user was 
 creating huge number of small files in the user directory. Due to the heavy 
 load on NN, the JT also was unable to communicate with NN...The cluster 
 became responsive only once the job was killed.”
 Excerpted from the communication of another incident, “Namenode was 
 overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
 requests. the job had a bug that called getFileInfo for a nonexistent file in 
 an endless loop). All other requests to namenode were also affected by this 
 and hence all jobs slowed down. Cluster almost came to a grinding 
 halt…Eventually killed jobtracker to kill all jobs that are running.”
 Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
 the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
 (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue

2013-12-09 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13843586#comment-13843586
 ] 

Hadoop QA commented on HADOOP-9640:
---

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12617900/faircallqueue5.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/3348//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/3348//console

This message is automatically generated.

 RPC Congestion Control with FairCallQueue
 -

 Key: HADOOP-9640
 URL: https://issues.apache.org/jira/browse/HADOOP-9640
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 3.0.0, 2.2.0
Reporter: Xiaobo Peng
  Labels: hdfs, qos, rpc
 Attachments: MinorityMajorityPerformance.pdf, 
 NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, 
 faircallqueue2.patch, faircallqueue3.patch, faircallqueue4.patch, 
 faircallqueue5.patch, rpc-congestion-control-draft-plan.pdf


 Several production Hadoop cluster incidents occurred where the Namenode was 
 overloaded and failed to respond. 
 We can improve quality of service for users during namenode peak loads by 
 replacing the FIFO call queue with a [Fair Call 
 Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
  (this plan supersedes rpc-congestion-control-draft-plan).
 Excerpted from the communication of one incident, “The map task of a user was 
 creating huge number of small files in the user directory. Due to the heavy 
 load on NN, the JT also was unable to communicate with NN...The cluster 
 became responsive only once the job was killed.”
 Excerpted from the communication of another incident, “Namenode was 
 overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
 requests. the job had a bug that called getFileInfo for a nonexistent file in 
 an endless loop). All other requests to namenode were also affected by this 
 and hence all jobs slowed down. Cluster almost came to a grinding 
 halt…Eventually killed jobtracker to kill all jobs that are running.”
 Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
 the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
 (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue

2013-12-06 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13841670#comment-13841670
 ] 

Hadoop QA commented on HADOOP-9640:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12617457/faircallqueue2.patch
  against trunk revision .

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/3343//console

This message is automatically generated.

 RPC Congestion Control with FairCallQueue
 -

 Key: HADOOP-9640
 URL: https://issues.apache.org/jira/browse/HADOOP-9640
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 3.0.0, 2.2.0
Reporter: Xiaobo Peng
  Labels: hdfs, qos, rpc
 Attachments: MinorityMajorityPerformance.pdf, 
 NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, 
 faircallqueue2.patch, rpc-congestion-control-draft-plan.pdf


 Several production Hadoop cluster incidents occurred where the Namenode was 
 overloaded and failed to respond. 
 We can improve quality of service for users during namenode peak loads by 
 replacing the FIFO call queue with a [Fair Call 
 Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
  (this plan supersedes rpc-congestion-control-draft-plan).
 Excerpted from the communication of one incident, “The map task of a user was 
 creating huge number of small files in the user directory. Due to the heavy 
 load on NN, the JT also was unable to communicate with NN...The cluster 
 became responsive only once the job was killed.”
 Excerpted from the communication of another incident, “Namenode was 
 overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
 requests. the job had a bug that called getFileInfo for a nonexistent file in 
 an endless loop). All other requests to namenode were also affected by this 
 and hence all jobs slowed down. Cluster almost came to a grinding 
 halt…Eventually killed jobtracker to kill all jobs that are running.”
 Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
 the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
 (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue

2013-12-06 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13841762#comment-13841762
 ] 

Hadoop QA commented on HADOOP-9640:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12617475/faircallqueue3.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified test files.

  {color:red}-1 javac{color}.  The applied patch generated 1546 javac 
compiler warnings (more than the trunk's current 1545 warnings).

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/3345//testReport/
Javac warnings: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/3345//artifact/trunk/patchprocess/diffJavacWarnings.txt
Console output: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/3345//console

This message is automatically generated.

 RPC Congestion Control with FairCallQueue
 -

 Key: HADOOP-9640
 URL: https://issues.apache.org/jira/browse/HADOOP-9640
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 3.0.0, 2.2.0
Reporter: Xiaobo Peng
  Labels: hdfs, qos, rpc
 Attachments: MinorityMajorityPerformance.pdf, 
 NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, 
 faircallqueue2.patch, faircallqueue3.patch, 
 rpc-congestion-control-draft-plan.pdf


 Several production Hadoop cluster incidents occurred where the Namenode was 
 overloaded and failed to respond. 
 We can improve quality of service for users during namenode peak loads by 
 replacing the FIFO call queue with a [Fair Call 
 Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
  (this plan supersedes rpc-congestion-control-draft-plan).
 Excerpted from the communication of one incident, “The map task of a user was 
 creating huge number of small files in the user directory. Due to the heavy 
 load on NN, the JT also was unable to communicate with NN...The cluster 
 became responsive only once the job was killed.”
 Excerpted from the communication of another incident, “Namenode was 
 overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
 requests. the job had a bug that called getFileInfo for a nonexistent file in 
 an endless loop). All other requests to namenode were also affected by this 
 and hence all jobs slowed down. Cluster almost came to a grinding 
 halt…Eventually killed jobtracker to kill all jobs that are running.”
 Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
 the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
 (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue

2013-12-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13839654#comment-13839654
 ] 

Hadoop QA commented on HADOOP-9640:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12617098/MinorityMajorityPerformance.pdf
  against trunk revision .

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/3335//console

This message is automatically generated.

 RPC Congestion Control with FairCallQueue
 -

 Key: HADOOP-9640
 URL: https://issues.apache.org/jira/browse/HADOOP-9640
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 3.0.0, 2.2.0
Reporter: Xiaobo Peng
  Labels: hdfs, qos, rpc
 Attachments: MinorityMajorityPerformance.pdf, 
 NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, 
 rpc-congestion-control-draft-plan.pdf


 Several production Hadoop cluster incidents occurred where the Namenode was 
 overloaded and failed to respond. 
 We can improve quality of service for users during namenode peak loads by 
 replacing the FIFO call queue with a [Fair Call 
 Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
  (this plan supersedes rpc-congestion-control-draft-plan).
 Excerpted from the communication of one incident, “The map task of a user was 
 creating huge number of small files in the user directory. Due to the heavy 
 load on NN, the JT also was unable to communicate with NN...The cluster 
 became responsive only once the job was killed.”
 Excerpted from the communication of another incident, “Namenode was 
 overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
 requests. the job had a bug that called getFileInfo for a nonexistent file in 
 an endless loop). All other requests to namenode were also affected by this 
 and hence all jobs slowed down. Cluster almost came to a grinding 
 halt…Eventually killed jobtracker to kill all jobs that are running.”
 Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
 the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
 (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue

2013-12-04 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13839710#comment-13839710
 ] 

Daryn Sharp commented on HADOOP-9640:
-

I haven't read all the docs, and I've only skimmed the patch, but the entire 
feature _must be configurable_.  As in a toggle to directly use the 
{{LinkedBlockingQueue}} as today.  An activity surge often isn't indicative of 
abuse, nor do I necessarily want heavy users to have priority above all others 
because there are multiple equal heavy users, nor do I want to debug priority 
inversions at this time. :)

I do think the patch might have potential performance benefits, as your graph 
mentions, from multiple queues lowering lock contention between the 100 hungry 
handlers.  I've been working to lower lock contention, so while in the RPC 
layer I considered playing with the callQ but it wasn't even a blip in the 
profiler.

However, you can't extrapolate performance improvements from 2 client threads, 
2 server threads, and multiple queues.  I think you've effectively eliminated 
any lock contention and given each client their own queue.  2 threads will 
produce negligible contention with even 1 queue.  Things don't get ugly till 
you have many threads contending.  Measurements with at least 16-32 clients  
server threads become interesting!

 RPC Congestion Control with FairCallQueue
 -

 Key: HADOOP-9640
 URL: https://issues.apache.org/jira/browse/HADOOP-9640
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 3.0.0, 2.2.0
Reporter: Xiaobo Peng
  Labels: hdfs, qos, rpc
 Attachments: MinorityMajorityPerformance.pdf, 
 NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, 
 rpc-congestion-control-draft-plan.pdf


 Several production Hadoop cluster incidents occurred where the Namenode was 
 overloaded and failed to respond. 
 We can improve quality of service for users during namenode peak loads by 
 replacing the FIFO call queue with a [Fair Call 
 Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
  (this plan supersedes rpc-congestion-control-draft-plan).
 Excerpted from the communication of one incident, “The map task of a user was 
 creating huge number of small files in the user directory. Due to the heavy 
 load on NN, the JT also was unable to communicate with NN...The cluster 
 became responsive only once the job was killed.”
 Excerpted from the communication of another incident, “Namenode was 
 overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
 requests. the job had a bug that called getFileInfo for a nonexistent file in 
 an endless loop). All other requests to namenode were also affected by this 
 and hence all jobs slowed down. Cluster almost came to a grinding 
 halt…Eventually killed jobtracker to kill all jobs that are running.”
 Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
 the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
 (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue

2013-12-03 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13838519#comment-13838519
 ] 

Andrew Wang commented on HADOOP-9640:
-

Sorry, it looks like I read the wrong document. I'm glad you still found some 
of my comments useful, but I'll read the updated one too and get back to you :)

 RPC Congestion Control with FairCallQueue
 -

 Key: HADOOP-9640
 URL: https://issues.apache.org/jira/browse/HADOOP-9640
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 3.0.0, 2.2.0
Reporter: Xiaobo Peng
  Labels: hdfs, qos, rpc
 Attachments: NN-denial-of-service-updated-plan.pdf, 
 faircallqueue.patch, rpc-congestion-control-draft-plan.pdf


 Several production Hadoop cluster incidents occurred where the Namenode was 
 overloaded and failed to respond. 
 We can improve quality of service for users during namenode peak loads by 
 replacing the FIFO call queue with a [Fair Call 
 Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
  (this plan supersedes rpc-congestion-control-draft-plan).
 Excerpted from the communication of one incident, “The map task of a user was 
 creating huge number of small files in the user directory. Due to the heavy 
 load on NN, the JT also was unable to communicate with NN...The cluster 
 became responsive only once the job was killed.”
 Excerpted from the communication of another incident, “Namenode was 
 overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
 requests. the job had a bug that called getFileInfo for a nonexistent file in 
 an endless loop). All other requests to namenode were also affected by this 
 and hence all jobs slowed down. Cluster almost came to a grinding 
 halt…Eventually killed jobtracker to kill all jobs that are running.”
 Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
 the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
 (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v6.1#6144)