[ 
https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13838507#comment-13838507
 ] 

Chris Li commented on HADOOP-9640:
----------------------------------

Thanks for the look, Andrew.

bq. Parsing the MapReduce job name out of the DFSClient name is kind of an ugly 
hack. The client name also isn't that reliable since it's formed from the 
client's Configuration, and more generally anything in the RPC format that 
isn't a Kerberos token can be faked. Are these concerns in scope for your 
proposal?

bq. Tracking by user is also not going to work so well in a HiveServer2 setup 
where all Hive queries are run as the hive user. This is a pretty common DB 
security model, since you need this for column/row-level security.

This is definitely up for discussion. One way would be to add a new field 
specifically for QoS that provides the identity (whether tied to job or user). 

I'm not too familiar with HiveServer2 and what could be done there. Maybe 
there's some information that's passed through about the original user?

bq. What's the purpose of separating read and write requests? Write requests 
take the write lock, and are thus more "expensive" in that sense, but your 
example of the listDir of a large directory is a read operation.

bq. In the "Identify suspects" section, I see that you present three options 
here. Which one do you think is best? Seems like you're leaning toward option 3.

bq. Does dropping an RPC result in exponential back-off from the client, a la 
TCP? Client backpressure is pretty important to reach a steady state.

The NN-denial-of-service plan (using a multi-level queue) supersedes the rpc 
congestion control doc (identifying bad users). 

bq. I didn't see any mention of fair share here, are you planning to adjust 
suspect thresholds based on client share?

Clients over-using resources are throttled automatically by being placed into 
low-priority queues, bringing them back into reign. Given many users contesting 
over 100% of the server's resources, they will all tend to use an equal amount.

Adjusting thresholds at runtime would be a future enhancement.

bq. Any thoughts on how to automatically determine these thresholds? These seem 
like kind of annoying parameters to tune.

There are two thresholds to tune:
1. the scheduler thresholds (defaults to even split e.g. with 4 queues: 25% 
each)
2. the multiplexer's round-robin weights (defaults to log split e.g. 2^3 from 
queue 0, 2^2 from queue 1, etc)

The defaults work pretty well for us, but different clusters will have 
different loads. The scheduler will provide JMX metrics to make it easier to 
tune.

bq. Maybe admin / superuser commands and service RPCs should be excluded from 
this feature

Currently a config key (like ipc.8020.history-scheduler.service-users) 
specifies service users which are given absolute  high priority, and will 
always be scheduled into the highest-priority queue. To completely exclude 
service RPC calls, one could use the service RPC server.

bq. Do you have any preliminary benchmarks supporting the design? Performance 
is a pretty important aspect of this design.

I'll put some more numbers up shortly. Some preliminary results are on page 8 
of the 
[attachment|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf]

I should have the code up soon as well.

> RPC Congestion Control
> ----------------------
>
>                 Key: HADOOP-9640
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9640
>             Project: Hadoop Common
>          Issue Type: Improvement
>    Affects Versions: 3.0.0, 2.2.0
>            Reporter: Xiaobo Peng
>              Labels: hdfs, qos, rpc
>         Attachments: NN-denial-of-service-updated-plan.pdf, 
> faircallqueue.patch, rpc-congestion-control-draft-plan.pdf
>
>
> Several production Hadoop cluster incidents occurred where the Namenode was 
> overloaded and failed to be responsive.  This task is to improve the system 
> to detect RPC congestion early, and to provide good diagnostic information 
> for alerts that identify suspicious jobs/users so as to restore services 
> quickly.
> Excerpted from the communication of one incident, “The map task of a user was 
> creating huge number of small files in the user directory. Due to the heavy 
> load on NN, the JT also was unable to communicate with NN...The cluster 
> became responsive only once the job was killed.”
> Excerpted from the communication of another incident, “Namenode was 
> overloaded by GetBlockLocation requests (Correction: should be getFileInfo 
> requests. the job had a bug that called getFileInfo for a nonexistent file in 
> an endless loop). All other requests to namenode were also affected by this 
> and hence all jobs slowed down. Cluster almost came to a grinding 
> halt…Eventually killed jobtracker to kill all jobs that are running.”
> Excerpted from HDFS-945, “We've seen defective applications cause havoc on 
> the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories 
> (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to