Thanks for the response Daryn!
I agree with you that for the overall average qtime it will increase due to the penalty FCQ brings to the heavy users. However, in our environment, out of the same consideration I intentionally turned off the Call selection between queues. i.e. the cost is calculated as usual, but all users are stayed in the first queue. This is to avoid the overall impact. Here are our configs, the red one is what I added for internal use to turn on this feature (making only selected users are actually added into the second queue when their cost reaches threshold). There are two patches for Cost Based FCQ. https://issues.apache.org/jira/browse/HADOOP-16266 and https://issues.apache.org/jira/browse/HDFS-14667. Which version are you using? I am right now trying to debug one by one. Thanks, Fengnan <property> <name>ipc.8020.callqueue.capacity.weights</name> <value>99,1</value> </property> <property> <name>ipc.8020.callqueue.impl</name> <value>org.apache.hadoop.ipc.FairCallQueue</value> </property> <property> <name>ipc.8020.cost-provider.impl</name> <value>org.apache.hadoop.ipc.WeightedTimeCostProvider</value> </property> <property> <name>ipc.8020.decay-scheduler.blacklisted.users.enabled</name> <value>true</value> </property> <property> <name>ipc.8020.decay-scheduler.decay-factor</name> <value>0.01</value> </property> <property> <name>ipc.8020.decay-scheduler.period-ms</name> <value>20000</value> </property> <property> <name>ipc.8020.decay-scheduler.thresholds</name> <value>15</value> </property> <property> <name>ipc.8020.faircallqueue.multiplexer.weights</name> <value>99,1</value> </property> <property> <name>ipc.8020.scheduler.priority.levels</name> <value>2</value> </property> From: Daryn Sharp <da...@verizonmedia.com> Date: Thursday, November 5, 2020 at 9:19 AM To: Fengnan Li <loyal...@gmail.com> Cc: Hdfs-dev <hdfs-dev@hadoop.apache.org> Subject: Re: [E] Cost Based FairCallQueue latency issue I submitted the original 2.8 cost-based FCQ patch (thanks to community members for porting to other branches). We've been running with it since early 2019 on all clusters. Multiple clusters run at a baseline of ~30k+ ops/sec with some bursting over 100k ops/sec. If you are looking at the overall average qtime, yes, that metric is expected to increase and means it's working as designed. De-prioritizing write heavy users will naturally result in increased qtime for those calls. Within a bucket, call N's qtime is the sum of the qtime+processing for the prior 0..N-1 calls. This will appear very high for congested low priority buckets receiving a fraction of the service rate and skew the overall average. On Fri, Oct 30, 2020 at 3:51 PM Fengnan Li <loyal...@gmail.com> wrote: Hi all, Has someone deployed Cost Based Fair Call Queue in their production cluster? We ran into some RPC queue latency degradation with ~30k-40k rps. I tried to debug but didn’t find anything suspicious. It is worth mentioning there is no memory issue coming with the extra heap usage for storing the call cost. Thanks, Fengnan