Thanks for the response Daryn!

 

I agree with you that for the overall average qtime it will increase due to the 
penalty FCQ brings to the heavy users. However, in our environment, out of the 
same consideration I intentionally turned off the Call selection between 
queues. i.e. the cost is calculated as usual, but all users are stayed in the 
first queue. This is to avoid the overall impact. 

Here are our configs, the red one is what I added for internal use to turn on 
this feature (making only selected users are actually added into the second 
queue when their cost reaches threshold).

 

There are two patches for Cost Based FCQ. 
https://issues.apache.org/jira/browse/HADOOP-16266 and 
https://issues.apache.org/jira/browse/HDFS-14667. Which version are you using? 

I am right now trying to debug one by one.

 

Thanks,
Fengnan

 

<property>

    <name>ipc.8020.callqueue.capacity.weights</name>

    <value>99,1</value>

  </property>

  <property>

    <name>ipc.8020.callqueue.impl</name>

    <value>org.apache.hadoop.ipc.FairCallQueue</value>

  </property>

  <property>

    <name>ipc.8020.cost-provider.impl</name>

    <value>org.apache.hadoop.ipc.WeightedTimeCostProvider</value>

  </property>

  <property>

    <name>ipc.8020.decay-scheduler.blacklisted.users.enabled</name>

    <value>true</value>

  </property>

  <property>

    <name>ipc.8020.decay-scheduler.decay-factor</name>

    <value>0.01</value>

  </property>

  <property>

    <name>ipc.8020.decay-scheduler.period-ms</name>

    <value>20000</value>

  </property>

  <property>

    <name>ipc.8020.decay-scheduler.thresholds</name>

    <value>15</value>

  </property>

  <property>

    <name>ipc.8020.faircallqueue.multiplexer.weights</name>

    <value>99,1</value>

  </property>

  <property>

    <name>ipc.8020.scheduler.priority.levels</name>

    <value>2</value>

  </property>

 

From: Daryn Sharp <da...@verizonmedia.com>
Date: Thursday, November 5, 2020 at 9:19 AM
To: Fengnan Li <loyal...@gmail.com>
Cc: Hdfs-dev <hdfs-dev@hadoop.apache.org>
Subject: Re: [E] Cost Based FairCallQueue latency issue

 

I submitted the original 2.8 cost-based FCQ patch (thanks to community members 
for porting to other branches).  We've been running with it since early 2019 on 
all clusters.  Multiple clusters run at a baseline of ~30k+ ops/sec with some 
bursting over 100k ops/sec.  

 

If you are looking at the overall average qtime, yes, that metric is expected 
to increase and means it's working as designed.  De-prioritizing write heavy 
users will naturally result in increased qtime for those calls.  Within a 
bucket, call N's qtime is the sum of the qtime+processing for the prior 0..N-1 
calls.  This will appear very high for congested low priority buckets receiving 
a fraction of the service rate and skew the overall average.

 

 

On Fri, Oct 30, 2020 at 3:51 PM Fengnan Li <loyal...@gmail.com> wrote:

Hi all,



Has someone deployed Cost Based Fair Call Queue in their production cluster? We 
ran into some RPC queue latency degradation with ~30k-40k rps. I tried to debug 
but didn’t find anything suspicious. It is worth mentioning there is no memory 
issue coming with the extra heap usage for storing the call cost.



Thanks,

Fengnan

Reply via email to