We're internally running the patch I submitted on HDFS-14403 which was
subsequently modified by other ppl in the community, so it's possible the
community flavor may behave differently.  I vaguely remember the RpcMetrics
timeunit was changed from micros to millis.  Measuring in millis has
meaningless precision.

WeightedTimeCostProvider is what enables the feature.  The blacklist is a
different feature so if twiddling that conf caused noticeably latency
differences then I'd suggest examining that change.

I don't think you are going to see much benefit from 2 queues with a .01
decay factor.  I'd suggest at least 4 queues with 0.5 decay so users
generating heavy load don't keep popping back up in priority so quickly.



On Thu, Nov 5, 2020 at 11:43 AM Fengnan Li <loyal...@gmail.com> wrote:

> Thanks for the response Daryn!
>
>
>
> I agree with you that for the overall average qtime it will increase due
> to the penalty FCQ brings to the heavy users. However, in our environment,
> out of the same consideration I intentionally turned off the Call selection
> between queues. i.e. the cost is calculated as usual, but all users are
> stayed in the first queue. This is to avoid the overall impact.
>
> Here are our configs, the red one is what I added for internal use to turn
> on this feature (making only selected users are actually added into the
> second queue when their cost reaches threshold).
>
>
>
> There are two patches for Cost Based FCQ.
> https://issues.apache.org/jira/browse/HADOOP-16266
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HADOOP-2D16266&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=YvobliAaIb5Zx_GAaDndHMzwchT42yLkTDhO4iJb4wY&m=GltLj2InqpUfpDOK5OWdPkOED62df3FmzR-OjNuIt5A&s=JdIi9kZN2CIGkM7HEOjAugCdo-727sbVkXOHPm0c2NM&e=>
> and https://issues.apache.org/jira/browse/HDFS-14667
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HDFS-2D14667&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=YvobliAaIb5Zx_GAaDndHMzwchT42yLkTDhO4iJb4wY&m=GltLj2InqpUfpDOK5OWdPkOED62df3FmzR-OjNuIt5A&s=Ef3nub-nzm7zqOSfZzO888h4JKoSa_6qdlULq6SXk6U&e=>.
> Which version are you using?
>
> I am right now trying to debug one by one.
>
>
>
> Thanks,
> Fengnan
>
>
>
> <property>
>
>     <name>ipc.8020.callqueue.capacity.weights</name>
>
>     <value>99,1</value>
>
>   </property>
>
>   <property>
>
>     <name>ipc.8020.callqueue.impl</name>
>
>     <value>org.apache.hadoop.ipc.FairCallQueue</value>
>
>   </property>
>
>   <property>
>
>     <name>ipc.8020.cost-provider.impl</name>
>
>     <value>org.apache.hadoop.ipc.WeightedTimeCostProvider</value>
>
>   </property>
>
>   <property>
>
>     <name>ipc.8020.decay-scheduler.blacklisted.users.enabled</name>
>
>     <value>true</value>
>
>   </property>
>
>   <property>
>
>     <name>ipc.8020.decay-scheduler.decay-factor</name>
>
>     <value>0.01</value>
>
>   </property>
>
>   <property>
>
>     <name>ipc.8020.decay-scheduler.period-ms</name>
>
>     <value>20000</value>
>
>   </property>
>
>   <property>
>
>     <name>ipc.8020.decay-scheduler.thresholds</name>
>
>     <value>15</value>
>
>   </property>
>
>   <property>
>
>     <name>ipc.8020.faircallqueue.multiplexer.weights</name>
>
>     <value>99,1</value>
>
>   </property>
>
>   <property>
>
>     <name>ipc.8020.scheduler.priority.levels</name>
>
>     <value>2</value>
>
>   </property>
>
>
>
> *From: *Daryn Sharp <da...@verizonmedia.com>
> *Date: *Thursday, November 5, 2020 at 9:19 AM
> *To: *Fengnan Li <loyal...@gmail.com>
> *Cc: *Hdfs-dev <hdfs-dev@hadoop.apache.org>
> *Subject: *Re: [E] Cost Based FairCallQueue latency issue
>
>
>
> I submitted the original 2.8 cost-based FCQ patch (thanks to community
> members for porting to other branches).  We've been running with it since
> early 2019 on all clusters.  Multiple clusters run at a baseline of ~30k+
> ops/sec with some bursting over 100k ops/sec.
>
>
>
> If you are looking at the overall average qtime, yes, that metric is
> expected to increase and means it's working as designed.  De-prioritizing
> write heavy users will naturally result in increased qtime for those
> calls.  Within a bucket, call N's qtime is the sum of the qtime+processing
> for the prior 0..N-1 calls.  This will appear very high for congested low
> priority buckets receiving a fraction of the service rate and skew the
> overall average.
>
>
>
>
>
> On Fri, Oct 30, 2020 at 3:51 PM Fengnan Li <loyal...@gmail.com> wrote:
>
> Hi all,
>
>
>
> Has someone deployed Cost Based Fair Call Queue in their production
> cluster? We ran into some RPC queue latency degradation with ~30k-40k rps.
> I tried to debug but didn’t find anything suspicious. It is worth
> mentioning there is no memory issue coming with the extra heap usage for
> storing the call cost.
>
>
>
> Thanks,
>
> Fengnan
>
>

Reply via email to