We're internally running the patch I submitted on HDFS-14403 which was subsequently modified by other ppl in the community, so it's possible the community flavor may behave differently. I vaguely remember the RpcMetrics timeunit was changed from micros to millis. Measuring in millis has meaningless precision.
WeightedTimeCostProvider is what enables the feature. The blacklist is a different feature so if twiddling that conf caused noticeably latency differences then I'd suggest examining that change. I don't think you are going to see much benefit from 2 queues with a .01 decay factor. I'd suggest at least 4 queues with 0.5 decay so users generating heavy load don't keep popping back up in priority so quickly. On Thu, Nov 5, 2020 at 11:43 AM Fengnan Li <loyal...@gmail.com> wrote: > Thanks for the response Daryn! > > > > I agree with you that for the overall average qtime it will increase due > to the penalty FCQ brings to the heavy users. However, in our environment, > out of the same consideration I intentionally turned off the Call selection > between queues. i.e. the cost is calculated as usual, but all users are > stayed in the first queue. This is to avoid the overall impact. > > Here are our configs, the red one is what I added for internal use to turn > on this feature (making only selected users are actually added into the > second queue when their cost reaches threshold). > > > > There are two patches for Cost Based FCQ. > https://issues.apache.org/jira/browse/HADOOP-16266 > <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HADOOP-2D16266&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=YvobliAaIb5Zx_GAaDndHMzwchT42yLkTDhO4iJb4wY&m=GltLj2InqpUfpDOK5OWdPkOED62df3FmzR-OjNuIt5A&s=JdIi9kZN2CIGkM7HEOjAugCdo-727sbVkXOHPm0c2NM&e=> > and https://issues.apache.org/jira/browse/HDFS-14667 > <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HDFS-2D14667&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=YvobliAaIb5Zx_GAaDndHMzwchT42yLkTDhO4iJb4wY&m=GltLj2InqpUfpDOK5OWdPkOED62df3FmzR-OjNuIt5A&s=Ef3nub-nzm7zqOSfZzO888h4JKoSa_6qdlULq6SXk6U&e=>. > Which version are you using? > > I am right now trying to debug one by one. > > > > Thanks, > Fengnan > > > > <property> > > <name>ipc.8020.callqueue.capacity.weights</name> > > <value>99,1</value> > > </property> > > <property> > > <name>ipc.8020.callqueue.impl</name> > > <value>org.apache.hadoop.ipc.FairCallQueue</value> > > </property> > > <property> > > <name>ipc.8020.cost-provider.impl</name> > > <value>org.apache.hadoop.ipc.WeightedTimeCostProvider</value> > > </property> > > <property> > > <name>ipc.8020.decay-scheduler.blacklisted.users.enabled</name> > > <value>true</value> > > </property> > > <property> > > <name>ipc.8020.decay-scheduler.decay-factor</name> > > <value>0.01</value> > > </property> > > <property> > > <name>ipc.8020.decay-scheduler.period-ms</name> > > <value>20000</value> > > </property> > > <property> > > <name>ipc.8020.decay-scheduler.thresholds</name> > > <value>15</value> > > </property> > > <property> > > <name>ipc.8020.faircallqueue.multiplexer.weights</name> > > <value>99,1</value> > > </property> > > <property> > > <name>ipc.8020.scheduler.priority.levels</name> > > <value>2</value> > > </property> > > > > *From: *Daryn Sharp <da...@verizonmedia.com> > *Date: *Thursday, November 5, 2020 at 9:19 AM > *To: *Fengnan Li <loyal...@gmail.com> > *Cc: *Hdfs-dev <hdfs-dev@hadoop.apache.org> > *Subject: *Re: [E] Cost Based FairCallQueue latency issue > > > > I submitted the original 2.8 cost-based FCQ patch (thanks to community > members for porting to other branches). We've been running with it since > early 2019 on all clusters. Multiple clusters run at a baseline of ~30k+ > ops/sec with some bursting over 100k ops/sec. > > > > If you are looking at the overall average qtime, yes, that metric is > expected to increase and means it's working as designed. De-prioritizing > write heavy users will naturally result in increased qtime for those > calls. Within a bucket, call N's qtime is the sum of the qtime+processing > for the prior 0..N-1 calls. This will appear very high for congested low > priority buckets receiving a fraction of the service rate and skew the > overall average. > > > > > > On Fri, Oct 30, 2020 at 3:51 PM Fengnan Li <loyal...@gmail.com> wrote: > > Hi all, > > > > Has someone deployed Cost Based Fair Call Queue in their production > cluster? We ran into some RPC queue latency degradation with ~30k-40k rps. > I tried to debug but didn’t find anything suspicious. It is worth > mentioning there is no memory issue coming with the extra heap usage for > storing the call cost. > > > > Thanks, > > Fengnan > >