Patch "fs/fuse/kio: severe reordering when cs is congested" fixed reordering happening when cs congested, but unexpectedly introduced serious degradation in random read performance.
The problem is pretty hard and it requires serious investigation. So, for current release we suggest to undo this patch, which must be safe since the problem which it fixed was not even ever noticed in the practice. Yet, retest of read tests for replicas is still required. Option "relax_congavoid" is added. Default value 1 means return of bevahiour as it was before the patch "fs/fuse/kio: severe reordering when cs is congested", 0 returns formally correct but badly performing behaviour and 2 is to disable congestion avoidance for local CSes, which might be variant mostly satisfactory, yet experimental. Summary of the problem: when target CS is congested, requests are queued in client waiting for completion of some of sent requests. But at high rates (>~1Miops) we get our own local cpu so busy, that it looks for our algorithm indistiguishable of actual CS congestion. So, requests start to be queued which drastically increases processing cost and put even more pressure on cpu. This unexpected positive feedback reduces iops in 64 thread tests by >20%, and I was able to make the configuration, where it is worse by 50%, which is unacceptable. Proper solution is still unknown, positive feedback problems are difficult. Either we change queue accounting not to account locally queued requests, or we optimize code to make requeueing cheap or we disable congestion limits for fast path, any of these things are not quick fixes. Affects: #VSTOR-103628 Signed-off-by: Alexey Kuznetsov <kuz...@virtuozzo.com> --- fs/fuse/kio/pcs/pcs_cs.c | 4 +--- fs/fuse/kio/pcs/pcs_cs.h | 12 +++++++++++- fs/fuse/kio/pcs/pcs_map.c | 23 ++++++++++++++++++----- 3 files changed, 30 insertions(+), 9 deletions(-) diff --git a/fs/fuse/kio/pcs/pcs_cs.c b/fs/fuse/kio/pcs/pcs_cs.c index 299ece8..6eef589 100644 --- a/fs/fuse/kio/pcs/pcs_cs.c +++ b/fs/fuse/kio/pcs/pcs_cs.c @@ -1549,11 +1549,10 @@ void pcs_cs_set_stat_up(struct pcs_cs_set *set) pcs_cs_for_each_entry(set, do_update_stat, 0); } -int pcs_cs_cong_enqueue_cond(struct pcs_int_request *ireq, struct pcs_cs *cs) +int __pcs_cs_cong_enqueue_cond(struct pcs_int_request *ireq, struct pcs_cs *cs) { int queued = 0; - spin_lock(&cs->lock); if (cs->in_flight >= cs->eff_cwnd || (cs->cong_queue_len && !(ireq->flags & IREQ_F_REQUEUED))) { queued = 1; @@ -1571,7 +1570,6 @@ int pcs_cs_cong_enqueue_cond(struct pcs_int_request *ireq, struct pcs_cs *cs) queued, ireq, smp_processor_id(), cs->cong_queue_len, cs->in_flight, cs->eff_cwnd, DENTRY_ARGS(ireq->dentry), ireq->iochunk.chunk + ireq->iochunk.offset, ireq->iochunk.size); - spin_unlock(&cs->lock); return queued; } diff --git a/fs/fuse/kio/pcs/pcs_cs.h b/fs/fuse/kio/pcs/pcs_cs.h index d2e80fe..1fdc502 100644 --- a/fs/fuse/kio/pcs/pcs_cs.h +++ b/fs/fuse/kio/pcs/pcs_cs.h @@ -128,7 +128,17 @@ static inline void pcs_cs_activate_cong_queue(struct pcs_cs *cs) list_splice_tail_init(&cs->cong_queue, &cs->active_list); } -int pcs_cs_cong_enqueue_cond(struct pcs_int_request *ireq, struct pcs_cs *cs); +int __pcs_cs_cong_enqueue_cond(struct pcs_int_request *ireq, struct pcs_cs *cs); + +static inline int pcs_cs_cong_enqueue_cond(struct pcs_int_request *ireq, struct pcs_cs *cs) +{ + int queued = 0; + + spin_lock(&cs->lock); + queued = __pcs_cs_cong_enqueue_cond(ireq, cs); + spin_unlock(&cs->lock); + return queued; +} #define PCS_CS_HASH_SIZE 1024 #define PCS_HOST_HASH_SIZE 64 diff --git a/fs/fuse/kio/pcs/pcs_map.c b/fs/fuse/kio/pcs/pcs_map.c index 5d1dd5a..b491f38 100644 --- a/fs/fuse/kio/pcs/pcs_map.c +++ b/fs/fuse/kio/pcs/pcs_map.c @@ -46,6 +46,10 @@ module_param(cs_enable_fanout, uint, 0644); MODULE_PARM_DESC(cs_enable_fanout, "Enable CS fanout"); +int relax_congavoid = 1; +module_param(relax_congavoid, int, 0644); +MODULE_PARM_DESC(relax_congavoid, "Weaker congestion avoidance"); + static struct pcs_cs_list *cs_link_to_cs_list(struct pcs_cs_link *csl) { struct pcs_cs_record *cs_rec; @@ -2063,15 +2067,24 @@ static int pcs_cslist_submit_read(struct pcs_int_request *ireq, struct pcs_cs_li spin_lock(&cs->lock); cs_cwnd_use_or_lose(cs); allot = cs->eff_cwnd - cs->in_flight; - spin_unlock(&cs->lock); - if (allot < 0 || cs->cong_queue_len) { - if (pcs_cs_cong_enqueue_cond(ireq, cs)) - return 0; + if ((relax_congavoid & 2) && test_bit(CS_SF_LOCAL, &cs->state)) + goto skip_cong_avoid; + + if (relax_congavoid && allot >= 0 && list_empty(&cs->cong_queue)) + goto skip_cong_avoid; + + if (__pcs_cs_cong_enqueue_cond(ireq, cs)) { + spin_unlock(&cs->lock); + return 0; } +skip_cong_avoid: + + spin_unlock(&cs->lock); + iochunk = READ_ONCE(ireq->dentry->cluster->cfg.lmss); - if (allot < iochunk) + if (allot < (int)iochunk) allot = iochunk; if (test_bit(CS_SF_LOCAL, &cs->state)) -- 1.8.3.1 _______________________________________________ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel