On 3/19/26 10:53, Aaron Tomlin wrote: > In high-performance storage environments, particularly when utilising > RAID controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED), severe > latency spikes can occur when fast devices (SSDs) are starved of hardware > tags when sharing the same blk_mq_tag_set. > > Currently, diagnosing this specific hardware queue contention is > difficult. When a CPU thread exhausts the tag pool, blk_mq_get_tag() > forces the current thread to block uninterruptible via io_schedule(). > While this can be inferred via sched:sched_switch or dynamically > traced by attaching a kprobe to blk_mq_mark_tag_wait(), there is no > dedicated, out-of-the-box observability for this event. > > This patch introduces the block_rq_tag_wait static trace point in the > tag allocation slow-path. It triggers immediately before the thread > yields the CPU, exposing the exact hardware context (hctx) that is > starved, the specific pool experiencing starvation (hardware or software > scheduler), and the total pool depth. > > This provides storage engineers and performance monitoring agents > with a zero-configuration, low-overhead mechanism to definitively > identify shared-tag bottlenecks and tune I/O schedulers or cgroup > throttling accordingly. > > Signed-off-by: Aaron Tomlin <[email protected]> > --- > Changes in v1 [1]: > - Improved the description of the trace point (Damien Le Moal) > - Removed the redundant "active requests" (Laurence Oberman) > - Introduced pool-specific starvation tracking > > [1]: https://lore.kernel.org/lkml/[email protected]/ > > block/blk-mq-tag.c | 4 ++++ > include/trace/events/block.h | 43 ++++++++++++++++++++++++++++++++++++ > 2 files changed, 47 insertions(+) > > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c > index 33946cdb5716..a6691a4fe7a7 100644 > --- a/block/blk-mq-tag.c > +++ b/block/blk-mq-tag.c > @@ -13,6 +13,7 @@ > #include <linux/kmemleak.h> > > #include <linux/delay.h> > +#include <trace/events/block.h> > #include "blk.h" > #include "blk-mq.h" > #include "blk-mq-sched.h" > @@ -187,6 +188,9 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data > *data) > if (tag != BLK_MQ_NO_TAG) > break; > > + trace_block_rq_tag_wait(data->q, data->hctx, > + !!(data->rq_flags & RQF_SCHED_TAGS));
I do not think that the "!!" is needed here. Other than this, this looks OK to me. Reviewed-by: Damien Le Moal <[email protected]> > + > bt_prev = bt; > io_schedule(); > > diff --git a/include/trace/events/block.h b/include/trace/events/block.h > index 6aa79e2d799c..f7708d0d7a0c 100644 > --- a/include/trace/events/block.h > +++ b/include/trace/events/block.h > @@ -226,6 +226,49 @@ DECLARE_EVENT_CLASS(block_rq, > IOPRIO_PRIO_LEVEL(__entry->ioprio), __entry->comm) > ); > > +/** > + * block_rq_tag_wait - triggered when a request is starved of a tag > + * @q: request queue of the target device > + * @hctx: hardware context of the request experiencing starvation > + * @is_sched_tag: indicates whether the starved pool is the software > scheduler > + * > + * Called immediately before the submitting context is forced to block due > + * to the exhaustion of available tags (i.e., physical hardware driver tags > + * or software scheduler tags). This trace point indicates that the context > + * will be placed into an uninterruptible state via io_schedule() until an > + * active request completes and relinquishes its assigned tag. > + */ > +TRACE_EVENT(block_rq_tag_wait, > + > + TP_PROTO(struct request_queue *q, struct blk_mq_hw_ctx *hctx, bool > is_sched_tag), > + > + TP_ARGS(q, hctx, is_sched_tag), > + > + TP_STRUCT__entry( > + __field( dev_t, dev ) > + __field( u32, hctx_id ) > + __field( u32, nr_tags ) > + __field( bool, is_sched_tag ) > + ), > + > + TP_fast_assign( > + __entry->dev = disk_devt(q->disk); > + __entry->hctx_id = hctx->queue_num; > + __entry->is_sched_tag = is_sched_tag; > + > + if (__entry->is_sched_tag) > + __entry->nr_tags = hctx->sched_tags->nr_tags; > + else > + __entry->nr_tags = hctx->tags->nr_tags; > + ), > + > + TP_printk("%d,%d hctx=%u starved on %s tags (depth=%u)", > + MAJOR(__entry->dev), MINOR(__entry->dev), > + __entry->hctx_id, > + __entry->is_sched_tag ? "scheduler" : "hardware", > + __entry->nr_tags) > +); > + > /** > * block_rq_insert - insert block operation request into queue > * @rq: block IO operation request -- Damien Le Moal Western Digital Research
