Re: [PATCH v6 2/2] blk-mq: expose tag starvation counts via debugfs

Aaron Tomlin Wed, 20 May 2026 19:24:42 -0700

On Mon, May 18, 2026 at 09:14:49AM +0100, John Garry wrote:
> On 17/05/2026 22:36, Aaron Tomlin wrote:
> > In high-performance storage environments, particularly when utilising
> > RAID controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED), severe
> > latency spikes can occur when fast devices are starved of available
> > tags.
> > 
> > This patch introduces two new debugfs attributes for each block
> > hardware queue:
> >    - /sys/kernel/debug/block/[device]/hctxN/wait_on_hw_tag
> >    - /sys/kernel/debug/block/[device]/hctxN/wait_on_sched_tag
> 
> How would these counters be used? You are just saying that we may have
> performance latency spikes and so here are two new counters.


[ ... ]

> > These files expose atomic counters that increment each time a submitting
> > context is forced into an uninterruptible sleep via io_schedule() due to
> > the complete exhaustion of physical driver tags or software scheduler
> > tags, respectively.
> > 
> > To ensure negligible performance overhead even in production
> > environments where CONFIG_BLK_DEBUG_FS is actively enabled, this
> > tracking logic utilises dynamically allocated per-CPU counters. When
> > this configuration is disabled, the tracking logic compiles down to a
> > safe no-op.
> 
> How does one normalise the values which are measured? I mean, during a
> period of high contention, we may get a bunch of threads waiting for a
> driver tag and the value in wait_on_hw_tag may jump considerably - how do
> you normalize this value in wait_on_hw_tag for meaningful analysis?

Hi John,

Thanks for the review.

Based on feedback from Jens regarding this series [1], I am actually going to
drop Patch 2 entirely in v7. 

To answer your questions in the context of the new tracepoint approach:

Storage engineers can use bpftrace(8) to hook the newly proposed
"block_rq_tag_wait" tracepoint. This allows us to track tag starvation
dynamically, filter it by specific devices, or even group it by the
specific process (comm) experiencing the latency spike.

Moving this to userspace completely solves the normalization problem you
highlighted. For example, using bpftrace, userspace can hook both the tag
starvation event (i.e., block_rq_tag_wait) and the standard block issue
event (i.e., block_rq_issue) over a defined time window (e.g., 5 seconds).

Userspace can then divide the waits by the total issues to get a
meaningful, normalized starvation ratio (e.g., "4% of I/Os were starved of
hardware tags in the last 5 seconds"). This is far more useful for analysis
than a contextless debugfs integer jumping by an arbitrary amount.

Thanks again for taking a look. I'll make sure to Cc you on v7.

[1]: 
https://lore.kernel.org/lkml/t47wegcgfc43nimo4vqfdedqih43ydfietb7tsaobeitxgdhxs@6lnzvbj5rhab/


Kind regards,
-- 
Aaron Tomlin

Re: [PATCH v6 2/2] blk-mq: expose tag starvation counts via debugfs

Reply via email to