On Fri, Oct 10, 2025 at 12:19:51PM +0200, Martin Wilck wrote: > On Thu, 2025-10-09 at 18:25 -0400, Benjamin Marzinski wrote: > > On Thu, Oct 09, 2025 at 11:57:08AM +0200, Martin Wilck wrote: > > > On Wed, 2025-10-08 at 23:04 -0400, Benjamin Marzinski wrote: > > > > Request-based devices (dm-multipath) queue I/O in blk-mq on > > > > noflush > > > > suspends. Any queued IO will make it impossible to freeze the > > > > queue. > > > > If > > > > a process attempts to update the queue limits while there is > > > > queued > > > > IO, > > > > it can be get stuck holding the limits lock, while unable to > > > > freeze > > > > the > > > > queue. If device-mapper then attempts to update the limits during > > > > a > > > > table swap, it will deadlock trying to grab the limits lock while > > > > making > > > > it impossible to flush the IO. > > > > > > > > Disallow updating the queue limits during a table swap, when > > > > updating > > > > an > > > > immutable request-based dm device (dm-multipath) during a noflush > > > > suspend. It is userspace's responsibility to make sure that the > > > > new > > > > table uses the same limits as the existing table if it asks for a > > > > noflush suspend. > > > > > > > > Signed-off-by: Benjamin Marzinski <[email protected]> > > > > --- > > > > drivers/md/dm-table.c | 4 ++++ > > > > drivers/md/dm-thin.c | 7 ++----- > > > > drivers/md/dm.c | 35 +++++++++++++++++++++++------------ > > > > 3 files changed, 29 insertions(+), 17 deletions(-) > > > > > > > > diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c > > > > index ad0a60a07b93..0522cd700e0e 100644 > > > > --- a/drivers/md/dm-table.c > > > > +++ b/drivers/md/dm-table.c > > > > @@ -2043,6 +2043,10 @@ bool dm_table_supports_size_change(struct > > > > dm_table *t, sector_t old_size, > > > > return true; > > > > } > > > > > > > > +/* > > > > + * This function will be skipped by noflush reloads of immutable > > > > request > > > > + * based devices (dm-mpath). > > > > + */ > > > > int dm_table_set_restrictions(struct dm_table *t, struct > > > > request_queue *q, > > > > struct queue_limits *limits) > > > > { > > > > diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c > > > > index c84149ba4e38..6f98936f0e05 100644 > > > > --- a/drivers/md/dm-thin.c > > > > +++ b/drivers/md/dm-thin.c > > > > @@ -4383,11 +4383,8 @@ static void thin_postsuspend(struct > > > > dm_target > > > > *ti) > > > > { > > > > struct thin_c *tc = ti->private; > > > > > > > > - /* > > > > - * The dm_noflush_suspending flag has been cleared by > > > > now, > > > > so > > > > - * unfortunately we must always run this. > > > > - */ > > > > - noflush_work(tc, do_noflush_stop); > > > > + if (dm_noflush_suspending(ti)) > > > > + noflush_work(tc, do_noflush_stop); > > > > } > > > > > > > > static int thin_preresume(struct dm_target *ti) > > > > diff --git a/drivers/md/dm.c b/drivers/md/dm.c > > > > index 3ede5ba4cf7e..87e65c48dd04 100644 > > > > --- a/drivers/md/dm.c > > > > +++ b/drivers/md/dm.c > > > > @@ -2439,7 +2439,6 @@ static struct dm_table *__bind(struct > > > > mapped_device *md, struct dm_table *t, > > > > { > > > > struct dm_table *old_map; > > > > sector_t size, old_size; > > > > - int ret; > > > > > > > > lockdep_assert_held(&md->suspend_lock); > > > > > > > > @@ -2454,11 +2453,13 @@ static struct dm_table *__bind(struct > > > > mapped_device *md, struct dm_table *t, > > > > > > > > set_capacity(md->disk, size); > > > > > > > > - ret = dm_table_set_restrictions(t, md->queue, limits); > > > > - if (ret) { > > > > - set_capacity(md->disk, old_size); > > > > - old_map = ERR_PTR(ret); > > > > - goto out; > > > > + if (limits) { > > > > + int ret = dm_table_set_restrictions(t, md- > > > > >queue, > > > > limits); > > > > + if (ret) { > > > > + set_capacity(md->disk, old_size); > > > > + old_map = ERR_PTR(ret); > > > > + goto out; > > > > + } > > > > } > > > > > > > > /* > > > > @@ -2836,6 +2837,7 @@ static void dm_wq_work(struct work_struct > > > > *work) > > > > > > > > static void dm_queue_flush(struct mapped_device *md) > > > > { > > > > + clear_bit(DMF_NOFLUSH_SUSPENDING, &md->flags); > > > > clear_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags); > > > > smp_mb__after_atomic(); > > > > queue_work(md->wq, &md->work); > > > > @@ -2848,6 +2850,7 @@ struct dm_table *dm_swap_table(struct > > > > mapped_device *md, struct dm_table *table) > > > > { > > > > struct dm_table *live_map = NULL, *map = ERR_PTR(- > > > > EINVAL); > > > > struct queue_limits limits; > > > > + bool update_limits = true; > > > > int r; > > > > > > > > mutex_lock(&md->suspend_lock); > > > > @@ -2856,20 +2859,31 @@ struct dm_table *dm_swap_table(struct > > > > mapped_device *md, struct dm_table *table) > > > > if (!dm_suspended_md(md)) > > > > goto out; > > > > > > > > + /* > > > > + * To avoid a potential deadlock locking the queue > > > > limits, > > > > disallow > > > > + * updating the queue limits during a table swap, when > > > > updating an > > > > + * immutable request-based dm device (dm-multipath) > > > > during a > > > > noflush > > > > + * suspend. It is userspace's responsibility to make > > > > sure > > > > that the new > > > > + * table uses the same limits as the existing table, if > > > > it > > > > asks for a > > > > + * noflush suspend. > > > > + */ > > > > + if (dm_request_based(md) && md->immutable_target && > > > > + __noflush_suspending(md)) > > > > + update_limits = false; > > > > /* > > > > * If the new table has no data devices, retain the > > > > existing > > > > limits. > > > > * This helps multipath with queue_if_no_path if all > > > > paths > > > > disappear, > > > > * then new I/O is queued based on these limits, and > > > > then > > > > some paths > > > > * reappear. > > > > */ > > > > - if (dm_table_has_no_data_devices(table)) { > > > > + else if (dm_table_has_no_data_devices(table)) { > > > > live_map = dm_get_live_table_fast(md); > > > > if (live_map) > > > > limits = md->queue->limits; > > > > dm_put_live_table_fast(md); > > > > } > > > > > > > > - if (!live_map) { > > > > + if (update_limits && !live_map) { > > > > r = dm_calculate_queue_limits(table, &limits); > > > > if (r) { > > > > map = ERR_PTR(r); > > > > @@ -2877,7 +2891,7 @@ struct dm_table *dm_swap_table(struct > > > > mapped_device *md, struct dm_table *table) > > > > } > > > > } > > > > > > > > - map = __bind(md, table, &limits); > > > > + map = __bind(md, table, update_limits ? &limits : NULL); > > > > dm_issue_global_event(); > > > > > > > > out: > > > > @@ -2930,7 +2944,6 @@ static int __dm_suspend(struct > > > > mapped_device > > > > *md, struct dm_table *map, > > > > > > > > /* > > > > * DMF_NOFLUSH_SUSPENDING must be set before presuspend. > > > > - * This flag is cleared before dm_suspend returns. > > > > */ > > > > if (noflush) > > > > set_bit(DMF_NOFLUSH_SUSPENDING, &md->flags); > > > > @@ -2993,8 +3006,6 @@ static int __dm_suspend(struct > > > > mapped_device > > > > *md, struct dm_table *map, > > > > if (!r) > > > > set_bit(dmf_suspended_flag, &md->flags); > > > > > > > > - if (noflush) > > > > - clear_bit(DMF_NOFLUSH_SUSPENDING, &md->flags); > > > > if (map) > > > > synchronize_srcu(&md->io_barrier); > > > > > > This moves the clear_bit() behind synchronize_rcu(). Is that safe? > > > > It's not just moved behind the synchronize_rcu(). > > DMF_NOFLUSH_SUSPENDING > > is now set for the entire time the device is suspended. If people > > would > > like, I'll update the patch to rename it to DMF_NOFLUSH_SUSPEND. > > Right. I realize now that there's a smp_mb__after_atomic() in > dm_queue_flush(). > > > I did check to see if holding it for the entire suspend would cause > > issues, but I didn't see any case where it would. If I missed a > > case where __noflush_suspending() should only return true if we are > > actually in the process of suspending, I can easily update that > > function to do that. > > If this is necessary, I agree that the flag an related function should > be renamed. But there are already generic DM flags to indicate that a > queue is suspend*ed*. Perhaps, instead of changing the semantics of > DMF_NOFLUSH_SUSPENDING, it would make more sense to test > > (__noflush_suspending || test_bit(DMF_BLOCK_IO_FOR_SUSPEND) > > in dm_swap_table()?
Won't we ALWAYS be suspended when we are in dm_swap_table()? We do need to refresh the limits in some cases (the cases where multipath-tools currently reloads the table without setting noflush). What we need to know is "is this table swap happening in a noflush suspend, where userspace understands that it can't modify the device table in a way that would change the limits". For multipath, this is almost always the case. > > Anyway, I am still not convinced that we want to have this specific > exception for multipath only. I agree that Bart's solution looks better to me, if it can get in. > > > In general, I'm wondering whether we need a more generic solution > > > to > > > this problem. Therefore I've added linux-block to cc. > > > > > > The way I see it, if a device has queued IO without any means to > > > perform the IO, it can't be frozen. We'd either need to fail all > > > queued > > > IO in this case, or refuse attempts to freeze the queue. > > > > In general, it's perfectly normal to start freezing the queue while > > there are still outstanding requests. > > Sure, but my point was "without any means to perform the IO". As you > pointed out yourself, a freeze attempt in this situation would get > stuck. > > I find Bart's approach very attractive; freezing might not be necessary > at all in that case. We'dd just need to avoid a race where paths get > reinstated while the operation that would normally have required a > freeze is ongoing. I agree. Even just the timing out of freezes, his "[PATCH 2/3] block: Restrict the duration of sysfs attribute changes" would be enough to keep this from deadlocking the system. -Ben > Martin
