On 4/10/25 1:33 PM, Damien Le Moal wrote: > When a dm-delay devie is being suspended, the .presuspend() operation is > first executed (delay_presuspend()) to immediately issue all the BIOs > present in the delayed list of the device and also sets the device > may_delay boolean to false. At the same time, any new BIO is issued to > the device will not be delayed and immediately issued with delay_bio() > returning DM_MAPIO_REMAPPED. This creates a situation where potentially > 2 different contexts may be issuing write BIOs to the same zone of a > zone device without respecting the issuing order from the user, that is, > a newly issued write BIO may be issued before other write BIOs for the > same target zone that are in the device delayed list. If such situation > occurs, write BIOs may be failed by the underlying zoned device due to > an unaligned write error. > > Prevent this situation from happening by always handling newly issued > write BIOs using the delayed list of BIOs, even when the device is being > suspended. This is done by forcing the use of the worker kthread for > zoned devices, and by modifying flush_worker_fn() to always flush all > delayed BIOs if the device may_delay boolean is false. > > Fixes: d43929ef65a6 ("dm-delay: support zoned devices") > Signed-off-by: Damien Le Moal <dlem...@kernel.org>
I probably should also add: Reported-by: Benjamin Marzinski <bmarz...@redhat.com> > --- > drivers/md/dm-delay.c | 29 +++++++++++++++++++++-------- > 1 file changed, 21 insertions(+), 8 deletions(-) > > diff --git a/drivers/md/dm-delay.c b/drivers/md/dm-delay.c > index d4cf0ac2a7aa..c665b2ab1115 100644 > --- a/drivers/md/dm-delay.c > +++ b/drivers/md/dm-delay.c > @@ -128,7 +128,7 @@ static int flush_worker_fn(void *data) > struct delay_c *dc = data; > > while (!kthread_should_stop()) { > - flush_delayed_bios(dc, false); > + flush_delayed_bios(dc, !dc->may_delay); > spin_lock(&dc->delayed_bios_lock); > if (unlikely(list_empty(&dc->delayed_bios))) { > set_current_state(TASK_INTERRUPTIBLE); > @@ -213,6 +213,7 @@ static int delay_ctr(struct dm_target *ti, unsigned int > argc, char **argv) > struct delay_c *dc; > int ret; > unsigned int max_delay; > + bool is_zoned = false; > > if (argc != 3 && argc != 6 && argc != 9) { > ti->error = "Requires exactly 3, 6 or 9 arguments"; > @@ -236,6 +237,7 @@ static int delay_ctr(struct dm_target *ti, unsigned int > argc, char **argv) > if (ret) > goto bad; > max_delay = dc->read.delay; > + is_zoned = bdev_is_zoned(dc->read.dev->bdev); > > if (argc == 3) { > ret = delay_class_ctr(ti, &dc->write, argv); > @@ -251,6 +253,7 @@ static int delay_ctr(struct dm_target *ti, unsigned int > argc, char **argv) > if (ret) > goto bad; > max_delay = max(max_delay, dc->write.delay); > + is_zoned = is_zoned || bdev_is_zoned(dc->write.dev->bdev); > > if (argc == 6) { > ret = delay_class_ctr(ti, &dc->flush, argv + 3); > @@ -263,13 +266,16 @@ static int delay_ctr(struct dm_target *ti, unsigned int > argc, char **argv) > if (ret) > goto bad; > max_delay = max(max_delay, dc->flush.delay); > + is_zoned = is_zoned || bdev_is_zoned(dc->flush.dev->bdev); > > out: > - if (max_delay < 50) { > - /* > - * In case of small requested delays, use kthread instead of > - * timers and workqueue to achieve better latency. > - */ > + /* > + * In case of small requested delays, use kthread instead of timers and > + * workqueue to achieve better latency. Also use a kthread for a zoned > + * device so that we can preserve the order of write operations during > + * suspend. > + */ > + if (max_delay < 50 || is_zoned) { > dc->worker = kthread_run(&flush_worker_fn, dc, > "dm-delay-flush-worker"); > if (IS_ERR(dc->worker)) { > ret = PTR_ERR(dc->worker); > @@ -313,8 +319,15 @@ static int delay_bio(struct delay_c *dc, struct > delay_class *c, struct bio *bio) > > spin_lock(&dc->delayed_bios_lock); > if (unlikely(!dc->may_delay)) { > - spin_unlock(&dc->delayed_bios_lock); > - return DM_MAPIO_REMAPPED; > + /* > + * Issue the BIO immediately if the device is not zoned. FOr a > + * zoned device, preserver the ordering of write operations by > + * using the delay list. > + */ > + if (!bdev_is_zoned(c->dev->bdev) || c != &dc->write) { > + spin_unlock(&dc->delayed_bios_lock); > + return DM_MAPIO_REMAPPED; > + } > } > c->ops++; > list_add_tail(&delayed->list, &dc->delayed_bios); -- Damien Le Moal Western Digital Research