On Mon, Apr 14, 2025 at 03:52:18PM +0200, Mikulas Patocka wrote:
> 
> 
> On Fri, 11 Apr 2025, Benjamin Marzinski wrote:
> 
> > When using a kthread to delay the IOs, dm-delay would continuously loop,
> > checking if IOs were ready to submit. It had a cond_resched() call in
> > the loop, but might still loop hundreds of millions of times waiting for
> > an IO that was scheduled to be submitted 10s of ms in the future. With
> > the change to make dm-delay over zoned devices always use kthreads
> > regardless of the length of the delay, this wasted work only gets worse.
> > 
> > To solve this and still keep roughly the same precision for very short
> > delays, dm-delay now calls fsleep() for 1/8th of the smallest non-zero
> > delay it will place on IOs, or 1 ms, whichever is smaller. The reason
> > that dm-delay doesn't just use the actual expiration time of the next
> > delayed IO to calculated the sleep time is that delay_dtr() must wait
> > for the kthread to finish before deleting the table. If a zoned device
> > with a long delay queued an IO shortly before being suspended and
> > removed, the IO would be flushed in delay_presuspend(), but the removing
> > the device would still have to wait for the remainder of the long delay.
> > This time is now capped at 1 ms.
> > 
> > Fixes: 70bbeb29fab09 ("dm delay: for short delays, use kthread instead of 
> > timers and wq")
> > Signed-off-by: Benjamin Marzinski <bmarz...@redhat.com>
> > ---
> > This patch is meant to apply on top of Damien Le Moal's "dm-delay:
> > Prevent zoned write reordering on suspend" patch. If people think it's
> > important to avoid either this much smaller amount of looping or the
> > possible 1 ms delay on deleting a table, I can send a patch that uses
> > usleep_range_state() and msleep_interruptible() to do an interruptible
> > sleep with a duration based on the expiration time of the next delayed
> > IO.
> 
> Hi
> 
> worker_sleep_ns should be worker_sleep_us - as the value is in 
> microseconds.

Oops.

> fsleep in flush_worker_fn should be called unconditionally, to not consume 
> 100% CPU when suspending.

This is fine, but flush_worker_fn() won't busy-wait while suspending.
Since it is calling flush_delayed_bios() with flush_all equal to true,
it will handle all the bios on the list. As long as bios keep getting
added to the list while it's flushing the last batch, it will keep
looping to flush them. But it will be doing necessary work on each loop,
and flush_delayed_bios() has a cond_resched(), so even if there are a
flood of bios, it will still take breaks. As soon as bios stop
continuously arriving, flush_worker_fn() will see the empty list and the
kthread will stop.

Unconditionally sleeping here makes it more likely that dm-delay will
end up sleeping unnecessarily while a there are just a few remaining
bios on the list. Like I said in my commit message, this sleep will be
capped at 1 ms, so it's not that big of a deal. But it's a trade-off.
I'm o.k. with your version. I just not sure that it is the better
trade-off. Perhaps I'm overlooking something.

> cond_resched() shouldn't be removed because fsleep may fall back to 
> udelay.

Again, your version is fine, but I'm not sure that cond_resched() was
ever necessary, since there already is one in flush_delayed_bios().
Also, at least the way it's currently coded, fsleep() will only resort
to busy-waiting when the delay is 10 us or less, and the shortest it can
be with this code is 62 us, so I don't think this cond_resched() will
ever do anything.

> The patch should increase target version.
> 
> I fixed the patch so that it applies on the top Linus' tree and applied 
> it to the linux-dm tree.
> 
> BTW. do we need to backport this to the stable kernels? I think not, but 
> if you have some reason why should we backport it, explain it.

dm-delay is basically a testing target, so I agree that it seems
unnecessary to backport this.

-Ben

> 
> Mikulas


Reply via email to