On 4/16/25 21:45, Mikulas Patocka wrote:
> 
> 
> On Tue, 15 Apr 2025, Damien Le Moal wrote:
> 
>>> Hi
>>>
>>> I looked at the generic device mapper code and it seems that ordering of 
>>> write bios is not guaranteed with any target in case of suspend/resume.
>>>
>>> * we suspend the device:
>>> * received bios are added to md->deferred in queue_io
>>>
>>> * we resume the device:
>>> * __dm_resume calls dm_queue_flush
>>> * dm_queue_flush clears DMF_BLOCK_IO_FOR_SUSPEND and submits work item 
>>>   &md->work (dm_wq_work)
>>> * dm_resume clears DMF_SUSPENDED
>>> * the device starts accepting new bios in dm_submit_bio
>>> * dm_wq_work runs concurrently with new bios that are received, so 
>>>   ordering of bios is not preserved
>>>
>>> So it doesn't make much sense to try to fix it in dm-delay, if it isn't 
>>> supposed to work at all.
>>
>> Just need to fix the generic DM resume code then. This patch fixing dm-delay 
>> is
>> still relevant even with DM generic resume fixes.
>>
>> I can resend the dm-delay fix together with DM core resume fixes. And 
>> Benjamin
>> can re-send the dm-delay kthread timer cleanup independently (I will rebase) 
>> or
>> on top of that fix series. Does that work for you ?
> 
> I would like to know why is this needed. If you have a zoned device, you 
> can send one big write bio, wait for the big bio to finish, send another 
> big write bio, wait for it to finish and so on. Then, there will be at 
> most one write bio oustanding and you don't have to care about kernel 
> reordering in-flight bios.

Except for the "big" adjective in you remark, what you are describing is zone
write plugging, which will limit the number of in-flight write commands to at
most 1 per zone. That is already done for all zoned block devices at the low
level. For DM, we enable zone write plugging if and only if the DM target driver
ask for zone append emulation because the target driver cannot support native
zone append operations. E.g. dm-crypt does that so that all zone append
operations are turned into regular writes so that we can have the usual IV for
encryption be set to the written sector.

As for "send one big write bio", the "big" here completely depends on the device
user. We can only process what the user issues (FS or userland). The block layer
does not do write buffering.

> It seems that you want to send many small overlapping write bios - the 
> question is why? Why can't the application accumulate the content and send 
> it as one big bio?

That is the application problem. On HDDs at least, small IOs will hurt
performance. SMR or not, same problem. Intellignet applications will try to
shape their workload to optimize performance. But that point is irrelevant here.
The kernel porvides a service: process write requests, regardless of how big
these requests are, if they are correct (i.e. for zoned devices, they must be
issued in order by the user), we must correctly execute the writes.

> I'm a bit worried that supporting this ordering will just bloat the kernel 
> with marginal benefit.

Bloat ? everything is already in place to preserve the order of write operations
to zoned devices, since a long time ago. What has not been covered are cases
like suspend/resume which may, depending on what they do, break the ordering
guarantees that we have for write requests. The only reason this has not been
fixed is because I completely overlooked these cases as zoned block devices were
in the past mostly used in enterprise systems where suspend/resume is not really
used at all. But we have zoned UFS devices these days (smart phones), so
properly supporting DM suspend/resume is important I think.



-- 
Damien Le Moal
Western Digital Research

Reply via email to