Hi Damien, Damien Le Moal <dlem...@kernel.org> 于2024年9月23日周一 15:22写道: > > On 2024/09/23 13:06, Sam Li wrote: > > [...] > > >>> @@ -2837,6 +3180,19 @@ qcow2_co_pwritev_part(BlockDriverState *bs, > >>> int64_t offset, int64_t bytes, > >>> qiov_offset += cur_bytes; > >>> trace_qcow2_writev_done_part(qemu_coroutine_self(), cur_bytes); > >>> } > >>> + > >>> + if (bs->bl.zoned == BLK_Z_HM) { > >>> + index = start_offset / zone_size; > >>> + wp = &bs->wps->wp[index]; > >>> + if (!QCOW2_ZT_IS_CONV(*wp)) { > >>> + /* Advance the write pointer when the write completes */ > >> > >> Updating the write pointer after I/O does not prevent other write > >> requests from beginning at the same offset as this request. Multiple > >> write request coroutines can run concurrently and only the first one > >> should succeed. The others should fail if they are using the same > >> offset. > >> > >> The comment above says "Real drives change states before it can write to > >> the zone" and I think it's appropriate to update the write pointer > >> before performing the write too. The qcow2 zone emulation code is > >> different from the file-posix.c passthrough code. We are responsible for > >> maintaining zoned metadata state and cannot wait for the result of the > >> I/O to tell us what happened. > > Yes, correct. The wp MUST be updated when issuing the IO, with the assumption > that the write IO will succeed (errors are rare !). > > > The problem of updating the write pointer before IO completion is the > > failure case. It can't be predicted in advance if an IO fails or not. > > When write I/O fails, the wp should not be updated. > > Correct, if an IO fails, the wp should not be updated. However, that is not > difficult to deal with: > 1) under the zone lock, advance the wp position when issuing the write IO > 2) When the write IO completes with success, nothing else needs to be done. > 3) When *any* write IO completes with error you need to: > - Lock the zone > - Do a report zone for the target zone of the failed write to get the > current > wp location > - Update bs->wps->wp[index] using that current wp location > - Unlock the zone > > With that, one may get a few errors if multiple async writes are being issued, > but that behavior is consistent with the same happening with a real drive. So > no > issue. And since the report zones gets you the current wp location, the user > can > restart writing from that location once it has dealt with all the previous > write > failures.
I see. To allow the concurrent writes, the lock will only be used on the failure path while processing append writes. > > > The alternative way is to hold the wps lock as is also required for wp > > accessing. Therefore only one of multiple concurrent write requests > > will succeed. > > That is a very simple solution that avoids the above error recovery, but that > would be very bad for performance (especially for a pure sequential write > workload as we would limit IOs to quue depth 1). So if we can avoid this > simple > approach, that would be a lot better. Yeah, I'll drop this approach. Although, it reminds me of how file-posix driver emulates zone_append. It holds the lock whenever accessing wps. Does that limit IOs to QD 1 too? If so, it can be improved. -- one zone_append starts >> wp_lock() >>> IO processing >>>> wp_update >>>>> wp_unlock() -- ends https://github.com/qemu/qemu/blob/master/block/file-posix.c#L2492 Sam > > > -- > Damien Le Moal > Western Digital Research