On 12/19/25 19:08, Denis V. Lunev wrote:
qcow2_header_updated() is the final call during image close. This means
after this point we will have no IO operations on this file descriptor.
This assumption has been validated via 'strace' which clearly confirms
that this is very final write and there is no sync after this point.
There is almost no problem when the image is residing in local
filesystem except that we will have image check if the chage will
not reach disk before powerloss, but with a network or distributed
filesystem we come to trouble. The change could be in flight and we
can miss this data on other node like during migration.
The patch adds BDRV_REQ_FUA to the write request to do the trick.
Signed-off-by: Denis V. Lunev <[email protected]>
CC: Kevin Wolf <[email protected]>
CC: Hanna Reitz <[email protected]>
---
block/qcow2.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/block/qcow2.c b/block/qcow2.c
index e29810d86a..abbc4e82ba 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -3252,7 +3252,7 @@ int qcow2_update_header(BlockDriverState *bs)
}
/* Write the new header */
- ret = bdrv_pwrite(bs->file, 0, s->cluster_size, header, 0);
+ ret = bdrv_pwrite(bs->file, 0, s->cluster_size, header, BDRV_REQ_FUA);
if (ret < 0) {
goto fail;
}
Thanks a lot for Andrey Drobyshev who has attempted to review
and test this patch.
In reality the patch is non working and the problem is even
more severe than it appears. BDRV_REQ_FUA processing is broken in
mainstream QEMU. Generic write path ends in static int coroutine_fn
GRAPH_RDLOCK bdrv_aligned_pwritev(BdrvChild *child, BdrvTrackedRequest
*req, int64_t offset, int64_t bytes, int64_t align, QEMUIOVector *qiov,
size_t qiov_offset, BdrvRequestFlags flags) { .... if (ret < 0) { /* Do
nothing, write notifier decided to fail this request */ } else if (flags
& BDRV_REQ_ZERO_WRITE) { bdrv_co_debug_event(bs, BLKDBG_PWRITEV_ZERO);
ret = bdrv_co_do_pwrite_zeroes(bs, offset, bytes, flags); } else if
(flags & BDRV_REQ_WRITE_COMPRESSED) { ret =
bdrv_driver_pwritev_compressed(bs, offset, bytes, qiov, qiov_offset); }
else if (bytes <= max_transfer) { bdrv_co_debug_event(bs,
BLKDBG_PWRITEV); ret = bdrv_driver_pwritev(bs, offset, bytes, qiov,
qiov_offset, flags); } else { bdrv_co_debug_event(bs, BLKDBG_PWRITEV);
while (bytes_remaining) { int num = MIN(bytes_remaining, max_transfer);
int local_flags = flags; assert(num); if (num < bytes_remaining &&
(flags & BDRV_REQ_FUA) && !(bs->supported_write_flags & BDRV_REQ_FUA)) {
/* If FUA is going to be emulated by flush, we only * need to flush on
the last iteration */ local_flags &= ~BDRV_REQ_FUA; } ret =
bdrv_driver_pwritev(bs, offset + bytes - bytes_remaining, num, qiov,
qiov_offset + bytes - bytes_remaining, local_flags); if (ret < 0) {
break; } bytes_remaining -= num; } } bdrv_co_debug_event(bs,
BLKDBG_PWRITEV_DONE); if (ret >= 0) { ret = 0; }
bdrv_co_write_req_finish(child, offset, bytes, req, ret);
BDRV_REQ_FUA is processed insidebdrv_driver_pwritev() and resulted in bdrv_co_flush() is __NO OP (!)__
until bdrv_co_write_req_finish() is called which performs
qatomic_inc(&bs->write_gen); Only next flush will really flush the data.
This means that flush semantics is broken inside QEMU and this problem
looks serious to me. I'll come up with a patch. Den