Hi, In this patch I fixed two issues: 1. A race condition during write operations on snapshots. Now we write the grain of data first and the L2 metadata after. So, the snapshot will stay correct if the VM will be destroyed in the middle of the write. 2. Non-leaf images growing during writes. Assume we have snapshots chain (Base->Snap1->Snap2->...->Leaf) and we run a VM with the latest image of this chain (leaf image). We have a problem with non-leaf images growing in the snapshot-chain (most noticeable when the VM performs aggressive writes). It's an incorrect behavior according to VMDK spec. For every write operation into an unknown offset, the active image must query its ancestors for this offset, and if exists in any of them perform a read-from-ancestor/modify/write-to-active the whole grain of that offset. The problem happened upon read-from-ancestor/modify/write-to-active where the ancestor was 2 or more generations above the active (leaf) image (not a direct parent), as its direct child was modified. Fixed by always write to the 'active' (leaf) image.
Regards, Igor Lvovsky -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Fabrice Bellard Sent: Tuesday, January 16, 2007 9:36 PM To: qemu-devel@nongnu.org Subject: Re: [Qemu-devel] Race condition in VMDK (QCOW*) formats. Well, it was never said that the QCOW* code was safe if you interrupted QEMU at some point. But I agree that it could be safer to write the sector first and update the links after. It could be interesting to analyze the QCOW2 snapshots handling too (what if QEMU is stopped during the creation of a snapshot ?). Regards, Fabrice. Igor Lvovsky wrote: > > > Hi all, > > I have doubt about the race condition during the *write operation on > snapshot*. > > I think the problem exists in VMDK and QCOW* formats (I didn't checked > the others). > > > > The example from the block_vmdk.c. > > > > static int vmdk_write(BlockDriverState *bs, int64_t sector_num, > > const uint8_t *buf, int nb_sectors) > > { > > BDRVVmdkState *s = bs->opaque; > > int ret, index_in_cluster, n; > > uint64_t cluster_offset; > > > > while (nb_sectors > 0) { > > index_in_cluster = sector_num & (s->cluster_sectors - 1); > > n = s->cluster_sectors - index_in_cluster; > > if (n > nb_sectors) > > n = nb_sectors; > > cluster_offset = get_cluster_offset(bs, sector_num << 9, 1); > > if (!cluster_offset) > > return -1; > > lseek(s->fd, cluster_offset + index_in_cluster * 512, SEEK_SET); > > ret = write(s->fd, buf, n * 512); > > if (ret != n * 512) > > return -1; > > nb_sectors -= n; > > sector_num += n; > > buf += n * 512; > > } > > return 0; > > } > > > > The /get_cluster_offset(…)/ routine update the L2 table of the metadata > and return the /cluster_offset. / > > After that the /vmdk_write(…)/ routine/ /actually write the grain at > right place. > > So, we have timing hole here. > > > > Assume, VM that perform write operation will be destroyed at this moment. > > So, we have corrupted image (with updated L2 table, but without the > grain itself). > > > > Regards, > > Igor Lvovsky > > > > > > > > > > > > > ------------------------------------------------------------------------ > > _______________________________________________ > Qemu-devel mailing list > Qemu-devel@nongnu.org > http://lists.nongnu.org/mailman/listinfo/qemu-devel _______________________________________________ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
block-vmdk.diff
Description: block-vmdk.diff