On Wed, Apr 22, 2015 at 04:16:38PM +0300, Denis V. Lunev wrote: > On 22/04/15 16:08, Stefan Hajnoczi wrote: > >On Wed, Mar 11, 2015 at 01:28:02PM +0300, Denis V. Lunev wrote: > >>+static int64_t allocate_cluster(BlockDriverState *bs, int64_t sector_num) > >>+{ > >>+ BDRVParallelsState *s = bs->opaque; > >>+ uint32_t idx, offset, tmp; > >>+ int64_t pos; > >>+ int ret; > >>+ > >>+ idx = sector_num / s->tracks; > >>+ offset = sector_num % s->tracks; > >>+ > >>+ if (idx >= s->catalog_size) { > >>+ return -EINVAL; > >>+ } > >>+ if (s->catalog_bitmap[idx] != 0) { > >>+ return (uint64_t)s->catalog_bitmap[idx] * s->off_multiplier + > >>offset; > >>+ } > >>+ > >>+ pos = bdrv_getlength(bs->file) >> BDRV_SECTOR_BITS; > >>+ bdrv_truncate(bs->file, (pos + s->tracks) << BDRV_SECTOR_BITS); > >>+ s->catalog_bitmap[idx] = pos / s->off_multiplier; > >>+ > >>+ tmp = cpu_to_le32(s->catalog_bitmap[idx]); > >>+ > >>+ ret = bdrv_pwrite_sync(bs->file, > >>+ sizeof(ParallelsHeader) + idx * sizeof(tmp), &tmp, > >>sizeof(tmp)); > >What is the purpose of the sync? > This is necessary to preserve image consistency on crash from > my point of view. There is no check consistency at the moment. > The sync will be removed later when proper crash detection > code will be added (patches 19, 20, 21)
Let's look at possible orderings in case of failure: A. BAT update B. Data write This sync enforces A, B ordering. If we can see B, then A must also have happened thanks to the sync. But A, B ordering is too conservative. Imagine B, A ordering and the failure where we crash before A. It means we wrote the data but never linked it into the BAT. What happens in that case? We've leaked a cluster in the underlying image file but it doesn't corrupt the visible disk from the guest point-of-view. Because your implementation uses truncate to extend the file size before A, even the A, B failure case results in a leaked cluster. So the B, A case is not worse in any way. Why do other image formats sync cluster allocation updates? Because they support backing files and in that case an A, B ordering results in data corruption so they enforce B, A ordering (the opposite of what you're trying to do!). The reason why A, B ordering results in data corruption when backing files are in use is because the guest's write request might touch only a subset of the cluster (a couple of sectors out of the whole cluster). So the guest needs to copy the remaining sectors from the backing file. If there is a dangling BAT entry like in the A, B failure case, then the guest will see a zeroed cluster instead of the contents of the backing file. This is a data corruption, but only if a backing file is being used! So the sync is not necessary, both A, B and B, A ordering work for block/parallels.c. Stefan
pgpdwKmpMj9YY.pgp
Description: PGP signature