> On 19 Jun 2019, at 20:12, Max Reitz <[email protected]> wrote: > > On 05.06.19 14:17, Sam Eiderman wrote: >> Until ESXi 6.5 VMware used the vmfsSparse format for snapshots (VMDK3 in >> QEMU). >> >> This format was lacking in the following: >> >> * Grain directory (L1) and grain table (L2) entries were 32-bit, >> allowing access to only 2TB (slightly less) of data. >> * The grain size (default) was 512 bytes - leading to data >> fragmentation and many grain tables. >> * For space reclamation purposes, it was necessary to find all the >> grains which are not pointed to by any grain table - so a reverse >> mapping of "offset of grain in vmdk" to "grain table" must be >> constructed - which takes large amounts of CPU/RAM. >> >> The format specification can be found in VMware's documentation: >> https://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf >> >> In ESXi 6.5, to support snapshot files larger than 2TB, a new format was >> introduced: SESparse (Space Efficient). >> >> This format fixes the above issues: >> >> * All entries are now 64-bit. >> * The grain size (default) is 4KB. >> * Grain directory and grain tables are now located at the beginning >> of the file. >> + seSparse format reserves space for all grain tables. >> + Grain tables can be addressed using an index. >> + Grains are located in the end of the file and can also be >> addressed with an index. >> - seSparse vmdks of large disks (64TB) have huge preallocated >> headers - mainly due to L2 tables, even for empty snapshots. >> * The header contains a reverse mapping ("backmap") of "offset of >> grain in vmdk" to "grain table" and a bitmap ("free bitmap") which >> specifies for each grain - whether it is allocated or not. >> Using these data structures we can implement space reclamation >> efficiently. >> * Due to the fact that the header now maintains two mappings: >> * The regular one (grain directory & grain tables) >> * A reverse one (backmap and free bitmap) >> These data structures can lose consistency upon crash and result >> in a corrupted VMDK. >> Therefore, a journal is also added to the VMDK and is replayed >> when the VMware reopens the file after a crash. >> >> Since ESXi 6.7 - SESparse is the only snapshot format available. >> >> Unfortunately, VMware does not provide documentation regarding the new >> seSparse format. >> >> This commit is based on black-box research of the seSparse format. >> Various in-guest block operations and their effect on the snapshot file >> were tested. >> >> The only VMware provided source of information (regarding the underlying >> implementation) was a log file on the ESXi: >> >> /var/log/hostd.log >> >> Whenever an seSparse snapshot is created - the log is being populated >> with seSparse records. >> >> Relevant log records are of the form: >> >> [...] Const Header: >> [...] constMagic = 0xcafebabe >> [...] version = 2.1 >> [...] capacity = 204800 >> [...] grainSize = 8 >> [...] grainTableSize = 64 >> [...] flags = 0 >> [...] Extents: >> [...] Header : <1 : 1> >> [...] JournalHdr : <2 : 2> >> [...] Journal : <2048 : 2048> >> [...] GrainDirectory : <4096 : 2048> >> [...] GrainTables : <6144 : 2048> >> [...] FreeBitmap : <8192 : 2048> >> [...] BackMap : <10240 : 2048> >> [...] Grain : <12288 : 204800> >> [...] Volatile Header: >> [...] volatileMagic = 0xcafecafe >> [...] FreeGTNumber = 0 >> [...] nextTxnSeqNumber = 0 >> [...] replayJournal = 0 >> >> The sizes that are seen in the log file are in sectors. >> Extents are of the following format: <offset : size> >> >> This commit is a strict implementation which enforces: >> * magics >> * version number 2.1 >> * grain size of 8 sectors (4KB) >> * grain table size of 64 sectors >> * zero flags >> * extent locations >> >> Additionally, this commit proivdes only a subset of the functionality >> offered by seSparse's format: >> * Read-only >> * No journal replay >> * No space reclamation >> * No unmap support >> >> Hence, journal header, journal, free bitmap and backmap extents are >> unused, only the "classic" (L1 -> L2 -> data) grain access is >> implemented. >> >> However there are several differences in the grain access itself. >> Grain directory (L1): >> * Grain directory entries are indexes (not offsets) to grain >> tables. >> * Valid grain directory entries have their highest nibble set to >> 0x1. >> * Since grain tables are always located in the beginning of the >> file - the index can fit into 32 bits - so we can use its low >> part if it's valid. >> Grain table (L2): >> * Grain table entries are indexes (not offsets) to grains. >> * If the highest nibble of the entry is: >> 0x0: >> The grain in not allocated. >> The rest of the bytes are 0. >> 0x1: >> The grain is unmapped - guest sees a zero grain. >> The rest of the bits point to the previously mapped grain, >> see 0x3 case. >> 0x2: >> The grain is zero. >> 0x3: >> The grain is allocated - to get the index calculate: >> ((entry & 0x0fff000000000000) >> 48) | >> ((entry & 0x0000ffffffffffff) << 12) >> * The difference between 0x1 and 0x2 is that 0x1 is an unallocated >> grain which results from the guest using sg_unmap to unmap the >> grain - but the grain itself still exists in the grain extent - a >> space reclamation procedure should delete it. >> Unmapping a zero grain has no effect (0x2 will not change to 0x1) >> but unmapping an unallocated grain will (0x0 to 0x1) - naturally. >> >> In order to implement seSparse some fields had to be changed to support >> both 32-bit and 64-bit entry sizes. >> >> Reviewed-by: Karl Heubaum <[email protected]> >> Reviewed-by: Eyal Moscovici <[email protected]> >> Reviewed-by: Arbel Moshe <[email protected]> >> Signed-off-by: Sam Eiderman <[email protected]> >> --- >> block/vmdk.c | 357 >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++--- >> 1 file changed, 341 insertions(+), 16 deletions(-) >> >> diff --git a/block/vmdk.c b/block/vmdk.c >> index 931eb2759c..4377779635 100644 >> --- a/block/vmdk.c >> +++ b/block/vmdk.c > > [...] > >> +static int vmdk_open_se_sparse(BlockDriverState *bs, >> + BdrvChild *file, >> + int flags, Error **errp) >> +{ >> + int ret; >> + VMDKSESparseConstHeader const_header; >> + VMDKSESparseVolatileHeader volatile_header; >> + VmdkExtent *extent; >> + >> + if (flags & BDRV_O_RDWR) { >> + error_setg(errp, "No write support for seSparse images available"); >> + return -ENOTSUP; >> + } > Kind of works for me, but why not bdrv_apply_auto_read_only() like I had > proposed? The advantage is that this would make the node read-only if > the user has specified auto-read-only=on instead of failing. >
Ah, I have not realized that bdrv_apply_auto_read_only() is preferred. I’ll send a v3. Sam > Max
