On 05.06.19 14:17, Sam Eiderman wrote: > Until ESXi 6.5 VMware used the vmfsSparse format for snapshots (VMDK3 in > QEMU). > > This format was lacking in the following: > > * Grain directory (L1) and grain table (L2) entries were 32-bit, > allowing access to only 2TB (slightly less) of data. > * The grain size (default) was 512 bytes - leading to data > fragmentation and many grain tables. > * For space reclamation purposes, it was necessary to find all the > grains which are not pointed to by any grain table - so a reverse > mapping of "offset of grain in vmdk" to "grain table" must be > constructed - which takes large amounts of CPU/RAM. > > The format specification can be found in VMware's documentation: > https://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf > > In ESXi 6.5, to support snapshot files larger than 2TB, a new format was > introduced: SESparse (Space Efficient). > > This format fixes the above issues: > > * All entries are now 64-bit. > * The grain size (default) is 4KB. > * Grain directory and grain tables are now located at the beginning > of the file. > + seSparse format reserves space for all grain tables. > + Grain tables can be addressed using an index. > + Grains are located in the end of the file and can also be > addressed with an index. > - seSparse vmdks of large disks (64TB) have huge preallocated > headers - mainly due to L2 tables, even for empty snapshots. > * The header contains a reverse mapping ("backmap") of "offset of > grain in vmdk" to "grain table" and a bitmap ("free bitmap") which > specifies for each grain - whether it is allocated or not. > Using these data structures we can implement space reclamation > efficiently. > * Due to the fact that the header now maintains two mappings: > * The regular one (grain directory & grain tables) > * A reverse one (backmap and free bitmap) > These data structures can lose consistency upon crash and result > in a corrupted VMDK. > Therefore, a journal is also added to the VMDK and is replayed > when the VMware reopens the file after a crash. > > Since ESXi 6.7 - SESparse is the only snapshot format available. > > Unfortunately, VMware does not provide documentation regarding the new > seSparse format. > > This commit is based on black-box research of the seSparse format. > Various in-guest block operations and their effect on the snapshot file > were tested. > > The only VMware provided source of information (regarding the underlying > implementation) was a log file on the ESXi: > > /var/log/hostd.log > > Whenever an seSparse snapshot is created - the log is being populated > with seSparse records. > > Relevant log records are of the form: > > [...] Const Header: > [...] constMagic = 0xcafebabe > [...] version = 2.1 > [...] capacity = 204800 > [...] grainSize = 8 > [...] grainTableSize = 64 > [...] flags = 0 > [...] Extents: > [...] Header : <1 : 1> > [...] JournalHdr : <2 : 2> > [...] Journal : <2048 : 2048> > [...] GrainDirectory : <4096 : 2048> > [...] GrainTables : <6144 : 2048> > [...] FreeBitmap : <8192 : 2048> > [...] BackMap : <10240 : 2048> > [...] Grain : <12288 : 204800> > [...] Volatile Header: > [...] volatileMagic = 0xcafecafe > [...] FreeGTNumber = 0 > [...] nextTxnSeqNumber = 0 > [...] replayJournal = 0 > > The sizes that are seen in the log file are in sectors. > Extents are of the following format: <offset : size> > > This commit is a strict implementation which enforces: > * magics > * version number 2.1 > * grain size of 8 sectors (4KB) > * grain table size of 64 sectors > * zero flags > * extent locations > > Additionally, this commit proivdes only a subset of the functionality > offered by seSparse's format: > * Read-only > * No journal replay > * No space reclamation > * No unmap support > > Hence, journal header, journal, free bitmap and backmap extents are > unused, only the "classic" (L1 -> L2 -> data) grain access is > implemented. > > However there are several differences in the grain access itself. > Grain directory (L1): > * Grain directory entries are indexes (not offsets) to grain > tables. > * Valid grain directory entries have their highest nibble set to > 0x1. > * Since grain tables are always located in the beginning of the > file - the index can fit into 32 bits - so we can use its low > part if it's valid. > Grain table (L2): > * Grain table entries are indexes (not offsets) to grains. > * If the highest nibble of the entry is: > 0x0: > The grain in not allocated. > The rest of the bytes are 0. > 0x1: > The grain is unmapped - guest sees a zero grain. > The rest of the bits point to the previously mapped grain, > see 0x3 case. > 0x2: > The grain is zero. > 0x3: > The grain is allocated - to get the index calculate: > ((entry & 0x0fff000000000000) >> 48) | > ((entry & 0x0000ffffffffffff) << 12) > * The difference between 0x1 and 0x2 is that 0x1 is an unallocated > grain which results from the guest using sg_unmap to unmap the > grain - but the grain itself still exists in the grain extent - a > space reclamation procedure should delete it. > Unmapping a zero grain has no effect (0x2 will not change to 0x1) > but unmapping an unallocated grain will (0x0 to 0x1) - naturally. > > In order to implement seSparse some fields had to be changed to support > both 32-bit and 64-bit entry sizes. > > Reviewed-by: Karl Heubaum <[email protected]> > Reviewed-by: Eyal Moscovici <[email protected]> > Reviewed-by: Arbel Moshe <[email protected]> > Signed-off-by: Sam Eiderman <[email protected]> > --- > block/vmdk.c | 357 > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++--- > 1 file changed, 341 insertions(+), 16 deletions(-) > > diff --git a/block/vmdk.c b/block/vmdk.c > index 931eb2759c..4377779635 100644 > --- a/block/vmdk.c > +++ b/block/vmdk.c
[...]
> +static int vmdk_open_se_sparse(BlockDriverState *bs,
> + BdrvChild *file,
> + int flags, Error **errp)
> +{
> + int ret;
> + VMDKSESparseConstHeader const_header;
> + VMDKSESparseVolatileHeader volatile_header;
> + VmdkExtent *extent;
> +
> + if (flags & BDRV_O_RDWR) {
> + error_setg(errp, "No write support for seSparse images available");
> + return -ENOTSUP;
> + }
Kind of works for me, but why not bdrv_apply_auto_read_only() like I had
proposed? The advantage is that this would make the node read-only if
the user has specified auto-read-only=on instead of failing.
Max
signature.asc
Description: OpenPGP digital signature
