On 02/23/2018 11:05 AM, Kevin Wolf wrote:
Am 23.02.2018 um 17:43 hat Eric Blake geschrieben:
OFFSET_VALID | DATA might be excusable because I can see that it's
convenient that a protocol driver refers to itself as *file instead of
returning NULL there and then the offset is valid (though it would be
pointless to actually follow the file pointer), but OFFSET_VALID without
DATA probably isn't.

So OFFSET_VALID | DATA for a protocol BDS is not just convenient, but
necessary to avoid breaking qemu-img map output.  But you are also right
that OFFSET_VALID without data makes little sense at a protocol layer. So
with that in mind, I'm auditing all of the protocol layers to make sure
OFFSET_VALID ends up as something sane.

That's one way to look at it.

The other way is that qemu-img map shouldn't ask the protocol layer for
its offset because it already knows the offset (it is what it passes as
a parameter to bdrv_co_block_status).

Anyway, it's probably not worth changing the interface, we should just
make sure that the return values of the individual drivers are

Yet another inconsistency, and it's making me scratch my head today.

By the way, in my byte-based stuff that is now pending on your tree, I tried hard to NOT change semantics or the set of flags returned by a given driver, and we agreed that's why you'd accept the series as-is and make me do this followup exercise. But it's looking like my followups may end up touching a lot of the same drivers again, now that I'm looking at what the semantics SHOULD be (and whatever I do end up tweaking, I will at least make sure that iotests is still happy with it).

First, let's read what states the NBD spec is proposing:

It defines the following flags for the flags field:

    NBD_STATE_HOLE (bit 0): if set, the block represents a hole (and future 
writes to that area may cause fragmentation or encounter an ENOSPC error); if 
clear, the block is allocated or the server could not otherwise determine its 
status. Note that the use of NBD_CMD_TRIM is related to this status, but that 
the server MAY report a hole even where NBD_CMD_TRIM has not been requested, 
and also that a server MAY report that the block is allocated even where 
NBD_CMD_TRIM has been requested.
    NBD_STATE_ZERO (bit 1): if set, the block contents read as all zeroes; if 
clear, the block contents are not known. Note that the use of 
NBD_CMD_WRITE_ZEROES is related to this status, but that the server MAY report 
zeroes even where NBD_CMD_WRITE_ZEROES has not been requested, and also that a 
server MAY report unknown content even where NBD_CMD_WRITE_ZEROES has been 

It is not an error for a server to report that a region of the export has both 
NBD_STATE_HOLE set and NBD_STATE_ZERO clear. The contents of such an area are 
undefined, and a client reading such an area should make no assumption as to 
its contents or stability.

So here's how Vladimir proposed implementing it in his series (written before my byte-based block status stuff went in to your tree):

Server side (3/9):

+ int ret = bdrv_block_status_above(bs, NULL, offset, tail_bytes, &num,
+                                          NULL, NULL);
+        if (ret < 0) {
+            return ret;
+        }
+        flags = (ret & BDRV_BLOCK_ALLOCATED ? 0 : NBD_STATE_HOLE) |
+                (ret & BDRV_BLOCK_ZERO      ? NBD_STATE_ZERO : 0);

Client side (6/9):

+    *pnum = extent.length >> BDRV_SECTOR_BITS;
+    return (extent.flags & NBD_STATE_HOLE ? 0 : BDRV_BLOCK_DATA) |
+           (extent.flags & NBD_STATE_ZERO ? BDRV_BLOCK_ZERO : 0);

Does anything there strike you as odd? In isolation, they seemed fine to me, but side-by-side, I'm scratching my head: the server queries the block layer, and turns BDRV_BLOCK_ALLOCATED into !NBD_STATE_HOLE; the client side then takes the NBD protocol and tries to turn it back into information to feed the block layer, where !NBD_STATE_HOLE now feeds BDRV_BLOCK_DATA. Why the different choice of bits?

Part of the story is that right now, we document that ONLY the block layer sets _ALLOCATED, in io.c, as a result of the driver layer returning HOLE || ZERO (there are cases where the block layer can return ZERO but not ALLOCATED, because the driver layer returned 0 but the block layer still knows that area reads as zero). So Victor's patch matches the fact that the driver shouldn't set ALLOCATED. Still, if we are tying ALLOCATED to whether there is a hole, then that seems like information we should be getting from the driver, not something synthesized after we've left the driver!

Then there's the question of file-posix.c: what should it return for a hole, ZERO|OFFSET_VALID or DATA|ZERO|OFFSET_VALID? The wording in block.h implies that if DATA is not set, then the area reads as zero to the guest, but may have indeterminate value on the underlying file - but we KNOW that a hole in a POSIX file reads as 0 rather than having indeterminate value, and returning DATA fits the current documentation (but doing so bleeds through to at least 'qemu-img map --output=json' for the raw format). I think we're overloading too many things into DATA (which layer of the chain feeds what the guest sees, and do we have a hole or is storage allocated for the data). The only uses of BDRV_BLOCK_ALLOCATED are in the computation of bdrv_is_allocated(), in qcow2 measure, and in qemu-img compare, which all really do care about the semantics of "does THIS layer provide the guest image, or do I defer to a backing layer". But the question NBD wants answered is "do I know whether there is a hole in the storage" There are also relatively few clients of BDRV_BLOCK_DATA (mirror.c, qemu-img, bdrv_co_block_status_above), and I wonder if some of them are more worried about BDRV_BLOCK_ALLOCATED instead.

I'm thinking of revamping things to still keep four bits, but with new names and semantics as follows:

BDRV_BLOCK_LOCAL - the guest gets this portion of the file from this BDS, rather than the backing chain - makes sense for format drivers, pointless for protocol drivers
BDRV_BLOCK_ZERO - this portion of the file reads as zeroes
BDRV_BLOCK_ALLOC - this portion of the file has reserved disk space
BDRV_BLOCK_OFFSET_VALID - offset for accessing raw data

For format drivers:
L Z A O   read as zero, returned file is zero at offset
L - A O   read as valid from file at offset
L Z - O   read as zero, but returned file has hole at offset
L - - O   preallocated at offset but reads as garbage - bug?
L Z A -   read as zero, but from unknown offset with storage
L - A - read as valid, but from unknown offset (including compressed, encrypted)
L Z - -   read as zero, but from unknown offset with hole
L - - -   preallocated but no offset known - bug?
- Z A O read defers to backing layer, but protocol layer contains allocated zeros at offset
- - A O   read defers to backing layer, but preallocated at offset
- Z - O   bug
- - - O   bug
- Z A -   bug
- - A -   bug
- Z - -   bug
- - - -   read defers to backing layer

For protocol drivers:
- Z A O   read as zero, offset is allocated
- - A O   read as data, offset is allocated
- Z - O   read as zero, offset is hole
- - - O   bug?
- Z A -   read as zero, but from unknown offset with storage
- - A -   read as valid, but from unknown offset
- Z - -   read as zero, but from unknown offset with hole
- - - -   can't access this portion of file

With the new bit definitions, any driver that returns RAW (necessarily with OFFSET_VALID) will have the block layer set LOCAL in addition to whatever the next layer returns (turning the protocol driver's response into the correct format layer response). Protocol drivers can omit the callback and get the sane default of '- - A O' mapped in place (or would that be better as '- - A -'?). file-posix.c would return either '- - A O' (after SEEK_DATA) or '- Z - O' (after SEEK_HOLE). NBD would map ZERO to NBD_STATE_ZERO, and ALLOC to !NBD_STATE_HOLE, in both server (block-layer-to-NBD-protocol) and client (NBD-protocol-to-block-layer). Format drivers would set LOCAL themselves (rather than the block layer synthesizing it).

bdrv_is_allocated will still let clients learn which layers are local without grabbing full mapping information, but is tied to the BDRV_BLOCK_LOCAL bit. Optimizations made during mirror based on whether and qemu-img compare previously based on BDRV_BLOCK_ALLOCATED are now based on BDRV_BLOCK_LOCAL, those based on BDRV_BLOCK_DATA are now based on BDRV_BLOCK_ALLOC.


Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Reply via email to