Hi Alex, let me try my hand at clarifying some points...

On 11/29/2016 07:57 AM, Alex Bligh wrote:
> Vladimir,
> I went back to April to reread the previous train of conversation
> then found you had helpfully summarised some if it. Comments
> below.
> Rather than comment on many of the points individual, the root
> of my confusion and to some extent uncomfortableness about this
> proposal is 'who owns the meaning the the bitmaps'.
> Some of this is my own confusion (sorry) about the use to which
> this is being put, which is I think at root a documentation issue.
> To illustrate this, you write in the FAQ section that this is for
> read only disks, but the text talks about:

I sent an earlier email in response to Wouter over our exact "goal" with
this spec extension that is a little more of an overview, but I will try
to address your questions specifically.

> +Some storage formats and operations over such formats express a
> +concept of data dirtiness. Whether the operation is block device
> +mirroring, incremental block device backup or any other operation with
> +a concept of data dirtiness, they all share a need to provide a list
> +of ranges that this particular operation treats as dirty.
> How can data be 'dirty' if it is static and unchangeable? (I thought)

In a simple case, live IO goes to e.g. hda.qcow2. These writes come from
the VM and cause the bitmap that QEMU manages to become dirty.

We intend to expose the ability to fleece dirty blocks via NBD. What
happens in this scenario would be that a snapshot of the data at the
time of the request is exported over NBD in a read-only manner.

In this way, the drive itself is R/W, but the "view" of it from NBD is
RO. While a hypothetical backup client is busy copying data out of this
temporary view, new writes are coming in to the drive, but are not being
exposed through the NBD export.

(This goes into QEMU-specifics, but those new writes are dirtying a
version of the bitmap not intended to be exposed via the NBD channel.
NBD gets effectively a snapshot of both the bitmap AND the data.)

> I now think what you are talking about backing up a *snapshot* of a disk
> that's running, where the disk itself was not connected using NBD? IE it's
> not being 'made dirty' by NBD_CMD_WRITE etc. Rather 'dirtiness' is effectively
> an opaque state represented in a bitmap, which is binary metadata
> at some particular level of granularity. It might as well be 'happiness'
> or 'is coloured blue'. The NBD server would (normally) have no way of
> manipulating this bitmap.
> In previous comments, I said 'how come we can set the dirty bit through
> writes but can't clear it?'. This (my statement) is now I think wrong,
> as NBD_CMD_WRITE etc. is not defined to set the dirty bit. The
> state of the bitmap comes from whatever sets the bitmap which is outside
> the scope of this protocol to transmit it.

You know, this is a fair point. We have not (to my knowledge) yet
carefully considered the exact bitmap management scenario when NBD is
involved in retrieving dirty blocks.

Humor me for a moment while I talk about a (completely hypothetical, not
yet fully discussed) workflow for how I envision this feature.

(1) User sets up a drive in QEMU, a bitmap is initialized, an initial
backup is made, etc.

(2) As writes come in, QEMU's bitmap is dirtied.

(3) The user decides they want to root around to see what data has
changed and would like to use NBD to do so, in contrast to QEMU's own
facilities for dumping dirty blocks.

(4) A command is issued that creates a temporary, lightweight snapshot
('fleecing') and exports this snapshot over NBD. The bitmap is
associated with the NBD export at this point at NBD server startup. (For
the sake of QEMU discussion, maybe this command is "blockdev-fleece")

(5) At this moment, the snapshot is static and represents the data at
the time the NBD server was started. The bitmap is also forked and
represents only this snapshot. The live data and bitmap continue to change.

(6) Dirty blocks are queried and copied out via NBD.

(7) The user closes the NBD instance upon completion of their task,
whatever it was. (Making a new incremental backup? Just taking a peek at
some changed data? who knows.)

The point that's interesting here is what do we do with the two bitmaps
at this point? The data delta can be discarded (this was after all just
a lightweight read-only point-in-time snapshot) but the bitmap data
needs to be dealt with.

(A) In the case of "User made a new incremental backup," the bitmap that
got forked off to serve the NBD read should be discarded.

(B) In the case of "User just wanted to look around," the bitmap should
be merged back into the bitmap it was forked from.

I don't advise a hybrid where "User copied some data, but not all" where
we need to partially clear *and* merge, but conceivably this could
happen, because the things we don't want to happen always will.

At this point maybe it's becoming obvious that actually it would be very
prudent to allow the NBD client itself to inform QEMU via the NBD
protocol which extents/blocks/(etc) that it is "done" with.

Maybe it *would* actually be useful if, in NBD allowing us to add a
"dirty" bit to the specification, we allow users to clear those bits.

Then, whether the user was trying to do (A) or (B) or the unspeakable
amalgamation of both things, it's up to the user to clear the bits
desired and QEMU can do the simple task of simply always merging the
bitmap fork upon the conclusion of the NBD fleecing exercise.

Maybe this would allow the dirty bit to have a bit more concrete meaning
for the NBD spec: "The bit stays dirty until the user clears it, and is
set when the matching block/extent/etc is written to."

With an exception that external management may cause the bits to clear.
(I.e., someone fiddles with the backing store in a way opaque to NBD,
e.g. someone clears the bitmap directly through QEMU instead of via NBD.)

> However, we have the uncomfortable (to me) situation where the protocol
> describes a flag 'dirty', with implications as to what it does, but
> no actual strict definition of how it's set. So any 'other' user has
> no real idea how to use the information, or how to implement a server
> that provides a 'dirty' bit, because the semantics of that aren't within
> the protocol. This does not sit happily with me.
> So I'm wondering whether we should simplify and generalise this spec. You
> say that for the dirty flag, there's no specification of when it is
> set and cleared - that's implementation defined. Would it not be better
> then to say 'that whole thing is private to Qemu - even the name'.
> Rather you could read the list of bitmaps a server has, with a textual
> name, each having an index (and perhaps a granularity). You could then
> ask on NBD_CMD_BLOCK_STATUS for the appropriate index, and get back that
> bitmap value. Some names (e.g. 'ALLOCATED') could be defined in the spec,
> and some (e.g. ones beginning with 'X-') could be reserved for user
> usage. So you could use 'X-QEMU-DIRTY'). If you then change what your
> dirty bit means, you could use 'X-QEMU-DIRTY2' or similar, and not
> need a protocol change to support it.

If we can't work out a more precise, semantically meaningful spec
extension then I am very happy with simply the ability to have a user
bit, or X-QEMU bits, etc etc etc.

> IE rather than looking at 'a way of reading the dirty bit', we could
> have this as a generic way of reading opaque bitmaps. Only one (allocation)
> might be given meaning to start off with, and it wouldn't be necessary
> for all servers to support that - i.e. you could support bitmap reading
> without having an ALLOCATION bitmap available.
> This spec would then be limited to the transmission of the bitmaps
> (remove the word 'dirty' entirely, except perhaps as an illustrative
> use case), and include only the definition of the allocation bitmap.
> Some more nits:
>> Also, bit of NBD_FLAG_SEND_BLOCK_STATUS is changed to 9, as 8 is now
>> NBD_FLAG_CAN_MULTI_CONN in master branch.
>> And, finally, I've rebased this onto current state of
>> extension-structured-reply branch (which itself should be rebased on
>> master IMHO).
> Each documentation branch should normally be branched off master unless
> it depends on another extension (in which case it will be branched from that).
> I haven't been rebasing them frequently as it can disrupt those working
> on the branches. There's only really an issue around rebasing where you
> depend on another branch.
>> 2. Q: different granularities of dirty/allocated bitmaps. Any problems?
>>   A: 1: server replies with status descriptors of any size, granularity
>>         is hidden from the client
>>      2: dirty/allocated requests are separate and unrelated to each
>>         other, so their granularities are not intersecting
> I'm OK with this, but note that you do actually mention a granularity
> of sorts in the spec (512 byes) - I think you should replace that
> with the minimum block size.
>> 3. Q: selecting of dirty bitmap to export
>>   A: several variants:
>>      1: id of bitmap is in flags field of request
>>          pros: - simple
>>          cons: - it's a hack. flags field is for other uses.
>>                - we'll have to map bitmap names to these "ids"
>>      2: introduce extended nbd requests with variable length and exploit this
>>         feature for BLOCK_STATUS command, specifying bitmap identifier.
>>         pros: - looks like a true way
>>         cons: - we have to create additional extension
>>               - possible we have to create a map,
>>                 {<QEMU bitmap name> <=> <NBD bitmap id>}
>>      3: exteranl tool should select which bitmap to export. So, in case of 
>> Qemu
>>         it should be something like qmp command block-export-dirty-bitmap.
>>         pros: - simple
>>               - we can extend it to behave like (2) later
>>         cons: - additional qmp command to implement (possibly, the lesser 
>> evil)
>>         note: Hmm, external tool can make chose between allocated/dirty data 
>> too,
>>               so, we can remove 'NBD_FLAG_STATUS_DIRTY' flag at all.
> Yes, this is all pretty horrible. I suspect we want to do something like (2),
> and permit extra data across (in my proposal, it would only be one byte to 
> select
> the index). I suppose one could ask for a list of bitmaps.

Having missed most of the discussion on v1/v2, is it a given that we
want in-band identification of bitmaps?

I guess this might depend very heavily on the nature of the definition
of the "dirty bit" in the NBD spec.

>> 4. Q: Should not get_{allocated,dirty} be separate commands?
>>   cons: Two commands with almost same semantic and similar means?
>>   pros: However here is a good point of separating clearly defined and native
>>         for block devices GET_BLOCK_STATUS from user-driven and actually
>>         undefined data, called 'dirtyness'.
> I'm suggesting one generic 'read bitmap' command like you.
>> 5. Number of status descriptors, sent by server, should be restricted
>>   variants:
>>   1: just allow server to restrict this as it wants (which was done in v3)
>>   2: (not excluding 1). Client specifies somehow the maximum for number
>>      of descriptors.
>>      2.1: add command flag, which will request only one descriptor
>>           (otherwise, no restrictions from the client)
>>      2.2: again, introduce extended nbd requests, and add field to
>>           specify this maximum
> I think some form of extended request is the way to go, but out of
> interest, what's the issue with as many descriptors being sent as it
> takes to encode the reply? The client can just consume the remainder
> (without buffering) and reissue the request at a later point for
> the areas it discarded.
>> 6. A: What to do with unspecified flags (in request/reply)?
>>   I think the normal variant is to make them reserved. (Server should
>>   return EINVAL if found unknown bits, client should consider replay
>>   with unknown bits as an error)
> Yeah.
>> +
>> +
>> +    A block status query request. Length and offset define the range
>> +    of interest. Clients SHOULD NOT use this request unless the server
> MUST NOT is what we say elsewhere I believe.
>> +    set `NBD_CMD_SEND_BLOCK_STATUS` in the transmission flags, which
>> +    in turn requires the client to first negotiate structured replies.
>> +    For a successful return, the server MUST use a structured reply,
>> +    containing at most one chunk of type `NBD_REPLY_TYPE_BLOCK_STATUS`.
> Nit: are you saying that non-structured error replies are permissible?
> You're always/often going to get a non-structured  (simple) error reply
> if the server doesn't support the command, but I think it would be fair to 
> say the
> server MUST use a structured reply to NBD_CMD_SEND_BLOCK_STATUS if
> it supports the command. This is effectively what we say re NBD_CMD_READ.
>> +
>> +    The list of block status descriptors within the
>> +    `NBD_REPLY_TYPE_BLOCK_STATUS` chunk represent consecutive portions
>> +    of the file starting from specified *offset*, and the sum of the
>> +    *length* fields of each descriptor MUST not be greater than the
>> +    overall *length* of the request. This means that the server MAY
>> +    return less data than required. However the server MUST return at
>> +    least one status descriptor
> I'm not sure I understand why that's useful. What should the client
> infer from the server refusing to provide information? We don't
> permit short reads etc.
>> .  The server SHOULD use different
>> +    *status* values between consecutive descriptors, and SHOULD use
>> +    descriptor lengths that are an integer multiple of 512 bytes where
>> +    possible (the first and last descriptor of an unaligned query being
>> +    the most obvious places for an exception).
> Surely better would be an an integer multiple of the minimum block
> size. Being able to offer bitmap support at finer granularity than
> the absolute minimum block size helps no one, and if it were possible
> to support a 256 byte block size (I think some floppy disks had that)
> I see no reason not to support that as a granularity.

Anyway, I hope I am being useful and just not more confounding. It seems
to me that we're having difficulty conveying precisely what it is we're
trying to accomplish, so I hope that I am making a good effort in
elaborating on our goals/requirements.

If not, well. You've got my address for hatemail :)


Reply via email to