Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-12-08 Thread Alex Bligh

> On 8 Dec 2016, at 15:59, Eric Blake  wrote:
> 
> We should use similar wording to whatever we already say about what a
> client would see when reading data cleared by NBD_CMD_TRIM.  After all,
> the status of STATE_HOLE set and STATE_ZERO clear is what you logically
> get when TRIM cannot guarantee reads-as-zero.

Yes. It was actually exactly that discussion I was trying to remember.

--
Alex Bligh






signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-12-08 Thread Eric Blake
On 12/08/2016 08:40 AM, Alex Bligh wrote:

>>> + metadata context is the basic "exists at all" metadata context.
>>>
>>> Disagree. You're saying that if a server supports metadata contexts
>>> at all, it must support this one.
>>
>> No, I'm trying to say that this metadata context exposes whether the
>> *block* exists at all (i.e., it exports NBD_STATE_HOLE). I should
>> probably clarify that wording then, if you misunderstood it in that way.
> 
> Ah. Perhaps 'exists at all' itself is misleading. 'Occupies storage
> space'. Or 'is not a hole'?
> 
>> No server MUST implement it (although we might want to make it a SHOULD)
> 
> Don't see why it should even be a 'SHOULD' to be honest. An nbd
> server cooked up for a specific purpose, or with a backend that
> can't provide that (or where there is never a hole) shouldn't
> be criticised.
> 
>>> I think the last sentence is probably meant to read something like:
>>>
>>> If a server supports the "BASE:allocation" metadata context, then
>>> writing to an extent which has `NBD_STATE_HOLE` clear MUST NOT fail
>>> with ENOSPC.
>>
>> No, it can't.
>>
>> Other metadata contexts may change the semantics to the extent that if
>> the block is allocated at the BASE:allocation context, it would still
>> need to space on another plane. In that case, writing to an area which
>> is not STATE_HOLE might still fail.

Not just that, but it is ALWAYS permissible to report NBD_STATE_HOLE as
clear (just not always optimal) - to allow servers that can't determine
sparseness information, but DO know how to communicate extents to the
client.  Yes, it is boring to communicate a single extent that the
entire file is data, and clients can't optimize their usage in that
case, but it should always be considered semantically correct to do so,
since the presence and knowledge of holes is merely an optimization
opportunity, not a data correctness issue.


>>> Why 512 bytes as opposed to 'minimum block size' (or is it because
>>> that is also an experimental extension)?
>>
>> Yes, and this extension does not depend on that one. Hence why it is
>> also a SHOULD and not a MUST.
> 
> OK. As a separate discussion I think we should talk about promoting
> WRITE_ZEROES and INFO. The former has a reference implementation,
> I think Eric did a qemu implementation, and I did a gonbdserver
> implementation. The latter I believe lacks a reference implementation.

Yes, I still have reference qemu patches for INFO; they did not make it
into qemu 2.8 (while WRITE_ZEROES did), but should make it into 2.9.

I also hope to get structured reads into qemu 2.9, but that's a bigger
task, as I don't have reference patches complete yet.  On the other
hand, since BLOCK_STATUS depends on structured reply, I have all the
more reason to complete it soon.

>>> +A client SHOULD NOT read from an area that has both `NBD_STATE_HOLE`
>>> +set and `NBD_STATE_ZERO` clear.
>>>
>>> That makes no sense, as normal data has both these bits clear! This
>>> also implies that to comply with this SHOULD, a client needs to
>>> request block status before any read, which is ridiculous. This
>>> should be dropped.
>>
>> No, it should not, although it may need rewording. It clarifies that
>> having STATE_HOLE set (i.e., there's no valid data in the given range)
>> and STATE_ZERO clear (i.e., we don't assert that it would read as
>> all-zeroes) is not an invalid thing for a server to set. The spec here
>> clarifies what a client should do with that information if it gets it
>> (i.e., "don't read it, it doesn't contain anything interesting").
> 
> That's fair enough until the last bit in brackets. Rather than saying
> a client SHOULD NOT read it, it should simply say that a read on
> such areas will succeed but the data read is undefined (and may
> not be stable).

We should use similar wording to whatever we already say about what a
client would see when reading data cleared by NBD_CMD_TRIM.  After all,
the status of STATE_HOLE set and STATE_ZERO clear is what you logically
get when TRIM cannot guarantee reads-as-zero.

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-12-08 Thread Alex Bligh
Wouter,

>> +- `NBD_OPT_META_CONTEXT` (10)
>> +
>> +Return a list of `NBD_REP_META_CONTEXT` replies, one per context,
>> +followed by an `NBD_REP_ACK`. If a server replies to such a request
>> +with no error message, clients
>> 
>> "*the* server" / "*the* cient"
>> 
>> Perhaps only an 'NBD_REP_ERR_UNSUP' error should prevent the
>> client querying the server.
> 
> I don't think that's necessarily a good idea. I think if the server
> replies with NBD_REP_ERR_INVALID, that means it understands the option
> but the user sent something invalid and now we haven't selected any
> contexts. That means the server won't be able to provide any metadata
> during transmission, either.
> 
> Perhaps it should be made clearer that once contexts have been selected
> (even if errors occurred later on), NBD_CMD_BLOCK_STATUS MAY be used.
> 
>> +MAY send NBD_CMD_BLOCK_STATUS
>> +commands during the transmission phase.
>> 
>> Add: "If a server replies to such a request with NBD_REP_ERR_UNSUP,
>> the client MUST NOT send NBD_CMD_BLOCK_STATUS commands during the
>> transmission phase."
> 
> That would be counter to the above.

My main point is that the current text does not prohibit sending
NBD_CMD_BLOCK_STATUS if it hasn't been established that the server
supports it.

>> Why not just define the format of a metadata-context string to be
>> of the form ':' (perhaps this definition goes
>> elsewhere), and here just say the returned list is a left-match
>> of all the available metadata-contexts, i.e. all those metadata
>> contexts whose names either start or consist entirely of the
>> specified string. If an empty string is specified, all metadata
>> contexts are returned.
> 
> I also want to make it possible for an implementation to define its own
> syntax. Say, a "last-snapshot" thing, or something that says
> "diff:snapshot1-snapshot2", or whatever.

That's a good idea, but doesn't preclude using a colon as a namespace
separator.

>> +The type may be one of:
>> +- `NBD_META_LIST_CONTEXT` (1): the list of metadata contexts
>> +  selected by the query string is returned to the client without
>> +  changing any state (i.e., this does not add metadata contexts
>> +  for further usage).
>> 
>> Somewhere it should say this list is returned by sending
>> zero or more NBD_REP_META_CONTEXT records followed by a NBD_REP_ACK.
> 
> We do that above already.

Must have missed it.

>> A metadata context represents metadata concerning the selected
>> export in the form of a list of extents, each with a status. The
>> meaning of the metadata and the status is dependent upon the
>> context. Each metadata context has a name which is colon separated,
>> in the form ':'. Namespaces that start with "X-"
>> are vendor dependent extensions.
> 
> No, I wouldn't do that, since by definition, every namespace is
> vendor-dependent.
> 
> Maybe we could ask that people who want to implement their own metadata
> context type (rather than be compatible with an existing one) should
> register their namespace somewhere, though.

What I'm trying to say is that there should be three types of namespace:

* "BASE" (or better "base" as Vladimir points out) which is defined within
  the document, i.e. it will say "base:allocated" does X, "base:foo" does
  Y.
* Registered, e.g. "qemu", where the document would say: "The following
  is a list of registered namespaces: ... qemu: to qemu.org" or whatever.
* Unregistered, e.g. "X-Alex-Experiment" where the document merely mentions
  that namespaces beginning with "X-" can be used by anyone, like X- headers
  in SMTP and HTTP.

The purpose of distinguishing between registered and unregistered is
that otherwise we might get two vendors with a product called "fastnbd"
(or whatever) who both pick the same namespace.

I suppose an alternative would be to go the Java-ish way and suggest
people use a domain name (so 'qemu' would be 'qemu.org:whatever').

>> + metadata context is the basic "exists at all" metadata context.
>> 
>> Disagree. You're saying that if a server supports metadata contexts
>> at all, it must support this one.
> 
> No, I'm trying to say that this metadata context exposes whether the
> *block* exists at all (i.e., it exports NBD_STATE_HOLE). I should
> probably clarify that wording then, if you misunderstood it in that way.

Ah. Perhaps 'exists at all' itself is misleading. 'Occupies storage
space'. Or 'is not a hole'?

> No server MUST implement it (although we might want to make it a SHOULD)

Don't see why it should even be a 'SHOULD' to be honest. An nbd
server cooked up for a specific purpose, or with a backend that
can't provide that (or where there is never a hole) shouldn't
be criticised.

>> I think the last sentence is probably meant to read something like:
>> 
>> If a server supports the "BASE:allocation" metadata context, then
>> writing to an extent which has `NBD_STATE_HOLE` clear MUST NOT fail
>> with ENOSPC.
> 
> No, it can't.
> 
> Other metadata 

Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-12-08 Thread Wouter Verhelst
On Thu, Dec 08, 2016 at 03:39:19AM +, Alex Bligh wrote:
> 
> > On 2 Dec 2016, at 18:45, Alex Bligh  wrote:
> > 
> > Thanks. That makes sense - or enough sense for me to carry on commenting!
> 
> 
> I finally had some time to go through this extension in detail. Rather
> than comment on all the individual patches, I squashed them into a single
> commit, did a 'git format-patch' on it, and commented on that.
> 
> diff --git a/doc/proto.md b/doc/proto.md
> index c443494..9c0981f 100644
> --- a/doc/proto.md
> +++ b/doc/proto.md
> @@ -871,6 +869,50 @@ of the newstyle negotiation.
> 
> Defined by the experimental `INFO` 
> [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md).
> 
> +- `NBD_OPT_META_CONTEXT` (10)
> +
> +Return a list of `NBD_REP_META_CONTEXT` replies, one per context,
> +followed by an `NBD_REP_ACK`. If a server replies to such a request
> +with no error message, clients
> 
> "*the* server" / "*the* cient"
> 
> Perhaps only an 'NBD_REP_ERR_UNSUP' error should prevent the
> client querying the server.

I don't think that's necessarily a good idea. I think if the server
replies with NBD_REP_ERR_INVALID, that means it understands the option
but the user sent something invalid and now we haven't selected any
contexts. That means the server won't be able to provide any metadata
during transmission, either.

Perhaps it should be made clearer that once contexts have been selected
(even if errors occurred later on), NBD_CMD_BLOCK_STATUS MAY be used.

> +MAY send NBD_CMD_BLOCK_STATUS
> +commands during the transmission phase.
> 
> Add: "If a server replies to such a request with NBD_REP_ERR_UNSUP,
> the client MUST NOT send NBD_CMD_BLOCK_STATUS commands during the
> transmission phase."

That would be counter to the above.

> +If the query string is syntactically invalid, the server SHOULD send
> +`NBD_REP_ERR_INVALID`.
> 
> 'MUST send' else it implies sending nothing is permissible.

Yes, I had already fixed that locally (but didn't push yet, because that
patch is... big, and needs some rework. I'll look at it again later,
incorporating your comments)

> +If the query string is syntactically valid
> +but finds no metadata contexts, the server MUST send a single
> +reply of type `NBD_REP_ACK`.
> +
> +This option MUST NOT be requested unless structured replies have
> 
> Active voice better:
> 
> "The client MUST NOT send this option unless" ...

Right.

> +been negotiated first. If a client attempts to do so, a server
> +SHOULD send `NBD_REP_ERR_INVALID`.
> +
> +Data:
> +- 32 bits, type
> +- String, query to select a subset of the available metadata
> +  contexts. If this is not specified (i.e., length is 4 and no
> +  command is sent), then the server MUST send all the metadata
> +  contexts it knows about. If specified, this query string MUST
> +  start with a name that uniquely identifies a server
> +  implementation; e.g., the reference implementation that
> +  accompanies this document would support query strings starting
> +  with 'nbd-server:'
> 
> Why not just define the format of a metadata-context string to be
> of the form ':' (perhaps this definition goes
> elsewhere), and here just say the returned list is a left-match
> of all the available metadata-contexts, i.e. all those metadata
> contexts whose names either start or consist entirely of the
> specified string. If an empty string is specified, all metadata
> contexts are returned.

I also want to make it possible for an implementation to define its own
syntax. Say, a "last-snapshot" thing, or something that says
"diff:snapshot1-snapshot2", or whatever.

> +The type may be one of:
> +- `NBD_META_LIST_CONTEXT` (1): the list of metadata contexts
> +  selected by the query string is returned to the client without
> +  changing any state (i.e., this does not add metadata contexts
> +  for further usage).
> 
> Somewhere it should say this list is returned by sending
> zero or more NBD_REP_META_CONTEXT records followed by a NBD_REP_ACK.

We do that above already.

> +- `NBD_META_ADD_CONTEXT` (2): the list of metadata contexts
> +  selected by the query string is added to the list of existing
> +  metadata contexts.
> +- `NBD_META_DEL_CONTEXT` (3): the list of metadata contexts
> +  selected by the query string is removed from the list of used
> +  metadata contexts. Servers SHOULD NOT reuse existing metadata
> +  context IDs.
> +
> +The syntax of the query string is not specified, except that
> +implementations MUST support adding and removing individual metadata
> +contexts by simply listing their names.
> 
> This seems slightly over complicated. Rather than have a list held
> by the server of active metadata contexts, we could simply have
> two NBD_OPT_ commands, say NBD_OPT_LIST_META_CONTEXTS and
> NBD_OPT_SET_META_CONTEXTS (which 

Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-12-06 Thread John Snow


On 12/06/2016 08:32 AM, Wouter Verhelst wrote:
> Hi John
> 
> Sorry for the late reply; weekend was busy, and so was monday.
> 

No problems.

> On Fri, Dec 02, 2016 at 03:39:08PM -0500, John Snow wrote:
>> On 12/02/2016 01:45 PM, Alex Bligh wrote:
>>> John,
>>>
> +Some storage formats and operations over such formats express a
> +concept of data dirtiness. Whether the operation is block device
> +mirroring, incremental block device backup or any other operation with
> +a concept of data dirtiness, they all share a need to provide a list
> +of ranges that this particular operation treats as dirty.
>
> How can data be 'dirty' if it is static and unchangeable? (I thought)
>

 In a simple case, live IO goes to e.g. hda.qcow2. These writes come from
 the VM and cause the bitmap that QEMU manages to become dirty.

 We intend to expose the ability to fleece dirty blocks via NBD. What
 happens in this scenario would be that a snapshot of the data at the
 time of the request is exported over NBD in a read-only manner.

 In this way, the drive itself is R/W, but the "view" of it from NBD is
 RO. While a hypothetical backup client is busy copying data out of this
 temporary view, new writes are coming in to the drive, but are not being
 exposed through the NBD export.

 (This goes into QEMU-specifics, but those new writes are dirtying a
 version of the bitmap not intended to be exposed via the NBD channel.
 NBD gets effectively a snapshot of both the bitmap AND the data.)
>>>
>>> Thanks. That makes sense - or enough sense for me to carry on commenting!
>>>
>>
>> Whew! I'm glad.
>>
> I now think what you are talking about backing up a *snapshot* of a disk
> that's running, where the disk itself was not connected using NBD? IE it's
> not being 'made dirty' by NBD_CMD_WRITE etc. Rather 'dirtiness' is 
> effectively
> an opaque state represented in a bitmap, which is binary metadata
> at some particular level of granularity. It might as well be 'happiness'
> or 'is coloured blue'. The NBD server would (normally) have no way of
> manipulating this bitmap.
>
> In previous comments, I said 'how come we can set the dirty bit through
> writes but can't clear it?'. This (my statement) is now I think wrong,
> as NBD_CMD_WRITE etc. is not defined to set the dirty bit. The
> state of the bitmap comes from whatever sets the bitmap which is outside
> the scope of this protocol to transmit it.
>

 You know, this is a fair point. We have not (to my knowledge) yet
 carefully considered the exact bitmap management scenario when NBD is
 involved in retrieving dirty blocks.

 Humor me for a moment while I talk about a (completely hypothetical, not
 yet fully discussed) workflow for how I envision this feature.

 (1) User sets up a drive in QEMU, a bitmap is initialized, an initial
 backup is made, etc.

 (2) As writes come in, QEMU's bitmap is dirtied.

 (3) The user decides they want to root around to see what data has
 changed and would like to use NBD to do so, in contrast to QEMU's own
 facilities for dumping dirty blocks.

 (4) A command is issued that creates a temporary, lightweight snapshot
 ('fleecing') and exports this snapshot over NBD. The bitmap is
 associated with the NBD export at this point at NBD server startup. (For
 the sake of QEMU discussion, maybe this command is "blockdev-fleece")

 (5) At this moment, the snapshot is static and represents the data at
 the time the NBD server was started. The bitmap is also forked and
 represents only this snapshot. The live data and bitmap continue to change.

 (6) Dirty blocks are queried and copied out via NBD.

 (7) The user closes the NBD instance upon completion of their task,
 whatever it was. (Making a new incremental backup? Just taking a peek at
 some changed data? who knows.)

 The point that's interesting here is what do we do with the two bitmaps
 at this point? The data delta can be discarded (this was after all just
 a lightweight read-only point-in-time snapshot) but the bitmap data
 needs to be dealt with.

 (A) In the case of "User made a new incremental backup," the bitmap that
 got forked off to serve the NBD read should be discarded.

 (B) In the case of "User just wanted to look around," the bitmap should
 be merged back into the bitmap it was forked from.

 I don't advise a hybrid where "User copied some data, but not all" where
 we need to partially clear *and* merge, but conceivably this could
 happen, because the things we don't want to happen always will.

 At this point maybe it's becoming obvious that actually it would be very
 prudent to allow the NBD client itself to inform QEMU via the NBD
 

Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-12-06 Thread Wouter Verhelst
Hi John

Sorry for the late reply; weekend was busy, and so was monday.

On Fri, Dec 02, 2016 at 03:39:08PM -0500, John Snow wrote:
> On 12/02/2016 01:45 PM, Alex Bligh wrote:
> > John,
> > 
> >>> +Some storage formats and operations over such formats express a
> >>> +concept of data dirtiness. Whether the operation is block device
> >>> +mirroring, incremental block device backup or any other operation with
> >>> +a concept of data dirtiness, they all share a need to provide a list
> >>> +of ranges that this particular operation treats as dirty.
> >>>
> >>> How can data be 'dirty' if it is static and unchangeable? (I thought)
> >>>
> >>
> >> In a simple case, live IO goes to e.g. hda.qcow2. These writes come from
> >> the VM and cause the bitmap that QEMU manages to become dirty.
> >>
> >> We intend to expose the ability to fleece dirty blocks via NBD. What
> >> happens in this scenario would be that a snapshot of the data at the
> >> time of the request is exported over NBD in a read-only manner.
> >>
> >> In this way, the drive itself is R/W, but the "view" of it from NBD is
> >> RO. While a hypothetical backup client is busy copying data out of this
> >> temporary view, new writes are coming in to the drive, but are not being
> >> exposed through the NBD export.
> >>
> >> (This goes into QEMU-specifics, but those new writes are dirtying a
> >> version of the bitmap not intended to be exposed via the NBD channel.
> >> NBD gets effectively a snapshot of both the bitmap AND the data.)
> > 
> > Thanks. That makes sense - or enough sense for me to carry on commenting!
> > 
> 
> Whew! I'm glad.
> 
> >>> I now think what you are talking about backing up a *snapshot* of a disk
> >>> that's running, where the disk itself was not connected using NBD? IE it's
> >>> not being 'made dirty' by NBD_CMD_WRITE etc. Rather 'dirtiness' is 
> >>> effectively
> >>> an opaque state represented in a bitmap, which is binary metadata
> >>> at some particular level of granularity. It might as well be 'happiness'
> >>> or 'is coloured blue'. The NBD server would (normally) have no way of
> >>> manipulating this bitmap.
> >>>
> >>> In previous comments, I said 'how come we can set the dirty bit through
> >>> writes but can't clear it?'. This (my statement) is now I think wrong,
> >>> as NBD_CMD_WRITE etc. is not defined to set the dirty bit. The
> >>> state of the bitmap comes from whatever sets the bitmap which is outside
> >>> the scope of this protocol to transmit it.
> >>>
> >>
> >> You know, this is a fair point. We have not (to my knowledge) yet
> >> carefully considered the exact bitmap management scenario when NBD is
> >> involved in retrieving dirty blocks.
> >>
> >> Humor me for a moment while I talk about a (completely hypothetical, not
> >> yet fully discussed) workflow for how I envision this feature.
> >>
> >> (1) User sets up a drive in QEMU, a bitmap is initialized, an initial
> >> backup is made, etc.
> >>
> >> (2) As writes come in, QEMU's bitmap is dirtied.
> >>
> >> (3) The user decides they want to root around to see what data has
> >> changed and would like to use NBD to do so, in contrast to QEMU's own
> >> facilities for dumping dirty blocks.
> >>
> >> (4) A command is issued that creates a temporary, lightweight snapshot
> >> ('fleecing') and exports this snapshot over NBD. The bitmap is
> >> associated with the NBD export at this point at NBD server startup. (For
> >> the sake of QEMU discussion, maybe this command is "blockdev-fleece")
> >>
> >> (5) At this moment, the snapshot is static and represents the data at
> >> the time the NBD server was started. The bitmap is also forked and
> >> represents only this snapshot. The live data and bitmap continue to change.
> >>
> >> (6) Dirty blocks are queried and copied out via NBD.
> >>
> >> (7) The user closes the NBD instance upon completion of their task,
> >> whatever it was. (Making a new incremental backup? Just taking a peek at
> >> some changed data? who knows.)
> >>
> >> The point that's interesting here is what do we do with the two bitmaps
> >> at this point? The data delta can be discarded (this was after all just
> >> a lightweight read-only point-in-time snapshot) but the bitmap data
> >> needs to be dealt with.
> >>
> >> (A) In the case of "User made a new incremental backup," the bitmap that
> >> got forked off to serve the NBD read should be discarded.
> >>
> >> (B) In the case of "User just wanted to look around," the bitmap should
> >> be merged back into the bitmap it was forked from.
> >>
> >> I don't advise a hybrid where "User copied some data, but not all" where
> >> we need to partially clear *and* merge, but conceivably this could
> >> happen, because the things we don't want to happen always will.
> >>
> >> At this point maybe it's becoming obvious that actually it would be very
> >> prudent to allow the NBD client itself to inform QEMU via the NBD
> >> protocol which extents/blocks/(etc) that it is "done" with.
> >>
> >> Maybe it 

Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-12-02 Thread Wouter Verhelst
Hi Vladimir,

On Thu, Dec 01, 2016 at 02:26:28PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 01.12.2016 13:14, Wouter Verhelst wrote:
[...]
> > -- `NBD_ALLOC_ADD_CONTEXT` (2): the list of allocation contexts
> > +- `NBD_META_ADD_CONTEXT` (2): the list of metadata contexts
> > selected by the query string is added to the list of existing
> > -  allocation contexts.
> 
> If I understand correctly, it should be not 'existing', but 'exporting'. 
> So there are several contexts, server knows about. They are definitely 
> exists. Some of them may be selected (by client) for export (to client, 
> through get_block_status).
> 
> so, what about 'list of metadata contexts to export' or something like this?

Yes, good idea. Thanks.

[...]
> > +The "BASE:allocation" metadata context is the basic "exists at all"
> > +metadata context. If an extent is marked with `NBD_STATE_HOLE` at that
> > +context, this means that the given extent is not allocated in the
> > +backend storage, and that writing to the extent MAY result in the ENOSPC
> > +error. This supports sparse file semantics on the server side. If a
> > +server has only one metadata context (the default), then writing to an
> > +extent which has `NBD_STATE_HOLE` clear MUST NOT fail with ENOSPC.
> 
> this dependence looks strange. user defined metadata, why it affects 
> allocation?

The reference nbd-server implementation (i.e., the one that accompanies
this document) has a "copy-on-write" mode in which modifications are
written to a separate file. This separate file can easily be a sparse
file.

Let's say you have an export which is fully allocated; the
BASE:allocation state would have STATE_HOLE cleared. However, when
writing to a particular block that has not yet been written to during
that session, the copy-on-write mode would have to allocate a new block
in the copy-on-write file, which may fail with ENOSPC.

Perhaps the above paragraph could be updated in the sense that it SHOULD
NOT fail in other cases, but that it may depend on the semantics of the
other metadata contexts, whether active (i.e., selected with
NBD_OPT_META_CONTEXT) or not.

Side note: that does mean that some metadata contexts may be specific to
particular exports (e.g., you may have a list of named snapshots to
select, and which ones would exist would depend on the specific export
chosen). I guess that means the metadata contexts should have the name
of the export, too.

> At least, 'only one' is not descriptive, it would be better to mention
> 'BASE:allocation' name.

Yes, that would make things clearer, indeed.

> (I hope, I can ask server to export only dirty_bitmap context, not
> exporting allocation?

That was the point of naming that context and making it selectable :-)

> In that case this 'only one' would be dirty_bitmap)
> 
> >   For all other cases, this specification requires no specific semantics
> 
> what are 'other cases'? For all other metadata contexts? Or for all 
> cases when we have more than one context?

Other metadata contexts. I'll rephrease it in that sense.

> > -of allocation contexts. Implementations could support allocation
> > +of metadata contexts. Implementations could support metadata
> >   contexts with semantics like the following:
> >   
> > -- Incremental snapshots; if a block is allocated in one allocation
> > +- Incremental snapshots; if a block is allocated in one metadata
> > context, that implies that it is also allocated in the next level up.
> >   - Various bits of data about the backend of the storage; e.g., if the
> > -  storage is written on a RAID array, an allocation context could
> > +  storage is written on a RAID array, a metadata context could
> > return information about the redundancy level of a given extent
> >   - If the backend implements a write-through cache of some sort, or
> > -  synchronises with other servers, an allocation context could state
> > -  that an extent is "allocated" once it has reached permanent storage
> > +  synchronises with other servers, a metadata context could state
> > +  that an extent is "active" once it has reached permanent storage
> > and/or is synchronized with other servers.
> 
> Incremental snapshots sounds strange for me. Snapshots are just 
> snapshots.. Backup may be incremental, but it is not about snapshots.. I 
> think this example may be safely deleted from the spec.

Yeah, I'll just remove all the examples; they're not really critical,
anyway, and might indeed confuse people.

[...]
> >   Servers MUST return an `NBD_REPLY_TYPE_BLOCK_STATUS` chunk for every
> > -allocation context ID, except if the semantics of particular
> > -allocation contexts mean that the information for one allocation
> > -context is implied by the information for another.
> > +metadata context ID, except if the semantics of particular
> > +metadata contexts mean that the information for one active metadata
> > +context is implied by the information for 

Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-12-01 Thread Vladimir Sementsov-Ogievskiy

01.12.2016 13:14, Wouter Verhelst wrote:

Here's another update.

Changes since previous version:
- Rename "allocation context" to "metadata context"
- Stop making metadata context 0 be special; instead, name it
   "BASE:allocation" and allow it to be selected like all other contexts.
- Clarify in a bit more detail when a server MAY omit metadata
   information on a particular metadata context (i.e., only if another
   metadata context that we actually got information for implies it can't
   have meaning). This was always meant that way, but the spec could have
   been a bit more explicit about it.
- Change one SHOULD to a MUST, where it should not have been a SHOULD
   in the first place.

(This applies on top of my previous patch)

Applied and visible at the usual place.

diff --git a/doc/proto.md b/doc/proto.md
index fe7ae53..9c0981f 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -869,16 +869,16 @@ of the newstyle negotiation.
  
  Defined by the experimental `INFO` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md).
  
-- `NBD_OPT_ALLOC_CONTEXT` (10)

+- `NBD_OPT_META_CONTEXT` (10)
  
-Return a list of `NBD_REP_ALLOC_CONTEXT` replies, one per context,

+Return a list of `NBD_REP_META_CONTEXT` replies, one per context,
  followed by an `NBD_REP_ACK`. If a server replies to such a request
  with no error message, clients MAY send NBD_CMD_BLOCK_STATUS
  commands during the transmission phase.
  
  If the query string is syntactically invalid, the server SHOULD send

  `NBD_REP_ERR_INVALID`. If the query string is syntactically valid
-but finds no allocation contexts, the server MUST send a single
+but finds no metadata contexts, the server MUST send a single
  reply of type `NBD_REP_ACK`.
  
  This option MUST NOT be requested unless structured replies have

@@ -887,9 +887,9 @@ of the newstyle negotiation.
  
  Data:

  - 32 bits, type
-- String, query to select a subset of the available allocation
+- String, query to select a subset of the available metadata
contexts. If this is not specified (i.e., length is 4 and no
-  command is sent), then the server MUST send all the allocation
+  command is sent), then the server MUST send all the metadata
contexts it knows about. If specified, this query string MUST
start with a name that uniquely identifies a server
implementation; e.g., the reference implementation that
@@ -897,18 +897,22 @@ of the newstyle negotiation.
with 'nbd-server:'
  
  The type may be one of:

-- `NBD_ALLOC_LIST_CONTEXT` (1): the list of allocation contexts
+- `NBD_META_LIST_CONTEXT` (1): the list of metadata contexts
selected by the query string is returned to the client without
-  changing any state (i.e., this does not add allocation contexts
+  changing any state (i.e., this does not add metadata contexts
for further usage).
-- `NBD_ALLOC_ADD_CONTEXT` (2): the list of allocation contexts
+- `NBD_META_ADD_CONTEXT` (2): the list of metadata contexts
selected by the query string is added to the list of existing


If I understand correctly, it should be not 'existing', but 'exporting'. 
So there are several contexts, server knows about. They are definitely 
exists. Some of them may be selected (by client) for export (to client, 
through get_block_status).


so, what about 'list of metadata contexts to export' or something like this?


-  allocation contexts.
-- `NBD_ALLOC_DEL_CONTEXT` (3): the list of allocation contexts
+  metadata contexts.
+- `NBD_META_DEL_CONTEXT` (3): the list of metadata contexts
selected by the query string is removed from the list of used
-  allocation contexts. Servers SHOULD NOT reuse existing allocation
+  metadata contexts. Servers SHOULD NOT reuse existing metadata
context IDs.
  
+The syntax of the query string is not specified, except that

+implementations MUST support adding and removing individual metadata
+contexts by simply listing their names.
+
   Option reply types
  
  These values are used in the "reply type" field, sent by the server

@@ -920,7 +924,7 @@ during option haggling in the fixed newstyle negotiation.
  information is available, or when sending data related to the option
  (in the case of `NBD_OPT_LIST`) has finished. No data.
  
-* `NBD_REP_SERVER` (2)

+- `NBD_REP_SERVER` (2)
  
  A description of an export. Data:
  
@@ -935,21 +939,20 @@ during option haggling in the fixed newstyle negotiation.

particular client request, this field is defined to be a string
suitable for direct display to a human being.
  
-* `NBD_REP_INFO` (3)

+- `NBD_REP_INFO` (3)
  
  Defined by the experimental `INFO` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md).
  
-- `NBD_REP_ALLOC_CONTEXT` (4)

+- `NBD_REP_META_CONTEXT` (4)
  
-  

Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-12-01 Thread Wouter Verhelst
Here's another update.

Changes since previous version:
- Rename "allocation context" to "metadata context"
- Stop making metadata context 0 be special; instead, name it
  "BASE:allocation" and allow it to be selected like all other contexts.
- Clarify in a bit more detail when a server MAY omit metadata
  information on a particular metadata context (i.e., only if another
  metadata context that we actually got information for implies it can't
  have meaning). This was always meant that way, but the spec could have
  been a bit more explicit about it.
- Change one SHOULD to a MUST, where it should not have been a SHOULD
  in the first place.

(This applies on top of my previous patch)

Applied and visible at the usual place.

diff --git a/doc/proto.md b/doc/proto.md
index fe7ae53..9c0981f 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -869,16 +869,16 @@ of the newstyle negotiation.
 
 Defined by the experimental `INFO` 
[extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md).
 
-- `NBD_OPT_ALLOC_CONTEXT` (10)
+- `NBD_OPT_META_CONTEXT` (10)
 
-Return a list of `NBD_REP_ALLOC_CONTEXT` replies, one per context,
+Return a list of `NBD_REP_META_CONTEXT` replies, one per context,
 followed by an `NBD_REP_ACK`. If a server replies to such a request
 with no error message, clients MAY send NBD_CMD_BLOCK_STATUS
 commands during the transmission phase.
 
 If the query string is syntactically invalid, the server SHOULD send
 `NBD_REP_ERR_INVALID`. If the query string is syntactically valid
-but finds no allocation contexts, the server MUST send a single
+but finds no metadata contexts, the server MUST send a single
 reply of type `NBD_REP_ACK`.
 
 This option MUST NOT be requested unless structured replies have
@@ -887,9 +887,9 @@ of the newstyle negotiation.
 
 Data:
 - 32 bits, type
-- String, query to select a subset of the available allocation
+- String, query to select a subset of the available metadata
   contexts. If this is not specified (i.e., length is 4 and no
-  command is sent), then the server MUST send all the allocation
+  command is sent), then the server MUST send all the metadata
   contexts it knows about. If specified, this query string MUST
   start with a name that uniquely identifies a server
   implementation; e.g., the reference implementation that
@@ -897,18 +897,22 @@ of the newstyle negotiation.
   with 'nbd-server:'
 
 The type may be one of:
-- `NBD_ALLOC_LIST_CONTEXT` (1): the list of allocation contexts
+- `NBD_META_LIST_CONTEXT` (1): the list of metadata contexts
   selected by the query string is returned to the client without
-  changing any state (i.e., this does not add allocation contexts
+  changing any state (i.e., this does not add metadata contexts
   for further usage).
-- `NBD_ALLOC_ADD_CONTEXT` (2): the list of allocation contexts
+- `NBD_META_ADD_CONTEXT` (2): the list of metadata contexts
   selected by the query string is added to the list of existing
-  allocation contexts.
-- `NBD_ALLOC_DEL_CONTEXT` (3): the list of allocation contexts
+  metadata contexts.
+- `NBD_META_DEL_CONTEXT` (3): the list of metadata contexts
   selected by the query string is removed from the list of used
-  allocation contexts. Servers SHOULD NOT reuse existing allocation
+  metadata contexts. Servers SHOULD NOT reuse existing metadata
   context IDs.
 
+The syntax of the query string is not specified, except that
+implementations MUST support adding and removing individual metadata
+contexts by simply listing their names.
+
  Option reply types
 
 These values are used in the "reply type" field, sent by the server
@@ -920,7 +924,7 @@ during option haggling in the fixed newstyle negotiation.
 information is available, or when sending data related to the option
 (in the case of `NBD_OPT_LIST`) has finished. No data.
 
-* `NBD_REP_SERVER` (2)
+- `NBD_REP_SERVER` (2)
 
 A description of an export. Data:
 
@@ -935,21 +939,20 @@ during option haggling in the fixed newstyle negotiation.
   particular client request, this field is defined to be a string
   suitable for direct display to a human being.
 
-* `NBD_REP_INFO` (3)
+- `NBD_REP_INFO` (3)
 
 Defined by the experimental `INFO` 
[extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md).
 
-- `NBD_REP_ALLOC_CONTEXT` (4)
+- `NBD_REP_META_CONTEXT` (4)
 
-A description of an allocation context. Data:
+A description of a metadata context. Data:
 
-- 32 bits, NBD allocation context ID. If the request was NOT of type
-  `NBD_ALLOC_LIST_CONTEXT`, this field MUST NOT be zero.
-- String, name of the allocation context. This is not required to be
+- 32 bits, NBD metadata context ID.
+- String, name of the metadata context. This is not required to be
   a 

Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-11-30 Thread Sergey Talantov
Hi, Wouter!

> Actually, come to think of that. What is the exact use case for this thing? I 
> understand you're trying to create incremental backups of things, which would 
> imply you don't write from the client that is getting the ?
> block status thingies, right?

Overall, the most desired use case for this NBD extension is to allow 3-rd 
party software to make incremental backups.
Acronis (vendor of backup solutions) would support qemu backup if block status 
is provided.


-Original Message-
From: Wouter Verhelst [mailto:w...@uter.be] 
Sent: Sunday, November 27, 2016 22:17
To: Vladimir Sementsov-Ogievskiy 
Cc: nbd-gene...@lists.sourceforge.net; kw...@redhat.com; qemu-devel@nongnu.org; 
pborzen...@virtuozzo.com; stefa...@redhat.com; pbonz...@redhat.com; 
m...@pengutronix.de; d...@openvz.org
Subject: Re: [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

Hi Vladimir,

Quickly: the reason I haven't merged this yes is twofold:
- I wasn't thrilled with the proposal at the time. It felt a bit
  hackish, and bolted onto NBD so you could use it, but without defining
  everything in the NBD protocol. "We're reading some data, but it's not
  about you". That didn't feel right
- There were a number of questions still unanswered (you're answering a
  few below, so that's good).

For clarity, I have no objection whatsoever to adding more commands if they're 
useful, but I would prefer that they're also useful with NBD on its own, i.e., 
without requiring an initiation or correlation of some state through another 
protocol or network connection or whatever. If that's needed, that feels like I 
didn't do my job properly, if you get my point.

On Fri, Nov 25, 2016 at 02:28:16PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> With the availability of sparse storage formats, it is often needed to 
> query status of a particular range and read only those blocks of data 
> that are actually present on the block device.
> 
> To provide such information, the patch adds the BLOCK_STATUS extension 
> with one new NBD_CMD_BLOCK_STATUS command, a new structured reply 
> chunk format, and a new transmission flag.
> 
> There exists a concept of data dirtiness, which is required during, 
> for example, incremental block device backup. To express this concept 
> via NBD protocol, this patch also adds a flag to NBD_CMD_BLOCK_STATUS 
> to request dirtiness information rather than provisioning information; 
> however, with the current proposal, data dirtiness is only useful with 
> additional coordination outside of the NBD protocol (such as a way to 
> start and stop the server from tracking dirty sectors).  Future NBD 
> extensions may add commands to control dirtiness through NBD.
> 
> Since NBD protocol has no notion of block size, and to mimic SCSI "GET 
> LBA STATUS" command more closely, it has been chosen to return a list 
> of extents in the response of NBD_CMD_BLOCK_STATUS command, instead of 
> a bitmap.
> 
> CC: Pavel Borzenkov 
> CC: Denis V. Lunev 
> CC: Wouter Verhelst 
> CC: Paolo Bonzini 
> CC: Kevin Wolf 
> CC: Stefan Hajnoczi 
> Signed-off-by: Eric Blake 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
> 
> v3:
> 
> Hi all. This is almost a resend of v2 (by Eric Blake), The only change 
> is removing the restriction, that sum of status descriptor lengths 
> must be equal to requested length. I.e., let's permit the server to 
> replay with less data than required if it wants.

Reasonable, yes. The length that the client requests should be a maximum (i.e.
"I'm interested in this range"), not an exact request.

> Also, bit of NBD_FLAG_SEND_BLOCK_STATUS is changed to 9, as 8 is now  
> NBD_FLAG_CAN_MULTI_CONN in master branch.

Right.

> And, finally, I've rebased this onto current state of 
> extension-structured-reply branch (which itself should be rebased on 
> master IMHO).

Probably a good idea, given the above.

> By this resend I just want to continue the diqussion, started about 
> half a year ago. Here is a summary of some questions and ideas from v2
> diqussion:
> 
> 1. Q: Synchronisation. Is such data (dirty/allocated) reliable? 
>A: This all is for read-only disks, so the data is static and unchangeable.

I think we should declare that it's up to the client to ensure no other writes 
happen without its knowledge. This may be because the client and server 
communicate out of band about state changes, or because the client somehow 
knows that it's the only writer, or whatever.

We can easily do that by declaring that the result of that command only talks 
about *current* state, and that concurrent writes by different clients may 
invalidate the state. This is true for NBD in general (i.e., concurrent read or 
write commands from other clients may confuse file systems on top of NBD), so 
it doesn't 

Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-11-29 Thread Wouter Verhelst
On Tue, Nov 29, 2016 at 06:07:56PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 29.11.2016 17:52, Alex Bligh wrote:
> > Vladimir,
> >> if the bitmap is 010101010101 we will have too many descriptors.
> >> For example, 16tb disk, 64k granularity -> 2G of descriptors
> >> payload.
> > Yep. And the cost of consuming and retrying is quite high. One
> > option would be for the client to realise this is a possibility, and
> > not request the entire extent map for a 16TB disk, as it might be
> > very large! Even if the client worked at e.g. a 64MB level (where
> > they'd get a maximum of 1024 extents per reply), this isn't going to
> > noticeably increase the round trip timing. One issue here is that to
> > determine a 'reasonable' size, the client needs to know the minimum
> > length of any extent.
> 
> and with this approach we will in turn have overhead of too many 
> requests for  or  bitmaps.

This is why my proposal suggests the server may abort the sent extents
after X extents (sixteen, but that number can certainly change) have
been sent. After all, the server will have a better view on what's going
to be costly in terms of "number of extents".

> > I think the answer is probably a 'maximum number of extents' in the request 
> > packet.
> >
> > Of course with statuses in extent, the final extent could be
> > represented as 'I don't know, break this bit into a separate
> > request' status.
> >
> 
> With such predefined status, we can postpone creating extended requests, 
> have number of extents restricted by server and have sum of extents 
> lengths be equal to request length.

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
   people in the world who think they really understand all of its rules,
   and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12



Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-11-29 Thread Wouter Verhelst
Hi Vladimir,

On Tue, Nov 29, 2016 at 03:41:10PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Hi,
> 
> 29.11.2016 13:50, Wouter Verhelst wrote:
> 
> Hi,
> 
> On Mon, Nov 28, 2016 at 06:33:24PM +0100, Wouter Verhelst wrote:
> 
> However, I'm arguing that if we're going to provide information about
> snapshots, we should be able to properly refer to these snapshots from
> within an NBD context. My previous mail suggested adding a negotiation
> message that would essentially ask the server "tell me about the
> snapshots you know about", giving them an NBD identifier in the 
> process
> (accompanied by a "foreign" identifier that is decidedly *not* an NBD
> identifier and that could be used to match the NBD identifier to
> something implementation-defined). This would be read-only 
> information;
> the client cannot ask the server to create new snapshots. We can then
> later in the protocol refer to these snapshots by way of that NBD
> identifier.
> 
> To make this a bit more concrete, I've changed the proposal like so:
> 
[...]
> +# Allocation contexts
> +
> +Allocation context 0 is the basic "exists at all" allocation context. If
> +an extent is not allocated at allocation context 0, it MUST NOT be
> +listed as allocated at another allocation context. This supports sparse
> 
> 
> allocated here is range with unset NBD_STATE_HOLE bit?

Eh, yes. I should clarify that a bit further (no time right now though,
but patches are certainly welcome).

> +file semantics on the server side. If a server has only one allocation
> +context (the default), then writing to an extent which is allocated in
> +that allocation context 0 MUST NOT fail with ENOSPC.
> +
> +For all other cases, this specification requires no specific semantics
> +of allocation contexts. Implementations could support allocation
> +contexts with semantics like the following:
> +
> +- Incremental snapshots; if a block is allocated in one allocation
> +  context, that implies that it is also allocated in the next level up.
> +- Various bits of data about the backend of the storage; e.g., if the
> +  storage is written on a RAID array, an allocation context could
> +  return information about the redundancy level of a given extent
> +- If the backend implements a write-through cache of some sort, or
> +  synchronises with other servers, an allocation context could state
> +  that an extent is "allocated" once it has reached permanent storage
> +  and/or is synchronized with other servers.
> +
> +The only requirement of an allocation context is that it MUST be
> +representable with the flags as defined for `NBD_CMD_BLOCK_STATUS`.
> +
> +Likewise, the syntax of query strings is not specified by this document.
> +
> +Server implementations SHOULD document their syntax for query strings
> +and semantics for resulting allocation contexts in a document like this
> +one.
> 
> 
> IMHO, redundant paragraph for this spec.

The "SHOULD document" one? I want that there at any rate, just to make
clear that it's probably a good thing to have it (but no, it's not part
of the formal protocol spec).

> +
>  ### Transmission phase
> 
>   Flag fields
> @@ -983,6 +1065,9 @@ valid may depend on negotiation during the handshake 
> phase.
> content chunk in reply.  MUST NOT be set unless the transmission
> flags include `NBD_FLAG_SEND_DF`.  Use of this flag MAY trigger an
> `EOVERFLOW` error chunk, if the request length is too large.
> +- bit 3, `NBD_CMD_FLAG_REQ_ONE`; valid during `NBD_CMD_BLOCK_STATUS`. If
> +  set, the client is interested in only one extent per allocation
> +  context.
> 
>  # Structured reply flags
> 
> @@ -1371,38 +1456,48 @@ adds a new `NBD_CMD_BLOCK_STATUS` command which 
> returns a list of
>  ranges with their respective states.  This extension is not available
>  unless the client also negotiates the `STRUCTURED_REPLY` extension.
> 
> -* `NBD_FLAG_SEND_BLOCK_STATUS`
> -
> -The server SHOULD set this transmission flag to 1 if structured
> -replies have been negotiated, and the `NBD_CMD_BLOCK_STATUS`
> -request is supported.
> -
>  * `NBD_REPLY_TYPE_BLOCK_STATUS`
> 
> -*length* MUST be a positive integer multiple of 8.  This reply
> +*length* MUST be 4 + (a positive integer multiple of 8).  This reply
>  represents a series of consecutive block descriptors where the sum
>  of the lengths of the descriptors MUST not be greater than the
> -length of the original request.  This chunk type MUST appear at most
> -once in a structured reply. Valid as a reply to
> +length of the original request. This chunk type MUST appear at most
> +once per allocation ID 

Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-11-29 Thread Alex Bligh

> On 29 Nov 2016, at 10:50, Wouter Verhelst  wrote:
> 
> +- `NBD_OPT_ALLOC_CONTEXT` (10)
> +
> +Return a list of `NBD_REP_ALLOC_CONTEXT` replies, one per context,
> +followed by an `NBD_REP_ACK`. If a server replies to such a request
> +with no error message, clients MAY send NBD_CMD_BLOCK_STATUS
> +commands during the transmission phase.

I haven't read this in detail but this seems to me to be similar to
the idea I just posted (sorry - kept getting interrupted whilst writing
the email) re multiple bitmaps?

But the name 'ALLOC_CONTEXT' is a bit weird. Why not call 'metadata
bitmaps' or 'metadata extents' or something. Metadata seems right as
it's data about data.

> +# Allocation contexts
> +
> +Allocation context 0 is the basic "exists at all" allocation context. If
> +an extent is not allocated at allocation context 0, it MUST NOT be
> +listed as allocated at another allocation context. This supports sparse
> +file semantics on the server side. If a server has only one allocation
> +context (the default), then writing to an extent which is allocated in
> +that allocation context 0 MUST NOT fail with ENOSPC.
> +
> +For all other cases, this specification requires no specific semantics
> +of allocation contexts. Implementations could support allocation
> +contexts with semantics like the following:
> +
> +- Incremental snapshots; if a block is allocated in one allocation
> +  context, that implies that it is also allocated in the next level up.
> +- Various bits of data about the backend of the storage; e.g., if the
> +  storage is written on a RAID array, an allocation context could
> +  return information about the redundancy level of a given extent
> +- If the backend implements a write-through cache of some sort, or
> +  synchronises with other servers, an allocation context could state
> +  that an extent is "allocated" once it has reached permanent storage
> +  and/or is synchronized with other servers.
> +
> +The only requirement of an allocation context is that it MUST be
> +representable with the flags as defined for `NBD_CMD_BLOCK_STATUS`.
> +
> +Likewise, the syntax of query strings is not specified by this document.
> +
> +Server implementations SHOULD document their syntax for query strings
> +and semantics for resulting allocation contexts in a document like this
> +one.
> +

But this seems strange. Whether something is 'allocated' seems orthogonal
to me to whether it needs backing up. Even the fact it's been zeroed
(now a hole) might need backing up).

So don't we need multiple independent lists of extents? Of course a server
might *implement* them under the hood with separate bitmaps or one big
bitmap, or no bitmap at all (for instance using file extents on POSIX).



-- 
Alex Bligh







Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-11-29 Thread Vladimir Sementsov-Ogievskiy

Hi,

29.11.2016 13:50, Wouter Verhelst wrote:

Hi,

On Mon, Nov 28, 2016 at 06:33:24PM +0100, Wouter Verhelst wrote:

However, I'm arguing that if we're going to provide information about
snapshots, we should be able to properly refer to these snapshots from
within an NBD context. My previous mail suggested adding a negotiation
message that would essentially ask the server "tell me about the
snapshots you know about", giving them an NBD identifier in the process
(accompanied by a "foreign" identifier that is decidedly *not* an NBD
identifier and that could be used to match the NBD identifier to
something implementation-defined). This would be read-only information;
the client cannot ask the server to create new snapshots. We can then
later in the protocol refer to these snapshots by way of that NBD
identifier.

To make this a bit more concrete, I've changed the proposal like so:

 From bc6d0df4156e670be7b6adea4f2813f44ffa7202 Mon Sep 17 00:00:00 2001
From: Wouter Verhelst 
Date: Tue, 29 Nov 2016 11:46:04 +0100
Subject: [PATCH] Update with allocation contexts etc

Signed-off-by: Wouter Verhelst 
---
  doc/proto.md | 210 +--
  1 file changed, 145 insertions(+), 65 deletions(-)

diff --git a/doc/proto.md b/doc/proto.md
index dfdd06d..fe7ae53 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -768,8 +768,6 @@ The field has the following format:
to that command to the client. In the absense of this flag, clients
SHOULD NOT multiplex their commands over more than one connection to
the export.
-- bit 9, `NBD_FLAG_SEND_BLOCK_STATUS`; defined by the experimental
-  `BLOCK_STATUS` extension; see below.
  
  Clients SHOULD ignore unknown flags.
  
@@ -871,6 +869,46 @@ of the newstyle negotiation.
  
  Defined by the experimental `INFO` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md).
  
+- `NBD_OPT_ALLOC_CONTEXT` (10)

+
+Return a list of `NBD_REP_ALLOC_CONTEXT` replies, one per context,
+followed by an `NBD_REP_ACK`. If a server replies to such a request
+with no error message, clients MAY send NBD_CMD_BLOCK_STATUS
+commands during the transmission phase.
+
+If the query string is syntactically invalid, the server SHOULD send
+`NBD_REP_ERR_INVALID`. If the query string is syntactically valid
+but finds no allocation contexts, the server MUST send a single
+reply of type `NBD_REP_ACK`.
+
+This option MUST NOT be requested unless structured replies have
+been negotiated first. If a client attempts to do so, a server
+SHOULD send `NBD_REP_ERR_INVALID`.
+
+Data:
+- 32 bits, type
+- String, query to select a subset of the available allocation
+  contexts. If this is not specified (i.e., length is 4 and no
+  command is sent), then the server MUST send all the allocation
+  contexts it knows about. If specified, this query string MUST
+  start with a name that uniquely identifies a server
+  implementation; e.g., the reference implementation that
+  accompanies this document would support query strings starting
+  with 'nbd-server:'
+
+The type may be one of:
+- `NBD_ALLOC_LIST_CONTEXT` (1): the list of allocation contexts
+  selected by the query string is returned to the client without
+  changing any state (i.e., this does not add allocation contexts
+  for further usage).
+- `NBD_ALLOC_ADD_CONTEXT` (2): the list of allocation contexts
+  selected by the query string is added to the list of existing
+  allocation contexts.
+- `NBD_ALLOC_DEL_CONTEXT` (3): the list of allocation contexts
+  selected by the query string is removed from the list of used
+  allocation contexts. Servers SHOULD NOT reuse existing allocation
+  context IDs.
+
   Option reply types
  
  These values are used in the "reply type" field, sent by the server

@@ -901,6 +939,18 @@ during option haggling in the fixed newstyle negotiation.
  
  Defined by the experimental `INFO` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md).
  
+- `NBD_REP_ALLOC_CONTEXT` (4)

+
+A description of an allocation context. Data:
+
+- 32 bits, NBD allocation context ID. If the request was NOT of type
+  `NBD_ALLOC_LIST_CONTEXT`, this field MUST NOT be zero.
+- String, name of the allocation context. This is not required to be
+  a human-readable string, but it MUST be valid UTF-8 data.
+
+Allocation context ID 0 is implied, and always exists. It cannot be
+removed.
+
  There are a number of error reply types, all of which are denoted by
  having bit 31 set. All error replies MAY have some data set, in which
  case that data is an error message string suitable for display to the user.
@@ -938,15 +988,47 @@ case that data is an error message string suitable for 
display to the user.
  
  Defined by the experimental `INFO` 

Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-11-29 Thread Vladimir Sementsov-Ogievskiy

29.11.2016 13:18, Kevin Wolf wrote:

Am 27.11.2016 um 20:17 hat Wouter Verhelst geschrieben:

3. Q: selecting of dirty bitmap to export
A: several variants:
   1: id of bitmap is in flags field of request
   pros: - simple
   cons: - it's a hack. flags field is for other uses.
 - we'll have to map bitmap names to these "ids"
   2: introduce extended nbd requests with variable length and exploit this
  feature for BLOCK_STATUS command, specifying bitmap identifier.
  pros: - looks like a true way
  cons: - we have to create additional extension
- possible we have to create a map,
  { <=> }
   3: exteranl tool should select which bitmap to export. So, in case of 
Qemu
  it should be something like qmp command block-export-dirty-bitmap.
  pros: - simple
- we can extend it to behave like (2) later
  cons: - additional qmp command to implement (possibly, the lesser 
evil)
  note: Hmm, external tool can make chose between allocated/dirty data 
too,
so, we can remove 'NBD_FLAG_STATUS_DIRTY' flag at all.

Downside of 3, though, is that it moves the definition of what the
different states mean outside of the NBD protocol (i.e., the protocol
messages are not entirely defined anymore, and their meaning depends on
the clients and servers in use).

Another point to consider is that option 3 doesn't allow you to access
two (or more) different bitmaps from the client without using the side
channel all the time to switch back and forth between them and having to
drain the request queue each time to avoid races.

In general, if we have something that "looks like the true way", I'd
advocate to choose this option. Experience tells that we'd regret
anything simpler in a year or two, and then we'll have to do the real
thing anyway, but still need to support the quick hack for
compatibility.


5. Number of status descriptors, sent by server, should be restricted
variants:
1: just allow server to restrict this as it wants (which was done in v3)
2: (not excluding 1). Client specifies somehow the maximum for number
   of descriptors.
   2.1: add command flag, which will request only one descriptor
(otherwise, no restrictions from the client)
   2.2: again, introduce extended nbd requests, and add field to
specify this maximum

I think having a flag which requests just one descriptor can be useful,
but I'm hesitant to add it unless it's actually going to be used; so in
other words, I'll leave the decision on that bit to you.

The native qemu block layer interface returns the status of only one
contiguous chunks, so the easiest way to implement the NBD block driver
in qemu would always use that bit.

On the other hand, it would be possible for the NBD block driver in qemu
to cache the rest of the response internally and answer the next request
from the cache instead of sending a new request over the network. Maybe
that's what it should be doing anyway for good performance.


Also, an idea on 2-4:

 As we say, that dirtiness is unknown for NBD, and external tool
 should specify, manage and understand, which data is actually
 transmitted, why not just call it user_data and leave status field
 of reply chunk unspecified in this case?

 So, I propose one flag for NBD_CMD_BLOCK_STATUS:
 NBD_FLAG_STATUS_USER. If it is clear, than behaviour is defined by
 Eric's 'Block provisioning status' paragraph.  If it is set, we just
 leave status field for some external... protocol? Who knows, what is
 this user data.

Yes, this sounds like a reasonable approach.

Makes sense to me, too.

However, if we have use for a single NBD_FLAG_STATUS_USER bit, we also
have use for a second one. If we go with one of the options where the
bitmap is selected per command, we're fine because you can simply move
the second bit to a different bitmap and do two requests. If we have
only a single active bitmap at a time, though, this feels like an actual
problem.


Another idea, about backups themselves:

 Why do we need allocated/zero status for backup? IMHO we don't.

Well, I've been thinking so all along, but then I don't really know what
it is, in detail, that you want to do :-)

I think we do. A block can be dirtied by discarding/write_zero-ing it.
Then we want the dirty bit to know that we need to include this block in
the incremental backup, but we also want to know that we don't actually
have to transfer the data in it.


And we will know that automatically, by using structured read command, 
so separate call to get_block_status is not needed.




Kevin



--
Best regards,
Vladimir




Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-11-29 Thread Wouter Verhelst
Hi,

On Mon, Nov 28, 2016 at 06:33:24PM +0100, Wouter Verhelst wrote:
> However, I'm arguing that if we're going to provide information about
> snapshots, we should be able to properly refer to these snapshots from
> within an NBD context. My previous mail suggested adding a negotiation
> message that would essentially ask the server "tell me about the
> snapshots you know about", giving them an NBD identifier in the process
> (accompanied by a "foreign" identifier that is decidedly *not* an NBD
> identifier and that could be used to match the NBD identifier to
> something implementation-defined). This would be read-only information;
> the client cannot ask the server to create new snapshots. We can then
> later in the protocol refer to these snapshots by way of that NBD
> identifier.

To make this a bit more concrete, I've changed the proposal like so:

>From bc6d0df4156e670be7b6adea4f2813f44ffa7202 Mon Sep 17 00:00:00 2001
From: Wouter Verhelst 
Date: Tue, 29 Nov 2016 11:46:04 +0100
Subject: [PATCH] Update with allocation contexts etc

Signed-off-by: Wouter Verhelst 
---
 doc/proto.md | 210 +--
 1 file changed, 145 insertions(+), 65 deletions(-)

diff --git a/doc/proto.md b/doc/proto.md
index dfdd06d..fe7ae53 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -768,8 +768,6 @@ The field has the following format:
   to that command to the client. In the absense of this flag, clients
   SHOULD NOT multiplex their commands over more than one connection to
   the export.
-- bit 9, `NBD_FLAG_SEND_BLOCK_STATUS`; defined by the experimental
-  `BLOCK_STATUS` extension; see below.
 
 Clients SHOULD ignore unknown flags.
 
@@ -871,6 +869,46 @@ of the newstyle negotiation.
 
 Defined by the experimental `INFO` 
[extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md).
 
+- `NBD_OPT_ALLOC_CONTEXT` (10)
+
+Return a list of `NBD_REP_ALLOC_CONTEXT` replies, one per context,
+followed by an `NBD_REP_ACK`. If a server replies to such a request
+with no error message, clients MAY send NBD_CMD_BLOCK_STATUS
+commands during the transmission phase.
+
+If the query string is syntactically invalid, the server SHOULD send
+`NBD_REP_ERR_INVALID`. If the query string is syntactically valid
+but finds no allocation contexts, the server MUST send a single
+reply of type `NBD_REP_ACK`.
+
+This option MUST NOT be requested unless structured replies have
+been negotiated first. If a client attempts to do so, a server
+SHOULD send `NBD_REP_ERR_INVALID`.
+
+Data:
+- 32 bits, type
+- String, query to select a subset of the available allocation
+  contexts. If this is not specified (i.e., length is 4 and no
+  command is sent), then the server MUST send all the allocation
+  contexts it knows about. If specified, this query string MUST
+  start with a name that uniquely identifies a server
+  implementation; e.g., the reference implementation that
+  accompanies this document would support query strings starting
+  with 'nbd-server:'
+
+The type may be one of:
+- `NBD_ALLOC_LIST_CONTEXT` (1): the list of allocation contexts
+  selected by the query string is returned to the client without
+  changing any state (i.e., this does not add allocation contexts
+  for further usage).
+- `NBD_ALLOC_ADD_CONTEXT` (2): the list of allocation contexts
+  selected by the query string is added to the list of existing
+  allocation contexts.
+- `NBD_ALLOC_DEL_CONTEXT` (3): the list of allocation contexts
+  selected by the query string is removed from the list of used
+  allocation contexts. Servers SHOULD NOT reuse existing allocation
+  context IDs.
+
  Option reply types
 
 These values are used in the "reply type" field, sent by the server
@@ -901,6 +939,18 @@ during option haggling in the fixed newstyle negotiation.
 
 Defined by the experimental `INFO` 
[extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md).
 
+- `NBD_REP_ALLOC_CONTEXT` (4)
+
+A description of an allocation context. Data:
+
+- 32 bits, NBD allocation context ID. If the request was NOT of type
+  `NBD_ALLOC_LIST_CONTEXT`, this field MUST NOT be zero.
+- String, name of the allocation context. This is not required to be
+  a human-readable string, but it MUST be valid UTF-8 data.
+
+Allocation context ID 0 is implied, and always exists. It cannot be
+removed.
+
 There are a number of error reply types, all of which are denoted by
 having bit 31 set. All error replies MAY have some data set, in which
 case that data is an error message string suitable for display to the user.
@@ -938,15 +988,47 @@ case that data is an error message string suitable for 
display to the user.
 
 Defined by the experimental `INFO` 

Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-11-29 Thread Kevin Wolf
Am 27.11.2016 um 20:17 hat Wouter Verhelst geschrieben:
> > 3. Q: selecting of dirty bitmap to export
> >A: several variants:
> >   1: id of bitmap is in flags field of request
> >   pros: - simple
> >   cons: - it's a hack. flags field is for other uses.
> > - we'll have to map bitmap names to these "ids"
> >   2: introduce extended nbd requests with variable length and exploit 
> > this
> >  feature for BLOCK_STATUS command, specifying bitmap identifier.
> >  pros: - looks like a true way
> >  cons: - we have to create additional extension
> >- possible we have to create a map,
> >  { <=> }
> >   3: exteranl tool should select which bitmap to export. So, in case of 
> > Qemu
> >  it should be something like qmp command block-export-dirty-bitmap.
> >  pros: - simple
> >- we can extend it to behave like (2) later
> >  cons: - additional qmp command to implement (possibly, the lesser 
> > evil)
> >  note: Hmm, external tool can make chose between allocated/dirty 
> > data too,
> >so, we can remove 'NBD_FLAG_STATUS_DIRTY' flag at all.
> 
> Downside of 3, though, is that it moves the definition of what the
> different states mean outside of the NBD protocol (i.e., the protocol
> messages are not entirely defined anymore, and their meaning depends on
> the clients and servers in use).

Another point to consider is that option 3 doesn't allow you to access
two (or more) different bitmaps from the client without using the side
channel all the time to switch back and forth between them and having to
drain the request queue each time to avoid races.

In general, if we have something that "looks like the true way", I'd
advocate to choose this option. Experience tells that we'd regret
anything simpler in a year or two, and then we'll have to do the real
thing anyway, but still need to support the quick hack for
compatibility.

> > 5. Number of status descriptors, sent by server, should be restricted
> >variants:
> >1: just allow server to restrict this as it wants (which was done in v3)
> >2: (not excluding 1). Client specifies somehow the maximum for number
> >   of descriptors.
> >   2.1: add command flag, which will request only one descriptor
> >(otherwise, no restrictions from the client)
> >   2.2: again, introduce extended nbd requests, and add field to
> >specify this maximum
> 
> I think having a flag which requests just one descriptor can be useful,
> but I'm hesitant to add it unless it's actually going to be used; so in
> other words, I'll leave the decision on that bit to you.

The native qemu block layer interface returns the status of only one
contiguous chunks, so the easiest way to implement the NBD block driver
in qemu would always use that bit.

On the other hand, it would be possible for the NBD block driver in qemu
to cache the rest of the response internally and answer the next request
from the cache instead of sending a new request over the network. Maybe
that's what it should be doing anyway for good performance.

> > Also, an idea on 2-4:
> > 
> > As we say, that dirtiness is unknown for NBD, and external tool
> > should specify, manage and understand, which data is actually
> > transmitted, why not just call it user_data and leave status field
> > of reply chunk unspecified in this case?
> > 
> > So, I propose one flag for NBD_CMD_BLOCK_STATUS:
> > NBD_FLAG_STATUS_USER. If it is clear, than behaviour is defined by
> > Eric's 'Block provisioning status' paragraph.  If it is set, we just
> > leave status field for some external... protocol? Who knows, what is
> > this user data.
> 
> Yes, this sounds like a reasonable approach.

Makes sense to me, too.

However, if we have use for a single NBD_FLAG_STATUS_USER bit, we also
have use for a second one. If we go with one of the options where the
bitmap is selected per command, we're fine because you can simply move
the second bit to a different bitmap and do two requests. If we have
only a single active bitmap at a time, though, this feels like an actual
problem.

> > Another idea, about backups themselves:
> > 
> > Why do we need allocated/zero status for backup? IMHO we don't.
> 
> Well, I've been thinking so all along, but then I don't really know what
> it is, in detail, that you want to do :-)

I think we do. A block can be dirtied by discarding/write_zero-ing it.
Then we want the dirty bit to know that we need to include this block in
the incremental backup, but we also want to know that we don't actually
have to transfer the data in it.

Kevin



Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-11-29 Thread Stefan Hajnoczi
On Mon, Nov 28, 2016 at 06:33:24PM +0100, Wouter Verhelst wrote:
> Hi Stefan,
> 
> On Mon, Nov 28, 2016 at 11:19:44AM +, Stefan Hajnoczi wrote:
> > On Sun, Nov 27, 2016 at 08:17:14PM +0100, Wouter Verhelst wrote:
> > > Quickly: the reason I haven't merged this yes is twofold:
> > > - I wasn't thrilled with the proposal at the time. It felt a bit
> > >   hackish, and bolted onto NBD so you could use it, but without defining
> > >   everything in the NBD protocol. "We're reading some data, but it's not
> > >   about you". That didn't feel right
> > >
> > > - There were a number of questions still unanswered (you're answering a
> > >   few below, so that's good).
> > > 
> > > For clarity, I have no objection whatsoever to adding more commands if
> > > they're useful, but I would prefer that they're also useful with NBD on
> > > its own, i.e., without requiring an initiation or correlation of some
> > > state through another protocol or network connection or whatever. If
> > > that's needed, that feels like I didn't do my job properly, if you get
> > > my point.
> > 
> > The out-of-band operations you are referring to are for dirty bitmap
> > management.  (The goal is to read out blocks that changed since the last
> > backup.)
> > 
> > The client does not access the live disk, instead it accesses a
> > read-only snapshot and the dirty information (so that it can copy out
> > only blocks that were written).  The client is allowed to read blocks
> > that are not dirty too.
> 
> I understood as much, yes.
> 
> > If you want to implement the whole incremental backup workflow in NBD
> > then the client would first have to connect to the live disk, set up
> > dirty tracking, create a snapshot export, and then connect to that
> > snapshot.
> > 
> > That sounds like a big feature set and I'd argue it's for the control
> > plane (storage API) and not the data plane (NBD).  There were
> > discussions about transferring the dirty information via the control
> > plane but it seems more appropriate to it in the data plane since it is
> > block-level information.
> 
> I agree that creating and managing snapshots is out of scope for NBD. The
> protocol is not set up for that.
> 
> However, I'm arguing that if we're going to provide information about
> snapshots, we should be able to properly refer to these snapshots from
> within an NBD context. My previous mail suggested adding a negotiation
> message that would essentially ask the server "tell me about the
> snapshots you know about", giving them an NBD identifier in the process
> (accompanied by a "foreign" identifier that is decidedly *not* an NBD
> identifier and that could be used to match the NBD identifier to
> something implementation-defined). This would be read-only information;
> the client cannot ask the server to create new snapshots. We can then
> later in the protocol refer to these snapshots by way of that NBD
> identifier.
> 
> My proposal also makes it impossible to get updates of newly created
> snapshots without disconnecting and reconnecting (due to the fact that
> you can't go from transmission back to negotiation), but I'm not sure
> that's a problem.
> 
> Doing so has two advantages:
> - If a client is accidentally (due to misconfiguration or implementation
>   bugs or whatnot) connecting to the wrong server after having created a
>   snapshot through a management protocol, we have an opportunity to
>   detect this error, due to the fact that the "foreign" identifiers
>   passed to the client during negotiation will not match with what the
>   client was expecting.
> - A future version of the protocol could possibly include an extended
>   version of the read command, allowing a client to read information
>   from multiple storage snapshots without requiring a reconnect, and
>   allowing current clients information about allocation status across
>   various snapshots (although a first implementation could very well
>   limit itself to only having one snapshot).

Sorry, I misunderstood you.

Snapshots are not very different from NBD exports.  Especially if the
storage system supports writeable-snapshot (aka cloning).  Should we
just used named exports?

Stefan


signature.asc
Description: PGP signature


Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-11-28 Thread John Snow

Hi Wouter,

Some of this mess may be partially my fault, but I have not been 
following the NBD extension proposals up until this point.


Are you familiar with the genesis behind this idea and what we are 
trying to accomplish in general?


We had the thought to propose an extension roughly similar to SCSI's 
'get lba status', but the existing status bits there did not correlate 
semantically to what we are hoping to convey. There was some debate over 
if it would be an abuse of protocol to attempt to use them as such.


For a quick recap, get lba status appears to offer three canonical statuses:

0h: "LBA extent is mapped, [...] or has an unknown state"
1h: "[...] LBA extent is deallocated"
2h: "[...] LBA extent is anchored"

My interpretation of mapped was simply that it was physically allocated, 
and 'deallocated' was simply unallocated.


(I uh, am not actually clear on what anchored means exactly.)

Either way, we felt at the time that it would be wrong to propose an 
analogue command for NBD and then immediately abuse the existing 
semantics, hence a new command like -- but not identical to -- the SCSI one.



On 11/27/2016 02:17 PM, Wouter Verhelst wrote:

Hi Vladimir,

Quickly: the reason I haven't merged this yes is twofold:
- I wasn't thrilled with the proposal at the time. It felt a bit
  hackish, and bolted onto NBD so you could use it, but without defining
  everything in the NBD protocol. "We're reading some data, but it's not
  about you". That didn't feel right
- There were a number of questions still unanswered (you're answering a
  few below, so that's good).

For clarity, I have no objection whatsoever to adding more commands if
they're useful, but I would prefer that they're also useful with NBD on
its own, i.e., without requiring an initiation or correlation of some
state through another protocol or network connection or whatever. If
that's needed, that feels like I didn't do my job properly, if you get
my point.

On Fri, Nov 25, 2016 at 02:28:16PM +0300, Vladimir Sementsov-Ogievskiy wrote:

With the availability of sparse storage formats, it is often needed
to query status of a particular range and read only those blocks of
data that are actually present on the block device.

To provide such information, the patch adds the BLOCK_STATUS
extension with one new NBD_CMD_BLOCK_STATUS command, a new
structured reply chunk format, and a new transmission flag.

There exists a concept of data dirtiness, which is required
during, for example, incremental block device backup. To express
this concept via NBD protocol, this patch also adds a flag to
NBD_CMD_BLOCK_STATUS to request dirtiness information rather than
provisioning information; however, with the current proposal, data
dirtiness is only useful with additional coordination outside of
the NBD protocol (such as a way to start and stop the server from
tracking dirty sectors).  Future NBD extensions may add commands
to control dirtiness through NBD.

Since NBD protocol has no notion of block size, and to mimic SCSI
"GET LBA STATUS" command more closely, it has been chosen to return
a list of extents in the response of NBD_CMD_BLOCK_STATUS command,
instead of a bitmap.

CC: Pavel Borzenkov 
CC: Denis V. Lunev 
CC: Wouter Verhelst 
CC: Paolo Bonzini 
CC: Kevin Wolf 
CC: Stefan Hajnoczi 
Signed-off-by: Eric Blake 
Signed-off-by: Vladimir Sementsov-Ogievskiy 
---

v3:

Hi all. This is almost a resend of v2 (by Eric Blake), The only change is
removing the restriction, that sum of status descriptor lengths must be equal
to requested length. I.e., let's permit the server to replay with less data
than required if it wants.


Reasonable, yes. The length that the client requests should be a maximum (i.e.
"I'm interested in this range"), not an exact request.


Also, bit of NBD_FLAG_SEND_BLOCK_STATUS is changed to 9, as 8 is now
 NBD_FLAG_CAN_MULTI_CONN in master branch.


Right.


And, finally, I've rebased this onto current state of
extension-structured-reply branch (which itself should be rebased on
master IMHO).


Probably a good idea, given the above.


By this resend I just want to continue the diqussion, started about half
a year ago. Here is a summary of some questions and ideas from v2
diqussion:

1. Q: Synchronisation. Is such data (dirty/allocated) reliable?
   A: This all is for read-only disks, so the data is static and unchangeable.


I think we should declare that it's up to the client to ensure no other
writes happen without its knowledge. This may be because the client and
server communicate out of band about state changes, or because the
client somehow knows that it's the only writer, or whatever.

We can easily do that by declaring that the result of that command only
talks about *current* state, and that concurrent writes by different
clients may invalidate the state. 

Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-11-28 Thread Wouter Verhelst
Hi Stefan,

On Mon, Nov 28, 2016 at 11:19:44AM +, Stefan Hajnoczi wrote:
> On Sun, Nov 27, 2016 at 08:17:14PM +0100, Wouter Verhelst wrote:
> > Quickly: the reason I haven't merged this yes is twofold:
> > - I wasn't thrilled with the proposal at the time. It felt a bit
> >   hackish, and bolted onto NBD so you could use it, but without defining
> >   everything in the NBD protocol. "We're reading some data, but it's not
> >   about you". That didn't feel right
> >
> > - There were a number of questions still unanswered (you're answering a
> >   few below, so that's good).
> > 
> > For clarity, I have no objection whatsoever to adding more commands if
> > they're useful, but I would prefer that they're also useful with NBD on
> > its own, i.e., without requiring an initiation or correlation of some
> > state through another protocol or network connection or whatever. If
> > that's needed, that feels like I didn't do my job properly, if you get
> > my point.
> 
> The out-of-band operations you are referring to are for dirty bitmap
> management.  (The goal is to read out blocks that changed since the last
> backup.)
> 
> The client does not access the live disk, instead it accesses a
> read-only snapshot and the dirty information (so that it can copy out
> only blocks that were written).  The client is allowed to read blocks
> that are not dirty too.

I understood as much, yes.

> If you want to implement the whole incremental backup workflow in NBD
> then the client would first have to connect to the live disk, set up
> dirty tracking, create a snapshot export, and then connect to that
> snapshot.
> 
> That sounds like a big feature set and I'd argue it's for the control
> plane (storage API) and not the data plane (NBD).  There were
> discussions about transferring the dirty information via the control
> plane but it seems more appropriate to it in the data plane since it is
> block-level information.

I agree that creating and managing snapshots is out of scope for NBD. The
protocol is not set up for that.

However, I'm arguing that if we're going to provide information about
snapshots, we should be able to properly refer to these snapshots from
within an NBD context. My previous mail suggested adding a negotiation
message that would essentially ask the server "tell me about the
snapshots you know about", giving them an NBD identifier in the process
(accompanied by a "foreign" identifier that is decidedly *not* an NBD
identifier and that could be used to match the NBD identifier to
something implementation-defined). This would be read-only information;
the client cannot ask the server to create new snapshots. We can then
later in the protocol refer to these snapshots by way of that NBD
identifier.

My proposal also makes it impossible to get updates of newly created
snapshots without disconnecting and reconnecting (due to the fact that
you can't go from transmission back to negotiation), but I'm not sure
that's a problem.

Doing so has two advantages:
- If a client is accidentally (due to misconfiguration or implementation
  bugs or whatnot) connecting to the wrong server after having created a
  snapshot through a management protocol, we have an opportunity to
  detect this error, due to the fact that the "foreign" identifiers
  passed to the client during negotiation will not match with what the
  client was expecting.
- A future version of the protocol could possibly include an extended
  version of the read command, allowing a client to read information
  from multiple storage snapshots without requiring a reconnect, and
  allowing current clients information about allocation status across
  various snapshots (although a first implementation could very well
  limit itself to only having one snapshot).

[...]
> I'm arguing that the NBD protocol doesn't need to support the
> incremental backup workflow since it's a complex control plane concept.
> 
> Being able to read dirty information via NBD is useful for other block
> backup applications, not just QEMU.  It could be used for syncing LVM
> volumes across machines, for example, if someone implements an NBD+LVM
> server.

Indeed, and I was considering adding a basic implementation to go with
the copy-on-write support in stock nbd-server, too.

> Another issue with adding control plane operations is that you need to
> begin considering privilege separation.  Should all NBD clients be able
> to initiate snapshots, dirty tracking, etc or is some kind of access
> control required to limit certain commands?  Not all clients require the
> same privileges and so they shouldn't have access to the same set of
> operations.

Sure, which is why I wasn't suggesting anything of the sorts :-)

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
   people in the world who think they really understand all of its rules,
   and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12



Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-11-28 Thread Stefan Hajnoczi
On Sun, Nov 27, 2016 at 08:17:14PM +0100, Wouter Verhelst wrote:
> Quickly: the reason I haven't merged this yes is twofold:
> - I wasn't thrilled with the proposal at the time. It felt a bit
>   hackish, and bolted onto NBD so you could use it, but without defining
>   everything in the NBD protocol. "We're reading some data, but it's not
>   about you". That didn't feel right
>
> - There were a number of questions still unanswered (you're answering a
>   few below, so that's good).
> 
> For clarity, I have no objection whatsoever to adding more commands if
> they're useful, but I would prefer that they're also useful with NBD on
> its own, i.e., without requiring an initiation or correlation of some
> state through another protocol or network connection or whatever. If
> that's needed, that feels like I didn't do my job properly, if you get
> my point.

The out-of-band operations you are referring to are for dirty bitmap
management.  (The goal is to read out blocks that changed since the last
backup.)

The client does not access the live disk, instead it accesses a
read-only snapshot and the dirty information (so that it can copy out
only blocks that were written).  The client is allowed to read blocks
that are not dirty too.

If you want to implement the whole incremental backup workflow in NBD
then the client would first have to connect to the live disk, set up
dirty tracking, create a snapshot export, and then connect to that
snapshot.

That sounds like a big feature set and I'd argue it's for the control
plane (storage API) and not the data plane (NBD).  There were
discussions about transferring the dirty information via the control
plane but it seems more appropriate to it in the data plane since it is
block-level information.

I'm arguing that the NBD protocol doesn't need to support the
incremental backup workflow since it's a complex control plane concept.

Being able to read dirty information via NBD is useful for other block
backup applications, not just QEMU.  It could be used for syncing LVM
volumes across machines, for example, if someone implements an NBD+LVM
server.

Another issue with adding control plane operations is that you need to
begin considering privilege separation.  Should all NBD clients be able
to initiate snapshots, dirty tracking, etc or is some kind of access
control required to limit certain commands?  Not all clients require the
same privileges and so they shouldn't have access to the same set of
operations.

Stefan


signature.asc
Description: PGP signature


Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-11-27 Thread Wouter Verhelst
Hi Vladimir,

Quickly: the reason I haven't merged this yes is twofold:
- I wasn't thrilled with the proposal at the time. It felt a bit
  hackish, and bolted onto NBD so you could use it, but without defining
  everything in the NBD protocol. "We're reading some data, but it's not
  about you". That didn't feel right
- There were a number of questions still unanswered (you're answering a
  few below, so that's good).

For clarity, I have no objection whatsoever to adding more commands if
they're useful, but I would prefer that they're also useful with NBD on
its own, i.e., without requiring an initiation or correlation of some
state through another protocol or network connection or whatever. If
that's needed, that feels like I didn't do my job properly, if you get
my point.

On Fri, Nov 25, 2016 at 02:28:16PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> With the availability of sparse storage formats, it is often needed
> to query status of a particular range and read only those blocks of
> data that are actually present on the block device.
> 
> To provide such information, the patch adds the BLOCK_STATUS
> extension with one new NBD_CMD_BLOCK_STATUS command, a new
> structured reply chunk format, and a new transmission flag.
> 
> There exists a concept of data dirtiness, which is required
> during, for example, incremental block device backup. To express
> this concept via NBD protocol, this patch also adds a flag to
> NBD_CMD_BLOCK_STATUS to request dirtiness information rather than
> provisioning information; however, with the current proposal, data
> dirtiness is only useful with additional coordination outside of
> the NBD protocol (such as a way to start and stop the server from
> tracking dirty sectors).  Future NBD extensions may add commands
> to control dirtiness through NBD.
> 
> Since NBD protocol has no notion of block size, and to mimic SCSI
> "GET LBA STATUS" command more closely, it has been chosen to return
> a list of extents in the response of NBD_CMD_BLOCK_STATUS command,
> instead of a bitmap.
> 
> CC: Pavel Borzenkov 
> CC: Denis V. Lunev 
> CC: Wouter Verhelst 
> CC: Paolo Bonzini 
> CC: Kevin Wolf 
> CC: Stefan Hajnoczi 
> Signed-off-by: Eric Blake 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
> 
> v3:
> 
> Hi all. This is almost a resend of v2 (by Eric Blake), The only change is
> removing the restriction, that sum of status descriptor lengths must be equal
> to requested length. I.e., let's permit the server to replay with less data
> than required if it wants.

Reasonable, yes. The length that the client requests should be a maximum (i.e.
"I'm interested in this range"), not an exact request.

> Also, bit of NBD_FLAG_SEND_BLOCK_STATUS is changed to 9, as 8 is now
>  NBD_FLAG_CAN_MULTI_CONN in master branch.

Right.

> And, finally, I've rebased this onto current state of
> extension-structured-reply branch (which itself should be rebased on
> master IMHO).

Probably a good idea, given the above.

> By this resend I just want to continue the diqussion, started about half
> a year ago. Here is a summary of some questions and ideas from v2
> diqussion:
> 
> 1. Q: Synchronisation. Is such data (dirty/allocated) reliable? 
>A: This all is for read-only disks, so the data is static and unchangeable.

I think we should declare that it's up to the client to ensure no other
writes happen without its knowledge. This may be because the client and
server communicate out of band about state changes, or because the
client somehow knows that it's the only writer, or whatever.

We can easily do that by declaring that the result of that command only
talks about *current* state, and that concurrent writes by different
clients may invalidate the state. This is true for NBD in general (i.e.,
concurrent read or write commands from other clients may confuse file
systems on top of NBD), so it doesn't change expectations in any way.

> 2. Q: different granularities of dirty/allocated bitmaps. Any problems?
>A: 1: server replies with status descriptors of any size, granularity
>  is hidden from the client
>   2: dirty/allocated requests are separate and unrelated to each
>  other, so their granularities are not intersecting

Not entirely sure anymore what this is about?

> 3. Q: selecting of dirty bitmap to export
>A: several variants:
>   1: id of bitmap is in flags field of request
>   pros: - simple
>   cons: - it's a hack. flags field is for other uses.
> - we'll have to map bitmap names to these "ids"
>   2: introduce extended nbd requests with variable length and exploit this
>  feature for BLOCK_STATUS command, specifying bitmap identifier.
>  pros: - looks like a true way
>  cons: - we have to create additional extension
>