Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
> On 8 Dec 2016, at 15:59, Eric Blakewrote: > > We should use similar wording to whatever we already say about what a > client would see when reading data cleared by NBD_CMD_TRIM. After all, > the status of STATE_HOLE set and STATE_ZERO clear is what you logically > get when TRIM cannot guarantee reads-as-zero. Yes. It was actually exactly that discussion I was trying to remember. -- Alex Bligh signature.asc Description: Message signed with OpenPGP using GPGMail
Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
On 12/08/2016 08:40 AM, Alex Bligh wrote: >>> + metadata context is the basic "exists at all" metadata context. >>> >>> Disagree. You're saying that if a server supports metadata contexts >>> at all, it must support this one. >> >> No, I'm trying to say that this metadata context exposes whether the >> *block* exists at all (i.e., it exports NBD_STATE_HOLE). I should >> probably clarify that wording then, if you misunderstood it in that way. > > Ah. Perhaps 'exists at all' itself is misleading. 'Occupies storage > space'. Or 'is not a hole'? > >> No server MUST implement it (although we might want to make it a SHOULD) > > Don't see why it should even be a 'SHOULD' to be honest. An nbd > server cooked up for a specific purpose, or with a backend that > can't provide that (or where there is never a hole) shouldn't > be criticised. > >>> I think the last sentence is probably meant to read something like: >>> >>> If a server supports the "BASE:allocation" metadata context, then >>> writing to an extent which has `NBD_STATE_HOLE` clear MUST NOT fail >>> with ENOSPC. >> >> No, it can't. >> >> Other metadata contexts may change the semantics to the extent that if >> the block is allocated at the BASE:allocation context, it would still >> need to space on another plane. In that case, writing to an area which >> is not STATE_HOLE might still fail. Not just that, but it is ALWAYS permissible to report NBD_STATE_HOLE as clear (just not always optimal) - to allow servers that can't determine sparseness information, but DO know how to communicate extents to the client. Yes, it is boring to communicate a single extent that the entire file is data, and clients can't optimize their usage in that case, but it should always be considered semantically correct to do so, since the presence and knowledge of holes is merely an optimization opportunity, not a data correctness issue. >>> Why 512 bytes as opposed to 'minimum block size' (or is it because >>> that is also an experimental extension)? >> >> Yes, and this extension does not depend on that one. Hence why it is >> also a SHOULD and not a MUST. > > OK. As a separate discussion I think we should talk about promoting > WRITE_ZEROES and INFO. The former has a reference implementation, > I think Eric did a qemu implementation, and I did a gonbdserver > implementation. The latter I believe lacks a reference implementation. Yes, I still have reference qemu patches for INFO; they did not make it into qemu 2.8 (while WRITE_ZEROES did), but should make it into 2.9. I also hope to get structured reads into qemu 2.9, but that's a bigger task, as I don't have reference patches complete yet. On the other hand, since BLOCK_STATUS depends on structured reply, I have all the more reason to complete it soon. >>> +A client SHOULD NOT read from an area that has both `NBD_STATE_HOLE` >>> +set and `NBD_STATE_ZERO` clear. >>> >>> That makes no sense, as normal data has both these bits clear! This >>> also implies that to comply with this SHOULD, a client needs to >>> request block status before any read, which is ridiculous. This >>> should be dropped. >> >> No, it should not, although it may need rewording. It clarifies that >> having STATE_HOLE set (i.e., there's no valid data in the given range) >> and STATE_ZERO clear (i.e., we don't assert that it would read as >> all-zeroes) is not an invalid thing for a server to set. The spec here >> clarifies what a client should do with that information if it gets it >> (i.e., "don't read it, it doesn't contain anything interesting"). > > That's fair enough until the last bit in brackets. Rather than saying > a client SHOULD NOT read it, it should simply say that a read on > such areas will succeed but the data read is undefined (and may > not be stable). We should use similar wording to whatever we already say about what a client would see when reading data cleared by NBD_CMD_TRIM. After all, the status of STATE_HOLE set and STATE_ZERO clear is what you logically get when TRIM cannot guarantee reads-as-zero. -- Eric Blake eblake redhat com+1-919-301-3266 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
Wouter, >> +- `NBD_OPT_META_CONTEXT` (10) >> + >> +Return a list of `NBD_REP_META_CONTEXT` replies, one per context, >> +followed by an `NBD_REP_ACK`. If a server replies to such a request >> +with no error message, clients >> >> "*the* server" / "*the* cient" >> >> Perhaps only an 'NBD_REP_ERR_UNSUP' error should prevent the >> client querying the server. > > I don't think that's necessarily a good idea. I think if the server > replies with NBD_REP_ERR_INVALID, that means it understands the option > but the user sent something invalid and now we haven't selected any > contexts. That means the server won't be able to provide any metadata > during transmission, either. > > Perhaps it should be made clearer that once contexts have been selected > (even if errors occurred later on), NBD_CMD_BLOCK_STATUS MAY be used. > >> +MAY send NBD_CMD_BLOCK_STATUS >> +commands during the transmission phase. >> >> Add: "If a server replies to such a request with NBD_REP_ERR_UNSUP, >> the client MUST NOT send NBD_CMD_BLOCK_STATUS commands during the >> transmission phase." > > That would be counter to the above. My main point is that the current text does not prohibit sending NBD_CMD_BLOCK_STATUS if it hasn't been established that the server supports it. >> Why not just define the format of a metadata-context string to be >> of the form ':' (perhaps this definition goes >> elsewhere), and here just say the returned list is a left-match >> of all the available metadata-contexts, i.e. all those metadata >> contexts whose names either start or consist entirely of the >> specified string. If an empty string is specified, all metadata >> contexts are returned. > > I also want to make it possible for an implementation to define its own > syntax. Say, a "last-snapshot" thing, or something that says > "diff:snapshot1-snapshot2", or whatever. That's a good idea, but doesn't preclude using a colon as a namespace separator. >> +The type may be one of: >> +- `NBD_META_LIST_CONTEXT` (1): the list of metadata contexts >> + selected by the query string is returned to the client without >> + changing any state (i.e., this does not add metadata contexts >> + for further usage). >> >> Somewhere it should say this list is returned by sending >> zero or more NBD_REP_META_CONTEXT records followed by a NBD_REP_ACK. > > We do that above already. Must have missed it. >> A metadata context represents metadata concerning the selected >> export in the form of a list of extents, each with a status. The >> meaning of the metadata and the status is dependent upon the >> context. Each metadata context has a name which is colon separated, >> in the form ':'. Namespaces that start with "X-" >> are vendor dependent extensions. > > No, I wouldn't do that, since by definition, every namespace is > vendor-dependent. > > Maybe we could ask that people who want to implement their own metadata > context type (rather than be compatible with an existing one) should > register their namespace somewhere, though. What I'm trying to say is that there should be three types of namespace: * "BASE" (or better "base" as Vladimir points out) which is defined within the document, i.e. it will say "base:allocated" does X, "base:foo" does Y. * Registered, e.g. "qemu", where the document would say: "The following is a list of registered namespaces: ... qemu: to qemu.org" or whatever. * Unregistered, e.g. "X-Alex-Experiment" where the document merely mentions that namespaces beginning with "X-" can be used by anyone, like X- headers in SMTP and HTTP. The purpose of distinguishing between registered and unregistered is that otherwise we might get two vendors with a product called "fastnbd" (or whatever) who both pick the same namespace. I suppose an alternative would be to go the Java-ish way and suggest people use a domain name (so 'qemu' would be 'qemu.org:whatever'). >> + metadata context is the basic "exists at all" metadata context. >> >> Disagree. You're saying that if a server supports metadata contexts >> at all, it must support this one. > > No, I'm trying to say that this metadata context exposes whether the > *block* exists at all (i.e., it exports NBD_STATE_HOLE). I should > probably clarify that wording then, if you misunderstood it in that way. Ah. Perhaps 'exists at all' itself is misleading. 'Occupies storage space'. Or 'is not a hole'? > No server MUST implement it (although we might want to make it a SHOULD) Don't see why it should even be a 'SHOULD' to be honest. An nbd server cooked up for a specific purpose, or with a backend that can't provide that (or where there is never a hole) shouldn't be criticised. >> I think the last sentence is probably meant to read something like: >> >> If a server supports the "BASE:allocation" metadata context, then >> writing to an extent which has `NBD_STATE_HOLE` clear MUST NOT fail >> with ENOSPC. > > No, it can't. > > Other metadata
Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
On Thu, Dec 08, 2016 at 03:39:19AM +, Alex Bligh wrote: > > > On 2 Dec 2016, at 18:45, Alex Blighwrote: > > > > Thanks. That makes sense - or enough sense for me to carry on commenting! > > > I finally had some time to go through this extension in detail. Rather > than comment on all the individual patches, I squashed them into a single > commit, did a 'git format-patch' on it, and commented on that. > > diff --git a/doc/proto.md b/doc/proto.md > index c443494..9c0981f 100644 > --- a/doc/proto.md > +++ b/doc/proto.md > @@ -871,6 +869,50 @@ of the newstyle negotiation. > > Defined by the experimental `INFO` > [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md). > > +- `NBD_OPT_META_CONTEXT` (10) > + > +Return a list of `NBD_REP_META_CONTEXT` replies, one per context, > +followed by an `NBD_REP_ACK`. If a server replies to such a request > +with no error message, clients > > "*the* server" / "*the* cient" > > Perhaps only an 'NBD_REP_ERR_UNSUP' error should prevent the > client querying the server. I don't think that's necessarily a good idea. I think if the server replies with NBD_REP_ERR_INVALID, that means it understands the option but the user sent something invalid and now we haven't selected any contexts. That means the server won't be able to provide any metadata during transmission, either. Perhaps it should be made clearer that once contexts have been selected (even if errors occurred later on), NBD_CMD_BLOCK_STATUS MAY be used. > +MAY send NBD_CMD_BLOCK_STATUS > +commands during the transmission phase. > > Add: "If a server replies to such a request with NBD_REP_ERR_UNSUP, > the client MUST NOT send NBD_CMD_BLOCK_STATUS commands during the > transmission phase." That would be counter to the above. > +If the query string is syntactically invalid, the server SHOULD send > +`NBD_REP_ERR_INVALID`. > > 'MUST send' else it implies sending nothing is permissible. Yes, I had already fixed that locally (but didn't push yet, because that patch is... big, and needs some rework. I'll look at it again later, incorporating your comments) > +If the query string is syntactically valid > +but finds no metadata contexts, the server MUST send a single > +reply of type `NBD_REP_ACK`. > + > +This option MUST NOT be requested unless structured replies have > > Active voice better: > > "The client MUST NOT send this option unless" ... Right. > +been negotiated first. If a client attempts to do so, a server > +SHOULD send `NBD_REP_ERR_INVALID`. > + > +Data: > +- 32 bits, type > +- String, query to select a subset of the available metadata > + contexts. If this is not specified (i.e., length is 4 and no > + command is sent), then the server MUST send all the metadata > + contexts it knows about. If specified, this query string MUST > + start with a name that uniquely identifies a server > + implementation; e.g., the reference implementation that > + accompanies this document would support query strings starting > + with 'nbd-server:' > > Why not just define the format of a metadata-context string to be > of the form ':' (perhaps this definition goes > elsewhere), and here just say the returned list is a left-match > of all the available metadata-contexts, i.e. all those metadata > contexts whose names either start or consist entirely of the > specified string. If an empty string is specified, all metadata > contexts are returned. I also want to make it possible for an implementation to define its own syntax. Say, a "last-snapshot" thing, or something that says "diff:snapshot1-snapshot2", or whatever. > +The type may be one of: > +- `NBD_META_LIST_CONTEXT` (1): the list of metadata contexts > + selected by the query string is returned to the client without > + changing any state (i.e., this does not add metadata contexts > + for further usage). > > Somewhere it should say this list is returned by sending > zero or more NBD_REP_META_CONTEXT records followed by a NBD_REP_ACK. We do that above already. > +- `NBD_META_ADD_CONTEXT` (2): the list of metadata contexts > + selected by the query string is added to the list of existing > + metadata contexts. > +- `NBD_META_DEL_CONTEXT` (3): the list of metadata contexts > + selected by the query string is removed from the list of used > + metadata contexts. Servers SHOULD NOT reuse existing metadata > + context IDs. > + > +The syntax of the query string is not specified, except that > +implementations MUST support adding and removing individual metadata > +contexts by simply listing their names. > > This seems slightly over complicated. Rather than have a list held > by the server of active metadata contexts, we could simply have > two NBD_OPT_ commands, say NBD_OPT_LIST_META_CONTEXTS and > NBD_OPT_SET_META_CONTEXTS (which
Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
On 12/06/2016 08:32 AM, Wouter Verhelst wrote: > Hi John > > Sorry for the late reply; weekend was busy, and so was monday. > No problems. > On Fri, Dec 02, 2016 at 03:39:08PM -0500, John Snow wrote: >> On 12/02/2016 01:45 PM, Alex Bligh wrote: >>> John, >>> > +Some storage formats and operations over such formats express a > +concept of data dirtiness. Whether the operation is block device > +mirroring, incremental block device backup or any other operation with > +a concept of data dirtiness, they all share a need to provide a list > +of ranges that this particular operation treats as dirty. > > How can data be 'dirty' if it is static and unchangeable? (I thought) > In a simple case, live IO goes to e.g. hda.qcow2. These writes come from the VM and cause the bitmap that QEMU manages to become dirty. We intend to expose the ability to fleece dirty blocks via NBD. What happens in this scenario would be that a snapshot of the data at the time of the request is exported over NBD in a read-only manner. In this way, the drive itself is R/W, but the "view" of it from NBD is RO. While a hypothetical backup client is busy copying data out of this temporary view, new writes are coming in to the drive, but are not being exposed through the NBD export. (This goes into QEMU-specifics, but those new writes are dirtying a version of the bitmap not intended to be exposed via the NBD channel. NBD gets effectively a snapshot of both the bitmap AND the data.) >>> >>> Thanks. That makes sense - or enough sense for me to carry on commenting! >>> >> >> Whew! I'm glad. >> > I now think what you are talking about backing up a *snapshot* of a disk > that's running, where the disk itself was not connected using NBD? IE it's > not being 'made dirty' by NBD_CMD_WRITE etc. Rather 'dirtiness' is > effectively > an opaque state represented in a bitmap, which is binary metadata > at some particular level of granularity. It might as well be 'happiness' > or 'is coloured blue'. The NBD server would (normally) have no way of > manipulating this bitmap. > > In previous comments, I said 'how come we can set the dirty bit through > writes but can't clear it?'. This (my statement) is now I think wrong, > as NBD_CMD_WRITE etc. is not defined to set the dirty bit. The > state of the bitmap comes from whatever sets the bitmap which is outside > the scope of this protocol to transmit it. > You know, this is a fair point. We have not (to my knowledge) yet carefully considered the exact bitmap management scenario when NBD is involved in retrieving dirty blocks. Humor me for a moment while I talk about a (completely hypothetical, not yet fully discussed) workflow for how I envision this feature. (1) User sets up a drive in QEMU, a bitmap is initialized, an initial backup is made, etc. (2) As writes come in, QEMU's bitmap is dirtied. (3) The user decides they want to root around to see what data has changed and would like to use NBD to do so, in contrast to QEMU's own facilities for dumping dirty blocks. (4) A command is issued that creates a temporary, lightweight snapshot ('fleecing') and exports this snapshot over NBD. The bitmap is associated with the NBD export at this point at NBD server startup. (For the sake of QEMU discussion, maybe this command is "blockdev-fleece") (5) At this moment, the snapshot is static and represents the data at the time the NBD server was started. The bitmap is also forked and represents only this snapshot. The live data and bitmap continue to change. (6) Dirty blocks are queried and copied out via NBD. (7) The user closes the NBD instance upon completion of their task, whatever it was. (Making a new incremental backup? Just taking a peek at some changed data? who knows.) The point that's interesting here is what do we do with the two bitmaps at this point? The data delta can be discarded (this was after all just a lightweight read-only point-in-time snapshot) but the bitmap data needs to be dealt with. (A) In the case of "User made a new incremental backup," the bitmap that got forked off to serve the NBD read should be discarded. (B) In the case of "User just wanted to look around," the bitmap should be merged back into the bitmap it was forked from. I don't advise a hybrid where "User copied some data, but not all" where we need to partially clear *and* merge, but conceivably this could happen, because the things we don't want to happen always will. At this point maybe it's becoming obvious that actually it would be very prudent to allow the NBD client itself to inform QEMU via the NBD
Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
Hi John Sorry for the late reply; weekend was busy, and so was monday. On Fri, Dec 02, 2016 at 03:39:08PM -0500, John Snow wrote: > On 12/02/2016 01:45 PM, Alex Bligh wrote: > > John, > > > >>> +Some storage formats and operations over such formats express a > >>> +concept of data dirtiness. Whether the operation is block device > >>> +mirroring, incremental block device backup or any other operation with > >>> +a concept of data dirtiness, they all share a need to provide a list > >>> +of ranges that this particular operation treats as dirty. > >>> > >>> How can data be 'dirty' if it is static and unchangeable? (I thought) > >>> > >> > >> In a simple case, live IO goes to e.g. hda.qcow2. These writes come from > >> the VM and cause the bitmap that QEMU manages to become dirty. > >> > >> We intend to expose the ability to fleece dirty blocks via NBD. What > >> happens in this scenario would be that a snapshot of the data at the > >> time of the request is exported over NBD in a read-only manner. > >> > >> In this way, the drive itself is R/W, but the "view" of it from NBD is > >> RO. While a hypothetical backup client is busy copying data out of this > >> temporary view, new writes are coming in to the drive, but are not being > >> exposed through the NBD export. > >> > >> (This goes into QEMU-specifics, but those new writes are dirtying a > >> version of the bitmap not intended to be exposed via the NBD channel. > >> NBD gets effectively a snapshot of both the bitmap AND the data.) > > > > Thanks. That makes sense - or enough sense for me to carry on commenting! > > > > Whew! I'm glad. > > >>> I now think what you are talking about backing up a *snapshot* of a disk > >>> that's running, where the disk itself was not connected using NBD? IE it's > >>> not being 'made dirty' by NBD_CMD_WRITE etc. Rather 'dirtiness' is > >>> effectively > >>> an opaque state represented in a bitmap, which is binary metadata > >>> at some particular level of granularity. It might as well be 'happiness' > >>> or 'is coloured blue'. The NBD server would (normally) have no way of > >>> manipulating this bitmap. > >>> > >>> In previous comments, I said 'how come we can set the dirty bit through > >>> writes but can't clear it?'. This (my statement) is now I think wrong, > >>> as NBD_CMD_WRITE etc. is not defined to set the dirty bit. The > >>> state of the bitmap comes from whatever sets the bitmap which is outside > >>> the scope of this protocol to transmit it. > >>> > >> > >> You know, this is a fair point. We have not (to my knowledge) yet > >> carefully considered the exact bitmap management scenario when NBD is > >> involved in retrieving dirty blocks. > >> > >> Humor me for a moment while I talk about a (completely hypothetical, not > >> yet fully discussed) workflow for how I envision this feature. > >> > >> (1) User sets up a drive in QEMU, a bitmap is initialized, an initial > >> backup is made, etc. > >> > >> (2) As writes come in, QEMU's bitmap is dirtied. > >> > >> (3) The user decides they want to root around to see what data has > >> changed and would like to use NBD to do so, in contrast to QEMU's own > >> facilities for dumping dirty blocks. > >> > >> (4) A command is issued that creates a temporary, lightweight snapshot > >> ('fleecing') and exports this snapshot over NBD. The bitmap is > >> associated with the NBD export at this point at NBD server startup. (For > >> the sake of QEMU discussion, maybe this command is "blockdev-fleece") > >> > >> (5) At this moment, the snapshot is static and represents the data at > >> the time the NBD server was started. The bitmap is also forked and > >> represents only this snapshot. The live data and bitmap continue to change. > >> > >> (6) Dirty blocks are queried and copied out via NBD. > >> > >> (7) The user closes the NBD instance upon completion of their task, > >> whatever it was. (Making a new incremental backup? Just taking a peek at > >> some changed data? who knows.) > >> > >> The point that's interesting here is what do we do with the two bitmaps > >> at this point? The data delta can be discarded (this was after all just > >> a lightweight read-only point-in-time snapshot) but the bitmap data > >> needs to be dealt with. > >> > >> (A) In the case of "User made a new incremental backup," the bitmap that > >> got forked off to serve the NBD read should be discarded. > >> > >> (B) In the case of "User just wanted to look around," the bitmap should > >> be merged back into the bitmap it was forked from. > >> > >> I don't advise a hybrid where "User copied some data, but not all" where > >> we need to partially clear *and* merge, but conceivably this could > >> happen, because the things we don't want to happen always will. > >> > >> At this point maybe it's becoming obvious that actually it would be very > >> prudent to allow the NBD client itself to inform QEMU via the NBD > >> protocol which extents/blocks/(etc) that it is "done" with. > >> > >> Maybe it
Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
Hi Vladimir, On Thu, Dec 01, 2016 at 02:26:28PM +0300, Vladimir Sementsov-Ogievskiy wrote: > 01.12.2016 13:14, Wouter Verhelst wrote: [...] > > -- `NBD_ALLOC_ADD_CONTEXT` (2): the list of allocation contexts > > +- `NBD_META_ADD_CONTEXT` (2): the list of metadata contexts > > selected by the query string is added to the list of existing > > - allocation contexts. > > If I understand correctly, it should be not 'existing', but 'exporting'. > So there are several contexts, server knows about. They are definitely > exists. Some of them may be selected (by client) for export (to client, > through get_block_status). > > so, what about 'list of metadata contexts to export' or something like this? Yes, good idea. Thanks. [...] > > +The "BASE:allocation" metadata context is the basic "exists at all" > > +metadata context. If an extent is marked with `NBD_STATE_HOLE` at that > > +context, this means that the given extent is not allocated in the > > +backend storage, and that writing to the extent MAY result in the ENOSPC > > +error. This supports sparse file semantics on the server side. If a > > +server has only one metadata context (the default), then writing to an > > +extent which has `NBD_STATE_HOLE` clear MUST NOT fail with ENOSPC. > > this dependence looks strange. user defined metadata, why it affects > allocation? The reference nbd-server implementation (i.e., the one that accompanies this document) has a "copy-on-write" mode in which modifications are written to a separate file. This separate file can easily be a sparse file. Let's say you have an export which is fully allocated; the BASE:allocation state would have STATE_HOLE cleared. However, when writing to a particular block that has not yet been written to during that session, the copy-on-write mode would have to allocate a new block in the copy-on-write file, which may fail with ENOSPC. Perhaps the above paragraph could be updated in the sense that it SHOULD NOT fail in other cases, but that it may depend on the semantics of the other metadata contexts, whether active (i.e., selected with NBD_OPT_META_CONTEXT) or not. Side note: that does mean that some metadata contexts may be specific to particular exports (e.g., you may have a list of named snapshots to select, and which ones would exist would depend on the specific export chosen). I guess that means the metadata contexts should have the name of the export, too. > At least, 'only one' is not descriptive, it would be better to mention > 'BASE:allocation' name. Yes, that would make things clearer, indeed. > (I hope, I can ask server to export only dirty_bitmap context, not > exporting allocation? That was the point of naming that context and making it selectable :-) > In that case this 'only one' would be dirty_bitmap) > > > For all other cases, this specification requires no specific semantics > > what are 'other cases'? For all other metadata contexts? Or for all > cases when we have more than one context? Other metadata contexts. I'll rephrease it in that sense. > > -of allocation contexts. Implementations could support allocation > > +of metadata contexts. Implementations could support metadata > > contexts with semantics like the following: > > > > -- Incremental snapshots; if a block is allocated in one allocation > > +- Incremental snapshots; if a block is allocated in one metadata > > context, that implies that it is also allocated in the next level up. > > - Various bits of data about the backend of the storage; e.g., if the > > - storage is written on a RAID array, an allocation context could > > + storage is written on a RAID array, a metadata context could > > return information about the redundancy level of a given extent > > - If the backend implements a write-through cache of some sort, or > > - synchronises with other servers, an allocation context could state > > - that an extent is "allocated" once it has reached permanent storage > > + synchronises with other servers, a metadata context could state > > + that an extent is "active" once it has reached permanent storage > > and/or is synchronized with other servers. > > Incremental snapshots sounds strange for me. Snapshots are just > snapshots.. Backup may be incremental, but it is not about snapshots.. I > think this example may be safely deleted from the spec. Yeah, I'll just remove all the examples; they're not really critical, anyway, and might indeed confuse people. [...] > > Servers MUST return an `NBD_REPLY_TYPE_BLOCK_STATUS` chunk for every > > -allocation context ID, except if the semantics of particular > > -allocation contexts mean that the information for one allocation > > -context is implied by the information for another. > > +metadata context ID, except if the semantics of particular > > +metadata contexts mean that the information for one active metadata > > +context is implied by the information for
Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
01.12.2016 13:14, Wouter Verhelst wrote: Here's another update. Changes since previous version: - Rename "allocation context" to "metadata context" - Stop making metadata context 0 be special; instead, name it "BASE:allocation" and allow it to be selected like all other contexts. - Clarify in a bit more detail when a server MAY omit metadata information on a particular metadata context (i.e., only if another metadata context that we actually got information for implies it can't have meaning). This was always meant that way, but the spec could have been a bit more explicit about it. - Change one SHOULD to a MUST, where it should not have been a SHOULD in the first place. (This applies on top of my previous patch) Applied and visible at the usual place. diff --git a/doc/proto.md b/doc/proto.md index fe7ae53..9c0981f 100644 --- a/doc/proto.md +++ b/doc/proto.md @@ -869,16 +869,16 @@ of the newstyle negotiation. Defined by the experimental `INFO` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md). -- `NBD_OPT_ALLOC_CONTEXT` (10) +- `NBD_OPT_META_CONTEXT` (10) -Return a list of `NBD_REP_ALLOC_CONTEXT` replies, one per context, +Return a list of `NBD_REP_META_CONTEXT` replies, one per context, followed by an `NBD_REP_ACK`. If a server replies to such a request with no error message, clients MAY send NBD_CMD_BLOCK_STATUS commands during the transmission phase. If the query string is syntactically invalid, the server SHOULD send `NBD_REP_ERR_INVALID`. If the query string is syntactically valid -but finds no allocation contexts, the server MUST send a single +but finds no metadata contexts, the server MUST send a single reply of type `NBD_REP_ACK`. This option MUST NOT be requested unless structured replies have @@ -887,9 +887,9 @@ of the newstyle negotiation. Data: - 32 bits, type -- String, query to select a subset of the available allocation +- String, query to select a subset of the available metadata contexts. If this is not specified (i.e., length is 4 and no - command is sent), then the server MUST send all the allocation + command is sent), then the server MUST send all the metadata contexts it knows about. If specified, this query string MUST start with a name that uniquely identifies a server implementation; e.g., the reference implementation that @@ -897,18 +897,22 @@ of the newstyle negotiation. with 'nbd-server:' The type may be one of: -- `NBD_ALLOC_LIST_CONTEXT` (1): the list of allocation contexts +- `NBD_META_LIST_CONTEXT` (1): the list of metadata contexts selected by the query string is returned to the client without - changing any state (i.e., this does not add allocation contexts + changing any state (i.e., this does not add metadata contexts for further usage). -- `NBD_ALLOC_ADD_CONTEXT` (2): the list of allocation contexts +- `NBD_META_ADD_CONTEXT` (2): the list of metadata contexts selected by the query string is added to the list of existing If I understand correctly, it should be not 'existing', but 'exporting'. So there are several contexts, server knows about. They are definitely exists. Some of them may be selected (by client) for export (to client, through get_block_status). so, what about 'list of metadata contexts to export' or something like this? - allocation contexts. -- `NBD_ALLOC_DEL_CONTEXT` (3): the list of allocation contexts + metadata contexts. +- `NBD_META_DEL_CONTEXT` (3): the list of metadata contexts selected by the query string is removed from the list of used - allocation contexts. Servers SHOULD NOT reuse existing allocation + metadata contexts. Servers SHOULD NOT reuse existing metadata context IDs. +The syntax of the query string is not specified, except that +implementations MUST support adding and removing individual metadata +contexts by simply listing their names. + Option reply types These values are used in the "reply type" field, sent by the server @@ -920,7 +924,7 @@ during option haggling in the fixed newstyle negotiation. information is available, or when sending data related to the option (in the case of `NBD_OPT_LIST`) has finished. No data. -* `NBD_REP_SERVER` (2) +- `NBD_REP_SERVER` (2) A description of an export. Data: @@ -935,21 +939,20 @@ during option haggling in the fixed newstyle negotiation. particular client request, this field is defined to be a string suitable for direct display to a human being. -* `NBD_REP_INFO` (3) +- `NBD_REP_INFO` (3) Defined by the experimental `INFO` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md). -- `NBD_REP_ALLOC_CONTEXT` (4) +- `NBD_REP_META_CONTEXT` (4) -
Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
Here's another update. Changes since previous version: - Rename "allocation context" to "metadata context" - Stop making metadata context 0 be special; instead, name it "BASE:allocation" and allow it to be selected like all other contexts. - Clarify in a bit more detail when a server MAY omit metadata information on a particular metadata context (i.e., only if another metadata context that we actually got information for implies it can't have meaning). This was always meant that way, but the spec could have been a bit more explicit about it. - Change one SHOULD to a MUST, where it should not have been a SHOULD in the first place. (This applies on top of my previous patch) Applied and visible at the usual place. diff --git a/doc/proto.md b/doc/proto.md index fe7ae53..9c0981f 100644 --- a/doc/proto.md +++ b/doc/proto.md @@ -869,16 +869,16 @@ of the newstyle negotiation. Defined by the experimental `INFO` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md). -- `NBD_OPT_ALLOC_CONTEXT` (10) +- `NBD_OPT_META_CONTEXT` (10) -Return a list of `NBD_REP_ALLOC_CONTEXT` replies, one per context, +Return a list of `NBD_REP_META_CONTEXT` replies, one per context, followed by an `NBD_REP_ACK`. If a server replies to such a request with no error message, clients MAY send NBD_CMD_BLOCK_STATUS commands during the transmission phase. If the query string is syntactically invalid, the server SHOULD send `NBD_REP_ERR_INVALID`. If the query string is syntactically valid -but finds no allocation contexts, the server MUST send a single +but finds no metadata contexts, the server MUST send a single reply of type `NBD_REP_ACK`. This option MUST NOT be requested unless structured replies have @@ -887,9 +887,9 @@ of the newstyle negotiation. Data: - 32 bits, type -- String, query to select a subset of the available allocation +- String, query to select a subset of the available metadata contexts. If this is not specified (i.e., length is 4 and no - command is sent), then the server MUST send all the allocation + command is sent), then the server MUST send all the metadata contexts it knows about. If specified, this query string MUST start with a name that uniquely identifies a server implementation; e.g., the reference implementation that @@ -897,18 +897,22 @@ of the newstyle negotiation. with 'nbd-server:' The type may be one of: -- `NBD_ALLOC_LIST_CONTEXT` (1): the list of allocation contexts +- `NBD_META_LIST_CONTEXT` (1): the list of metadata contexts selected by the query string is returned to the client without - changing any state (i.e., this does not add allocation contexts + changing any state (i.e., this does not add metadata contexts for further usage). -- `NBD_ALLOC_ADD_CONTEXT` (2): the list of allocation contexts +- `NBD_META_ADD_CONTEXT` (2): the list of metadata contexts selected by the query string is added to the list of existing - allocation contexts. -- `NBD_ALLOC_DEL_CONTEXT` (3): the list of allocation contexts + metadata contexts. +- `NBD_META_DEL_CONTEXT` (3): the list of metadata contexts selected by the query string is removed from the list of used - allocation contexts. Servers SHOULD NOT reuse existing allocation + metadata contexts. Servers SHOULD NOT reuse existing metadata context IDs. +The syntax of the query string is not specified, except that +implementations MUST support adding and removing individual metadata +contexts by simply listing their names. + Option reply types These values are used in the "reply type" field, sent by the server @@ -920,7 +924,7 @@ during option haggling in the fixed newstyle negotiation. information is available, or when sending data related to the option (in the case of `NBD_OPT_LIST`) has finished. No data. -* `NBD_REP_SERVER` (2) +- `NBD_REP_SERVER` (2) A description of an export. Data: @@ -935,21 +939,20 @@ during option haggling in the fixed newstyle negotiation. particular client request, this field is defined to be a string suitable for direct display to a human being. -* `NBD_REP_INFO` (3) +- `NBD_REP_INFO` (3) Defined by the experimental `INFO` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md). -- `NBD_REP_ALLOC_CONTEXT` (4) +- `NBD_REP_META_CONTEXT` (4) -A description of an allocation context. Data: +A description of a metadata context. Data: -- 32 bits, NBD allocation context ID. If the request was NOT of type - `NBD_ALLOC_LIST_CONTEXT`, this field MUST NOT be zero. -- String, name of the allocation context. This is not required to be +- 32 bits, NBD metadata context ID. +- String, name of the metadata context. This is not required to be a
Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
Hi, Wouter! > Actually, come to think of that. What is the exact use case for this thing? I > understand you're trying to create incremental backups of things, which would > imply you don't write from the client that is getting the ? > block status thingies, right? Overall, the most desired use case for this NBD extension is to allow 3-rd party software to make incremental backups. Acronis (vendor of backup solutions) would support qemu backup if block status is provided. -Original Message- From: Wouter Verhelst [mailto:w...@uter.be] Sent: Sunday, November 27, 2016 22:17 To: Vladimir Sementsov-OgievskiyCc: nbd-gene...@lists.sourceforge.net; kw...@redhat.com; qemu-devel@nongnu.org; pborzen...@virtuozzo.com; stefa...@redhat.com; pbonz...@redhat.com; m...@pengutronix.de; d...@openvz.org Subject: Re: [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension Hi Vladimir, Quickly: the reason I haven't merged this yes is twofold: - I wasn't thrilled with the proposal at the time. It felt a bit hackish, and bolted onto NBD so you could use it, but without defining everything in the NBD protocol. "We're reading some data, but it's not about you". That didn't feel right - There were a number of questions still unanswered (you're answering a few below, so that's good). For clarity, I have no objection whatsoever to adding more commands if they're useful, but I would prefer that they're also useful with NBD on its own, i.e., without requiring an initiation or correlation of some state through another protocol or network connection or whatever. If that's needed, that feels like I didn't do my job properly, if you get my point. On Fri, Nov 25, 2016 at 02:28:16PM +0300, Vladimir Sementsov-Ogievskiy wrote: > With the availability of sparse storage formats, it is often needed to > query status of a particular range and read only those blocks of data > that are actually present on the block device. > > To provide such information, the patch adds the BLOCK_STATUS extension > with one new NBD_CMD_BLOCK_STATUS command, a new structured reply > chunk format, and a new transmission flag. > > There exists a concept of data dirtiness, which is required during, > for example, incremental block device backup. To express this concept > via NBD protocol, this patch also adds a flag to NBD_CMD_BLOCK_STATUS > to request dirtiness information rather than provisioning information; > however, with the current proposal, data dirtiness is only useful with > additional coordination outside of the NBD protocol (such as a way to > start and stop the server from tracking dirty sectors). Future NBD > extensions may add commands to control dirtiness through NBD. > > Since NBD protocol has no notion of block size, and to mimic SCSI "GET > LBA STATUS" command more closely, it has been chosen to return a list > of extents in the response of NBD_CMD_BLOCK_STATUS command, instead of > a bitmap. > > CC: Pavel Borzenkov > CC: Denis V. Lunev > CC: Wouter Verhelst > CC: Paolo Bonzini > CC: Kevin Wolf > CC: Stefan Hajnoczi > Signed-off-by: Eric Blake > Signed-off-by: Vladimir Sementsov-Ogievskiy > --- > > v3: > > Hi all. This is almost a resend of v2 (by Eric Blake), The only change > is removing the restriction, that sum of status descriptor lengths > must be equal to requested length. I.e., let's permit the server to > replay with less data than required if it wants. Reasonable, yes. The length that the client requests should be a maximum (i.e. "I'm interested in this range"), not an exact request. > Also, bit of NBD_FLAG_SEND_BLOCK_STATUS is changed to 9, as 8 is now > NBD_FLAG_CAN_MULTI_CONN in master branch. Right. > And, finally, I've rebased this onto current state of > extension-structured-reply branch (which itself should be rebased on > master IMHO). Probably a good idea, given the above. > By this resend I just want to continue the diqussion, started about > half a year ago. Here is a summary of some questions and ideas from v2 > diqussion: > > 1. Q: Synchronisation. Is such data (dirty/allocated) reliable? >A: This all is for read-only disks, so the data is static and unchangeable. I think we should declare that it's up to the client to ensure no other writes happen without its knowledge. This may be because the client and server communicate out of band about state changes, or because the client somehow knows that it's the only writer, or whatever. We can easily do that by declaring that the result of that command only talks about *current* state, and that concurrent writes by different clients may invalidate the state. This is true for NBD in general (i.e., concurrent read or write commands from other clients may confuse file systems on top of NBD), so it doesn't
Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
On Tue, Nov 29, 2016 at 06:07:56PM +0300, Vladimir Sementsov-Ogievskiy wrote: > 29.11.2016 17:52, Alex Bligh wrote: > > Vladimir, > >> if the bitmap is 010101010101 we will have too many descriptors. > >> For example, 16tb disk, 64k granularity -> 2G of descriptors > >> payload. > > Yep. And the cost of consuming and retrying is quite high. One > > option would be for the client to realise this is a possibility, and > > not request the entire extent map for a 16TB disk, as it might be > > very large! Even if the client worked at e.g. a 64MB level (where > > they'd get a maximum of 1024 extents per reply), this isn't going to > > noticeably increase the round trip timing. One issue here is that to > > determine a 'reasonable' size, the client needs to know the minimum > > length of any extent. > > and with this approach we will in turn have overhead of too many > requests for or bitmaps. This is why my proposal suggests the server may abort the sent extents after X extents (sixteen, but that number can certainly change) have been sent. After all, the server will have a better view on what's going to be costly in terms of "number of extents". > > I think the answer is probably a 'maximum number of extents' in the request > > packet. > > > > Of course with statuses in extent, the final extent could be > > represented as 'I don't know, break this bit into a separate > > request' status. > > > > With such predefined status, we can postpone creating extended requests, > have number of extents restricted by server and have sum of extents > lengths be equal to request length. -- < ron> I mean, the main *practical* problem with C++, is there's like a dozen people in the world who think they really understand all of its rules, and pretty much all of them are just lying to themselves too. -- #debian-devel, OFTC, 2016-02-12
Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
Hi Vladimir, On Tue, Nov 29, 2016 at 03:41:10PM +0300, Vladimir Sementsov-Ogievskiy wrote: > Hi, > > 29.11.2016 13:50, Wouter Verhelst wrote: > > Hi, > > On Mon, Nov 28, 2016 at 06:33:24PM +0100, Wouter Verhelst wrote: > > However, I'm arguing that if we're going to provide information about > snapshots, we should be able to properly refer to these snapshots from > within an NBD context. My previous mail suggested adding a negotiation > message that would essentially ask the server "tell me about the > snapshots you know about", giving them an NBD identifier in the > process > (accompanied by a "foreign" identifier that is decidedly *not* an NBD > identifier and that could be used to match the NBD identifier to > something implementation-defined). This would be read-only > information; > the client cannot ask the server to create new snapshots. We can then > later in the protocol refer to these snapshots by way of that NBD > identifier. > > To make this a bit more concrete, I've changed the proposal like so: > [...] > +# Allocation contexts > + > +Allocation context 0 is the basic "exists at all" allocation context. If > +an extent is not allocated at allocation context 0, it MUST NOT be > +listed as allocated at another allocation context. This supports sparse > > > allocated here is range with unset NBD_STATE_HOLE bit? Eh, yes. I should clarify that a bit further (no time right now though, but patches are certainly welcome). > +file semantics on the server side. If a server has only one allocation > +context (the default), then writing to an extent which is allocated in > +that allocation context 0 MUST NOT fail with ENOSPC. > + > +For all other cases, this specification requires no specific semantics > +of allocation contexts. Implementations could support allocation > +contexts with semantics like the following: > + > +- Incremental snapshots; if a block is allocated in one allocation > + context, that implies that it is also allocated in the next level up. > +- Various bits of data about the backend of the storage; e.g., if the > + storage is written on a RAID array, an allocation context could > + return information about the redundancy level of a given extent > +- If the backend implements a write-through cache of some sort, or > + synchronises with other servers, an allocation context could state > + that an extent is "allocated" once it has reached permanent storage > + and/or is synchronized with other servers. > + > +The only requirement of an allocation context is that it MUST be > +representable with the flags as defined for `NBD_CMD_BLOCK_STATUS`. > + > +Likewise, the syntax of query strings is not specified by this document. > + > +Server implementations SHOULD document their syntax for query strings > +and semantics for resulting allocation contexts in a document like this > +one. > > > IMHO, redundant paragraph for this spec. The "SHOULD document" one? I want that there at any rate, just to make clear that it's probably a good thing to have it (but no, it's not part of the formal protocol spec). > + > ### Transmission phase > > Flag fields > @@ -983,6 +1065,9 @@ valid may depend on negotiation during the handshake > phase. > content chunk in reply. MUST NOT be set unless the transmission > flags include `NBD_FLAG_SEND_DF`. Use of this flag MAY trigger an > `EOVERFLOW` error chunk, if the request length is too large. > +- bit 3, `NBD_CMD_FLAG_REQ_ONE`; valid during `NBD_CMD_BLOCK_STATUS`. If > + set, the client is interested in only one extent per allocation > + context. > > # Structured reply flags > > @@ -1371,38 +1456,48 @@ adds a new `NBD_CMD_BLOCK_STATUS` command which > returns a list of > ranges with their respective states. This extension is not available > unless the client also negotiates the `STRUCTURED_REPLY` extension. > > -* `NBD_FLAG_SEND_BLOCK_STATUS` > - > -The server SHOULD set this transmission flag to 1 if structured > -replies have been negotiated, and the `NBD_CMD_BLOCK_STATUS` > -request is supported. > - > * `NBD_REPLY_TYPE_BLOCK_STATUS` > > -*length* MUST be a positive integer multiple of 8. This reply > +*length* MUST be 4 + (a positive integer multiple of 8). This reply > represents a series of consecutive block descriptors where the sum > of the lengths of the descriptors MUST not be greater than the > -length of the original request. This chunk type MUST appear at most > -once in a structured reply. Valid as a reply to > +length of the original request. This chunk type MUST appear at most > +once per allocation ID
Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
> On 29 Nov 2016, at 10:50, Wouter Verhelstwrote: > > +- `NBD_OPT_ALLOC_CONTEXT` (10) > + > +Return a list of `NBD_REP_ALLOC_CONTEXT` replies, one per context, > +followed by an `NBD_REP_ACK`. If a server replies to such a request > +with no error message, clients MAY send NBD_CMD_BLOCK_STATUS > +commands during the transmission phase. I haven't read this in detail but this seems to me to be similar to the idea I just posted (sorry - kept getting interrupted whilst writing the email) re multiple bitmaps? But the name 'ALLOC_CONTEXT' is a bit weird. Why not call 'metadata bitmaps' or 'metadata extents' or something. Metadata seems right as it's data about data. > +# Allocation contexts > + > +Allocation context 0 is the basic "exists at all" allocation context. If > +an extent is not allocated at allocation context 0, it MUST NOT be > +listed as allocated at another allocation context. This supports sparse > +file semantics on the server side. If a server has only one allocation > +context (the default), then writing to an extent which is allocated in > +that allocation context 0 MUST NOT fail with ENOSPC. > + > +For all other cases, this specification requires no specific semantics > +of allocation contexts. Implementations could support allocation > +contexts with semantics like the following: > + > +- Incremental snapshots; if a block is allocated in one allocation > + context, that implies that it is also allocated in the next level up. > +- Various bits of data about the backend of the storage; e.g., if the > + storage is written on a RAID array, an allocation context could > + return information about the redundancy level of a given extent > +- If the backend implements a write-through cache of some sort, or > + synchronises with other servers, an allocation context could state > + that an extent is "allocated" once it has reached permanent storage > + and/or is synchronized with other servers. > + > +The only requirement of an allocation context is that it MUST be > +representable with the flags as defined for `NBD_CMD_BLOCK_STATUS`. > + > +Likewise, the syntax of query strings is not specified by this document. > + > +Server implementations SHOULD document their syntax for query strings > +and semantics for resulting allocation contexts in a document like this > +one. > + But this seems strange. Whether something is 'allocated' seems orthogonal to me to whether it needs backing up. Even the fact it's been zeroed (now a hole) might need backing up). So don't we need multiple independent lists of extents? Of course a server might *implement* them under the hood with separate bitmaps or one big bitmap, or no bitmap at all (for instance using file extents on POSIX). -- Alex Bligh
Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
Hi, 29.11.2016 13:50, Wouter Verhelst wrote: Hi, On Mon, Nov 28, 2016 at 06:33:24PM +0100, Wouter Verhelst wrote: However, I'm arguing that if we're going to provide information about snapshots, we should be able to properly refer to these snapshots from within an NBD context. My previous mail suggested adding a negotiation message that would essentially ask the server "tell me about the snapshots you know about", giving them an NBD identifier in the process (accompanied by a "foreign" identifier that is decidedly *not* an NBD identifier and that could be used to match the NBD identifier to something implementation-defined). This would be read-only information; the client cannot ask the server to create new snapshots. We can then later in the protocol refer to these snapshots by way of that NBD identifier. To make this a bit more concrete, I've changed the proposal like so: From bc6d0df4156e670be7b6adea4f2813f44ffa7202 Mon Sep 17 00:00:00 2001 From: Wouter VerhelstDate: Tue, 29 Nov 2016 11:46:04 +0100 Subject: [PATCH] Update with allocation contexts etc Signed-off-by: Wouter Verhelst --- doc/proto.md | 210 +-- 1 file changed, 145 insertions(+), 65 deletions(-) diff --git a/doc/proto.md b/doc/proto.md index dfdd06d..fe7ae53 100644 --- a/doc/proto.md +++ b/doc/proto.md @@ -768,8 +768,6 @@ The field has the following format: to that command to the client. In the absense of this flag, clients SHOULD NOT multiplex their commands over more than one connection to the export. -- bit 9, `NBD_FLAG_SEND_BLOCK_STATUS`; defined by the experimental - `BLOCK_STATUS` extension; see below. Clients SHOULD ignore unknown flags. @@ -871,6 +869,46 @@ of the newstyle negotiation. Defined by the experimental `INFO` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md). +- `NBD_OPT_ALLOC_CONTEXT` (10) + +Return a list of `NBD_REP_ALLOC_CONTEXT` replies, one per context, +followed by an `NBD_REP_ACK`. If a server replies to such a request +with no error message, clients MAY send NBD_CMD_BLOCK_STATUS +commands during the transmission phase. + +If the query string is syntactically invalid, the server SHOULD send +`NBD_REP_ERR_INVALID`. If the query string is syntactically valid +but finds no allocation contexts, the server MUST send a single +reply of type `NBD_REP_ACK`. + +This option MUST NOT be requested unless structured replies have +been negotiated first. If a client attempts to do so, a server +SHOULD send `NBD_REP_ERR_INVALID`. + +Data: +- 32 bits, type +- String, query to select a subset of the available allocation + contexts. If this is not specified (i.e., length is 4 and no + command is sent), then the server MUST send all the allocation + contexts it knows about. If specified, this query string MUST + start with a name that uniquely identifies a server + implementation; e.g., the reference implementation that + accompanies this document would support query strings starting + with 'nbd-server:' + +The type may be one of: +- `NBD_ALLOC_LIST_CONTEXT` (1): the list of allocation contexts + selected by the query string is returned to the client without + changing any state (i.e., this does not add allocation contexts + for further usage). +- `NBD_ALLOC_ADD_CONTEXT` (2): the list of allocation contexts + selected by the query string is added to the list of existing + allocation contexts. +- `NBD_ALLOC_DEL_CONTEXT` (3): the list of allocation contexts + selected by the query string is removed from the list of used + allocation contexts. Servers SHOULD NOT reuse existing allocation + context IDs. + Option reply types These values are used in the "reply type" field, sent by the server @@ -901,6 +939,18 @@ during option haggling in the fixed newstyle negotiation. Defined by the experimental `INFO` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md). +- `NBD_REP_ALLOC_CONTEXT` (4) + +A description of an allocation context. Data: + +- 32 bits, NBD allocation context ID. If the request was NOT of type + `NBD_ALLOC_LIST_CONTEXT`, this field MUST NOT be zero. +- String, name of the allocation context. This is not required to be + a human-readable string, but it MUST be valid UTF-8 data. + +Allocation context ID 0 is implied, and always exists. It cannot be +removed. + There are a number of error reply types, all of which are denoted by having bit 31 set. All error replies MAY have some data set, in which case that data is an error message string suitable for display to the user. @@ -938,15 +988,47 @@ case that data is an error message string suitable for display to the user. Defined by the experimental `INFO`
Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
29.11.2016 13:18, Kevin Wolf wrote: Am 27.11.2016 um 20:17 hat Wouter Verhelst geschrieben: 3. Q: selecting of dirty bitmap to export A: several variants: 1: id of bitmap is in flags field of request pros: - simple cons: - it's a hack. flags field is for other uses. - we'll have to map bitmap names to these "ids" 2: introduce extended nbd requests with variable length and exploit this feature for BLOCK_STATUS command, specifying bitmap identifier. pros: - looks like a true way cons: - we have to create additional extension - possible we have to create a map, { <=> } 3: exteranl tool should select which bitmap to export. So, in case of Qemu it should be something like qmp command block-export-dirty-bitmap. pros: - simple - we can extend it to behave like (2) later cons: - additional qmp command to implement (possibly, the lesser evil) note: Hmm, external tool can make chose between allocated/dirty data too, so, we can remove 'NBD_FLAG_STATUS_DIRTY' flag at all. Downside of 3, though, is that it moves the definition of what the different states mean outside of the NBD protocol (i.e., the protocol messages are not entirely defined anymore, and their meaning depends on the clients and servers in use). Another point to consider is that option 3 doesn't allow you to access two (or more) different bitmaps from the client without using the side channel all the time to switch back and forth between them and having to drain the request queue each time to avoid races. In general, if we have something that "looks like the true way", I'd advocate to choose this option. Experience tells that we'd regret anything simpler in a year or two, and then we'll have to do the real thing anyway, but still need to support the quick hack for compatibility. 5. Number of status descriptors, sent by server, should be restricted variants: 1: just allow server to restrict this as it wants (which was done in v3) 2: (not excluding 1). Client specifies somehow the maximum for number of descriptors. 2.1: add command flag, which will request only one descriptor (otherwise, no restrictions from the client) 2.2: again, introduce extended nbd requests, and add field to specify this maximum I think having a flag which requests just one descriptor can be useful, but I'm hesitant to add it unless it's actually going to be used; so in other words, I'll leave the decision on that bit to you. The native qemu block layer interface returns the status of only one contiguous chunks, so the easiest way to implement the NBD block driver in qemu would always use that bit. On the other hand, it would be possible for the NBD block driver in qemu to cache the rest of the response internally and answer the next request from the cache instead of sending a new request over the network. Maybe that's what it should be doing anyway for good performance. Also, an idea on 2-4: As we say, that dirtiness is unknown for NBD, and external tool should specify, manage and understand, which data is actually transmitted, why not just call it user_data and leave status field of reply chunk unspecified in this case? So, I propose one flag for NBD_CMD_BLOCK_STATUS: NBD_FLAG_STATUS_USER. If it is clear, than behaviour is defined by Eric's 'Block provisioning status' paragraph. If it is set, we just leave status field for some external... protocol? Who knows, what is this user data. Yes, this sounds like a reasonable approach. Makes sense to me, too. However, if we have use for a single NBD_FLAG_STATUS_USER bit, we also have use for a second one. If we go with one of the options where the bitmap is selected per command, we're fine because you can simply move the second bit to a different bitmap and do two requests. If we have only a single active bitmap at a time, though, this feels like an actual problem. Another idea, about backups themselves: Why do we need allocated/zero status for backup? IMHO we don't. Well, I've been thinking so all along, but then I don't really know what it is, in detail, that you want to do :-) I think we do. A block can be dirtied by discarding/write_zero-ing it. Then we want the dirty bit to know that we need to include this block in the incremental backup, but we also want to know that we don't actually have to transfer the data in it. And we will know that automatically, by using structured read command, so separate call to get_block_status is not needed. Kevin -- Best regards, Vladimir
Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
Hi, On Mon, Nov 28, 2016 at 06:33:24PM +0100, Wouter Verhelst wrote: > However, I'm arguing that if we're going to provide information about > snapshots, we should be able to properly refer to these snapshots from > within an NBD context. My previous mail suggested adding a negotiation > message that would essentially ask the server "tell me about the > snapshots you know about", giving them an NBD identifier in the process > (accompanied by a "foreign" identifier that is decidedly *not* an NBD > identifier and that could be used to match the NBD identifier to > something implementation-defined). This would be read-only information; > the client cannot ask the server to create new snapshots. We can then > later in the protocol refer to these snapshots by way of that NBD > identifier. To make this a bit more concrete, I've changed the proposal like so: >From bc6d0df4156e670be7b6adea4f2813f44ffa7202 Mon Sep 17 00:00:00 2001 From: Wouter VerhelstDate: Tue, 29 Nov 2016 11:46:04 +0100 Subject: [PATCH] Update with allocation contexts etc Signed-off-by: Wouter Verhelst --- doc/proto.md | 210 +-- 1 file changed, 145 insertions(+), 65 deletions(-) diff --git a/doc/proto.md b/doc/proto.md index dfdd06d..fe7ae53 100644 --- a/doc/proto.md +++ b/doc/proto.md @@ -768,8 +768,6 @@ The field has the following format: to that command to the client. In the absense of this flag, clients SHOULD NOT multiplex their commands over more than one connection to the export. -- bit 9, `NBD_FLAG_SEND_BLOCK_STATUS`; defined by the experimental - `BLOCK_STATUS` extension; see below. Clients SHOULD ignore unknown flags. @@ -871,6 +869,46 @@ of the newstyle negotiation. Defined by the experimental `INFO` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md). +- `NBD_OPT_ALLOC_CONTEXT` (10) + +Return a list of `NBD_REP_ALLOC_CONTEXT` replies, one per context, +followed by an `NBD_REP_ACK`. If a server replies to such a request +with no error message, clients MAY send NBD_CMD_BLOCK_STATUS +commands during the transmission phase. + +If the query string is syntactically invalid, the server SHOULD send +`NBD_REP_ERR_INVALID`. If the query string is syntactically valid +but finds no allocation contexts, the server MUST send a single +reply of type `NBD_REP_ACK`. + +This option MUST NOT be requested unless structured replies have +been negotiated first. If a client attempts to do so, a server +SHOULD send `NBD_REP_ERR_INVALID`. + +Data: +- 32 bits, type +- String, query to select a subset of the available allocation + contexts. If this is not specified (i.e., length is 4 and no + command is sent), then the server MUST send all the allocation + contexts it knows about. If specified, this query string MUST + start with a name that uniquely identifies a server + implementation; e.g., the reference implementation that + accompanies this document would support query strings starting + with 'nbd-server:' + +The type may be one of: +- `NBD_ALLOC_LIST_CONTEXT` (1): the list of allocation contexts + selected by the query string is returned to the client without + changing any state (i.e., this does not add allocation contexts + for further usage). +- `NBD_ALLOC_ADD_CONTEXT` (2): the list of allocation contexts + selected by the query string is added to the list of existing + allocation contexts. +- `NBD_ALLOC_DEL_CONTEXT` (3): the list of allocation contexts + selected by the query string is removed from the list of used + allocation contexts. Servers SHOULD NOT reuse existing allocation + context IDs. + Option reply types These values are used in the "reply type" field, sent by the server @@ -901,6 +939,18 @@ during option haggling in the fixed newstyle negotiation. Defined by the experimental `INFO` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md). +- `NBD_REP_ALLOC_CONTEXT` (4) + +A description of an allocation context. Data: + +- 32 bits, NBD allocation context ID. If the request was NOT of type + `NBD_ALLOC_LIST_CONTEXT`, this field MUST NOT be zero. +- String, name of the allocation context. This is not required to be + a human-readable string, but it MUST be valid UTF-8 data. + +Allocation context ID 0 is implied, and always exists. It cannot be +removed. + There are a number of error reply types, all of which are denoted by having bit 31 set. All error replies MAY have some data set, in which case that data is an error message string suitable for display to the user. @@ -938,15 +988,47 @@ case that data is an error message string suitable for display to the user. Defined by the experimental `INFO`
Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
Am 27.11.2016 um 20:17 hat Wouter Verhelst geschrieben: > > 3. Q: selecting of dirty bitmap to export > >A: several variants: > > 1: id of bitmap is in flags field of request > > pros: - simple > > cons: - it's a hack. flags field is for other uses. > > - we'll have to map bitmap names to these "ids" > > 2: introduce extended nbd requests with variable length and exploit > > this > > feature for BLOCK_STATUS command, specifying bitmap identifier. > > pros: - looks like a true way > > cons: - we have to create additional extension > >- possible we have to create a map, > > { <=> } > > 3: exteranl tool should select which bitmap to export. So, in case of > > Qemu > > it should be something like qmp command block-export-dirty-bitmap. > > pros: - simple > >- we can extend it to behave like (2) later > > cons: - additional qmp command to implement (possibly, the lesser > > evil) > > note: Hmm, external tool can make chose between allocated/dirty > > data too, > >so, we can remove 'NBD_FLAG_STATUS_DIRTY' flag at all. > > Downside of 3, though, is that it moves the definition of what the > different states mean outside of the NBD protocol (i.e., the protocol > messages are not entirely defined anymore, and their meaning depends on > the clients and servers in use). Another point to consider is that option 3 doesn't allow you to access two (or more) different bitmaps from the client without using the side channel all the time to switch back and forth between them and having to drain the request queue each time to avoid races. In general, if we have something that "looks like the true way", I'd advocate to choose this option. Experience tells that we'd regret anything simpler in a year or two, and then we'll have to do the real thing anyway, but still need to support the quick hack for compatibility. > > 5. Number of status descriptors, sent by server, should be restricted > >variants: > >1: just allow server to restrict this as it wants (which was done in v3) > >2: (not excluding 1). Client specifies somehow the maximum for number > > of descriptors. > > 2.1: add command flag, which will request only one descriptor > >(otherwise, no restrictions from the client) > > 2.2: again, introduce extended nbd requests, and add field to > >specify this maximum > > I think having a flag which requests just one descriptor can be useful, > but I'm hesitant to add it unless it's actually going to be used; so in > other words, I'll leave the decision on that bit to you. The native qemu block layer interface returns the status of only one contiguous chunks, so the easiest way to implement the NBD block driver in qemu would always use that bit. On the other hand, it would be possible for the NBD block driver in qemu to cache the rest of the response internally and answer the next request from the cache instead of sending a new request over the network. Maybe that's what it should be doing anyway for good performance. > > Also, an idea on 2-4: > > > > As we say, that dirtiness is unknown for NBD, and external tool > > should specify, manage and understand, which data is actually > > transmitted, why not just call it user_data and leave status field > > of reply chunk unspecified in this case? > > > > So, I propose one flag for NBD_CMD_BLOCK_STATUS: > > NBD_FLAG_STATUS_USER. If it is clear, than behaviour is defined by > > Eric's 'Block provisioning status' paragraph. If it is set, we just > > leave status field for some external... protocol? Who knows, what is > > this user data. > > Yes, this sounds like a reasonable approach. Makes sense to me, too. However, if we have use for a single NBD_FLAG_STATUS_USER bit, we also have use for a second one. If we go with one of the options where the bitmap is selected per command, we're fine because you can simply move the second bit to a different bitmap and do two requests. If we have only a single active bitmap at a time, though, this feels like an actual problem. > > Another idea, about backups themselves: > > > > Why do we need allocated/zero status for backup? IMHO we don't. > > Well, I've been thinking so all along, but then I don't really know what > it is, in detail, that you want to do :-) I think we do. A block can be dirtied by discarding/write_zero-ing it. Then we want the dirty bit to know that we need to include this block in the incremental backup, but we also want to know that we don't actually have to transfer the data in it. Kevin
Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
On Mon, Nov 28, 2016 at 06:33:24PM +0100, Wouter Verhelst wrote: > Hi Stefan, > > On Mon, Nov 28, 2016 at 11:19:44AM +, Stefan Hajnoczi wrote: > > On Sun, Nov 27, 2016 at 08:17:14PM +0100, Wouter Verhelst wrote: > > > Quickly: the reason I haven't merged this yes is twofold: > > > - I wasn't thrilled with the proposal at the time. It felt a bit > > > hackish, and bolted onto NBD so you could use it, but without defining > > > everything in the NBD protocol. "We're reading some data, but it's not > > > about you". That didn't feel right > > > > > > - There were a number of questions still unanswered (you're answering a > > > few below, so that's good). > > > > > > For clarity, I have no objection whatsoever to adding more commands if > > > they're useful, but I would prefer that they're also useful with NBD on > > > its own, i.e., without requiring an initiation or correlation of some > > > state through another protocol or network connection or whatever. If > > > that's needed, that feels like I didn't do my job properly, if you get > > > my point. > > > > The out-of-band operations you are referring to are for dirty bitmap > > management. (The goal is to read out blocks that changed since the last > > backup.) > > > > The client does not access the live disk, instead it accesses a > > read-only snapshot and the dirty information (so that it can copy out > > only blocks that were written). The client is allowed to read blocks > > that are not dirty too. > > I understood as much, yes. > > > If you want to implement the whole incremental backup workflow in NBD > > then the client would first have to connect to the live disk, set up > > dirty tracking, create a snapshot export, and then connect to that > > snapshot. > > > > That sounds like a big feature set and I'd argue it's for the control > > plane (storage API) and not the data plane (NBD). There were > > discussions about transferring the dirty information via the control > > plane but it seems more appropriate to it in the data plane since it is > > block-level information. > > I agree that creating and managing snapshots is out of scope for NBD. The > protocol is not set up for that. > > However, I'm arguing that if we're going to provide information about > snapshots, we should be able to properly refer to these snapshots from > within an NBD context. My previous mail suggested adding a negotiation > message that would essentially ask the server "tell me about the > snapshots you know about", giving them an NBD identifier in the process > (accompanied by a "foreign" identifier that is decidedly *not* an NBD > identifier and that could be used to match the NBD identifier to > something implementation-defined). This would be read-only information; > the client cannot ask the server to create new snapshots. We can then > later in the protocol refer to these snapshots by way of that NBD > identifier. > > My proposal also makes it impossible to get updates of newly created > snapshots without disconnecting and reconnecting (due to the fact that > you can't go from transmission back to negotiation), but I'm not sure > that's a problem. > > Doing so has two advantages: > - If a client is accidentally (due to misconfiguration or implementation > bugs or whatnot) connecting to the wrong server after having created a > snapshot through a management protocol, we have an opportunity to > detect this error, due to the fact that the "foreign" identifiers > passed to the client during negotiation will not match with what the > client was expecting. > - A future version of the protocol could possibly include an extended > version of the read command, allowing a client to read information > from multiple storage snapshots without requiring a reconnect, and > allowing current clients information about allocation status across > various snapshots (although a first implementation could very well > limit itself to only having one snapshot). Sorry, I misunderstood you. Snapshots are not very different from NBD exports. Especially if the storage system supports writeable-snapshot (aka cloning). Should we just used named exports? Stefan signature.asc Description: PGP signature
Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
Hi Wouter, Some of this mess may be partially my fault, but I have not been following the NBD extension proposals up until this point. Are you familiar with the genesis behind this idea and what we are trying to accomplish in general? We had the thought to propose an extension roughly similar to SCSI's 'get lba status', but the existing status bits there did not correlate semantically to what we are hoping to convey. There was some debate over if it would be an abuse of protocol to attempt to use them as such. For a quick recap, get lba status appears to offer three canonical statuses: 0h: "LBA extent is mapped, [...] or has an unknown state" 1h: "[...] LBA extent is deallocated" 2h: "[...] LBA extent is anchored" My interpretation of mapped was simply that it was physically allocated, and 'deallocated' was simply unallocated. (I uh, am not actually clear on what anchored means exactly.) Either way, we felt at the time that it would be wrong to propose an analogue command for NBD and then immediately abuse the existing semantics, hence a new command like -- but not identical to -- the SCSI one. On 11/27/2016 02:17 PM, Wouter Verhelst wrote: Hi Vladimir, Quickly: the reason I haven't merged this yes is twofold: - I wasn't thrilled with the proposal at the time. It felt a bit hackish, and bolted onto NBD so you could use it, but without defining everything in the NBD protocol. "We're reading some data, but it's not about you". That didn't feel right - There were a number of questions still unanswered (you're answering a few below, so that's good). For clarity, I have no objection whatsoever to adding more commands if they're useful, but I would prefer that they're also useful with NBD on its own, i.e., without requiring an initiation or correlation of some state through another protocol or network connection or whatever. If that's needed, that feels like I didn't do my job properly, if you get my point. On Fri, Nov 25, 2016 at 02:28:16PM +0300, Vladimir Sementsov-Ogievskiy wrote: With the availability of sparse storage formats, it is often needed to query status of a particular range and read only those blocks of data that are actually present on the block device. To provide such information, the patch adds the BLOCK_STATUS extension with one new NBD_CMD_BLOCK_STATUS command, a new structured reply chunk format, and a new transmission flag. There exists a concept of data dirtiness, which is required during, for example, incremental block device backup. To express this concept via NBD protocol, this patch also adds a flag to NBD_CMD_BLOCK_STATUS to request dirtiness information rather than provisioning information; however, with the current proposal, data dirtiness is only useful with additional coordination outside of the NBD protocol (such as a way to start and stop the server from tracking dirty sectors). Future NBD extensions may add commands to control dirtiness through NBD. Since NBD protocol has no notion of block size, and to mimic SCSI "GET LBA STATUS" command more closely, it has been chosen to return a list of extents in the response of NBD_CMD_BLOCK_STATUS command, instead of a bitmap. CC: Pavel BorzenkovCC: Denis V. Lunev CC: Wouter Verhelst CC: Paolo Bonzini CC: Kevin Wolf CC: Stefan Hajnoczi Signed-off-by: Eric Blake Signed-off-by: Vladimir Sementsov-Ogievskiy --- v3: Hi all. This is almost a resend of v2 (by Eric Blake), The only change is removing the restriction, that sum of status descriptor lengths must be equal to requested length. I.e., let's permit the server to replay with less data than required if it wants. Reasonable, yes. The length that the client requests should be a maximum (i.e. "I'm interested in this range"), not an exact request. Also, bit of NBD_FLAG_SEND_BLOCK_STATUS is changed to 9, as 8 is now NBD_FLAG_CAN_MULTI_CONN in master branch. Right. And, finally, I've rebased this onto current state of extension-structured-reply branch (which itself should be rebased on master IMHO). Probably a good idea, given the above. By this resend I just want to continue the diqussion, started about half a year ago. Here is a summary of some questions and ideas from v2 diqussion: 1. Q: Synchronisation. Is such data (dirty/allocated) reliable? A: This all is for read-only disks, so the data is static and unchangeable. I think we should declare that it's up to the client to ensure no other writes happen without its knowledge. This may be because the client and server communicate out of band about state changes, or because the client somehow knows that it's the only writer, or whatever. We can easily do that by declaring that the result of that command only talks about *current* state, and that concurrent writes by different clients may invalidate the state.
Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
Hi Stefan, On Mon, Nov 28, 2016 at 11:19:44AM +, Stefan Hajnoczi wrote: > On Sun, Nov 27, 2016 at 08:17:14PM +0100, Wouter Verhelst wrote: > > Quickly: the reason I haven't merged this yes is twofold: > > - I wasn't thrilled with the proposal at the time. It felt a bit > > hackish, and bolted onto NBD so you could use it, but without defining > > everything in the NBD protocol. "We're reading some data, but it's not > > about you". That didn't feel right > > > > - There were a number of questions still unanswered (you're answering a > > few below, so that's good). > > > > For clarity, I have no objection whatsoever to adding more commands if > > they're useful, but I would prefer that they're also useful with NBD on > > its own, i.e., without requiring an initiation or correlation of some > > state through another protocol or network connection or whatever. If > > that's needed, that feels like I didn't do my job properly, if you get > > my point. > > The out-of-band operations you are referring to are for dirty bitmap > management. (The goal is to read out blocks that changed since the last > backup.) > > The client does not access the live disk, instead it accesses a > read-only snapshot and the dirty information (so that it can copy out > only blocks that were written). The client is allowed to read blocks > that are not dirty too. I understood as much, yes. > If you want to implement the whole incremental backup workflow in NBD > then the client would first have to connect to the live disk, set up > dirty tracking, create a snapshot export, and then connect to that > snapshot. > > That sounds like a big feature set and I'd argue it's for the control > plane (storage API) and not the data plane (NBD). There were > discussions about transferring the dirty information via the control > plane but it seems more appropriate to it in the data plane since it is > block-level information. I agree that creating and managing snapshots is out of scope for NBD. The protocol is not set up for that. However, I'm arguing that if we're going to provide information about snapshots, we should be able to properly refer to these snapshots from within an NBD context. My previous mail suggested adding a negotiation message that would essentially ask the server "tell me about the snapshots you know about", giving them an NBD identifier in the process (accompanied by a "foreign" identifier that is decidedly *not* an NBD identifier and that could be used to match the NBD identifier to something implementation-defined). This would be read-only information; the client cannot ask the server to create new snapshots. We can then later in the protocol refer to these snapshots by way of that NBD identifier. My proposal also makes it impossible to get updates of newly created snapshots without disconnecting and reconnecting (due to the fact that you can't go from transmission back to negotiation), but I'm not sure that's a problem. Doing so has two advantages: - If a client is accidentally (due to misconfiguration or implementation bugs or whatnot) connecting to the wrong server after having created a snapshot through a management protocol, we have an opportunity to detect this error, due to the fact that the "foreign" identifiers passed to the client during negotiation will not match with what the client was expecting. - A future version of the protocol could possibly include an extended version of the read command, allowing a client to read information from multiple storage snapshots without requiring a reconnect, and allowing current clients information about allocation status across various snapshots (although a first implementation could very well limit itself to only having one snapshot). [...] > I'm arguing that the NBD protocol doesn't need to support the > incremental backup workflow since it's a complex control plane concept. > > Being able to read dirty information via NBD is useful for other block > backup applications, not just QEMU. It could be used for syncing LVM > volumes across machines, for example, if someone implements an NBD+LVM > server. Indeed, and I was considering adding a basic implementation to go with the copy-on-write support in stock nbd-server, too. > Another issue with adding control plane operations is that you need to > begin considering privilege separation. Should all NBD clients be able > to initiate snapshots, dirty tracking, etc or is some kind of access > control required to limit certain commands? Not all clients require the > same privileges and so they shouldn't have access to the same set of > operations. Sure, which is why I wasn't suggesting anything of the sorts :-) -- < ron> I mean, the main *practical* problem with C++, is there's like a dozen people in the world who think they really understand all of its rules, and pretty much all of them are just lying to themselves too. -- #debian-devel, OFTC, 2016-02-12
Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
On Sun, Nov 27, 2016 at 08:17:14PM +0100, Wouter Verhelst wrote: > Quickly: the reason I haven't merged this yes is twofold: > - I wasn't thrilled with the proposal at the time. It felt a bit > hackish, and bolted onto NBD so you could use it, but without defining > everything in the NBD protocol. "We're reading some data, but it's not > about you". That didn't feel right > > - There were a number of questions still unanswered (you're answering a > few below, so that's good). > > For clarity, I have no objection whatsoever to adding more commands if > they're useful, but I would prefer that they're also useful with NBD on > its own, i.e., without requiring an initiation or correlation of some > state through another protocol or network connection or whatever. If > that's needed, that feels like I didn't do my job properly, if you get > my point. The out-of-band operations you are referring to are for dirty bitmap management. (The goal is to read out blocks that changed since the last backup.) The client does not access the live disk, instead it accesses a read-only snapshot and the dirty information (so that it can copy out only blocks that were written). The client is allowed to read blocks that are not dirty too. If you want to implement the whole incremental backup workflow in NBD then the client would first have to connect to the live disk, set up dirty tracking, create a snapshot export, and then connect to that snapshot. That sounds like a big feature set and I'd argue it's for the control plane (storage API) and not the data plane (NBD). There were discussions about transferring the dirty information via the control plane but it seems more appropriate to it in the data plane since it is block-level information. I'm arguing that the NBD protocol doesn't need to support the incremental backup workflow since it's a complex control plane concept. Being able to read dirty information via NBD is useful for other block backup applications, not just QEMU. It could be used for syncing LVM volumes across machines, for example, if someone implements an NBD+LVM server. Another issue with adding control plane operations is that you need to begin considering privilege separation. Should all NBD clients be able to initiate snapshots, dirty tracking, etc or is some kind of access control required to limit certain commands? Not all clients require the same privileges and so they shouldn't have access to the same set of operations. Stefan signature.asc Description: PGP signature
Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
Hi Vladimir, Quickly: the reason I haven't merged this yes is twofold: - I wasn't thrilled with the proposal at the time. It felt a bit hackish, and bolted onto NBD so you could use it, but without defining everything in the NBD protocol. "We're reading some data, but it's not about you". That didn't feel right - There were a number of questions still unanswered (you're answering a few below, so that's good). For clarity, I have no objection whatsoever to adding more commands if they're useful, but I would prefer that they're also useful with NBD on its own, i.e., without requiring an initiation or correlation of some state through another protocol or network connection or whatever. If that's needed, that feels like I didn't do my job properly, if you get my point. On Fri, Nov 25, 2016 at 02:28:16PM +0300, Vladimir Sementsov-Ogievskiy wrote: > With the availability of sparse storage formats, it is often needed > to query status of a particular range and read only those blocks of > data that are actually present on the block device. > > To provide such information, the patch adds the BLOCK_STATUS > extension with one new NBD_CMD_BLOCK_STATUS command, a new > structured reply chunk format, and a new transmission flag. > > There exists a concept of data dirtiness, which is required > during, for example, incremental block device backup. To express > this concept via NBD protocol, this patch also adds a flag to > NBD_CMD_BLOCK_STATUS to request dirtiness information rather than > provisioning information; however, with the current proposal, data > dirtiness is only useful with additional coordination outside of > the NBD protocol (such as a way to start and stop the server from > tracking dirty sectors). Future NBD extensions may add commands > to control dirtiness through NBD. > > Since NBD protocol has no notion of block size, and to mimic SCSI > "GET LBA STATUS" command more closely, it has been chosen to return > a list of extents in the response of NBD_CMD_BLOCK_STATUS command, > instead of a bitmap. > > CC: Pavel Borzenkov> CC: Denis V. Lunev > CC: Wouter Verhelst > CC: Paolo Bonzini > CC: Kevin Wolf > CC: Stefan Hajnoczi > Signed-off-by: Eric Blake > Signed-off-by: Vladimir Sementsov-Ogievskiy > --- > > v3: > > Hi all. This is almost a resend of v2 (by Eric Blake), The only change is > removing the restriction, that sum of status descriptor lengths must be equal > to requested length. I.e., let's permit the server to replay with less data > than required if it wants. Reasonable, yes. The length that the client requests should be a maximum (i.e. "I'm interested in this range"), not an exact request. > Also, bit of NBD_FLAG_SEND_BLOCK_STATUS is changed to 9, as 8 is now > NBD_FLAG_CAN_MULTI_CONN in master branch. Right. > And, finally, I've rebased this onto current state of > extension-structured-reply branch (which itself should be rebased on > master IMHO). Probably a good idea, given the above. > By this resend I just want to continue the diqussion, started about half > a year ago. Here is a summary of some questions and ideas from v2 > diqussion: > > 1. Q: Synchronisation. Is such data (dirty/allocated) reliable? >A: This all is for read-only disks, so the data is static and unchangeable. I think we should declare that it's up to the client to ensure no other writes happen without its knowledge. This may be because the client and server communicate out of band about state changes, or because the client somehow knows that it's the only writer, or whatever. We can easily do that by declaring that the result of that command only talks about *current* state, and that concurrent writes by different clients may invalidate the state. This is true for NBD in general (i.e., concurrent read or write commands from other clients may confuse file systems on top of NBD), so it doesn't change expectations in any way. > 2. Q: different granularities of dirty/allocated bitmaps. Any problems? >A: 1: server replies with status descriptors of any size, granularity > is hidden from the client > 2: dirty/allocated requests are separate and unrelated to each > other, so their granularities are not intersecting Not entirely sure anymore what this is about? > 3. Q: selecting of dirty bitmap to export >A: several variants: > 1: id of bitmap is in flags field of request > pros: - simple > cons: - it's a hack. flags field is for other uses. > - we'll have to map bitmap names to these "ids" > 2: introduce extended nbd requests with variable length and exploit this > feature for BLOCK_STATUS command, specifying bitmap identifier. > pros: - looks like a true way > cons: - we have to create additional extension >