Re: [Nbd] [RFC PATCH] doc: Add new NBD_FLAG_BLOCK_SIZE extension

Eric Blake Tue, 12 Apr 2016 07:23:40 -0700

On 04/12/2016 04:34 AM, Alex Bligh wrote:
> 
> On 12 Apr 2016, at 05:04, Eric Blake <[email protected]> wrote:
> 
>> Existing NBD servers often have limitations, such as requiring
>> actions to be aligned to block sizes or limiting maximum
>> transactions to avoid denial of service attacks; for example,
>> qemu's NBD server refuses any transaction larger than 32M.  But
>> to date, clients have to learn these limitations via out-of-band
>> means.
> 
> I fully support solving this.
> 
>> This adds a new negotiation flag to let client and server agree
>> to alter the response to NBD_OPT_EXPORT_NAME and NBD_OPT_GO to
>> advertise the server's limitations, as well as documenting sane
>> defaults that a client should follow if the information was not
>> documented.
>>
>> This is done as a global flag rather than a new NBD_OPT, for a
>> couple of reasons.  First, block size information can be
>> implemented even without support for NBD_FLAG_FIXED_NEWSTYLE,
>> because of the reserved block of zeroes sent in response to
>> NBD_CMD_EXPORT_NAME.
> 
> This is a bad reason. Clients & servers without support
> for NBD_FLAG_FIXED_NEWSTYLE should be burned with fire.
> We should do nothing to help them beyond help them upgrade
> to sanity (I'm serious).
> 
>>  Second, negotiating the option changes
>> the NBD_REP_SERVER message size sent by the server for
>> NBD_OPT_INFO; although this reply is structured and the length
>> will be reliable, it would still be awkward for out-of-order
>> option replies from the server to ping-pong in size within a
>> single session.  Making the negotiation part of the initial
>> global flag handshake means the rest of the session has
>> consistent message sizing.
> 
> I don't think this is a great reason either. The INFO
> extension is still experimental. I would prefer to fix the
> INFO statement to transmit information about the export.


In other words, you're proposing that we limit the advertisement of
block size ONLY via NBD_CMD_INFO/GO, so that gives people a reason to
upgrade. I can buy that, if Wouter likes it.

I think it's probably time to promote NBD_CMD_INFO/GO to stable, since
we have several implementations coded up (I'm too late for qemu 2.6 to
contain it, but qemu 2.7 will have the code merged in as soon as 2.6 is
released) - you already mentioned that you looked at what it would take
to get the reference implementation to also support it (with the tricky
part being how to get the export size).

> 
> Perhaps instead of the name (before the other bit that looks
> like NBD_OPT_EXPORT_NAME's reply), we should *now* specify
> a structured selection of export information, e.g.
> 
> 0000: 32 bits: information element length
> 0004: 32 bits: information element type
> 0008 ... length: information element data

If we're going to send something other than the name, then it is no
longer NBD_REP_SERVER, and we'd be better off inventing a new NBD_REP_.
 But if we keep it NBD_REP_SERVER (start with name like always, and end
with transmission flags like NBD_OPT_EXPORT_NAME), we can still make the
middle as structured as we want, as in this example:

S: 64 bits magic (0x3e889045565a9)
S: 32 bits NBD_OPT_INFO (6)
S: 32 bits NBD_REP_SERVER (2)
S: 32 bits length
S: 32 bits name length (3)
S: name length bytes ("foo")
S: 16 bits element type NBD_INFO_BLOCK_SIZE (1)
S: 16 bits element length (12)
S: 32 bits minimum block size
S: 32 bits preferred block size
S: 32 bits maximum block size
S: 16 bits element type NBD_INFO_...
S: 16 bits element length ...
S: length bytes for next element...
S: 16 bits element type NBD_INFO_END (0)
S: 64 bits export size
S: 16 bits transmission flags

where we can have as many additional information elements defined in the
middle, and where the client doesn't have to negotiate anything (the
server just goes ahead and provides everything that it knows, and the
client is free to use the element types it knows and ignore via the
length the ones it does not, until reaching NBD_INFO_END).

> 
> We can then put what we like in there, and we have a future-proof
> way of sending mount information.
> 
> The name then can be the first of these, i.e.
> 
> Element type 0001, Data: length of name (32 bits, followed by name).
> 
> We can then transmit the other data as an element type.

If we reuse NBD_REP_SERVER, then name can't be any different from how
NBD_OPT_LIST uses it. Otherwise, we want a new NBD_REP_ type (but then,
we could even make the name optional, as for NBD_OPT_INFO it is only
useful information if the export has a canonical name that differs from
the name the user asked about).


> 
>> +The three block size fields represent constraints that the client
>> +SHOULD obey during transmission phase, as further detailed in
>> +individual transmission requests.  If `NBD_FLAG_C_BLOCK_SIZE` was not
>> +negotiated, the server MUST use zero for all three fields (or omit
>> +them, when `NBD_FLAG_C_NO_ZEROES` is negotiated), and the client
>> +SHOULD assume a minimum block size of 512 bytes, a maximum block size
>> +of 33554432 (32M) bytes (unless the export size is smaller), and no
>> +hint on a preferred block size.  Otherwise, the three values MUST each
>> +be non-zero, with the preferred size at least as large as the minimum,
>> +and the maximum at least as large as the preferred size, and the
>> +export size at least as large as the maximum.
> 
> It seems a bit harsh that I can't specify maximum block size as -1
> to mean 'send what you like' given that's current behaviour.

I can try and permit that - or maybe we make the sizes separately
specified (that is, in your structured reply layout, maximum block size
is optional, omitted for unlimited (in reality limited by export size),
separate from minimum block size.  (Qemu will always advertise a maximum
of 32M, but that doesn't mean we have to forbid implementations that
want to support larger transactions by default).

> 
>> +
>> +The minimum block size MUST be a power of 2, and represents the
>> +smallest addressable length and alignment within the export, even if
>> +writing a block that small requires the server to use a less-efficient
>> +read-modify-write action.  This value MAY be as small as 1 if the
>> +export is backed by a regular file, although the values of 512 or 4096
>> +are more typical for an export backed by a block device.  When block
>> +sizes are negotiated, the server MUST return an export size that is an
>> +integer multiple of the minimum block size.
> 
> I think you should make it clear that ALL requests MUST have an
> offset AND a length which are multiples of the minimum blocksize.
> Ah, you handle this below, but it might be worth saying here
> what this means.
> 
> +1 on export size.
> 
> Should this not be larger than 32M?

True. In fact, minimum block size should probably not be larger than
32k, since the NBD_CMD_FLAG_DF says that we only guarantee that the
server won't fragment at 32k or smaller, but at the same time, the
server should not be fragmenting smaller than minimum block size.  If
the minimum block size is greater than 32k, then the client can't
usefully set NBD_CMD_FLAG_DF, but also can't usefully guarantee the
result will be unfragmented.  And while 4k physical blocks are becoming
more prominent, I seriously doubt we have block devices where minimum
block size larger than 32k are going to be in heavy use any time soon.

I still strongly think that export size MUST be an integer multiple of
minimum block size (otherwise, you have the funky situation of the tail
of the file being inaccessible via NBD) - perhaps this means we should
add a statement that if an NBD server will be visiting a file that is
not a multiple of the minimum block size it is willing to let the client
use, then the server either has to truncate the export size down (the
client can't access the last few bytes of the file), or round it up (the
last sector of the file will have a tail that always reads as zeroes and
where writes into the tail are ignored), so that the client always sees
an export size that is a sane multiple.


> 
>> +The preferred block size MUST be an integer multiple of the minimum
>> +block size, SHOULD be either a power of 2 or equal to the export size,
>> +and MUST NOT be larger than 33554432 (32M) nor the export size.
> 
> See below re export size.
> 
>>  It
>> +represents the minimal length and alignment required for optimal I/O
>> +access; a typical size might be 65536.  The export size MAY be other
>> +than an integer multiple of the preferred size.
>> +
>> +The maximum block size represents the largest length that the server
>> +is willing to handle in a single transaction.  It MUST be an integer
>> +multiple of the minimum block size, MUST be at least the smaller of
>> +33554432 (32M) or the export size, and MUST NOT be larger than the
>> +export size.
> 
> Wording is a bit difficult. I think better (modulo the -1 thing):
> "it MUST be at least 32MB".
> 
> Obviously they can't send offsets or lengths greater than the export
> size, but that's handled elsewhere. In practical terms it's
> easier to get this information from the driver, and leave the export
> size out of it.

Fair enough - so the wording should be tweaked to state that preferred
and maximum sizes MAY be larger than export size, at which point export
size becomes the effectual limit. I'm also starting to think I should
add block sizes as a separate sub-heading up front, particularly if we
DON'T make it possible to get export sizes except by NBD_OPT_GO (that
is, older clients using NBD_OPT_EXPORT_NAME should still read the text
about the client SHOULD NOT assume larger than 32M maximum transactions
without word from the server, because they will be unable to get word
from the server).

> 
> IE I'd prefer what happened is the driver got the native export size
> (length of file or whatever), and adjusted THAT with its own
> minimum block size (rounded it down to the next multiple), then passed
> through that (rounded) export size, together with the minimum, maximum
> and preferred block sizes (each unaltered by the export size).
> 

>> +    If the client negotiated `NBD_FLAG_C_BLOCK_SIZE`, it MUST ensure
>> +    that *offset* and *length* are integer multiples of the minimum
>> +    block size, and SHOULD ensure that they are integer multiples of
>> +    the preferred block size where possible.  The client SHOULD NOT
>> +    request a length larger than the maximum block size.
> 
> MUST NOT if this is a maximum block size, else what's the point?
> 
> I think I would handle this in one place - just say 'on all requests
> that specify a non-zero offset or length, ....'

Okay on both points (again, an argument for making block sizes be its
own subsection in the introduction).

I'll wait to see if Wouter has any comments before posting a v2.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

signature.asc
Description: OpenPGP digital signature

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z

_______________________________________________
Nbd-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nbd-general

Re: [Nbd] [RFC PATCH] doc: Add new NBD_FLAG_BLOCK_SIZE extension

Reply via email to