On 5/30/23 20:48, Eric Blake wrote: > On Tue, May 30, 2023 at 05:48:25PM +0200, Laszlo Ersek wrote: >> On 5/30/23 16:56, Wouter Verhelst wrote: >>> On Tue, May 30, 2023 at 01:50:59PM +0200, Laszlo Ersek wrote: >>>> On 5/25/23 15:00, Eric Blake wrote: >> >>>>> +struct nbd_structured_reply_block_status_ext_hdr { >>>>> + uint32_t context_id; /* metadata context ID */ >>>>> + uint32_t count; /* length of following array */ >>>> >>>> (1) I think that "length of following array" is confusing here. Per >>>> spec, this is "descriptor count" -- that's clearer. "Length" could be >>>> mistaken for "number of bytes". >>>> >>>> Also, what was the justification for *not* making "count" uint64_t? >>>> (Just asking.) I do understand a server is permitted to respond with a >>>> block status reply that covers less than the requested range, so I >>>> understand a server, if it needs to, *can* truncate its response for the >>>> sake of fitting "count" into a uint32_t. >>> >>> This is, as you say, the number of block descriptor messages we are >>> going to send to the client. Each such message is at 8 bytes long. >> >> 16 bytes, isn't it? (uint64_t length, uint64_t status_flags) > > Existing block status was 8 bytes per extent (uint32_t * 2), but yes, > the extended extent info is 16 bytes (uint64_t * 2). > >> >>> That would mean that by the time you overflow a uint32_t with the number >>> of extents that are to be sent, you'll be promising to send 2^32 * 8 >>> (i.e., 2^36) bytes of data, or 32 GiB. >> >> (Or 64 GiB if we agree that sizeof(nbd_block_descriptor_ext)=16.) >> >> But... yes, I'm aware this is exorbitant, practically speaking. > > We already coded into the NBD spec (commit 926a51df) that a server > SHOULD truncate its response rather than sending more than 2M extent > entries; I picked that number because 32M (the existing payload cap in > qemu) / 16 bytes is 2M extents (for non-extended headers, it actually > caps things at 16M instead of 32M). In that commit, I pointed out > that qemu actually caps things at 128k entries (payload cap 1M without > extended headers), and nbdkit caps things at 1M entries.
[*] > >> >> My concern was that "practical considerations" must have been what led >> to the original 32-bit-only: >> >> struct nbd_block_descriptor { >> uint32_t length; /* length of block */ >> uint32_t status_flags; /* block type (hole etc) */ >> } NBD_ATTRIBUTE_PACKED; >> >> which is now proving too restrictive (being the basis for this entire >> work, IIUC). >> >> A 2^(32+9) B = 2 TB image is not unthinkable. If the image used 512B >> sectors, and sectors alternated between being a hole and being >> allocated, then 2^32 extended descriptors would be justified. >> >> May not be practical, but that's "policy"; the "mechanism" could still >> exist (if it doesn't come with undesirable costs). >> >>> That would be a ridiculously amount of data, and no user is going to >>> wait for a client to finish parsing that amount of data without a hard >>> reboot of their client. >> >> Over a 10gbit/s connection, transferring 64GB should take on the order >> of a minute. >> >> ... *parsing* 4.3 billion entries is a different matter, of course ;) >> >> OK! >> >>> The only change that would be reasonable here is to reduce the size of >>> this field 16 to bits, rather than increasing it (but then we lose >>> alignment, so that's also not a good idea) >> >> Putting aside alignment even, I don't understand why reducing "count" to >> uint16_t would be reasonable. With the current 32-bit-only block >> descriptor, we already need to write loops in libnbd clients, because we >> can't cover the entire remote image in one API call [*]. If I understood >> Eric right earlier, the 64-bit extensions were supposed to remedy that >> -- but as it stands, clients will still need loops ("chunking") around >> block status fetching; is that right? > > While the larger extents reduce the need for looping, it does not > entirely eliminate it. For example, just because the server can now > tell you that an image is entirely data in just one reply does not > mean that it will actually do so - qemu in particular limits block > status of a qcow2 file to reporting just one cluster at a time for > consistency reasons, where even if you use the maximum size of 2M > clusters, you can never get more than (2M/16)*2M = 256G status > reported in a single request. I don't understand the calculation. I can imagine the following interpretation: - QEMU never sends more than 128K block descriptors, and each descriptor covers one 2MB sized cluster --> 256 GB of the disk covered in one go. But I don't understand where the (2M/16) division comes from, even though the quotient is 128K. I can connect the constant "128K", and <https://github.com/NetworkBlockDevice/nbd/commit/926a51df>, to your paragraph [*] above, but not the division. > But without 64-bit lengths, you are > guaranteed to need at least 2 (and possibly more) round trips to map > out an entire 6G image, while with 64-bit lengths, you can often get > an entire image map in one round trip. > > Reducing 'count' to uint16_t for 64-bit responses would be possible if > we wanted to hard-code a server limit of never sending more than 64k > extents in one go; but above, I argued that existing servers currently > do exceed that cap even for 32-bit responses. > OK, understood. Looping is still required, but we expect to see it less exercised in practice, with extended headers. >> >> Let me emphasize again that I'm not challenging the spec; also I don't >> secretly wish for the patches to do more than required by the spec. I >> guess I'd like to understand the intent of the spec, plus the >> consequences for client applications. >> >> [*] References (in this order!): >> >> - https://github.com/libguestfs/virt-v2v/commit/27c056cdc6aa0 >> - https://gitlab.com/nbdkit/libnbd/-/commit/0e714a6e06e6 >> - https://github.com/libguestfs/virt-v2v/commit/a2afed32d8b1f >> >> To be less cryptic, the first commit added "chunked" block status >> fetching to virt-v2v. Because our OCaml language libnbd mappings weren't >> proper at the time, that loop could move backwards and get stuck. We >> fixed that in the second commit (regarding the bindings) and then >> adapted virt-v2v's loop to the fixed bindings in the thirds commit. I've >> been hoping that such complexities could be eliminated in the future by >> virtue of not having to "chunk" the block status fetching. >> > > We'll always have to deal with servers that send shorter replies than > we asked for, and where consecutive replies may have the same status. > The best the spec was able to do was recommend that the server return > as much as it can, and without consecutive status, where feasible. > >> ( >> >> BTW I'm foreseeing a problem: if the extended block descriptor can >> provide an unsigned 64-bit length, we're going to have trouble exposing >> that in OCaml, because OCaml only has signed 64-bit integers. So that's >> going to reproduce the same issue, only for OCaml callers of the *new* API. > > At one point, I was playing with an idea to add a patch to the NBD > spec to REQUIRE that an export be capped at 2^63-1 bytes (that is, the > maximum of 'off_t'). It doesn't change any existing implementations, > and actually frees us up to use signed 64-bit numbers where negative > values are indeed error cases. I'll probably try to revisit my > thoughts on that patch front, but don't think it holds up this series. > >> >> I can see Eric's series includes patches like "ocaml: Add example for >> 64-bit extents" -- I've not looked at those yet; for now I'm just >> wondering what tricks we might need in the bindings generator. The >> method seen in the "middle patch" above won't work; we don't have a >> native OCaml "i128" type for example that we could use as an escape >> hatch, for representing C's uint64_t. > > I _did_ go with the (currently implicit) assumption that no server > will ever expose larger than off_t can represent, and therefore a > signed 64-bit size is good enough. I'll have to see that patch later, but for now I think making that assumption explicit would be nice (assert, abort, etc). > Flags has to be unsigned, but flags > is representing something different than image size. I take it you mean "nbd_block_descriptor_ext.status_flags", which is a "uint64_t". If it's only used as a bitmap, that should be safe; first because some spec (dependent on metacontext) will introduce each bit one by one, and probably not start with bit#63; second because it makes no sense to sum bitmaps. However, Wouter mentions up-thread that (dependent on metacontext) status_flags could actually stand for a length too: 'Some metadata namespaces treat the "flags" field as an unsigned integer, to specify offsets or counters. I that context, the flags field should indeed be extended to 64 bits.' and then dealing with those particular metacontexts in OCaml could be a problem. It's probably best to catch unrepresentable values when converting between OCaml's int64 and C's uint64_t (both ways) in the generator as centrally as possible. Laszlo _______________________________________________ Libguestfs mailing list Libguestfs@redhat.com https://listman.redhat.com/mailman/listinfo/libguestfs