On 04/10/2018 09:07 AM, Nir Soffer wrote: > On Tue, Apr 10, 2018 at 4:48 PM Kevin Wolf <kw...@redhat.com> wrote: > >> Am 10.04.2018 um 15:03 hat Nir Soffer geschrieben: >>> On Tue, Apr 10, 2018 at 1:44 PM Richard W.M. Jones <rjo...@redhat.com> >>> wrote: >>> >>>> We now have true zeroing support in oVirt imageio, thanks for that. >>>> >>>> However a problem is that ‘qemu-img convert’ issues zero requests for >>>> the whole disk before starting the transfer. It does this using 32 MB >>>> requests which take approx. 1 second each to execute on the oVirt side. >>> >>> >>>> Two problems therefore: >>>> >>>> (1) Zeroing the disk can take a long time (eg. 40 GB is approx. >>>> 20 minutes). Furthermore there is no progress indication while >> this >>>> is happening.
This is going to be true whether or not you write zeroes in 32M chunks or in 2G chunks - it takes a long time to write actual zeroes to a block device if you are unsure of whether the device already contains zeroes. There is more overhead for sending 64 requests of 32M each than 1 request for 2G, there's a question for whether that's in the noise (slightly more data sent over the wire) or impactful (because you have to wait for more round trips, where the time spent waiting for traffic is on par with the time spent writing zeroes for a single request). The only way that a write zeroes request is not going to be slower than a normal write is if the block device itself supports an efficient way to guarantee that the sectors of the disk will read as zero (for example, using things like WRITE_SAME on iscsi devices). >>>> >>> >>>> Nothing bad happens: because it is making frequent requests there >>>> is no timeout. >>>> >>>> (2) I suspect that because we don't have trim support that this is >>>> actually causing the disk to get fully allocated on the target. >>>> >>>> The NBD requests are sent with may_trim=1 so we could turn these >>>> into trim requests, but obviously cannot do that while there is no >>>> trim support. In fact, if a trim request guarantees that you can read back zeroes regardless of what was previously on the block device, then that is precisely what you SHOULD be doing to make write zeroes more efficient (but only when may_trim=1). >>>> >>> >>> It sounds like nbdkit is emulating trim with zero instead of noop. No, qemu-img is NOT requesting trim, it is requesting write zeroes. You can implement write zeroes with a trim if the trim will read back as zeroes. But while trim is advisory, write zeroes has mandatory semantics on what you read back (where may_trim=1 is a determining factor on whether the write MUST allocate, or MAY trim. Ignoring may_trim and always allocating is semantically correct but may be slower, while trimming is correct only when may_trim=1). >>> >>> I'm not sure why qemu-img is trying to do, I hope the nbd maintainer on >>> qemu side can explain this. >> >> qemu-img tries to efficiently zero out the whole device at once so that >> it doesn't have to use individual small write requests for unallocated >> parts of the image later on. At one point, there was a proposal to have the NBD protocol add something where the server could advertise to the client if it is known at initial connection time that the export is starting life with ALL sectors zeroed. (Easy to prove for a just-created sparse file, a bit harder to prove for a block device although at least some iscsi devices do have queries to learn if the entire device is unallocated). This has not yet been implemented in the NBD protocol, but may be worth doing; it is something that is slightly redundant with the NBD_CMD_BLOCK_STATUS that qemu 2.12 is introducing (in that the client can perform that sort of query itself rather than the server advertising it at initial connection), but may be easy enough to implement even where NBD_CMD_BLOCK_STATUS is difficult that it would still allow qemu-img to operate more efficiently in some situations. But qemu-img DOES know how to skip zeroing a block device if it knows up front that the device already reads as all zeroes, so the missing piece of information is getting NBD to tell that to qemu-img. Meanwhile, NBD_CMD_BLOCK_STATUS is still quite a ways from being supported in nbdkit, so that's not anything that rhv-upload can exploit any time soon. >> > > This makes sense if the device is backed by a block device on oVirt side, > and the NBD support efficient zeroing. But in this case the device is backed > by an empty sparse file on NFS, and oVirt does not support yet efficient > zeroing, we just write zeros manually. > > I think should be handled on virt-v2v plugin side. When zeroing a file raw > image, > you can ignore zero requests after the highest write offset, since the > plugin > created a new image, and we know that the image is empty. Didn't Rich already try to do that? +def emulate_zero(h, count, offset): + # qemu-img convert starts by trying to zero/trim the whole device. + # Since we've just created a new disk it's safe to ignore these + # requests as long as they are smaller than the highest write seen. + # After that we must emulate them with writes. + if offset+count < h['highestwrite']: Or is the problem that emulate_zero() is only being called if: + # Unlike the trim and flush calls, there is no 'can_zero' method + # so nbdkit could call this even if the server doesn't support + # zeroing. If this is the case we must emulate. + if not h['can_zero']: + emulate_zero(h, count, offset) + return rather than doing the 'highestwrite' check unconditionally even when oVirt supports zero requests? > > When the destination is a block device we cannot avoid zeroing since a block > device may contain junk data (we usually get dirty empty images from our > local > xtremio server). And that's why qemu-img is starting life with write zeroes requests - because it needs to guarantee that the image either already started as all zeroes, or that zeroes are written to overwrite junk data. >> The problem is that the NBD block driver has max_pwrite_zeroes = 32 MB, >> so it's not that efficient after all. I'm not sure if there is a real >> reason for this, but Eric should know. >> Yes, I do know. But it missed qemu 2.12; it's another NBD spec proposal where I'm also going to submit a qemu patch: https://lists.debian.org/nbd/2018/03/msg00017.html Right now, the NBD protocol has no clean distinction between maximum data request (hard limit of 32M for NBD_CMD_WRITE in qemu-img) and for maximum length on a request with no accompanying data (NBD_CMD_WRITE_ZEROES). Once we add NBD_INFO_ZERO_SIZE, then it becomes obvious that sending a 2G NBD_CMD_WRITE_ZEROES request makes sense, even when 32M is the maximum for a normal write; but until that point, qemu is being conservative and capping EVERYTHING to the 32M limit. There's also talk about enhancing NBD to support larger than 4G by adding an extension that permits 64-bit lengths, but that's further off in the "nice idea, but not yet documented or implemented" category. > > We support zero with unlimited size without sending any payload to oVirt, > so > there is no reason to limit zero request by max_pwrite_zeros. This limit may > make sense when zero is emulated using pwrite. Even when write zeroes is emulated by falling back to pwrite, the pwrite can be done in a loop (however, then you get into the game of whether writing 2G of zeroes takes long enough that you really DO want to enforce a write zero maximum smaller than 4G, if only to guarantee more frequent traffic to avoid timing out). > > >> >>> However, since you suggest that we could use "trim" request for these >>> requests, it means that these requests are advisory (since trim is), and >>> we can just ignore them if the server does not support trim. >> >> What qemu-img sends shouldn't be a NBD_CMD_TRIM request (which is indeed >> advisory), but a NBD_CMD_WRITE_ZEROES request. qemu-img relies on the >> image actually being zeroed after this. >> > > So it seems that may_trim=1 is wrong, since trim cannot replace zero. No, 'may_trim=1' means you may trim, IF you can guarantee that you can read back as zero. If trim can't guarantee a read back as zero, then may_trim=1 must be ignored and the server do a write instead. The client should always be able to request may_trim=1, whether or not the server can actually do a trim as an optimization. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Description: OpenPGP digital signature