Using -o preallocation falloc works great on NFS 4.2 and local file system, when fallocate() is supported, but when it is not, posix_fallocate falls back to very inefficient way: https://code.woboq.org/userspace/glibc/sysdeps/posix/posix_fallocate.c.html#96
This will read the last byte for every 4k block, and if the byte is null, write one null byte. This minimizes the amount of data sent over the wire, but is very slow. In file-posix we optimize this flow by not truncating the file to the final size, so this will only write one null byte for every 4k block, but this is still very slow. Except the poor performance, we have a bug showing that for some reason, this does not work well with OFD locking: https://bugzilla.redhat.com/1851097 In oVirt 4.4.2 we avoid the issue by not using -o preallocation falloc. Instead we use our own fallocate helper: https://github.com/oVirt/vdsm/blob/master/helpers/fallocate (We got feedback that the name of this helper is confusing since it does destructive operation when fallocate() is not supported. We will change the name) This helper is similar to posix_fallocate, but instead of falling back to writing one byte per 4k block, it falls back to writing zeros in large blocks. Testing shows that this improves fallocation time by 385% for one disk, and 468% for 10 concurrent disk preallocation: https://bugzilla.redhat.com/1850267#c25 I think the next step is to move this change into qemu, so all users can benefit from this change. I think the way to do this is to replace posix_fallocate() with fallocate(), and fallback to "full" preallocation if fallocate is not supported. However with current code, in qemu-img create, we don't have a way to force O_DIRECT for the preallocation, and in qemu-img convert the preallocation step does not respect the -t none flag. Not using O_DIRECT in oVirt is very bad, and likely to cause timeouts in sanlock when the kernel flushes the page cache. So needed changes are: 1. Add a way to control cache in qemu-img create (-t none? -o cache=none?) 2. Respect -t none in qemu-img convert -o preallocation falloc 3. Replace posix_falloate to fallocate https://github.com/qemu/qemu/blob/152be6de9100e58b5d896272e951d4c910bd735a/block/file-posix.c#L1868 4. Fall back to full zeroing if fallocate is not supported https://github.com/qemu/qemu/blob/152be6de9100e58b5d896272e951d4c910bd735a/block/file-posix.c#L1891 5. Probably use larger zero buffer, 64k is not efficient https://github.com/qemu/qemu/blob/152be6de9100e58b5d896272e951d4c910bd735a/block/file-posix.c#L1907 What do you think? Nir
