Re: [Qemu-block] Change in qemu 2.12 causes qemu-img convert to NBD to write more data
Am 17.11.2018 um 21:59 hat Nir Soffer geschrieben: > On Fri, Nov 16, 2018 at 5:26 PM Kevin Wolf wrote: > > > Am 15.11.2018 um 23:27 hat Nir Soffer geschrieben: > > > On Sun, Nov 11, 2018 at 6:11 PM Nir Soffer wrote: > > > > > > > On Wed, Nov 7, 2018 at 7:55 PM Nir Soffer wrote: > > > > > > > >> On Wed, Nov 7, 2018 at 7:27 PM Kevin Wolf wrote: > > > >> > > > >>> Am 07.11.2018 um 15:56 hat Nir Soffer geschrieben: > > > >>> > Wed, Nov 7, 2018 at 4:36 PM Richard W.M. Jones > > > >>> wrote: > > > >>> > > > > >>> > > Another thing I tried was to change the NBD server (nbdkit) so > > that > > > >>> it > > > >>> > > doesn't advertise zero support to the client: > > > >>> > > > > > >>> > > $ nbdkit --filter=log --filter=nozero memory size=6G > > > >>> logfile=/tmp/log \ > > > >>> > > --run './qemu-img convert ./fedora-28.img -n $nbd' > > > >>> > > $ grep '\.\.\.$' /tmp/log | sed 's/.*\([A-Z][a-z]*\).*/\1/' | > > uniq > > > >>> -c > > > >>> > >2154 Write > > > >>> > > > > > >>> > > Not surprisingly no zero commands are issued. The size of the > > write > > > >>> > > commands is very uneven -- it appears to be send one command per > > > >>> block > > > >>> > > of zeroes or data. > > > >>> > > > > > >>> > > Nir: If we could get information from imageio about whether > > zeroing > > > >>> is > > > >>> > > implemented efficiently or not by the backend, we could change > > > >>> > > virt-v2v / nbdkit to advertise this back to qemu. > > > >>> > > > > >>> > There is no way to detect the capability, ioctl(BLKZEROOUT) always > > > >>> > succeeds, falling back to manual zeroing in the kernel silently > > > >>> > > > > >>> > Even if we could, sending zero on the wire from qemu may be even > > > >>> > slower, and it looks like qemu send even more requests in this case > > > >>> > (2154 vs ~1300). > > > >>> > > > > >>> > Looks like this optimization in qemu side leads to worse > > performance, > > > >>> > so it should not be enabled by default. > > > >>> > > > >>> Well, that's overgeneralising your case a bit. If the backend does > > > >>> support efficient zero writes (which file systems, the most common > > case, > > > >>> generally do), doing one big write_zeroes request at the start can > > > >>> improve performance quite a bit. > > > >>> > > > >>> It seems the problem is that we can't really know whether the > > operation > > > >>> will be efficient because the backends generally don't tell us. Maybe > > > >>> NBD could introduce a flag for this, but in the general case it > > appears > > > >>> to me that we'll have to have a command line option. > > > >>> > > > >>> However, I'm curious what your exact use case and the backend used > > in it > > > >>> is? Can something be improved there to actually get efficient zero > > > >>> writes and get even better performance than by just disabling the big > > > >>> zero write? > > > >> > > > >> > > > >> The backend is some NetApp storage connected via FC. I don't have > > > >> more info on this. We get zero rate of about 1G/s on this storage, > > which > > > >> is quite slow compared with other storage we tested. > > > >> > > > >> One option we check now is if this is the kernel silent fallback to > > manual > > > >> zeroing when the server advertise wrong value of write_same_max_bytes. > > > >> > > > > > > > > We eliminated this using blkdiscard. This is what we get on with this > > > > storage > > > > zeroing 100G LV: > > > > > > > > for i in 1 2 4 8 16 32; do time blkdiscard -z -p ${i}m > > > > > > /dev/6e1d84f9-f939-46e9-b108-0427a08c280c/2d5c06ce-6536-4b3c-a7b6-13c6d8e55ade; > > > > done > > > > > > > > real 4m50.851s > > > > user 0m0.065s > > > > sys 0m1.482s > > > > > > > > real 4m30.504s > > > > user 0m0.047s > > > > sys 0m0.870s > > > > > > > > real 4m19.443s > > > > user 0m0.029s > > > > sys 0m0.508s > > > > > > > > real 4m13.016s > > > > user 0m0.020s > > > > sys 0m0.284s > > > > > > > > real 2m45.888s > > > > user 0m0.011s > > > > sys 0m0.162s > > > > > > > > real 2m10.153s > > > > user 0m0.003s > > > > sys 0m0.100s > > > > > > > > We are investigating why we get low throughput on this server, and also > > > > will check > > > > several other servers. > > > > > > > > Having a command line option to control this behavior sounds good. I > > don't > > > >> have enough data to tell what should be the default, but I think the > > safe > > > >> way would be to keep old behavior. > > > >> > > > > > > > > We file this bug: > > > > https://bugzilla.redhat.com/1648622 > > > > > > > > > > More data from even slower storage - zeroing 10G lv on Kaminario K2 > > > > > > # time blkdiscard -z -p 32m /dev/test_vg/test_lv2 > > > > > > real50m12.425s > > > user0m0.018s > > > sys 2m6.785s > > > > > > Maybe something is wrong with this storage, since we see this: > > > > > > # grep -s "" /sys/block/dm-29/queue/* | grep write_same_max_bytes > > > /sys/block/dm-29/queue/write_same_max_bytes:512 > > > > > > Since BLKZEROOUT always fallback to manual slow zeroing
Re: [Qemu-block] Change in qemu 2.12 causes qemu-img convert to NBD to write more data
On Sat, Nov 17, 2018 at 11:13 PM Richard W.M. Jones wrote: > On Sat, Nov 17, 2018 at 10:59:26PM +0200, Nir Soffer wrote: > > On Fri, Nov 16, 2018 at 5:26 PM Kevin Wolf wrote: > > > > > Am 15.11.2018 um 23:27 hat Nir Soffer geschrieben: > > > > On Sun, Nov 11, 2018 at 6:11 PM Nir Soffer > wrote: > > > > > > > > > On Wed, Nov 7, 2018 at 7:55 PM Nir Soffer > wrote: > > > > > > > > > >> On Wed, Nov 7, 2018 at 7:27 PM Kevin Wolf > wrote: > > > > >> > > > > >>> Am 07.11.2018 um 15:56 hat Nir Soffer geschrieben: > > > > >>> > Wed, Nov 7, 2018 at 4:36 PM Richard W.M. Jones < > rjo...@redhat.com> > > > > >>> wrote: > > > > >>> > > > > > >>> > > Another thing I tried was to change the NBD server (nbdkit) > so > > > that > > > > >>> it > > > > >>> > > doesn't advertise zero support to the client: > > > > >>> > > > > > > >>> > > $ nbdkit --filter=log --filter=nozero memory size=6G > > > > >>> logfile=/tmp/log \ > > > > >>> > > --run './qemu-img convert ./fedora-28.img -n $nbd' > > > > >>> > > $ grep '\.\.\.$' /tmp/log | sed > 's/.*\([A-Z][a-z]*\).*/\1/' | > > > uniq > > > > >>> -c > > > > >>> > >2154 Write > > > > >>> > > > > > > >>> > > Not surprisingly no zero commands are issued. The size of > the > > > write > > > > >>> > > commands is very uneven -- it appears to be send one command > per > > > > >>> block > > > > >>> > > of zeroes or data. > > > > >>> > > > > > > >>> > > Nir: If we could get information from imageio about whether > > > zeroing > > > > >>> is > > > > >>> > > implemented efficiently or not by the backend, we could > change > > > > >>> > > virt-v2v / nbdkit to advertise this back to qemu. > > > > >>> > > > > > >>> > There is no way to detect the capability, ioctl(BLKZEROOUT) > always > > > > >>> > succeeds, falling back to manual zeroing in the kernel silently > > > > >>> > > > > > >>> > Even if we could, sending zero on the wire from qemu may be > even > > > > >>> > slower, and it looks like qemu send even more requests in this > case > > > > >>> > (2154 vs ~1300). > > > > >>> > > > > > >>> > Looks like this optimization in qemu side leads to worse > > > performance, > > > > >>> > so it should not be enabled by default. > > > > >>> > > > > >>> Well, that's overgeneralising your case a bit. If the backend > does > > > > >>> support efficient zero writes (which file systems, the most > common > > > case, > > > > >>> generally do), doing one big write_zeroes request at the start > can > > > > >>> improve performance quite a bit. > > > > >>> > > > > >>> It seems the problem is that we can't really know whether the > > > operation > > > > >>> will be efficient because the backends generally don't tell us. > Maybe > > > > >>> NBD could introduce a flag for this, but in the general case it > > > appears > > > > >>> to me that we'll have to have a command line option. > > > > >>> > > > > >>> However, I'm curious what your exact use case and the backend > used > > > in it > > > > >>> is? Can something be improved there to actually get efficient > zero > > > > >>> writes and get even better performance than by just disabling > the big > > > > >>> zero write? > > > > >> > > > > >> > > > > >> The backend is some NetApp storage connected via FC. I don't have > > > > >> more info on this. We get zero rate of about 1G/s on this storage, > > > which > > > > >> is quite slow compared with other storage we tested. > > > > >> > > > > >> One option we check now is if this is the kernel silent fallback > to > > > manual > > > > >> zeroing when the server advertise wrong value of > write_same_max_bytes. > > > > >> > > > > > > > > > > We eliminated this using blkdiscard. This is what we get on with > this > > > > > storage > > > > > zeroing 100G LV: > > > > > > > > > > for i in 1 2 4 8 16 32; do time blkdiscard -z -p ${i}m > > > > > > > > > /dev/6e1d84f9-f939-46e9-b108-0427a08c280c/2d5c06ce-6536-4b3c-a7b6-13c6d8e55ade; > > > > > done > > > > > > > > > > real 4m50.851s > > > > > user 0m0.065s > > > > > sys 0m1.482s > > > > > > > > > > real 4m30.504s > > > > > user 0m0.047s > > > > > sys 0m0.870s > > > > > > > > > > real 4m19.443s > > > > > user 0m0.029s > > > > > sys 0m0.508s > > > > > > > > > > real 4m13.016s > > > > > user 0m0.020s > > > > > sys 0m0.284s > > > > > > > > > > real 2m45.888s > > > > > user 0m0.011s > > > > > sys 0m0.162s > > > > > > > > > > real 2m10.153s > > > > > user 0m0.003s > > > > > sys 0m0.100s > > > > > > > > > > We are investigating why we get low throughput on this server, and > also > > > > > will check > > > > > several other servers. > > > > > > > > > > Having a command line option to control this behavior sounds good. > I > > > don't > > > > >> have enough data to tell what should be the default, but I think > the > > > safe > > > > >> way would be to keep old behavior. > > > > >> > > > > > > > > > > We file this bug: > > > > > https://bugzilla.redhat.com/1648622 > > > > > > > > > > > > > More data from even slower storage - zeroing 10G lv on Kaminario K2 > > > > > > > >
Re: [Qemu-block] Change in qemu 2.12 causes qemu-img convert to NBD to write more data
On Sat, Nov 17, 2018 at 10:59:26PM +0200, Nir Soffer wrote: > On Fri, Nov 16, 2018 at 5:26 PM Kevin Wolf wrote: > > > Am 15.11.2018 um 23:27 hat Nir Soffer geschrieben: > > > On Sun, Nov 11, 2018 at 6:11 PM Nir Soffer wrote: > > > > > > > On Wed, Nov 7, 2018 at 7:55 PM Nir Soffer wrote: > > > > > > > >> On Wed, Nov 7, 2018 at 7:27 PM Kevin Wolf wrote: > > > >> > > > >>> Am 07.11.2018 um 15:56 hat Nir Soffer geschrieben: > > > >>> > Wed, Nov 7, 2018 at 4:36 PM Richard W.M. Jones > > > >>> wrote: > > > >>> > > > > >>> > > Another thing I tried was to change the NBD server (nbdkit) so > > that > > > >>> it > > > >>> > > doesn't advertise zero support to the client: > > > >>> > > > > > >>> > > $ nbdkit --filter=log --filter=nozero memory size=6G > > > >>> logfile=/tmp/log \ > > > >>> > > --run './qemu-img convert ./fedora-28.img -n $nbd' > > > >>> > > $ grep '\.\.\.$' /tmp/log | sed 's/.*\([A-Z][a-z]*\).*/\1/' | > > uniq > > > >>> -c > > > >>> > >2154 Write > > > >>> > > > > > >>> > > Not surprisingly no zero commands are issued. The size of the > > write > > > >>> > > commands is very uneven -- it appears to be send one command per > > > >>> block > > > >>> > > of zeroes or data. > > > >>> > > > > > >>> > > Nir: If we could get information from imageio about whether > > zeroing > > > >>> is > > > >>> > > implemented efficiently or not by the backend, we could change > > > >>> > > virt-v2v / nbdkit to advertise this back to qemu. > > > >>> > > > > >>> > There is no way to detect the capability, ioctl(BLKZEROOUT) always > > > >>> > succeeds, falling back to manual zeroing in the kernel silently > > > >>> > > > > >>> > Even if we could, sending zero on the wire from qemu may be even > > > >>> > slower, and it looks like qemu send even more requests in this case > > > >>> > (2154 vs ~1300). > > > >>> > > > > >>> > Looks like this optimization in qemu side leads to worse > > performance, > > > >>> > so it should not be enabled by default. > > > >>> > > > >>> Well, that's overgeneralising your case a bit. If the backend does > > > >>> support efficient zero writes (which file systems, the most common > > case, > > > >>> generally do), doing one big write_zeroes request at the start can > > > >>> improve performance quite a bit. > > > >>> > > > >>> It seems the problem is that we can't really know whether the > > operation > > > >>> will be efficient because the backends generally don't tell us. Maybe > > > >>> NBD could introduce a flag for this, but in the general case it > > appears > > > >>> to me that we'll have to have a command line option. > > > >>> > > > >>> However, I'm curious what your exact use case and the backend used > > in it > > > >>> is? Can something be improved there to actually get efficient zero > > > >>> writes and get even better performance than by just disabling the big > > > >>> zero write? > > > >> > > > >> > > > >> The backend is some NetApp storage connected via FC. I don't have > > > >> more info on this. We get zero rate of about 1G/s on this storage, > > which > > > >> is quite slow compared with other storage we tested. > > > >> > > > >> One option we check now is if this is the kernel silent fallback to > > manual > > > >> zeroing when the server advertise wrong value of write_same_max_bytes. > > > >> > > > > > > > > We eliminated this using blkdiscard. This is what we get on with this > > > > storage > > > > zeroing 100G LV: > > > > > > > > for i in 1 2 4 8 16 32; do time blkdiscard -z -p ${i}m > > > > > > /dev/6e1d84f9-f939-46e9-b108-0427a08c280c/2d5c06ce-6536-4b3c-a7b6-13c6d8e55ade; > > > > done > > > > > > > > real 4m50.851s > > > > user 0m0.065s > > > > sys 0m1.482s > > > > > > > > real 4m30.504s > > > > user 0m0.047s > > > > sys 0m0.870s > > > > > > > > real 4m19.443s > > > > user 0m0.029s > > > > sys 0m0.508s > > > > > > > > real 4m13.016s > > > > user 0m0.020s > > > > sys 0m0.284s > > > > > > > > real 2m45.888s > > > > user 0m0.011s > > > > sys 0m0.162s > > > > > > > > real 2m10.153s > > > > user 0m0.003s > > > > sys 0m0.100s > > > > > > > > We are investigating why we get low throughput on this server, and also > > > > will check > > > > several other servers. > > > > > > > > Having a command line option to control this behavior sounds good. I > > don't > > > >> have enough data to tell what should be the default, but I think the > > safe > > > >> way would be to keep old behavior. > > > >> > > > > > > > > We file this bug: > > > > https://bugzilla.redhat.com/1648622 > > > > > > > > > > More data from even slower storage - zeroing 10G lv on Kaminario K2 > > > > > > # time blkdiscard -z -p 32m /dev/test_vg/test_lv2 > > > > > > real50m12.425s > > > user0m0.018s > > > sys 2m6.785s > > > > > > Maybe something is wrong with this storage, since we see this: > > > > > > # grep -s "" /sys/block/dm-29/queue/* | grep write_same_max_bytes > > > /sys/block/dm-29/queue/write_same_max_bytes:512 > > > > > > Since BLKZEROOUT always fallback to manual slow
Re: [Qemu-block] Change in qemu 2.12 causes qemu-img convert to NBD to write more data
On Fri, Nov 16, 2018 at 5:26 PM Kevin Wolf wrote: > Am 15.11.2018 um 23:27 hat Nir Soffer geschrieben: > > On Sun, Nov 11, 2018 at 6:11 PM Nir Soffer wrote: > > > > > On Wed, Nov 7, 2018 at 7:55 PM Nir Soffer wrote: > > > > > >> On Wed, Nov 7, 2018 at 7:27 PM Kevin Wolf wrote: > > >> > > >>> Am 07.11.2018 um 15:56 hat Nir Soffer geschrieben: > > >>> > Wed, Nov 7, 2018 at 4:36 PM Richard W.M. Jones > > >>> wrote: > > >>> > > > >>> > > Another thing I tried was to change the NBD server (nbdkit) so > that > > >>> it > > >>> > > doesn't advertise zero support to the client: > > >>> > > > > >>> > > $ nbdkit --filter=log --filter=nozero memory size=6G > > >>> logfile=/tmp/log \ > > >>> > > --run './qemu-img convert ./fedora-28.img -n $nbd' > > >>> > > $ grep '\.\.\.$' /tmp/log | sed 's/.*\([A-Z][a-z]*\).*/\1/' | > uniq > > >>> -c > > >>> > >2154 Write > > >>> > > > > >>> > > Not surprisingly no zero commands are issued. The size of the > write > > >>> > > commands is very uneven -- it appears to be send one command per > > >>> block > > >>> > > of zeroes or data. > > >>> > > > > >>> > > Nir: If we could get information from imageio about whether > zeroing > > >>> is > > >>> > > implemented efficiently or not by the backend, we could change > > >>> > > virt-v2v / nbdkit to advertise this back to qemu. > > >>> > > > >>> > There is no way to detect the capability, ioctl(BLKZEROOUT) always > > >>> > succeeds, falling back to manual zeroing in the kernel silently > > >>> > > > >>> > Even if we could, sending zero on the wire from qemu may be even > > >>> > slower, and it looks like qemu send even more requests in this case > > >>> > (2154 vs ~1300). > > >>> > > > >>> > Looks like this optimization in qemu side leads to worse > performance, > > >>> > so it should not be enabled by default. > > >>> > > >>> Well, that's overgeneralising your case a bit. If the backend does > > >>> support efficient zero writes (which file systems, the most common > case, > > >>> generally do), doing one big write_zeroes request at the start can > > >>> improve performance quite a bit. > > >>> > > >>> It seems the problem is that we can't really know whether the > operation > > >>> will be efficient because the backends generally don't tell us. Maybe > > >>> NBD could introduce a flag for this, but in the general case it > appears > > >>> to me that we'll have to have a command line option. > > >>> > > >>> However, I'm curious what your exact use case and the backend used > in it > > >>> is? Can something be improved there to actually get efficient zero > > >>> writes and get even better performance than by just disabling the big > > >>> zero write? > > >> > > >> > > >> The backend is some NetApp storage connected via FC. I don't have > > >> more info on this. We get zero rate of about 1G/s on this storage, > which > > >> is quite slow compared with other storage we tested. > > >> > > >> One option we check now is if this is the kernel silent fallback to > manual > > >> zeroing when the server advertise wrong value of write_same_max_bytes. > > >> > > > > > > We eliminated this using blkdiscard. This is what we get on with this > > > storage > > > zeroing 100G LV: > > > > > > for i in 1 2 4 8 16 32; do time blkdiscard -z -p ${i}m > > > > /dev/6e1d84f9-f939-46e9-b108-0427a08c280c/2d5c06ce-6536-4b3c-a7b6-13c6d8e55ade; > > > done > > > > > > real 4m50.851s > > > user 0m0.065s > > > sys 0m1.482s > > > > > > real 4m30.504s > > > user 0m0.047s > > > sys 0m0.870s > > > > > > real 4m19.443s > > > user 0m0.029s > > > sys 0m0.508s > > > > > > real 4m13.016s > > > user 0m0.020s > > > sys 0m0.284s > > > > > > real 2m45.888s > > > user 0m0.011s > > > sys 0m0.162s > > > > > > real 2m10.153s > > > user 0m0.003s > > > sys 0m0.100s > > > > > > We are investigating why we get low throughput on this server, and also > > > will check > > > several other servers. > > > > > > Having a command line option to control this behavior sounds good. I > don't > > >> have enough data to tell what should be the default, but I think the > safe > > >> way would be to keep old behavior. > > >> > > > > > > We file this bug: > > > https://bugzilla.redhat.com/1648622 > > > > > > > More data from even slower storage - zeroing 10G lv on Kaminario K2 > > > > # time blkdiscard -z -p 32m /dev/test_vg/test_lv2 > > > > real50m12.425s > > user0m0.018s > > sys 2m6.785s > > > > Maybe something is wrong with this storage, since we see this: > > > > # grep -s "" /sys/block/dm-29/queue/* | grep write_same_max_bytes > > /sys/block/dm-29/queue/write_same_max_bytes:512 > > > > Since BLKZEROOUT always fallback to manual slow zeroing silently, > > maybe we can disable the aggressive pre-zero of the entire device > > for block devices, and keep this optimization for files when fallocate() > > is supported? > > I'm not sure what the detour through NBD changes, but qemu-img directly > on a block device doesn't use BLKZEROOUT first, but > FALLOC_FL_PUNCH_HOLE. Lo
Re: [Qemu-block] Change in qemu 2.12 causes qemu-img convert to NBD to write more data
Am 15.11.2018 um 23:27 hat Nir Soffer geschrieben: > On Sun, Nov 11, 2018 at 6:11 PM Nir Soffer wrote: > > > On Wed, Nov 7, 2018 at 7:55 PM Nir Soffer wrote: > > > >> On Wed, Nov 7, 2018 at 7:27 PM Kevin Wolf wrote: > >> > >>> Am 07.11.2018 um 15:56 hat Nir Soffer geschrieben: > >>> > Wed, Nov 7, 2018 at 4:36 PM Richard W.M. Jones > >>> wrote: > >>> > > >>> > > Another thing I tried was to change the NBD server (nbdkit) so that > >>> it > >>> > > doesn't advertise zero support to the client: > >>> > > > >>> > > $ nbdkit --filter=log --filter=nozero memory size=6G > >>> logfile=/tmp/log \ > >>> > > --run './qemu-img convert ./fedora-28.img -n $nbd' > >>> > > $ grep '\.\.\.$' /tmp/log | sed 's/.*\([A-Z][a-z]*\).*/\1/' | uniq > >>> -c > >>> > >2154 Write > >>> > > > >>> > > Not surprisingly no zero commands are issued. The size of the write > >>> > > commands is very uneven -- it appears to be send one command per > >>> block > >>> > > of zeroes or data. > >>> > > > >>> > > Nir: If we could get information from imageio about whether zeroing > >>> is > >>> > > implemented efficiently or not by the backend, we could change > >>> > > virt-v2v / nbdkit to advertise this back to qemu. > >>> > > >>> > There is no way to detect the capability, ioctl(BLKZEROOUT) always > >>> > succeeds, falling back to manual zeroing in the kernel silently > >>> > > >>> > Even if we could, sending zero on the wire from qemu may be even > >>> > slower, and it looks like qemu send even more requests in this case > >>> > (2154 vs ~1300). > >>> > > >>> > Looks like this optimization in qemu side leads to worse performance, > >>> > so it should not be enabled by default. > >>> > >>> Well, that's overgeneralising your case a bit. If the backend does > >>> support efficient zero writes (which file systems, the most common case, > >>> generally do), doing one big write_zeroes request at the start can > >>> improve performance quite a bit. > >>> > >>> It seems the problem is that we can't really know whether the operation > >>> will be efficient because the backends generally don't tell us. Maybe > >>> NBD could introduce a flag for this, but in the general case it appears > >>> to me that we'll have to have a command line option. > >>> > >>> However, I'm curious what your exact use case and the backend used in it > >>> is? Can something be improved there to actually get efficient zero > >>> writes and get even better performance than by just disabling the big > >>> zero write? > >> > >> > >> The backend is some NetApp storage connected via FC. I don't have > >> more info on this. We get zero rate of about 1G/s on this storage, which > >> is quite slow compared with other storage we tested. > >> > >> One option we check now is if this is the kernel silent fallback to manual > >> zeroing when the server advertise wrong value of write_same_max_bytes. > >> > > > > We eliminated this using blkdiscard. This is what we get on with this > > storage > > zeroing 100G LV: > > > > for i in 1 2 4 8 16 32; do time blkdiscard -z -p ${i}m > > /dev/6e1d84f9-f939-46e9-b108-0427a08c280c/2d5c06ce-6536-4b3c-a7b6-13c6d8e55ade; > > done > > > > real 4m50.851s > > user 0m0.065s > > sys 0m1.482s > > > > real 4m30.504s > > user 0m0.047s > > sys 0m0.870s > > > > real 4m19.443s > > user 0m0.029s > > sys 0m0.508s > > > > real 4m13.016s > > user 0m0.020s > > sys 0m0.284s > > > > real 2m45.888s > > user 0m0.011s > > sys 0m0.162s > > > > real 2m10.153s > > user 0m0.003s > > sys 0m0.100s > > > > We are investigating why we get low throughput on this server, and also > > will check > > several other servers. > > > > Having a command line option to control this behavior sounds good. I don't > >> have enough data to tell what should be the default, but I think the safe > >> way would be to keep old behavior. > >> > > > > We file this bug: > > https://bugzilla.redhat.com/1648622 > > > > More data from even slower storage - zeroing 10G lv on Kaminario K2 > > # time blkdiscard -z -p 32m /dev/test_vg/test_lv2 > > real50m12.425s > user0m0.018s > sys 2m6.785s > > Maybe something is wrong with this storage, since we see this: > > # grep -s "" /sys/block/dm-29/queue/* | grep write_same_max_bytes > /sys/block/dm-29/queue/write_same_max_bytes:512 > > Since BLKZEROOUT always fallback to manual slow zeroing silently, > maybe we can disable the aggressive pre-zero of the entire device > for block devices, and keep this optimization for files when fallocate() > is supported? I'm not sure what the detour through NBD changes, but qemu-img directly on a block device doesn't use BLKZEROOUT first, but FALLOC_FL_PUNCH_HOLE. Maybe we can add a flag that avoids anything that could be slow, such as BLKZEROOUT, as a fallback (and also the slow emulation that QEMU itself would do if all kernel calls fail). Kevin
Re: [Qemu-block] Change in qemu 2.12 causes qemu-img convert to NBD to write more data
On Sun, Nov 11, 2018 at 6:11 PM Nir Soffer wrote: > On Wed, Nov 7, 2018 at 7:55 PM Nir Soffer wrote: > >> On Wed, Nov 7, 2018 at 7:27 PM Kevin Wolf wrote: >> >>> Am 07.11.2018 um 15:56 hat Nir Soffer geschrieben: >>> > Wed, Nov 7, 2018 at 4:36 PM Richard W.M. Jones >>> wrote: >>> > >>> > > Another thing I tried was to change the NBD server (nbdkit) so that >>> it >>> > > doesn't advertise zero support to the client: >>> > > >>> > > $ nbdkit --filter=log --filter=nozero memory size=6G >>> logfile=/tmp/log \ >>> > > --run './qemu-img convert ./fedora-28.img -n $nbd' >>> > > $ grep '\.\.\.$' /tmp/log | sed 's/.*\([A-Z][a-z]*\).*/\1/' | uniq >>> -c >>> > >2154 Write >>> > > >>> > > Not surprisingly no zero commands are issued. The size of the write >>> > > commands is very uneven -- it appears to be send one command per >>> block >>> > > of zeroes or data. >>> > > >>> > > Nir: If we could get information from imageio about whether zeroing >>> is >>> > > implemented efficiently or not by the backend, we could change >>> > > virt-v2v / nbdkit to advertise this back to qemu. >>> > >>> > There is no way to detect the capability, ioctl(BLKZEROOUT) always >>> > succeeds, falling back to manual zeroing in the kernel silently >>> > >>> > Even if we could, sending zero on the wire from qemu may be even >>> > slower, and it looks like qemu send even more requests in this case >>> > (2154 vs ~1300). >>> > >>> > Looks like this optimization in qemu side leads to worse performance, >>> > so it should not be enabled by default. >>> >>> Well, that's overgeneralising your case a bit. If the backend does >>> support efficient zero writes (which file systems, the most common case, >>> generally do), doing one big write_zeroes request at the start can >>> improve performance quite a bit. >>> >>> It seems the problem is that we can't really know whether the operation >>> will be efficient because the backends generally don't tell us. Maybe >>> NBD could introduce a flag for this, but in the general case it appears >>> to me that we'll have to have a command line option. >>> >>> However, I'm curious what your exact use case and the backend used in it >>> is? Can something be improved there to actually get efficient zero >>> writes and get even better performance than by just disabling the big >>> zero write? >> >> >> The backend is some NetApp storage connected via FC. I don't have >> more info on this. We get zero rate of about 1G/s on this storage, which >> is quite slow compared with other storage we tested. >> >> One option we check now is if this is the kernel silent fallback to manual >> zeroing when the server advertise wrong value of write_same_max_bytes. >> > > We eliminated this using blkdiscard. This is what we get on with this > storage > zeroing 100G LV: > > for i in 1 2 4 8 16 32; do time blkdiscard -z -p ${i}m > /dev/6e1d84f9-f939-46e9-b108-0427a08c280c/2d5c06ce-6536-4b3c-a7b6-13c6d8e55ade; > done > > real 4m50.851s > user 0m0.065s > sys 0m1.482s > > real 4m30.504s > user 0m0.047s > sys 0m0.870s > > real 4m19.443s > user 0m0.029s > sys 0m0.508s > > real 4m13.016s > user 0m0.020s > sys 0m0.284s > > real 2m45.888s > user 0m0.011s > sys 0m0.162s > > real 2m10.153s > user 0m0.003s > sys 0m0.100s > > We are investigating why we get low throughput on this server, and also > will check > several other servers. > > Having a command line option to control this behavior sounds good. I don't >> have enough data to tell what should be the default, but I think the safe >> way would be to keep old behavior. >> > > We file this bug: > https://bugzilla.redhat.com/1648622 > More data from even slower storage - zeroing 10G lv on Kaminario K2 # time blkdiscard -z -p 32m /dev/test_vg/test_lv2 real50m12.425s user0m0.018s sys 2m6.785s Maybe something is wrong with this storage, since we see this: # grep -s "" /sys/block/dm-29/queue/* | grep write_same_max_bytes /sys/block/dm-29/queue/write_same_max_bytes:512 Since BLKZEROOUT always fallback to manual slow zeroing silently, maybe we can disable the aggressive pre-zero of the entire device for block devices, and keep this optimization for files when fallocate() is supported? Nir
Re: [Qemu-block] Change in qemu 2.12 causes qemu-img convert to NBD to write more data
On Wed, Nov 7, 2018 at 7:55 PM Nir Soffer wrote: > On Wed, Nov 7, 2018 at 7:27 PM Kevin Wolf wrote: > >> Am 07.11.2018 um 15:56 hat Nir Soffer geschrieben: >> > Wed, Nov 7, 2018 at 4:36 PM Richard W.M. Jones >> wrote: >> > >> > > Another thing I tried was to change the NBD server (nbdkit) so that it >> > > doesn't advertise zero support to the client: >> > > >> > > $ nbdkit --filter=log --filter=nozero memory size=6G >> logfile=/tmp/log \ >> > > --run './qemu-img convert ./fedora-28.img -n $nbd' >> > > $ grep '\.\.\.$' /tmp/log | sed 's/.*\([A-Z][a-z]*\).*/\1/' | uniq >> -c >> > >2154 Write >> > > >> > > Not surprisingly no zero commands are issued. The size of the write >> > > commands is very uneven -- it appears to be send one command per block >> > > of zeroes or data. >> > > >> > > Nir: If we could get information from imageio about whether zeroing is >> > > implemented efficiently or not by the backend, we could change >> > > virt-v2v / nbdkit to advertise this back to qemu. >> > >> > There is no way to detect the capability, ioctl(BLKZEROOUT) always >> > succeeds, falling back to manual zeroing in the kernel silently >> > >> > Even if we could, sending zero on the wire from qemu may be even >> > slower, and it looks like qemu send even more requests in this case >> > (2154 vs ~1300). >> > >> > Looks like this optimization in qemu side leads to worse performance, >> > so it should not be enabled by default. >> >> Well, that's overgeneralising your case a bit. If the backend does >> support efficient zero writes (which file systems, the most common case, >> generally do), doing one big write_zeroes request at the start can >> improve performance quite a bit. >> >> It seems the problem is that we can't really know whether the operation >> will be efficient because the backends generally don't tell us. Maybe >> NBD could introduce a flag for this, but in the general case it appears >> to me that we'll have to have a command line option. >> >> However, I'm curious what your exact use case and the backend used in it >> is? Can something be improved there to actually get efficient zero >> writes and get even better performance than by just disabling the big >> zero write? > > > The backend is some NetApp storage connected via FC. I don't have > more info on this. We get zero rate of about 1G/s on this storage, which > is quite slow compared with other storage we tested. > > One option we check now is if this is the kernel silent fallback to manual > zeroing when the server advertise wrong value of write_same_max_bytes. > We eliminated this using blkdiscard. This is what we get on with this storage zeroing 100G LV: for i in 1 2 4 8 16 32; do time blkdiscard -z -p ${i}m /dev/6e1d84f9-f939-46e9-b108-0427a08c280c/2d5c06ce-6536-4b3c-a7b6-13c6d8e55ade; done real 4m50.851s user 0m0.065s sys 0m1.482s real 4m30.504s user 0m0.047s sys 0m0.870s real 4m19.443s user 0m0.029s sys 0m0.508s real 4m13.016s user 0m0.020s sys 0m0.284s real 2m45.888s user 0m0.011s sys 0m0.162s real 2m10.153s user 0m0.003s sys 0m0.100s We are investigating why we get low throughput on this server, and also will check several other servers. Having a command line option to control this behavior sounds good. I don't > have enough data to tell what should be the default, but I think the safe > way would be to keep old behavior. > We file this bug: https://bugzilla.redhat.com/1648622 Nir
Re: [Qemu-block] Change in qemu 2.12 causes qemu-img convert to NBD to write more data
On Wed, Nov 7, 2018 at 6:42 PM Eric Blake wrote: > On 11/7/18 6:13 AM, Richard W.M. Jones wrote: > > (I'm not going to claim this is a bug, but it causes a large, easily > > measurable performance regression in virt-v2v). > > I haven't closely looked at at this email thread yet, but a quick first > impression: > > > > In qemu 2.12 this behaviour changed: > > > >$ nbdkit --filter=log memory size=6G logfile=/tmp/log \ > >--run './qemu-img convert ./fedora-28.img -n $nbd' > >$ grep '\.\.\.$' /tmp/log | sed 's/.*\([A-Z][a-z]*\).*/\1/' | uniq -c > >193 Zero > > 1246 Write > > > > It now zeroes the whole disk up front and then writes data over the > > top of the zeroed blocks. > > > > The reason for the performance regression is that in the first case we > > write 6G in total. In the second case we write 6G of zeroes up front, > > followed by the amount of data in the disk image (in this case the > > test disk image contains 1G of non-sparse data, so we write about 7G > > in total). > > There was talk on the NBD list a while ago about the idea of letting the > server advertise to the client when the image is known to start in an > all-zero state, so that the client doesn't have to waste time writing > zeroes (or relying on repeated NBD_CMD_BLOCK_STATUS calls to learn the > same). This may be justification for reviving that topic. > This is a good idea in general, since in some cases we know that a volume is already zeroed (e.g. new file on NFS/Gluster storage). But with block storage, we typically don't have any guarantee about storage content, and qemu need to zero or write the entire device, so this does not solve the issue discussed in this thread. Nir
Re: [Qemu-block] Change in qemu 2.12 causes qemu-img convert to NBD to write more data
On Wed, Nov 7, 2018 at 7:27 PM Kevin Wolf wrote: > Am 07.11.2018 um 15:56 hat Nir Soffer geschrieben: > > Wed, Nov 7, 2018 at 4:36 PM Richard W.M. Jones > wrote: > > > > > Another thing I tried was to change the NBD server (nbdkit) so that it > > > doesn't advertise zero support to the client: > > > > > > $ nbdkit --filter=log --filter=nozero memory size=6G > logfile=/tmp/log \ > > > --run './qemu-img convert ./fedora-28.img -n $nbd' > > > $ grep '\.\.\.$' /tmp/log | sed 's/.*\([A-Z][a-z]*\).*/\1/' | uniq -c > > >2154 Write > > > > > > Not surprisingly no zero commands are issued. The size of the write > > > commands is very uneven -- it appears to be send one command per block > > > of zeroes or data. > > > > > > Nir: If we could get information from imageio about whether zeroing is > > > implemented efficiently or not by the backend, we could change > > > virt-v2v / nbdkit to advertise this back to qemu. > > > > There is no way to detect the capability, ioctl(BLKZEROOUT) always > > succeeds, falling back to manual zeroing in the kernel silently > > > > Even if we could, sending zero on the wire from qemu may be even > > slower, and it looks like qemu send even more requests in this case > > (2154 vs ~1300). > > > > Looks like this optimization in qemu side leads to worse performance, > > so it should not be enabled by default. > > Well, that's overgeneralising your case a bit. If the backend does > support efficient zero writes (which file systems, the most common case, > generally do), doing one big write_zeroes request at the start can > improve performance quite a bit. > > It seems the problem is that we can't really know whether the operation > will be efficient because the backends generally don't tell us. Maybe > NBD could introduce a flag for this, but in the general case it appears > to me that we'll have to have a command line option. > > However, I'm curious what your exact use case and the backend used in it > is? Can something be improved there to actually get efficient zero > writes and get even better performance than by just disabling the big > zero write? The backend is some NetApp storage connected via FC. I don't have more info on this. We get zero rate of about 1G/s on this storage, which is quite slow compared with other storage we tested. One option we check now is if this is the kernel silent fallback to manual zeroing when the server advertise wrong value of write_same_max_bytes. Having a command line option to control this behavior sounds good. I don't have enough data to tell what should be the default, but I think the safe way would be to keep old behavior. Nir
Re: [Qemu-block] Change in qemu 2.12 causes qemu-img convert to NBD to write more data
Am 07.11.2018 um 15:56 hat Nir Soffer geschrieben: > Wed, Nov 7, 2018 at 4:36 PM Richard W.M. Jones wrote: > > > Another thing I tried was to change the NBD server (nbdkit) so that it > > doesn't advertise zero support to the client: > > > > $ nbdkit --filter=log --filter=nozero memory size=6G logfile=/tmp/log \ > > --run './qemu-img convert ./fedora-28.img -n $nbd' > > $ grep '\.\.\.$' /tmp/log | sed 's/.*\([A-Z][a-z]*\).*/\1/' | uniq -c > >2154 Write > > > > Not surprisingly no zero commands are issued. The size of the write > > commands is very uneven -- it appears to be send one command per block > > of zeroes or data. > > > > Nir: If we could get information from imageio about whether zeroing is > > implemented efficiently or not by the backend, we could change > > virt-v2v / nbdkit to advertise this back to qemu. > > There is no way to detect the capability, ioctl(BLKZEROOUT) always > succeeds, falling back to manual zeroing in the kernel silently > > Even if we could, sending zero on the wire from qemu may be even > slower, and it looks like qemu send even more requests in this case > (2154 vs ~1300). > > Looks like this optimization in qemu side leads to worse performance, > so it should not be enabled by default. Well, that's overgeneralising your case a bit. If the backend does support efficient zero writes (which file systems, the most common case, generally do), doing one big write_zeroes request at the start can improve performance quite a bit. It seems the problem is that we can't really know whether the operation will be efficient because the backends generally don't tell us. Maybe NBD could introduce a flag for this, but in the general case it appears to me that we'll have to have a command line option. However, I'm curious what your exact use case and the backend used in it is? Can something be improved there to actually get efficient zero writes and get even better performance than by just disabling the big zero write? Kevin
Re: [Qemu-block] Change in qemu 2.12 causes qemu-img convert to NBD to write more data
On 11/7/18 6:13 AM, Richard W.M. Jones wrote: (I'm not going to claim this is a bug, but it causes a large, easily measurable performance regression in virt-v2v). I haven't closely looked at at this email thread yet, but a quick first impression: In qemu 2.12 this behaviour changed: $ nbdkit --filter=log memory size=6G logfile=/tmp/log \ --run './qemu-img convert ./fedora-28.img -n $nbd' $ grep '\.\.\.$' /tmp/log | sed 's/.*\([A-Z][a-z]*\).*/\1/' | uniq -c 193 Zero 1246 Write It now zeroes the whole disk up front and then writes data over the top of the zeroed blocks. The reason for the performance regression is that in the first case we write 6G in total. In the second case we write 6G of zeroes up front, followed by the amount of data in the disk image (in this case the test disk image contains 1G of non-sparse data, so we write about 7G in total). There was talk on the NBD list a while ago about the idea of letting the server advertise to the client when the image is known to start in an all-zero state, so that the client doesn't have to waste time writing zeroes (or relying on repeated NBD_CMD_BLOCK_STATUS calls to learn the same). This may be justification for reviving that topic. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [Qemu-block] Change in qemu 2.12 causes qemu-img convert to NBD to write more data
On Wed, Nov 07, 2018 at 04:56:48PM +0200, Nir Soffer wrote: > Wed, Nov 7, 2018 at 4:36 PM Richard W.M. Jones wrote: > > > Another thing I tried was to change the NBD server (nbdkit) so that it > > doesn't advertise zero support to the client: > > > > $ nbdkit --filter=log --filter=nozero memory size=6G logfile=/tmp/log \ > > --run './qemu-img convert ./fedora-28.img -n $nbd' > > $ grep '\.\.\.$' /tmp/log | sed 's/.*\([A-Z][a-z]*\).*/\1/' | uniq -c > >2154 Write > > > > Not surprisingly no zero commands are issued. The size of the write > > commands is very uneven -- it appears to be send one command per block > > of zeroes or data. > > > > Nir: If we could get information from imageio about whether zeroing is > > implemented efficiently or not by the backend, we could change > > virt-v2v / nbdkit to advertise this back to qemu. > > > > There is no way to detect the capability, ioctl(BLKZEROOUT) always > succeeds, falling back to manual zeroing in the kernel silently > > Even if we could, sending zero on the wire from qemu may be even > slower, Yes this is a very good point. Sending zeroes would be terrible. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-df lists disk usage of guests without needing to install any software inside the virtual machine. Supports Linux and Windows. http://people.redhat.com/~rjones/virt-df/
Re: [Qemu-block] Change in qemu 2.12 causes qemu-img convert to NBD to write more data
Wed, Nov 7, 2018 at 4:36 PM Richard W.M. Jones wrote: > Another thing I tried was to change the NBD server (nbdkit) so that it > doesn't advertise zero support to the client: > > $ nbdkit --filter=log --filter=nozero memory size=6G logfile=/tmp/log \ > --run './qemu-img convert ./fedora-28.img -n $nbd' > $ grep '\.\.\.$' /tmp/log | sed 's/.*\([A-Z][a-z]*\).*/\1/' | uniq -c >2154 Write > > Not surprisingly no zero commands are issued. The size of the write > commands is very uneven -- it appears to be send one command per block > of zeroes or data. > > Nir: If we could get information from imageio about whether zeroing is > implemented efficiently or not by the backend, we could change > virt-v2v / nbdkit to advertise this back to qemu. > There is no way to detect the capability, ioctl(BLKZEROOUT) always succeeds, falling back to manual zeroing in the kernel silently Even if we could, sending zero on the wire from qemu may be even slower, and it looks like qemu send even more requests in this case (2154 vs ~1300). Looks like this optimization in qemu side leads to worse performance, so it should not be enabled by default. Nir
Re: [Qemu-block] Change in qemu 2.12 causes qemu-img convert to NBD to write more data
Another thing I tried was to change the NBD server (nbdkit) so that it doesn't advertise zero support to the client: $ nbdkit --filter=log --filter=nozero memory size=6G logfile=/tmp/log \ --run './qemu-img convert ./fedora-28.img -n $nbd' $ grep '\.\.\.$' /tmp/log | sed 's/.*\([A-Z][a-z]*\).*/\1/' | uniq -c 2154 Write Not surprisingly no zero commands are issued. The size of the write commands is very uneven -- it appears to be send one command per block of zeroes or data. Nir: If we could get information from imageio about whether zeroing is implemented efficiently or not by the backend, we could change virt-v2v / nbdkit to advertise this back to qemu. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-builder quickly builds VMs from scratch http://libguestfs.org/virt-builder.1.html