On 11/1/19 4:09 PM, Vladimir Sementsov-Ogievskiy wrote: > 01.11.2019 15:34, Max Reitz wrote: >> On 01.11.19 12:20, Max Reitz wrote: >>> On 01.11.19 12:16, Vladimir Sementsov-Ogievskiy wrote: >>>> 01.11.2019 14:12, Max Reitz wrote: >>>>> On 01.11.19 11:28, Vladimir Sementsov-Ogievskiy wrote: >>>>>> 01.11.2019 13:20, Max Reitz wrote: >>>>>>> On 01.11.19 11:00, Max Reitz wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>> This series builds on the previous RFC. The workaround is now applied >>>>>>>> unconditionally of AIO mode and filesystem because we don’t know those >>>>>>>> things for remote filesystems. Furthermore, bdrv_co_get_self_request() >>>>>>>> has been moved to block/io.c. >>>>>>>> >>>>>>>> Applying the workaround unconditionally is fine from a performance >>>>>>>> standpoint, because it should actually be dead code, thanks to patch 1 >>>>>>>> (the elephant in the room). As far as I know, there is no other block >>>>>>>> driver but qcow2 in handle_alloc_space() that would submit zero writes >>>>>>>> as part of normal I/O so it can occur concurrently to other write >>>>>>>> requests. It still makes sense to take the workaround for file-posix >>>>>>>> because we can’t really prevent that any other block driver will submit >>>>>>>> zero writes as part of normal I/O in the future. >>>>>>>> >>>>>>>> Anyway, let’s get to the elephant. >>>>>>>> >>>>>>>> From input by XFS developers >>>>>>>> (https://bugzilla.redhat.com/show_bug.cgi?id=1765547#c7) it seems clear >>>>>>>> that c8bb23cbdbe causes fundamental performance problems on XFS with >>>>>>>> aio=native that cannot be fixed. In other cases, c8bb23cbdbe improves >>>>>>>> performance or we wouldn’t have it. >>>>>>>> >>>>>>>> In general, avoiding performance regressions is more important than >>>>>>>> improving performance, unless the regressions are just a minor corner >>>>>>>> case or insignificant when compared to the improvement. The XFS >>>>>>>> regression is no minor corner case, and it isn’t insignificant. >>>>>>>> Laurent >>>>>>>> Vivier has found performance to decrease by as much as 88 % (on >>>>>>>> ppc64le, >>>>>>>> fio in a guest with 4k blocks, iodepth=8: 1662 kB/s from 13.9 MB/s). >>>>>>> Ah, crap. >>>>>>> >>>>>>> I wanted to send this series as early today as possible to get as much >>>>>>> feedback as possible, so I’ve only started doing benchmarks now. >>>>>>> >>>>>>> The obvious >>>>>>> >>>>>>> $ qemu-img bench -t none -n -w -S 65536 test.qcow2 >>>>>>> >>>>>>> on XFS takes like 6 seconds on master, and like 50 to 80 seconds with >>>>>>> c8bb23cbdbe reverted. So now on to guest tests... >>>>>> Aha, that's very interesting) What about aio-native which should be >>>>>> slowed down? >>>>>> Could it be tested like this? >>>>> That is aio=native (-n). >>>>> >>>>> But so far I don’t see any significant difference in guest tests (i.e., >>>>> fio --rw=write --bs=4k --iodepth=8 --runtime=1m --direct=1 >>>>> --ioengine=libaio --thread --numjobs=16 --size=2G --time_based), neither >>>>> with 64 kB nor with 2 MB clusters. (But only on XFS, I’ll have to see >>>>> about ext4 still.) >>>> hmm, this possibly mostly tests writes to already allocated clusters. Has >>>> fio >>>> an option to behave like qemu-img bench with -S 65536, i.e. write once into >>>> each cluster? >>> Maybe, but is that a realistic depiction of whether this change is worth >>> it? That is why I’m doing the guest test, to see whether it actually >>> has much impact on the guest. >> I’ve changed the above fio invocation to use --rw=randwrite and added >> --fallocate=none. The performance went down, but it went down both with >> and without c8bb23cbdbe. >> >> So on my XFS system (XFS on luks on SSD), I see: >> - with c8bb23cbdbe: 26.0 - 27.9 MB/s >> - without c8bb23cbdbe: 25.6 - 27 MB/s >> >> On my ext4 system (native on SSD), I see: >> - with: 39.4 - 41.5 MB/s >> - without: 39.4 - 42.0 MB/s >> >> So basically no difference for XFS, and really no difference for ext4. >> (I ran these tests with 2 MB clusters.) >> > Hmm. I don't know. For me it seems obvious that zeroing 2M cluster is slow, > and this > is proved by simple tests with qemu-img bench, that fallocate is faster than > zeroing > most of the cluster. > > So, if some guest test doesn't show the difference, this means that "small > write into > new cluster" is effectively rare case in this test.. And this doesn't prove > that it's > always rare and insignificant. > > I don't sure that we have a real-world example that proves necessity of this > optimization, > or was there some original bug about low-performance which was fixed by this > optimization.. > Den, Anton, do we have something about it? > sorry, I have missed the beginning of the thread.
Which driver is used for virtual disk - cached or non-cached IO is used in QEMU? We use non-cached by default and this could make a difference significantly. Max, can you pls share your domain.xml of the guest config and fio file for guest. I will recheck to be 120% sure. Thank you in advance, Den
