On 26/11/2015 10:36, Christian Borntraeger wrote: > For some unknown reason, this seems to be slightly slower than 2.5-rc1 on my > old z196. (have not net tested the z13) > > your branch is certainly better regarding malloc, but worse regarding others.
Thanks for taking the time to test this! This is correct, see the cover letter: "[Patches 14 to 16 remove] the duplicate dataplane-specific implementation of virtio in favor of the regular one that is already used for non-dataplane. While the dataplane implementation is slightly more optimized, I chose to keep the other one to avoid another "touch all virtio devices" series. Patch 10 alone mostly brings performance in par between the two. The remaining 7-8% can be recovered by mostly getting rid of tiny address_space_* operations, keeping the rings always mapped. Note that the rest of this big series does bring a little performance improvement, and already makes up for the lost performance." The profile shows that the culprit is the repeated access to the virtio ring: 3.99% qemu-system-s39 libc-2.18.so [.] __memcpy_z196 2.66% qemu-system-s39 qemu-system-s390x [.] address_space_lduw_le 2.51% qemu-system-s39 qemu-system-s390x [.] address_space_map 2.51% qemu-system-s39 qemu-system-s390x [.] phys_page_find 2.24% qemu-system-s39 qemu-system-s390x [.] qemu_get_ram_ptr 2.18% qemu-system-s39 qemu-system-s390x [.] address_space_translate_internal 1.91% qemu-system-s39 qemu-system-s390x [.] qemu_coroutine_switch 1.66% qemu-system-s39 qemu-system-s390x [.] address_space_rw 1.63% qemu-system-s39 qemu-system-s390x [.] address_space_stw_le 1.57% qemu-system-s39 qemu-system-s390x [.] address_space_stl_le 1.57% qemu-system-s39 qemu-system-s390x [.] address_space_translate 1.45% qemu-system-s39 qemu-system-s390x [.] virtqueue_pop 0.91% qemu-system-s39 qemu-system-s390x [.] qemu_ram_block_from_host 0.79% qemu-system-s39 qemu-system-s390x [.] vring_desc_read 0.76% qemu-system-s39 qemu-system-s390x [.] qemu_get_ram_block ----------- 28.33% 3.30% qemu-system-s39 libc-2.18.so [.] __memcpy_z196 2.83% qemu-system-s39 qemu-system-s390x [.] memory_region_find_rcu 2.72% qemu-system-s39 qemu-system-s390x [.] vring_pop 1.37% qemu-system-s39 qemu-system-s390x [.] address_space_rw 1.37% qemu-system-s39 qemu-system-s390x [.] qemu_get_ram_ptr 1.18% qemu-system-s39 qemu-system-s390x [.] memory_region_find 0.92% qemu-system-s39 qemu-system-s390x [.] get_desc.isra.11 0.92% qemu-system-s39 qemu-system-s390x [.] qemu_ram_block_from_host 0.84% qemu-system-s39 qemu-system-s390x [.] vring_push ----------- 15.45% I would really prefer to get rid of vring.c as soon as the infrastructure makes it possible---even if it's faster. We know what makes virtio.c slower, and it's simpler to fix virtio.c than to convert all the other models to vring.c _plus_ make vring.c safe for migration. Paolo