On Thu, Jun 05, 2025 at 09:34:01AM +0100, Daniel P. Berrangé wrote:
> On Wed, Jun 04, 2025 at 03:18:43PM -0400, Stefan Hajnoczi wrote:
> > Since commit 7ff9ff039380 ("meson: mitigate against use of uninitialize
> > stack for exploits") the -ftrivial-auto-var-init=zero compiler option is
> > used to zero local variables. While this reduces security risks
> > associated with uninitialized stack data, it introduced a measurable
> > bottleneck in the virtqueue_split_pop() and virtqueue_packed_pop()
> > functions.
> > 
> > These virtqueue functions are in the hot path. They are called for each
> > element (request) that is popped from a VIRTIO device's virtqueue. Using
> > __attribute__((uninitialized)) on large stack variables in these
> > functions improves fio randread bs=4k iodepth=64 performance from 304k
> > to 332k IOPS (+9%).
> 
> IIUC, the 'hwaddr addr' variable is 8k in size, and the 'struct iovec iov'
> array is 16k in size, so we have 24k on the stack that we're clearing and
> then later writing the real value. Makes sense that this would have a
> perf impact in a hotpath.
> 
> > This issue was found using perf-top(1). virtqueue_split_pop() was one of
> > the top CPU consumers and the "annotate" feature showed that the memory
> > zeroing instructions at the beginning of the functions were hot.
> 
> When you say you found it with 'perf-top' was that just discovered by
> accident, or was this usage of perf-top in response to users reporting
> a performance degradation vs earlier QEMU ?

By accident. I was looking for ways to optimize my work-in-progress QEMU
io_uring patches and virtqueue_split_pop() stood out in the CPU time
profile.

Attachment: signature.asc
Description: PGP signature

Reply via email to