On Tue, Mar 24, 2026 at 10:23:45AM +0100, Boris Brezillon wrote: > On Mon, 23 Mar 2026 11:38:06 -0700 > Matthew Brost <[email protected]> wrote: > > > > > Ok, getting stats is easier than I thought... > > > > ./perf stat -a -e > > context-switches,cpu-migrations,task-clock,cycles,instructions > > /home/mbrost/xe/source/drivers.gpu.i915.igt-gpu-tools/build/tests/xe_exec_threads > > --r threads-basic > > > > This test creates one thread per engine instance (7 instances this BMG > > device) and submits 1k exec IOCTLs per thread, each performing a DW > > write. Each exec IOCTL typically does not have unsignaled input > > dependencies. > > > > With IRQ putting of jobs off + no bypass (drm_dep_queue_flags = 0): > > > > 8,449 context-switches > > 412 cpu-migrations > > 2,531.43 msec task-clock > > 1,847,846,588 cpu_atom/cycles/ > > 1,847,856,947 cpu_core/cycles/ > > <not supported> cpu_atom/instructions/ > > 460,744,020 cpu_core/instructions/ > > > > With IRQ putting of jobs off + bypass (drm_dep_queue_flags = > > DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED): > > > > 8,655 context-switches > > 229 cpu-migrations > > 2,571.33 msec task-clock > > 855,900,607 cpu_atom/cycles/ > > 855,900,272 cpu_core/cycles/ > > <not supported> cpu_atom/instructions/ > > 403,651,469 cpu_core/instructions/ > > > > With IRQ putting of jobs on + bypass (drm_dep_queue_flags = > > DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED | > > DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE): > > > > 5,361 context-switches > > 169 cpu-migrations > > 2,577.44 msec task-clock > > 685,769,153 cpu_atom/cycles/ > > 685,768,407 cpu_core/cycles/ > > <not supported> cpu_atom/instructions/ > > 321,336,297 cpu_core/instructions/ > > Thanks for sharing those numbers. For completeness, can you also add the > "With IRQ putting of jobs on + no bypass" case? >
Yes, I also will share a DRM sched baseline too + I figured out power can be measured too - initial results confirm what I expected too - less power. I'm putting together a doc based on running glxgears and another benchmark on top Ubuntu 24.10 + Wayland which has explicit sync (linux-drm-syncobj, behaves like surfface flinger when rendering flag to not pass in fences to draw jobs). Almost have all the data. Will share here once I have it. > I'm a bit surprised by the difference in number of context switches > given I'd expect the local-CPU to be picked in priority, and so queuing > work items on the same wq from another work item to be almost free in > term on scheduling. But I guess there's some load-balancing happening > when you execute jobs at such a high rate. > > Also, I don't know if that's just noise or if it's reproducible, but > task-clock seems to be ~40usec lower with the deferred cleanup and > no-bypass (higher throughput because you're not blocking the dequeuing > of the next job on the cleanup of the previous one, I suspect). I think that is just noise of what the test is doing in user space - that bounces around a bit. Matt >
