On Mon, 22 Jun 2026 14:49:49 +0200 Boris Brezillon <[email protected]> wrote:
> On Wed, 20 May 2026 15:15:54 -0700 > Chia-I Wu <[email protected]> wrote: > > > > > > I collected > > > > > some numbers with baseline, with this series, and with patch 9 > > > > > reverted at > > > > > https://gitlab.freedesktop.org/panfrost/linux/-/work_items/85#note_3481308. > > > > > Reposting the numbers here for reference > > > > > > > > > > | | baseline | entire series | patch 9 reverted | > > > > > | - | - | - | - | > > > > > | frag job median | 2.8ms | 2.2ms | 2.2ms | > > > > > | frag job 95% | 4.5ms | 2.8ms | 2.8ms | > > > > > | frag job 99% | 4.9ms | 2.8ms | 2.8ms | > > > > > | panthor-job median | 0.8us | 6.2us | 0.9us | > > > > > | panthor-job 95% | 1.5us | 16.6us | 1.5us | > > > > > | panthor-job 99% | 1.6us | 28.0us | 1.8us | > > > > > > > > > > > > > panthor-job rows are the durations of the raw irq handlers, collected > > > > from irq/irq_handler_{entry,exit}. > > > > > > > > frag job rows are the durations from frag jobs, collected from > > > > gpu_scheduler/drm_sched_job_{run,done}. > > > > > > > > The fence signaling paths of them are > > > > > > > > - baseline: raw handler -> rt threaded handler -> wq job -> wq job -> > > > > fence signal > > > > - entire series: raw handler -> fence signal > > > > - patch 9 reverted: raw handler -> rt threaded handler -> fence signal > > > > > > > > > > Just did another set of throughput tests, and I confirm the gains are > > > noticeable only with patch 9 applied (that's on rk3588, which embeds a > > > G610, so not the exact same setup). As an example, on > > > gfxbench/gl_manhattan, I get the following score bump 2391 -> 2457. > > > > > > Now I need to set things up to measure latency like you did and make > > > sure I'm observing the same thing: threaded handlers providing roughly > > > the same latency as hardirq handlers. If not it probably has to do with > > > some config options that differ and change the preemptability of the > > > system. > > > > > > I'll hold off on the submission of v3 until this is done, because if > > > threaded handlers are roughly as efficient as hardirq ones, we probably > > > want to stick to threaded handlers. > > Sorry for the delay, I only got back to this on Friday. > > So, I've been using ftrace/function-graph with some noinline added to > get a sense of where most of the time was spent in the hardirq handler > after the transition to hardirqs, and unlike what I thought, it's not > coming from the accesses to uncached mappings of the FW > interface/syncobjs, but instead the various queue[_delayed]_work() > and/or wake_up_all() on panthor_fw::req_waitqueue. I don't expect us to > be able to optimize that anytime soon, so I guess we should just keep > everything in the threaded handler for now and accept the extra delay > (assuming 20+ usec for the hardirq handler is too long). This also > means that a lot of the things I do in this series are moot > (irqsave/restore, using spinlocks instead of mutexes, ...), but before > I go and rework that, I'd like to get some feedback from Steve and > Liviu to make sure this is okay with Arm. I ended up sending a v3 doing that. I can easily go back to the previous version if needed.
