On Wed, 20 May 2026 15:15:54 -0700
Chia-I Wu <[email protected]> wrote:

> > > > I collected
> > > > some numbers with baseline, with this series, and with patch 9
> > > > reverted at 
> > > > https://gitlab.freedesktop.org/panfrost/linux/-/work_items/85#note_3481308.
> > > > Reposting the numbers here for reference
> > > >
> > > > |                    | baseline | entire series | patch 9 reverted |
> > > > | -                  | -        | -             | -                |
> > > > | frag job median    | 2.8ms    | 2.2ms         | 2.2ms            |
> > > > | frag job 95%       | 4.5ms    | 2.8ms         | 2.8ms            |
> > > > | frag job 99%       | 4.9ms    | 2.8ms         | 2.8ms            |
> > > > | panthor-job median | 0.8us    | 6.2us         | 0.9us            |
> > > > | panthor-job 95%    | 1.5us    | 16.6us        | 1.5us            |
> > > > | panthor-job 99%    | 1.6us    | 28.0us        | 1.8us            |  
> > >
> > > panthor-job rows are the durations of the raw irq handlers, collected
> > > from irq/irq_handler_{entry,exit}.
> > >
> > > frag job rows are the durations from frag jobs, collected from
> > > gpu_scheduler/drm_sched_job_{run,done}.
> > >
> > > The fence signaling paths of them are
> > >
> > >  - baseline: raw handler -> rt threaded handler -> wq job -> wq job ->
> > > fence signal
> > >  - entire series: raw handler -> fence signal
> > >  - patch 9 reverted: raw handler -> rt threaded handler -> fence signal  
> >
> > Just did another set of throughput tests, and I confirm the gains are
> > noticeable only with patch 9 applied (that's on rk3588, which embeds a
> > G610, so not the exact same setup). As an example, on
> > gfxbench/gl_manhattan, I get the following score bump 2391 -> 2457.
> >
> > Now I need to set things up to measure latency like you did and make
> > sure I'm observing the same thing: threaded handlers providing roughly
> > the same latency as hardirq handlers. If not it probably has to do with
> > some config options that differ and change the preemptability of the
> > system.
> >
> > I'll hold off on the submission of v3 until this is done, because if
> > threaded handlers are roughly as efficient as hardirq ones, we probably
> > want to stick to threaded handlers. 

Sorry for the delay, I only got back to this on Friday.

So, I've been using ftrace/function-graph with some noinline added to
get a sense of where most of the time was spent in the hardirq handler
after the transition to hardirqs, and unlike what I thought, it's not
coming from the accesses to uncached mappings of the FW
interface/syncobjs, but instead the various queue[_delayed]_work()
and/or wake_up_all() on panthor_fw::req_waitqueue. I don't expect us to
be able to optimize that anytime soon, so I guess we should just keep
everything in the threaded handler for now and accept the extra delay
(assuming 20+ usec for the hardirq handler is too long). This also
means that a lot of the things I do in this series are moot
(irqsave/restore, using spinlocks instead of mutexes, ...), but before
I go and rework that, I'd like to get some feedback from Steve and
Liviu to make sure this is okay with Arm.

Reply via email to