On Tue, 19 May 2026 14:04:47 -0700
Chia-I Wu <[email protected]> wrote:

> On Tue, May 19, 2026 at 1:45 PM Chia-I Wu <[email protected]> wrote:
> >
> > On Tue, May 19, 2026 at 11:26 AM Boris Brezillon
> > <[email protected]> wrote:  
> > >
> > > On Tue, 19 May 2026 10:16:26 -0700
> > > Chia-I Wu <[email protected]> wrote:
> > >  
> > > > On Tue, May 19, 2026 at 12:53 AM Boris Brezillon
> > > > <[email protected]> wrote:  
> > > > >
> > > > > On Mon, 18 May 2026 16:33:20 -0700
> > > > > Chia-I Wu <[email protected]> wrote:
> > > > >
> > > > >  
> > > > > > > >
> > > > > > > >  
> > > > > > > > >  
> > > > > > > > > >  
> > > > > > > > > > >         if (!ptdev->scheduler)
> > > > > > > > > > >                 return;
> > > > > > > > > > >
> > > > > > > > > > > -       atomic_or(events, &ptdev->scheduler->fw_events);
> > > > > > > > > > > -       sched_queue_work(ptdev->scheduler, fw_events);
> > > > > > > > > > > +       
> > > > > > > > > > > guard(spinlock_irqsave)(&ptdev->scheduler->events_lock);
> > > > > > > > > > > +
> > > > > > > > > > > +       if (events & JOB_INT_GLOBAL_IF) {
> > > > > > > > > > > +               sched_process_global_irq_locked(ptdev);
> > > > > > > > > > > +               events &= ~JOB_INT_GLOBAL_IF;
> > > > > > > > > > > +       }
> > > > > > > > > > > +
> > > > > > > > > > > +       while (events) {
> > > > > > > > > > > +               u32 csg_id = ffs(events) - 1;
> > > > > > > > > > > +
> > > > > > > > > > > +               sched_process_csg_irq_locked(ptdev, 
> > > > > > > > > > > csg_id);
> > > > > > > > > > > +               events &= ~BIT(csg_id);
> > > > > > > > > > > +       }  
> > > > > > > > > > This handles all fw events in the irq context. Are there 
> > > > > > > > > > concerns that
> > > > > > > > > > it may take too long? I might be wrong, but it seems 
> > > > > > > > > > possible to
> > > > > > > > > > handle only CSG_SYNC_UPDATE and defer the rest as before.  
> > > > > > > > >
> > > > > > > > > I started with just the SYNC_UPDATE processing done in the 
> > > > > > > > > hard-irq
> > > > > > > > > context, but after auditing the other stuff done in the 
> > > > > > > > > handler, I
> > > > > > > > > realized it's basically just deferring all actual processing 
> > > > > > > > > to work
> > > > > > > > > items. Yes, there's the overhead of demuxing the events from 
> > > > > > > > > the
> > > > > > > > > ack/req regs, but part of this is already done to get to 
> > > > > > > > > SYNC_UPDATE
> > > > > > > > > anyway, so at this point we're probably better off demuxing 
> > > > > > > > > everything
> > > > > > > > > and scheduling works for all kind of events.
> > > > > > > > >
> > > > > > > > > I also compared the perfs between the two approaches (though 
> > > > > > > > > I didn't
> > > > > > > > > do as much testing as I did with the new version, so I might 
> > > > > > > > > have
> > > > > > > > > missed something), and it didn't seem to matter at all, 
> > > > > > > > > because the
> > > > > > > > > interrupts we receive the most are SYNC_UPDATE and IDLE 
> > > > > > > > > events, and
> > > > > > > > > those are at the same level.  
> > > > > > > > Looking at ftrace irq events, when there is one active csg,
> > > > > > > > panthor-job takes 6us (median) / 17us (95%) / 27us (slowest).
> > > > > > > >
> > > > > > > > I don't have a good sense if that's considered normal in 
> > > > > > > > hardirq. But
> > > > > > > > if that is ever an issue, and if the majority of the time is 
> > > > > > > > spent in
> > > > > > > > CSG_SYNC_UPDATE anyway, we can always revert the last patch to 
> > > > > > > > move
> > > > > > > > processing to threaded handler.  
> > > > > > >
> > > > > > > Actually, the threaded -> hard transition (patch 9) is where the 
> > > > > > > perf
> > > > > > > gain is.  
> > > > > > hardirq is even more timely for sure. For our use case, the threaded
> > > > > > handler is RT and is also good enough.  
> > > > >
> > > > > Yeah, true. I forgot you were forcing RT priority on threaded 
> > > > > handlers.
> > > > > Anyway, let's stick to hardirqs for now, and revisit it if it proves 
> > > > > to
> > > > > be too much work done in irq context.  
> > > > Just want to clarify that irq_thread calls sched_set_fifo to make the
> > > > task RT. The behavior is universal and is not specific to any
> > > > downstream kernel.  

There's a difference in what RT means depending on whether the system
is configured with PREEMPT or PREEMPT_RT though. But I assume you're
using PREEMPT not PREEMPT_RT.

> > >
> > > Hm, interesting. In my testing, any of the changes before patch 9
> > > didn't make a huge difference in term of perf, patch 9 is where the perf
> > > gains happen. For the record, patch 6 is where we get rid of the
> > > threaded -> work round-trip for job completion/fence signaling, and it
> > > didn't seem to reflect in the benchmark results, but I'll do another
> > > round of tests before posting v3, just to confirm.  
> > We care the most about signaling latency for this series.

Yes, I know. It's just that it also seemed to help the throughput, which
I initially checked to make sure we were not regressing perfs
significantly by interrupting the system aggressively. I guess the
reason for that is that, by reducing the latency, we also unleash the
job submitter (if you get signaled early, and jobs tend to be
serialized because of deps, you can submit more).

> > I collected
> > some numbers with baseline, with this series, and with patch 9
> > reverted at 
> > https://gitlab.freedesktop.org/panfrost/linux/-/work_items/85#note_3481308.
> > Reposting the numbers here for reference
> >
> > |                    | baseline | entire series | patch 9 reverted |
> > | -                  | -        | -             | -                |
> > | frag job median    | 2.8ms    | 2.2ms         | 2.2ms            |
> > | frag job 95%       | 4.5ms    | 2.8ms         | 2.8ms            |
> > | frag job 99%       | 4.9ms    | 2.8ms         | 2.8ms            |
> > | panthor-job median | 0.8us    | 6.2us         | 0.9us            |
> > | panthor-job 95%    | 1.5us    | 16.6us        | 1.5us            |
> > | panthor-job 99%    | 1.6us    | 28.0us        | 1.8us            |  
> 
> panthor-job rows are the durations of the raw irq handlers, collected
> from irq/irq_handler_{entry,exit}.
> 
> frag job rows are the durations from frag jobs, collected from
> gpu_scheduler/drm_sched_job_{run,done}.
> 
> The fence signaling paths of them are
> 
>  - baseline: raw handler -> rt threaded handler -> wq job -> wq job ->
> fence signal
>  - entire series: raw handler -> fence signal
>  - patch 9 reverted: raw handler -> rt threaded handler -> fence signal

Just did another set of throughput tests, and I confirm the gains are
noticeable only with patch 9 applied (that's on rk3588, which embeds a
G610, so not the exact same setup). As an example, on
gfxbench/gl_manhattan, I get the following score bump 2391 -> 2457.

Now I need to set things up to measure latency like you did and make
sure I'm observing the same thing: threaded handlers providing roughly
the same latency as hardirq handlers. If not it probably has to do with
some config options that differ and change the preemptability of the
system.

I'll hold off on the submission of v3 until this is done, because if
threaded handlers are roughly as efficient as hardirq ones, we probably
want to stick to threaded handlers.

Reply via email to