sched: Convert drm scheduler to use a work queue rather than kthread

John Harrison Thu, 12 Jan 2023 16:40:02 -0800

On 1/11/2023 14:56, Jason Ekstrand wrote:

On Wed, Jan 11, 2023 at 4:32 PM Matthew Brost<matthew.br...@intel.com> wrote:


    On Wed, Jan 11, 2023 at 04:18:01PM -0600, Jason Ekstrand wrote:
    > On Wed, Jan 11, 2023 at 2:50 AM Tvrtko Ursulin <
    > tvrtko.ursu...@linux.intel.com> wrote:
    >
    > >
    [snip]
    > >
    > > Typically is the key here. But I am not sure it is good
    enough. Consider
    > > this example - Intel Flex 170:
    > >
    > >   * Delivers up to 36 streams 1080p60 transcode throughput per
    card.
    > >   * When scaled to 10 cards in a 4U server configuration, it
    can support
    > > up to 360 streams of HEVC/HEVC 1080p60 transcode throughput.
    > >
    >
    > I had a feeling it was going to be media.... 😅
    >

    Yea wondering the media UMD can be rewritten to use less
    xe_engines, it
    is massive rewrite for VM bind + no implicit dependencies so let's
    just
    pile on some more work?

It could probably use fewer than it does today. It currently createsand throws away contexts like crazy, or did last I looked at it. However, the nature of media encode is that it often spreads acrosstwo or three different types of engines. There's not much you can doto change that.

And as per Tvrtko's example, you get media servers that transcode hugenumbers of tiny streams in parallel. Almost no work per frame but 100sof independent streams being run concurrently. That means many 100s ofcontexts all trying to run at 30fps. I recall a specific bug aboutthundering herds - hundreds (thousands?) of waiting threads all beingwoken up at once because some request had completed.

    >
    > > One transcode stream from my experience typically is 3-4 GPU
    contexts
    > > (buffer travels from vcs -> rcs -> vcs, maybe vecs) used from
    a single
    > > CPU thread. 4 contexts * 36 streams = 144 active contexts.
    Multiply by
    > > 60fps = 8640 jobs submitted and completed per second.
    > >
    > > 144 active contexts in the proposed scheme means possibly
    means 144
    > > kernel worker threads spawned (driven by 36 transcode CPU
    threads). (I
    > > don't think the pools would scale down given all are
    constantly pinged
    > > at 60fps.)
    > >
    > > And then each of 144 threads goes to grab the single GuC CT
    mutex. First
    > > threads are being made schedulable, then put to sleep as mutex
    > > contention is hit, then woken again as mutexes are getting
    released,
    > > rinse, repeat.
    > >
    >
    > Why is every submission grabbing the GuC CT mutex? I've not read
    the GuC
    > back-end yet but I was under the impression that most run_job()
    would be
    > just shoving another packet into a ring buffer.  If we have to
    send the GuC
    > a message on the control ring every single time we submit a job,
    that's
    > pretty horrible.
    >

    Run job writes the ring buffer and moves the tail as the first
    step (no
    lock required). Next it needs to tell the GuC the xe_engine LRC
    tail has
    moved, this is done from a single Host to GuC channel which is
    circular
    buffer, the writing of the channel protected by the mutex. There are
    little more nuances too but in practice there is always space in the
    channel so the time mutex needs to held is really, really small
    (check cached credits, write 3 dwords in payload, write 1 dword to
    move
    tail). I also believe mutexes in Linux are hybrid where they spin
    for a
    little bit before sleeping and certainly if there is space in the
    channel we shouldn't sleep mutex contention.

Ok, that makes sense. It's maybe a bit clunky and it'd be nice if wehad some way to batch things up a bit so we only have to poke the GuCchannel once for every batch of things rather than once per job. That's maybe something we can look into as a future improvement; notfundamental.

Generally, though, it sounds like contention could be a real problemif we end up ping-ponging that lock between cores. It's going todepend on how much work it takes to get the next ready thing vs. thecost of that atomic. But, also, anything we do is going topotentially run into contention problems. *shrug* If we were goingto go for one-per-HW-engine, we may as well go one-per-device and thenwe wouldn't need the lock. Off the top of my head, that doesn't soundgreat either but IDK.


    As far as this being horrible, well didn't design the GuC and this how
    it is implemented for KMD based submission. We also have 256 doorbells
    so we wouldn't need a lock but I think are other issues with that
    design
    too which need to be worked out in the Xe2 / Xe3 timeframe.

Yeah, not blaming you. Just surprised, that's all. How does it workfor userspace submission? What would it look like if the kernelemulated userspace submission? Is that even possible?


What are these doorbell things?  How do they play into it?

Basically a bank of MMIO space reserved per 'entity' where a write tothat MMIO space becomes an named interrupt to GuC. You can assign eachdoorbell to a specific GuC context. So writing to that doorbell addressis effectively the same as sending a SCHEDULE_CONTEXT H2G message fromthe KMD for that context. But the advantage is you ring the doorbellfrom user land with no call into the kernel at all. Or from within thekernel, you can do it without needing any locks at all. Problem is, wehave 64K contexts in GuC but only 256 doorbells in the hardware. Less ifusing SRIOV. So the "per 'entity'" part because somewhat questionable asto exactly what the 'entity' is. And hence we just haven't botheredsupporting them in Linux because a) no direct submission from user landyet, and b) as Matthew says entire chain of IOCTL from UMD to kernel toacquiring a lock and sending the H2G has generally been fast enough. Thelatency only becomes an issue for ULLS people but for them, even thedoorbells from user space are too high a latency because that stillpotentially involves the GuC having to do some scheduling and contextswitch type action.


John.

    Also if you see my follow up response Xe is ~33k execs per second with
    the current implementation on a 8 core (or maybe 8 thread) TGL which
    seems to fine to me.

33k exec/sec is about 500/frame which should be fine. 500 is a lot fora single frame. I typically tell game devs to shoot for dozens perframe. The important thing is that it stays low even with hundreds ofmemory objects bound. (Xe should be just fine there.)


--Jason

    Matt

    > --Jason
    >
    >
    > (And yes this backend contention is there regardless of 1:1:1,
    it would
    > > require a different re-design to solve that. But it is just a
    question
    > > whether there are 144 contending threads, or just 6 with the
    thread per
    > > engine class scheme.)
    > >
    > > Then multiply all by 10 for a 4U server use case and you get
    1440 worker
    > > kthreads, yes 10 more CT locks, but contending on how many CPU
    cores?
    > > Just so they can grab a timeslice and maybe content on a mutex
    as the
    > > next step.
    > >
    > > This example is where it would hurt on large systems. Imagine
    only an
    > > even wider media transcode card...
    > >
    > > Second example is only a single engine class used (3d
    desktop?) but with
    > > a bunch of not-runnable jobs queued and waiting on a fence to
    signal.
    > > Implicit or explicit dependencies doesn't matter. Then the
    fence signals
    > > and call backs run. N work items get scheduled, but they all
    submit to
    > > the same HW engine. So we end up with:
    > >
    > >          /-- wi1 --\
    > >         / ..     .. \
    > >   cb --+---  wi.. ---+-- rq1 -- .. -- rqN
    > >         \ ..    ..  /
    > >          \-- wiN --/
    > >
    > >
    > > All that we have achieved is waking up N CPUs to contend on
    the same
    > > lock and effectively insert the job into the same single HW
    queue. I
    > > don't see any positives there.
    > >
    > > This example I think can particularly hurt small / low power
    devices
    > > because of needless waking up of many cores for no benefit.
    Granted, I
    > > don't have a good feel on how common this pattern is in practice.
    > >
    > > >
    > > >     That
    > > >     is the number which drives the maximum number of
    not-runnable jobs
    > > that
    > > >     can become runnable at once, and hence spawn that many
    work items,
    > > and
    > > >     in turn unbound worker threads.
    > > >
    > > >     Several problems there.
    > > >
    > > >     It is fundamentally pointless to have potentially that
    many more
    > > >     threads
    > > >     than the number of CPU cores - it simply creates a
    scheduling storm.
    > > >
    > > >     Unbound workers have no CPU / cache locality either and
    no connection
    > > >     with the CPU scheduler to optimize scheduling patterns.
    This may
    > > matter
    > > >     either on large systems or on small ones. Whereas the
    current design
    > > >     allows for scheduler to notice userspace CPU thread
    keeps waking up
    > > the
    > > >     same drm scheduler kernel thread, and so it can keep
    them on the same
    > > >     CPU, the unbound workers lose that ability and so 2nd
    CPU might be
    > > >     getting woken up from low sleep for every submission.
    > > >
    > > >     Hence, apart from being a bit of a impedance mismatch,
    the proposal
    > > has
    > > >     the potential to change performance and power patterns
    and both large
    > > >     and small machines.
    > > >
    > > >
    > > > Ok, thanks for explaining the issue you're seeing in more
    detail.  Yes,
    > > > deferred kwork does appear to mismatch somewhat with what
    the scheduler
    > > > needs or at least how it's worked in the past.  How much
    impact will
    > > > that mismatch have?  Unclear.
    > > >
    > > >      >      >>> Secondly, it probably demands separate
    workers (not
    > > >     optional),
    > > >      >     otherwise
    > > >      >      >>> behaviour of shared workqueues has either
    the potential
    > > to
    > > >      >     explode number
    > > >      >      >>> kernel threads anyway, or add latency.
    > > >      >      >>>
    > > >      >      >>
    > > >      >      >> Right now the system_unbound_wq is used which
    does have a
    > > >     limit
    > > >      >     on the
    > > >      >      >> number of threads, right? I do have a FIXME
    to allow a
    > > >     worker to be
    > > >      >      >> passed in similar to TDR.
    > > >      >      >>
    > > >      >      >> WRT to latency, the 1:1 ratio could actually
    have lower
    > > >     latency
    > > >      >     as 2 GPU
    > > >      >      >> schedulers can be pushing jobs into the backend /
    > > cleaning up
    > > >      >     jobs in
    > > >      >      >> parallel.
    > > >      >      >>
    > > >      >      >
    > > >      >      > Thought of one more point here where why in Xe we
    > > >     absolutely want
    > > >      >     a 1 to
    > > >      >      > 1 ratio between entity and scheduler - the way
    we implement
    > > >      >     timeslicing
    > > >      >      > for preempt fences.
    > > >      >      >
    > > >      >      > Let me try to explain.
    > > >      >      >
    > > >      >      > Preempt fences are implemented via the generic
    messaging
    > > >      >     interface [1]
    > > >      >      > with suspend / resume messages. If a suspend
    messages is
    > > >     received to
    > > >      >      > soon after calling resume (this is per entity)
    we simply
    > > >     sleep in the
    > > >      >      > suspend call thus giving the entity a
    timeslice. This
    > > >     completely
    > > >      >     falls
    > > >      >      > apart with a many to 1 relationship as now a
    entity
    > > >     waiting for a
    > > >      >      > timeslice blocks the other entities. Could we
    work aroudn
    > > >     this,
    > > >      >     sure but
    > > >      >      > just another bunch of code we'd have to add in
    Xe. Being to
    > > >      >     freely sleep
    > > >      >      > in backend without affecting other entities is
    really,
    > > really
    > > >      >     nice IMO
    > > >      >      > and I bet Xe isn't the only driver that is
    going to feel
    > > >     this way.
    > > >      >      >
    > > >      >      > Last thing I'll say regardless of how anyone
    feels about
    > > >     Xe using
    > > >      >     a 1 to
    > > >      >      > 1 relationship this patch IMO makes sense as I
    hope we can
    > > all
    > > >      >     agree a
    > > >      >      > workqueue scales better than kthreads.
    > > >      >
    > > >      >     I don't know for sure what will scale better and
    for what use
    > > >     case,
    > > >      >     combination of CPU cores vs number of GPU engines
    to keep
    > > >     busy vs other
    > > >      >     system activity. But I wager someone is bound to
    ask for some
    > > >      >     numbers to
    > > >      >     make sure proposal is not negatively affecting
    any other
    > > drivers.
    > > >      >
    > > >      >
    > > >      > Then let them ask.  Waving your hands vaguely in the
    direction of
    > > >     the
    > > >      > rest of DRM and saying "Uh, someone (not me) might
    object" is
    > > >     profoundly
    > > >      > unhelpful.  Sure, someone might. That's why it's on
    dri-devel.
    > > >     If you
    > > >      > think there's someone in particular who might have a
    useful
    > > >     opinion on
    > > >      > this, throw them in the CC so they don't miss the
    e-mail thread.
    > > >      >
    > > >      > Or are you asking for numbers?  If so, what numbers
    are you
    > > >     asking for?
    > > >
    > > >     It was a heads up to the Xe team in case people weren't
    appreciating
    > > >     how
    > > >     the proposed change has the potential influence power
    and performance
    > > >     across the board. And nothing in the follow up
    discussion made me
    > > think
    > > >     it was considered so I don't think it was redundant to
    raise it.
    > > >
    > > >     In my experience it is typical that such core changes
    come with some
    > > >     numbers. Which is in case of drm scheduler is tricky and
    probably
    > > >     requires explicitly asking everyone to test (rather than
    count on
    > > >     "don't
    > > >     miss the email thread"). Real products can fail to ship
    due ten mW
    > > here
    > > >     or there. Like suddenly an extra core prevented from
    getting into
    > > deep
    > > >     sleep.
    > > >
    > > >     If that was "profoundly unhelpful" so be it.
    > > >
    > > >
    > > > With your above explanation, it makes more sense what you're
    asking.
    > > > It's still not something Matt is likely to be able to
    provide on his
    > > > own.  We need to tag some other folks and ask them to test
    it out.  We
    > > > could play around a bit with it on Xe but it's not exactly
    production
    > > > grade yet and is going to hit this differently from most. 
    Likely
    > > > candidates are probably AMD and Freedreno.
    > >
    > > Whoever is setup to check out power and performance would be
    good to
    > > give it a spin, yes.
    > >
    > > PS. I don't think I was asking Matt to test with other
    devices. To start
    > > with I think Xe is a team effort. I was asking for more
    background on
    > > the design decision since patch 4/20 does not say anything on that
    > > angle, nor later in the thread it was IMO sufficiently addressed.
    > >
    > > >      > Also, If we're talking about a design that might
    paint us into an
    > > >      > Intel-HW-specific hole, that would be one thing.  But
    we're not.
    > > >     We're
    > > >      > talking about switching which kernel threading/task
    mechanism to
    > > >     use for
    > > >      > what's really a very generic problem.  The core Xe
    design works
    > > >     without
    > > >      > this patch (just with more kthreads).  If we land
    this patch or
    > > >      > something like it and get it wrong and it causes a
    performance
    > > >     problem
    > > >      > for someone down the line, we can revisit it.
    > > >
    > > >     For some definition of "it works" - I really wouldn't
    suggest
    > > >     shipping a
    > > >     kthread per user context at any point.
    > > >
    > > >
    > > > You have yet to elaborate on why. What resources is it
    consuming that's
    > > > going to be a problem? Are you anticipating CPU affinity
    problems? Or
    > > > does it just seem wasteful?
    > >
    > > Well I don't know, commit message says the approach does not
    scale. :)
    > >
    > > > I think I largely agree that it's probably
    unnecessary/wasteful but
    > > > reducing the number of kthreads seems like a tractable
    problem to solve
    > > > regardless of where we put the gpu_scheduler object.  Is
    this the right
    > > > solution?  Maybe not.  It was also proposed at one point
    that we could
    > > > split the scheduler into two pieces: A scheduler which owns
    the kthread,
    > > > and a back-end which targets some HW ring thing where you
    can have
    > > > multiple back-ends per scheduler.  That's certainly more
    invasive from a
    > > > DRM scheduler internal API PoV but would solve the kthread
    problem in a
    > > > way that's more similar to what we have now.
    > > >
    > > >      >     In any case that's a low level question caused by
    the high
    > > >     level design
    > > >      >     decision. So I'd think first focus on the high
    level - which
    > > >     is the 1:1
    > > >      >     mapping of entity to scheduler instance proposal.
    > > >      >
    > > >      >     Fundamentally it will be up to the DRM
    maintainers and the
    > > >     community to
    > > >      >     bless your approach. And it is important to
    stress 1:1 is
    > > about
    > > >      >     userspace contexts, so I believe unlike any other
    current
    > > >     scheduler
    > > >      >     user. And also important to stress this
    effectively does not
    > > >     make Xe
    > > >      >     _really_ use the scheduler that much.
    > > >      >
    > > >      >
    > > >      > I don't think this makes Xe nearly as much of a
    one-off as you
    > > >     think it
    > > >      > does.  I've already told the Asahi team working on
    Apple M1/2
    > > >     hardware
    > > >      > to do it this way and it seems to be a pretty good
    mapping for
    > > >     them. I
    > > >      > believe this is roughly the plan for nouveau as
    well.  It's not
    > > >     the way
    > > >      > it currently works for anyone because most other
    groups aren't
    > > >     doing FW
    > > >      > scheduling yet.  In the world of FW scheduling and
    hardware
    > > >     designed to
    > > >      > support userspace direct-to-FW submit, I think the
    design makes
    > > >     perfect
    > > >      > sense (see below) and I expect we'll see more drivers
    move in this
    > > >      > direction as those drivers evolve. (AMD is doing some
    customish
    > > >     thing
    > > >      > for how with gpu_scheduler on the front-end somehow.
    I've not dug
    > > >     into
    > > >      > those details.)
    > > >      >
    > > >      >     I can only offer my opinion, which is that the
    two options
    > > >     mentioned in
    > > >      >     this thread (either improve drm scheduler to cope
    with what is
    > > >      >     required,
    > > >      >     or split up the code so you can use just the parts of
    > > >     drm_sched which
    > > >      >     you want - which is frontend dependency tracking)
    shouldn't
    > > be so
    > > >      >     readily dismissed, given how I think the idea was
    for the new
    > > >     driver to
    > > >      >     work less in a silo and more in the community
    (not do kludges
    > > to
    > > >      >     workaround stuff because it is thought to be too
    hard to
    > > >     improve common
    > > >      >     code), but fundamentally, "goto previous
    paragraph" for what
    > > I am
    > > >      >     concerned.
    > > >      >
    > > >      >
    > > >      > Meta comment:  It appears as if you're falling into
    the standard
    > > >     i915
    > > >      > team trap of having an internal discussion about what the
    > > community
    > > >      > discussion might look like instead of actually having the
    > > community
    > > >      > discussion.  If you are seriously concerned about
    interactions
    > > with
    > > >      > other drivers or whether or setting common direction,
    the right
    > > >     way to
    > > >      > do that is to break a patch or two out into a
    separate RFC series
    > > >     and
    > > >      > tag a handful of driver maintainers.  Trying to
    predict the
    > > >     questions
    > > >      > other people might ask is pointless. Cc them and
    asking for their
    > > >     input
    > > >      > instead.
    > > >
    > > >     I don't follow you here. It's not an internal discussion
    - I am
    > > raising
    > > >     my concerns on the design publicly. I am supposed to
    write a patch to
    > > >     show something, but am allowed to comment on a RFC series?
    > > >
    > > >
    > > > I may have misread your tone a bit.  It felt a bit like too many
    > > > discussions I've had in the past where people are trying to
    predict what
    > > > others will say instead of just asking them. Reading it
    again, I was
    > > > probably jumping to conclusions a bit.  Sorry about that.
    > >
    > > Okay no problem, thanks. In any case we don't have to keep
    discussing
    > > it, since I wrote one or two emails ago it is fundamentally on the
    > > maintainers and community to ack the approach. I only felt
    like RFC did
    > > not explain the potential downsides sufficiently so I wanted
    to probe
    > > that area a bit.
    > >
    > > >     It is "drm/sched: Convert drm scheduler to use a work
    queue rather
    > > than
    > > >     kthread" which should have Cc-ed _everyone_ who use drm
    scheduler.
    > > >
    > > >
    > > > Yeah, it probably should have.  I think that's mostly what
    I've been
    > > > trying to say.
    > > >
    > > >      >
    > > >      >     Regards,
    > > >      >
    > > >      >     Tvrtko
    > > >      >
    > > >      >     P.S. And as a related side note, there are more
    areas where
    > > >     drm_sched
    > > >      >     could be improved, like for instance priority
    handling.
    > > >      >     Take a look at msm_submitqueue_create /
    > > >     msm_gpu_convert_priority /
    > > >      >     get_sched_entity to see how msm works around the
    drm_sched
    > > >     hardcoded
    > > >      >     limit of available priority levels, in order to
    avoid having
    > > >     to leave a
    > > >      >     hw capability unused. I suspect msm would be
    happier if they
    > > >     could have
    > > >      >     all priority levels equal in terms of whether
    they apply only
    > > >     at the
    > > >      >     frontend level or completely throughout the pipeline.
    > > >      >
    > > >      >      > [1]
    > > >      >
    > > >
    https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
    <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1>
    > > >   
     <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
    <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1>
    > > >
    > > >      >
    > > >       <
    > >
    https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
    <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1>
    <
    > >
    https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
    <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1>>>
    > > >      >      >
    > > >      >      >>> What would be interesting to learn is
    whether the option
    > > of
    > > >      >     refactoring
    > > >      >      >>> drm_sched to deal with out of order
    completion was
    > > >     considered
    > > >      >     and what were
    > > >      >      >>> the conclusions.
    > > >      >      >>>
    > > >      >      >>
    > > >      >      >> I coded this up a while back when trying to
    convert the
    > > >     i915 to
    > > >      >     the DRM
    > > >      >      >> scheduler it isn't all that hard either. The
    free flow
    > > >     control
    > > >      >     on the
    > > >      >      >> ring (e.g. set job limit == SIZE OF RING /
    MAX JOB SIZE)
    > > is
    > > >      >     really what
    > > >      >      >> sold me on the this design.
    > > >      >
    > > >      >
    > > >      > You're not the only one to suggest supporting
    out-of-order
    > > >     completion.
    > > >      > However, it's tricky and breaks a lot of internal
    assumptions of
    > > the
    > > >      > scheduler. It also reduces functionality a bit
    because it can no
    > > >     longer
    > > >      > automatically rate-limit HW/FW queues which are often
    > > >     fixed-size.  (Ok,
    > > >      > yes, it probably could but it becomes a substantially
    harder
    > > >     problem.)
    > > >      >
    > > >      > It also seems like a worse mapping to me.  The goal
    here is to
    > > turn
    > > >      > submissions on a userspace-facing engine/queue into
    submissions
    > > >     to a FW
    > > >      > queue submissions, sorting out any dma_fence
    dependencies.  Matt's
    > > >      > description of saying this is a 1:1 mapping between
    sched/entity
    > > >     doesn't
    > > >      > tell the whole story. It's a 1:1:1 mapping between
    xe_engine,
    > > >      > gpu_scheduler, and GuC FW engine. Why make it a
    1:something:1
    > > >     mapping?
    > > >      > Why is that better?
    > > >
    > > >     As I have stated before, what I think what would fit
    well for Xe is
    > > one
    > > >     drm_scheduler per engine class. In specific terms on our
    current
    > > >     hardware, one drm scheduler instance for render,
    compute, blitter,
    > > >     video
    > > >     and video enhance. Userspace contexts remain scheduler
    entities.
    > > >
    > > >
    > > > And this is where we fairly strongly disagree.  More in a bit.
    > > >
    > > >     That way you avoid the whole kthread/kworker story and
    you have it
    > > >     actually use the entity picking code in the scheduler,
    which may be
    > > >     useful when the backend is congested.
    > > >
    > > >
    > > > What back-end congestion are you referring to here?  Running
    out of FW
    > > > queue IDs?  Something else?
    > >
    > > CT channel, number of context ids.
    > >
    > > >
    > > >     Yes you have to solve the out of order problem so in my
    mind that is
    > > >     something to discuss. What the problem actually is (just
    TDR?), how
    > > >     tricky and why etc.
    > > >
    > > >     And yes you lose the handy LRCA ring buffer size
    management so you'd
    > > >     have to make those entities not runnable in some other way.
    > > >
    > > >     Regarding the argument you raise below - would any of
    that make the
    > > >     frontend / backend separation worse and why? Do you
    think it is less
    > > >     natural? If neither is true then all remains is that it
    appears extra
    > > >     work to support out of order completion of entities has been
    > > discounted
    > > >     in favour of an easy but IMO inelegant option.
    > > >
    > > >
    > > > Broadly speaking, the kernel needs to stop thinking about
    GPU scheduling
    > > > in terms of scheduling jobs and start thinking in terms of
    scheduling
    > > > contexts/engines.  There is still some need for scheduling
    individual
    > > > jobs but that is only for the purpose of delaying them as
    needed to
    > > > resolve dma_fence dependencies.  Once dependencies are
    resolved, they
    > > > get shoved onto the context/engine queue and from there the
    kernel only
    > > > really manages whole contexts/engines.  This is a major
    architectural
    > > > shift, entirely different from the way i915 scheduling
    works.  It's also
    > > > different from the historical usage of DRM scheduler which I
    think is
    > > > why this all looks a bit funny.
    > > >
    > > > To justify this architectural shift, let's look at where
    we're headed.
    > > > In the glorious future...
    > > >
    > > >   1. Userspace submits directly to firmware queues.  The
    kernel has no
    > > > visibility whatsoever into individual jobs. At most it can
    pause/resume
    > > > FW contexts as needed to handle eviction and memory management.
    > > >
    > > >   2. Because of 1, apart from handing out the FW queue IDs
    at the
    > > > beginning, the kernel can't really juggle them that much. 
    Depending on
    > > > FW design, it may be able to pause a client, give its IDs to
    another,
    > > > and then resume it later when IDs free up. What it's not
    doing is
    > > > juggling IDs on a job-by-job basis like i915 currently is.
    > > >
    > > >   3. Long-running compute jobs may not complete for days. 
    This means
    > > > that memory management needs to happen in terms of
    pause/resume of
    > > > entire contexts/engines using the memory rather than based
    on waiting
    > > > for individual jobs to complete or pausing individual jobs
    until the
    > > > memory is available.
    > > >
    > > >   4. Synchronization happens via userspace memory fences
    (UMF) and the
    > > > kernel is mostly unaware of most dependencies and when a
    context/engine
    > > > is or is not runnable.  Instead, it keeps as many of them
    minimally
    > > > active (memory is available, even if it's in system RAM) as
    possible and
    > > > lets the FW sort out dependencies.  (There may need to be
    some facility
    > > > for sleeping a context until a memory change similar to
    futex() or
    > > > poll() for userspace threads.  There are some details TBD.)
    > > >
    > > > Are there potential problems that will need to be solved
    here?  Yes.  Is
    > > > it a good design?  Well, Microsoft has been living in this
    future for
    > > > half a decade or better and it's working quite well for
    them.  It's also
    > > > the way all modern game consoles work.  It really is just
    Linux that's
    > > > stuck with the same old job model we've had since the
    monumental shift
    > > > to DRI2.
    > > >
    > > > To that end, one of the core goals of the Xe project was to
    make the
    > > > driver internally behave as close to the above model as
    possible while
    > > > keeping the old-school job model as a very thin layer on
    top.  As the
    > > > broader ecosystem problems (window-system support for UMF,
    for instance)
    > > > are solved, that layer can be peeled back. The core driver
    will already
    > > > be ready for it.
    > > >
    > > > To that end, the point of the DRM scheduler in Xe isn't to
    schedule
    > > > jobs.  It's to resolve syncobj and dma-buf implicit sync
    dependencies
    > > > and stuff jobs into their respective context/engine queue
    once they're
    > > > ready.  All the actual scheduling happens in firmware and
    any scheduling
    > > > the kernel does to deal with contention, oversubscriptions,
    too many
    > > > contexts, etc. is between contexts/engines, not individual
    jobs.  Sure,
    > > > the individual job visibility is nice, but if we design
    around it, we'll
    > > > never get to the glorious future.
    > > >
    > > > I really need to turn the above (with a bit more detail)
    into a blog
    > > > post.... Maybe I'll do that this week.
    > > >
    > > > In any case, I hope that provides more insight into why Xe
    is designed
    > > > the way it is and why I'm pushing back so hard on trying to
    make it more
    > > > of a "classic" driver as far as scheduling is concerned. 
    Are there
    > > > potential problems here?  Yes, that's why Xe has been labeled a
    > > > prototype.  Are such radical changes necessary to get to
    said glorious
    > > > future?  Yes, I think they are.  Will it be worth it?  I
    believe so.
    > >
    > > Right, that's all solid I think. My takeaway is that frontend
    priority
    > > sorting and that stuff isn't needed and that is okay. And that
    there are
    > > multiple options to maybe improve drm scheduler, like the fore
    mentioned
    > > making it deal with out of order, or split into functional
    components,
    > > or split frontend/backend what you suggested. For most of them
    cost vs
    > > benefit is more or less not completely clear, neither how much
    effort
    > > was invested to look into them.
    > >
    > > One thing I missed from this explanation is how drm_scheduler
    per engine
    > > class interferes with the high level concepts. And I did not
    manage to
    > > pick up on what exactly is the TDR problem in that case. Maybe
    the two
    > > are one and the same.
    > >
    > > Bottom line is I still have the concern that conversion to
    kworkers has
    > > an opportunity to regress. Possibly more opportunity for some
    Xe use
    > > cases than to affect other vendors, since they would still be
    using per
    > > physical engine / queue scheduler instances.
    > >
    > > And to put my money where my mouth is I will try to put testing Xe
    > > inside the full blown ChromeOS environment in my team plans.
    It would
    > > probably also be beneficial if Xe team could take a look at
    real world
    > > behaviour of the extreme transcode use cases too. If the stack
    is ready
    > > for that and all. It would be better to know earlier rather
    than later
    > > if there is a fundamental issue.
    > >
    > > For the patch at hand, and the cover letter, it certainly
    feels it would
    > > benefit to record the past design discussion had with AMD
    folks, to
    > > explicitly copy other drivers, and to record the theoretical
    pros and
    > > cons of threads vs unbound workers as I have tried to
    highlight them.
    > >
    > > Regards,
    > >
    > > Tvrtko
    > >

Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

Reply via email to