sched: Add enqueue credit limit

Connor Abbott Tue, 20 May 2025 10:06:26 -0700

n Tue, May 20, 2025 at 12:54 PM Danilo Krummrich <d...@kernel.org> wrote:
>
> On Tue, May 20, 2025 at 09:07:05AM -0700, Rob Clark wrote:
> > On Tue, May 20, 2025 at 12:06 AM Danilo Krummrich <d...@kernel.org> wrote:
> > >
> > > On Thu, May 15, 2025 at 12:56:38PM -0700, Rob Clark wrote:
> > > > On Thu, May 15, 2025 at 11:56 AM Danilo Krummrich <d...@kernel.org> 
> > > > wrote:
> > > > >
> > > > > On Thu, May 15, 2025 at 10:40:15AM -0700, Rob Clark wrote:
> > > > > > On Thu, May 15, 2025 at 10:30 AM Danilo Krummrich <d...@kernel.org> 
> > > > > > wrote:
> > > > > > >
> > > > > > > (Cc: Boris)
> > > > > > >
> > > > > > > On Thu, May 15, 2025 at 12:22:18PM -0400, Connor Abbott wrote:
> > > > > > > > For some context, other drivers have the concept of a 
> > > > > > > > "synchronous"
> > > > > > > > VM_BIND ioctl which completes immediately, and drivers 
> > > > > > > > implement it by
> > > > > > > > waiting for the whole thing to finish before returning.
> > > > > > >
> > > > > > > Nouveau implements sync by issuing a normal async VM_BIND and 
> > > > > > > subsequently
> > > > > > > waits for the out-fence synchronously.
> > > > > >
> > > > > > As Connor mentioned, we'd prefer it to be async rather than 
> > > > > > blocking,
> > > > > > in normal cases, otherwise with drm native context for using native
> > > > > > UMD in guest VM, you'd be blocking the single host/VMM virglrender
> > > > > > thread.
> > > > > >
> > > > > > The key is we want to keep it async in the normal cases, and not 
> > > > > > have
> > > > > > weird edge case CTS tests blow up from being _too_ async ;-)
> > > > >
> > > > > I really wonder why they don't blow up in Nouveau, which also support 
> > > > > full
> > > > > asynchronous VM_BIND. Mind sharing which tests blow up? :)
> > > >
> > > > Maybe it was 
> > > > dEQP-VK.sparse_resources.buffer.ssbo.sparse_residency.buffer_size_2_24,
> > >
> > > The test above is part of the smoke testing I do for nouveau, but I 
> > > haven't seen
> > > such issues yet for nouveau.
> >
> > nouveau is probably not using async binds for everything?  Or maybe
> > I'm just pointing to the wrong test.
>
> Let me double check later on.
>
> > > > but I might be mixing that up, I'd have to back out this patch and see
> > > > where things blow up, which would take many hours.
> > >
> > > Well, you said that you never had this issue with "real" workloads, but 
> > > only
> > > with VK CTS, so I really think we should know what we are trying to fix 
> > > here.
> > >
> > > We can't just add new generic infrastructure without reasonable and *well
> > > understood* justification.
> >
> > What is not well understood about this?  We need to pre-allocate
> > memory that we likely don't need for pagetables.
> >
> > In the worst case, a large # of async PAGE_SIZE binds, you end up
> > needing to pre-allocate 3 pgtable pages (4 lvl pgtable) per one page
> > of mapping.  Queue up enough of those and you can explode your memory
> > usage.
>
> Well, the general principle how this can OOM is well understood, sure. What's
> not well understood is how we run in this case. I think we should also
> understand what test causes the issue and why other drivers are not affected
> (yet).


Once again, it's well understood why other drivers aren't affected.
They have both synchronous and asynchronous VM_BINDs in the uabi, and
the userspace driver uses synchronous VM_BIND for everything except
sparse mappings. For freedreno we tried to change that because async
works better for native context, which exposed the pre-existing issue
with async VM_BINDs causing the whole system to hang when we run out
of memory since more mappings started being async.

I think it would be possible in theory for other drivers to forward
synchronous VM_BINDs asynchronously to the host as long as the host
kernel executes them synchronously, so maybe other drivers won't have
a problem with native context support. But it will still be possible
to make them fall over if you poke them the right way.

Connor

>
> > > > There definitely was one where I was seeing >5k VM_BIND jobs pile up,
> > > > so absolutely throttling like this is needed.
> > >
> > > I still don't understand why the kernel must throttle this? If userspace 
> > > uses
> > > async VM_BIND, it obviously can't spam the kernel infinitely without 
> > > running
> > > into an OOM case.
> >
> > It is a valid question about whether the kernel or userspace should be
> > the one to do the throttling.
> >
> > I went for doing it in the kernel because the kernel has better
> > knowledge of how much it needs to pre-allocate.
> >
> > (There is also the side point, that this pre-allocated memory is not
> > charged to the calling process from a PoV of memory accounting.  So
> > with that in mind it seems like a good idea for the kernel to throttle
> > memory usage.)
>
> That's a very valid point, maybe we should investigate in the direction of
> addressing this, rather than trying to work around it in the scheduler, where 
> we
> can only set an arbitrary credit limit.
>
> > > But let's assume we agree that we want to avoid that userspace can ever 
> > > OOM itself
> > > through async VM_BIND, then the proposed solution seems wrong:
> > >
> > > Do we really want the driver developer to set an arbitrary boundary of a 
> > > number
> > > of jobs that can be submitted before *async* VM_BIND blocks and becomes
> > > semi-sync?
> > >
> > > How do we choose this number of jobs? A very small number to be safe, 
> > > which
> > > scales badly on powerful machines? A large number that scales well on 
> > > powerful
> > > machines, but OOMs on weaker ones?
> >
> > The way I am using it in msm, the credit amount and limit are in units
> > of pre-allocated pages in-flight.  I set the enqueue_credit_limit to
> > 1024 pages, once there are jobs queued up exceeding that limit, they
> > start blocking.
> >
> > The number of _jobs_ is irrelevant, it is # of pre-alloc'd pages in flight.
>
> That doesn't make a difference for my question. How do you know 1024 pages is 
> a
> good value? How do we scale for different machines with different 
> capabilities?
>
> If you have a powerful machine with lots of memory, we might throttle 
> userspace
> for no reason, no?
>
> If the machine has very limited resources, it might already be too much?

Re: [PATCH v4 04/40] drm/sched: Add enqueue credit limit

Reply via email to