On Thu, May 22, 2025 at 4:00 AM Danilo Krummrich <d...@kernel.org> wrote: > > On Tue, May 20, 2025 at 10:22:54AM -0700, Rob Clark wrote: > > On Tue, May 20, 2025 at 9:54 AM Danilo Krummrich <d...@kernel.org> wrote: > > > On Tue, May 20, 2025 at 09:07:05AM -0700, Rob Clark wrote: > > > > On Tue, May 20, 2025 at 12:06 AM Danilo Krummrich <d...@kernel.org> > > > > wrote: > > > > > But let's assume we agree that we want to avoid that userspace can > > > > > ever OOM itself > > > > > through async VM_BIND, then the proposed solution seems wrong: > > > > > > > > > > Do we really want the driver developer to set an arbitrary boundary > > > > > of a number > > > > > of jobs that can be submitted before *async* VM_BIND blocks and > > > > > becomes > > > > > semi-sync? > > > > > > > > > > How do we choose this number of jobs? A very small number to be safe, > > > > > which > > > > > scales badly on powerful machines? A large number that scales well on > > > > > powerful > > > > > machines, but OOMs on weaker ones? > > > > > > > > The way I am using it in msm, the credit amount and limit are in units > > > > of pre-allocated pages in-flight. I set the enqueue_credit_limit to > > > > 1024 pages, once there are jobs queued up exceeding that limit, they > > > > start blocking. > > > > > > > > The number of _jobs_ is irrelevant, it is # of pre-alloc'd pages in > > > > flight. > > > > > > That doesn't make a difference for my question. How do you know 1024 > > > pages is a > > > good value? How do we scale for different machines with different > > > capabilities? > > > > > > If you have a powerful machine with lots of memory, we might throttle > > > userspace > > > for no reason, no? > > > > > > If the machine has very limited resources, it might already be too much? > > > > It may be a bit arbitrary, but then again I'm not sure that userspace > > is in any better position to pick an appropriate limit. > > > > 4MB of in-flight pages isn't going to be too much for anything that is > > capable enough to run vk, but still allows for a lot of in-flight > > maps. > > Ok, but what about the other way around? What's the performance impact if the > limit is chosen rather small, but we're running on a very powerful machine? > > Since you already have the implementation for hardware you have access to, can > you please check if and how performance degrades when you use a very small > threshold?
I mean, considering that some drivers (asahi, at least), _only_ implement synchronous VM_BIND, I guess blocking in extreme cases isn't so bad. But I think you are overthinking this. 4MB of pagetables is enough to map ~8GB of buffers. Perhaps drivers would want to set their limit based on the amount of memory the GPU could map, which might land them on a # larger than 1024, but still not an order of magnitude more. I don't really have a good setup for testing games that use this, atm, fex-emu isn't working for me atm. But I think Connor has a setup with proton working? But, flip it around. It is pretty simple to create a test program that submits a flood of 4k (or whatever your min page size is) VM_BINDs, and see how prealloc memory usage blows up. This is really the thing this patch is trying to protect against. > Also, I think we should probably put this throttle mechanism in a separate > component, that just wraps a counter of bytes or rather pages that can be > increased and decreased through an API and the increase just blocks at a > certain > threshold. Maybe? I don't see why we need to explicitly define the units for the credit. This wasn't done for the existing credit mechanism.. which, seems like if you used some extra fences could also have been implemented externally. > This component can then be called by a driver from the job submit IOCTL and > the > corresponding place where the pre-allocated memory is actually used / freed. > > Depending on the driver, this might not necessarily be in the scheduler's > run_job() callback. > > We could call the component something like drm_throttle or > drm_submit_throttle. Maybe? This still has the same complaint I had about just implementing this in msm.. it would have to reach in and use the scheduler's job_scheduled wait-queue. Which, to me at least, seems like more of an internal detail about how the scheduler works. BR, -R