Re: [PATCH v3 0/8] drm: Introduce sparse GEM shmem

Boris Brezillon Mon, 14 Apr 2025 08:15:25 -0700

On Mon, 14 Apr 2025 16:34:47 +0200
Simona Vetter <simona.vet...@ffwll.ch> wrote:


> On Mon, Apr 14, 2025 at 02:08:25PM +0100, Liviu Dudau wrote:
> > On Mon, Apr 14, 2025 at 01:22:06PM +0200, Boris Brezillon wrote:  
> > > Hi Sima,
> > > 
> > > On Fri, 11 Apr 2025 14:01:16 +0200
> > > Simona Vetter <simona.vet...@ffwll.ch> wrote:
> > >   
> > > > On Thu, Apr 10, 2025 at 08:41:55PM +0200, Boris Brezillon wrote:  
> > > > > On Thu, 10 Apr 2025 14:01:03 -0400
> > > > > Alyssa Rosenzweig <aly...@rosenzweig.io> wrote:
> > > > >     
> > > > > > > > > In Panfrost and Lima, we don't have this concept of 
> > > > > > > > > "incremental
> > > > > > > > > rendering", so when we fail the allocation, we just fail the 
> > > > > > > > > GPU job
> > > > > > > > > with an unhandled GPU fault.        
> > > > > > > > 
> > > > > > > > To be honest I think that this is enough to mark those two 
> > > > > > > > drivers as
> > > > > > > > broken.  It's documented that this approach is a no-go for 
> > > > > > > > upstream
> > > > > > > > drivers.
> > > > > > > > 
> > > > > > > > How widely is that used?      
> > > > > > > 
> > > > > > > It exists in lima and panfrost, and I wouldn't be surprised if a 
> > > > > > > similar
> > > > > > > mechanism was used in other drivers for tiler-based GPUs (etnaviv,
> > > > > > > freedreno, powervr, ...), because ultimately that's how tilers 
> > > > > > > work:
> > > > > > > the amount of memory needed to store per-tile primitives (and 
> > > > > > > metadata)
> > > > > > > depends on what the geometry pipeline feeds the tiler with, and 
> > > > > > > that
> > > > > > > can't be predicted. If you over-provision, that's memory the 
> > > > > > > system won't
> > > > > > > be able to use while rendering takes place, even though only a 
> > > > > > > small
> > > > > > > portion might actually be used by the GPU. If your allocation is 
> > > > > > > too
> > > > > > > small, it will either trigger a GPU fault (for HW not supporting 
> > > > > > > an
> > > > > > > "incremental rendering" mode) or under-perform (because flushing
> > > > > > > primitives has a huge cost on tilers).      
> > > > > > 
> > > > > > Yes and no.
> > > > > > 
> > > > > > Although we can't allocate more memory for /this/ frame, we know the
> > > > > > required size is probably constant across its lifetime. That gives a
> > > > > > simple heuristic to manage the tiler heap efficiently without
> > > > > > allocations - even fallible ones - in the fence signal path:
> > > > > > 
> > > > > > * Start with a small fixed size tiler heap
> > > > > > * Try to render, let incremental rendering kick in when it's too 
> > > > > > small.
> > > > > > * When cleaning up the job, check if we used incremental rendering.
> > > > > > * If we did - double the size of the heap the next time we submit 
> > > > > > work.
> > > > > > 
> > > > > > The tiler heap still grows dynamically - it just does so over the 
> > > > > > span
> > > > > > of a couple frames. In practice that means a tiny hit to startup 
> > > > > > time as
> > > > > > we dynamically figure out the right size, incurring extra flushing 
> > > > > > at
> > > > > > the start, without needing any "grow-on-page-fault" heroics.
> > > > > > 
> > > > > > This should solve the problem completely for CSF/panthor. So it's 
> > > > > > only
> > > > > > hardware that architecturally cannot do incremental rendering (older
> > > > > > Mali: panfrost/lima) where we need this mess.    
> > > > > 
> > > > > OTOH, if we need something
> > > > > for Utgard(Lima)/Midgard/Bifrost/Valhall(Panfrost), why not use the 
> > > > > same
> > > > > thing for CSF, since CSF is arguably the sanest of all the HW
> > > > > architectures listed above: allocation can fail/be non-blocking,
> > > > > because there's a fallback to incremental rendering when it fails.    
> > > > 
> > > > So this is a really horrible idea to sort this out for panfrost 
> > > > hardware,
> > > > which doesn't have a tiler cache flush as a fallback. It's roughly three
> > > > stages:
> > > > 
> > > > 1. A pile of clever tricks to make the chances of running out of memory
> > > > really low. Most of these also make sense for panthor platforms, just 
> > > > as a
> > > > performance optimization.
> > > > 
> > > > 2. I terrible way to handle the unavoidable VK_DEVICE_LOST, but in a way
> > > > such that the impact should be minimal. This is nasty, and we really 
> > > > want
> > > > to avoid that for panthor.
> > > > 
> > > > 3. Mesa quirks so that 2 doesn't actually ever happen in practice.
> > > > 
> > > > 1. Clever tricks
> > > > ----------------
> > > > 
> > > > This is a cascade of tricks we can pull in the gpu fault handler:
> > > > 
> > > > 1a. Allocate with GFP_NORECLAIM. We want this first because that 
> > > > triggers
> > > >   background reclaim, and that might be enough to get us through and 
> > > > free
> > > >   some easy caches (like clean fs cache and stuff like that which can 
> > > > just
> > > >   be dropped).  
> > > 
> > > There's no GFP_NORECLAIM, and given the discussions we had before, I
> > > guess you meant GFP_NOWAIT. Otherwise it's the __GFP_NOWARN |
> > > __GFP_NORETRY I used in this series, and it probably doesn't try hard
> > > enough as pointed out by you and Christian.
> > >   
> > > > 
> > > > 1b Userspace needs to guesstimate a good guess for how much we'll need. 
> > > > I'm
> > > >   hoping that between render target size and maybe counting the total
> > > >   amounts of vertices we can do a decent guesstimate for many 
> > > > workloads.  
> > > 
> > > There are extra parameters to take into account, like the tile
> > > hierarchy mask (number of binning lists instantiated) and probably
> > > other things I forget, but for simple vertex+fragment pipelines and
> > > direct draws, guessing the worst memory usage case is probably doable.
> > > Throw indirect draws into the mix, and it suddenly becomes a lot more
> > > complicated. Not even talking about GEOM/TESS stages, which makes the
> > > guessing even harder AFAICT.
> > >   
> > > >   Note that goal here is not to ensure success, but just to get the 
> > > > rough
> > > >   ballpark. The actual starting number here should aim fairly low, so 
> > > > that
> > > >   we avoid wasting memory since this is memory wasted on every context
> > > >   (that uses a feature which needs dynamic memory allocation, which I
> > > >   guess for pan* is everything, but for apple it would be more 
> > > > limited).  
> > > 
> > > Ack.
> > >   
> > > > 
> > > > 1c The kernel then keeps an additional global memory pool. Note this 
> > > > would
> > > >   not have the same semantics as mempool.h, which is aimed GFP_NOIO
> > > >   forward progress guarantees, but more as a preallocation pool. In 
> > > > every
> > > >   CS ioctl we'll make sure the pool is filled, and we probably want to
> > > >   size the pool relative to the context with the biggest dynamic memory
> > > >   usage. So probably this thing needs a shrinker, so we can reclaim it
> > > >   when you don't run an app with a huge buffer need on the gpu anymore. 
> > > >  
> > > 
> > > Okay, that's a technique Arm has been using in their downstream driver
> > > (it named JIT-allocation there).
> > >   
> > > > 
> > > >   Note that we're still not sizing this to guarantee success, but 
> > > > together
> > > >   with the userspace heuristics it should be big enough to almost always
> > > >   work out. And since it's global reserve we can afford to waste a bit
> > > >   more memory on this one. We might also want to scale this pool by the
> > > >   total memory available, like the watermarks core mm computes. We'll 
> > > > only
> > > >   hang onto this memory when the gpu is in active usage, so this should 
> > > > be
> > > >   fine.  
> > > 
> > > Sounds like a good idea.
> > >   
> > > > 
> > > >   Also the preallocation would need to happen without holding the memory
> > > >   pool look, so that we can use GFP_KERNEL.
> > > > 
> > > > Up to this point I think it's all tricks that panthor also wants to
> > > > employ.
> > > > 
> > > > 1d Next up is scratch dynamic memory. If we can assume that the memory 
> > > > does
> > > >   not need to survive a batchbuffer (hopefully the case with vulkan 
> > > > render
> > > >   pass) we could steal such memory from other contexts. We could even do
> > > >   that for contexts which are queued but not yet running on the hardware
> > > >   (might need unloading them to be doable with fw renderers like
> > > >   panthor/CSF) as long as we keep such stolen dynamic memory on a 
> > > > separate
> > > >   free list. Because if we'd entirely free this, or release it into the
> > > >   memory pool we'll make things worse for these other contexts, we need 
> > > > to
> > > >   be able to guarantee that any context can always get all the stolen
> > > >   dynamic pages back before we start running it on the gpu.  
> > > 
> > > Actually, CSF stands in the way of re-allocating memory to other
> > > contexts, because once we've allocated memory to a tiler heap, the FW
> > > manages this pool of chunks, and recycles them. Mesa can intercept
> > > the "returned chunks" and collect those chunks instead of re-assiging
> > > then to the tiler heap through a CS instruction (which goes thought
> > > the FW internallu), but that involves extra collaboration between the
> > > UMD, KMD and FW which we don't have at the moment. Not saying never,
> > > but I'd rather fix things gradually (first the blocking alloc in the
> > > fence-signalling path, then the optimization to share the extra mem
> > > reservation cost among contexts by returning the chunks to the global
> > > kernel pool rather than directly to the heap).  
> > 
> > The additional issue with borrowing memory from idle contexts is that it 
> > will
> > involve MMU operations, as we will have to move the memory into the active
> > context address space. CSF GPUs have a limitation that they can only work 
> > with
> > one address space for the active job when it comes to memory used internally
> > by the job, so we either have to map the scratch dynamic memory in all the
> > jobs before we submit them, or we will have to do MMU maintainance 
> > operations
> > in the OOM path in order to borrow memory from other contexts.  
> 
> Hm, this could be tricky. So mmu operations shouldn't be an issue because
> they must work for GFP_NOFS contexts for i/o writeback. You might need to
> much more carefully manage this and make sure the iommu has big enough
> range of pagetables preallocated. This also holds for the other games
> we're playing here, at least for gpu pagetables. But since pagetables are
> really small overhead it might be good to somewhat aggressively
> preallocate them.
> 
> But yeah this is possible an issue if you you need iommu wrangling, I
> have honestly not looked at the exact rules in there.

We already have a mechanism to pre-allocate page tables in panthor (for
aync VM_BIND requests where we're not allowed to allocate in the
run_job() path), but as said before, I probably won't try this global
mem pool thing on Panthor, since Panthor can do without it for now.
The page table pre-allocation mechanism is something we can easily
transpose to panfrost though.

Re: [PATCH v3 0/8] drm: Introduce sparse GEM shmem

Reply via email to