On Mon, 14 Apr 2025 16:34:47 +0200 Simona Vetter <simona.vet...@ffwll.ch> wrote:
> On Mon, Apr 14, 2025 at 02:08:25PM +0100, Liviu Dudau wrote: > > On Mon, Apr 14, 2025 at 01:22:06PM +0200, Boris Brezillon wrote: > > > Hi Sima, > > > > > > On Fri, 11 Apr 2025 14:01:16 +0200 > > > Simona Vetter <simona.vet...@ffwll.ch> wrote: > > > > > > > On Thu, Apr 10, 2025 at 08:41:55PM +0200, Boris Brezillon wrote: > > > > > On Thu, 10 Apr 2025 14:01:03 -0400 > > > > > Alyssa Rosenzweig <aly...@rosenzweig.io> wrote: > > > > > > > > > > > > > > In Panfrost and Lima, we don't have this concept of > > > > > > > > > "incremental > > > > > > > > > rendering", so when we fail the allocation, we just fail the > > > > > > > > > GPU job > > > > > > > > > with an unhandled GPU fault. > > > > > > > > > > > > > > > > To be honest I think that this is enough to mark those two > > > > > > > > drivers as > > > > > > > > broken. It's documented that this approach is a no-go for > > > > > > > > upstream > > > > > > > > drivers. > > > > > > > > > > > > > > > > How widely is that used? > > > > > > > > > > > > > > It exists in lima and panfrost, and I wouldn't be surprised if a > > > > > > > similar > > > > > > > mechanism was used in other drivers for tiler-based GPUs (etnaviv, > > > > > > > freedreno, powervr, ...), because ultimately that's how tilers > > > > > > > work: > > > > > > > the amount of memory needed to store per-tile primitives (and > > > > > > > metadata) > > > > > > > depends on what the geometry pipeline feeds the tiler with, and > > > > > > > that > > > > > > > can't be predicted. If you over-provision, that's memory the > > > > > > > system won't > > > > > > > be able to use while rendering takes place, even though only a > > > > > > > small > > > > > > > portion might actually be used by the GPU. If your allocation is > > > > > > > too > > > > > > > small, it will either trigger a GPU fault (for HW not supporting > > > > > > > an > > > > > > > "incremental rendering" mode) or under-perform (because flushing > > > > > > > primitives has a huge cost on tilers). > > > > > > > > > > > > Yes and no. > > > > > > > > > > > > Although we can't allocate more memory for /this/ frame, we know the > > > > > > required size is probably constant across its lifetime. That gives a > > > > > > simple heuristic to manage the tiler heap efficiently without > > > > > > allocations - even fallible ones - in the fence signal path: > > > > > > > > > > > > * Start with a small fixed size tiler heap > > > > > > * Try to render, let incremental rendering kick in when it's too > > > > > > small. > > > > > > * When cleaning up the job, check if we used incremental rendering. > > > > > > * If we did - double the size of the heap the next time we submit > > > > > > work. > > > > > > > > > > > > The tiler heap still grows dynamically - it just does so over the > > > > > > span > > > > > > of a couple frames. In practice that means a tiny hit to startup > > > > > > time as > > > > > > we dynamically figure out the right size, incurring extra flushing > > > > > > at > > > > > > the start, without needing any "grow-on-page-fault" heroics. > > > > > > > > > > > > This should solve the problem completely for CSF/panthor. So it's > > > > > > only > > > > > > hardware that architecturally cannot do incremental rendering (older > > > > > > Mali: panfrost/lima) where we need this mess. > > > > > > > > > > OTOH, if we need something > > > > > for Utgard(Lima)/Midgard/Bifrost/Valhall(Panfrost), why not use the > > > > > same > > > > > thing for CSF, since CSF is arguably the sanest of all the HW > > > > > architectures listed above: allocation can fail/be non-blocking, > > > > > because there's a fallback to incremental rendering when it fails. > > > > > > > > So this is a really horrible idea to sort this out for panfrost > > > > hardware, > > > > which doesn't have a tiler cache flush as a fallback. It's roughly three > > > > stages: > > > > > > > > 1. A pile of clever tricks to make the chances of running out of memory > > > > really low. Most of these also make sense for panthor platforms, just > > > > as a > > > > performance optimization. > > > > > > > > 2. I terrible way to handle the unavoidable VK_DEVICE_LOST, but in a way > > > > such that the impact should be minimal. This is nasty, and we really > > > > want > > > > to avoid that for panthor. > > > > > > > > 3. Mesa quirks so that 2 doesn't actually ever happen in practice. > > > > > > > > 1. Clever tricks > > > > ---------------- > > > > > > > > This is a cascade of tricks we can pull in the gpu fault handler: > > > > > > > > 1a. Allocate with GFP_NORECLAIM. We want this first because that > > > > triggers > > > > background reclaim, and that might be enough to get us through and > > > > free > > > > some easy caches (like clean fs cache and stuff like that which can > > > > just > > > > be dropped). > > > > > > There's no GFP_NORECLAIM, and given the discussions we had before, I > > > guess you meant GFP_NOWAIT. Otherwise it's the __GFP_NOWARN | > > > __GFP_NORETRY I used in this series, and it probably doesn't try hard > > > enough as pointed out by you and Christian. > > > > > > > > > > > 1b Userspace needs to guesstimate a good guess for how much we'll need. > > > > I'm > > > > hoping that between render target size and maybe counting the total > > > > amounts of vertices we can do a decent guesstimate for many > > > > workloads. > > > > > > There are extra parameters to take into account, like the tile > > > hierarchy mask (number of binning lists instantiated) and probably > > > other things I forget, but for simple vertex+fragment pipelines and > > > direct draws, guessing the worst memory usage case is probably doable. > > > Throw indirect draws into the mix, and it suddenly becomes a lot more > > > complicated. Not even talking about GEOM/TESS stages, which makes the > > > guessing even harder AFAICT. > > > > > > > Note that goal here is not to ensure success, but just to get the > > > > rough > > > > ballpark. The actual starting number here should aim fairly low, so > > > > that > > > > we avoid wasting memory since this is memory wasted on every context > > > > (that uses a feature which needs dynamic memory allocation, which I > > > > guess for pan* is everything, but for apple it would be more > > > > limited). > > > > > > Ack. > > > > > > > > > > > 1c The kernel then keeps an additional global memory pool. Note this > > > > would > > > > not have the same semantics as mempool.h, which is aimed GFP_NOIO > > > > forward progress guarantees, but more as a preallocation pool. In > > > > every > > > > CS ioctl we'll make sure the pool is filled, and we probably want to > > > > size the pool relative to the context with the biggest dynamic memory > > > > usage. So probably this thing needs a shrinker, so we can reclaim it > > > > when you don't run an app with a huge buffer need on the gpu anymore. > > > > > > > > > > Okay, that's a technique Arm has been using in their downstream driver > > > (it named JIT-allocation there). > > > > > > > > > > > Note that we're still not sizing this to guarantee success, but > > > > together > > > > with the userspace heuristics it should be big enough to almost always > > > > work out. And since it's global reserve we can afford to waste a bit > > > > more memory on this one. We might also want to scale this pool by the > > > > total memory available, like the watermarks core mm computes. We'll > > > > only > > > > hang onto this memory when the gpu is in active usage, so this should > > > > be > > > > fine. > > > > > > Sounds like a good idea. > > > > > > > > > > > Also the preallocation would need to happen without holding the memory > > > > pool look, so that we can use GFP_KERNEL. > > > > > > > > Up to this point I think it's all tricks that panthor also wants to > > > > employ. > > > > > > > > 1d Next up is scratch dynamic memory. If we can assume that the memory > > > > does > > > > not need to survive a batchbuffer (hopefully the case with vulkan > > > > render > > > > pass) we could steal such memory from other contexts. We could even do > > > > that for contexts which are queued but not yet running on the hardware > > > > (might need unloading them to be doable with fw renderers like > > > > panthor/CSF) as long as we keep such stolen dynamic memory on a > > > > separate > > > > free list. Because if we'd entirely free this, or release it into the > > > > memory pool we'll make things worse for these other contexts, we need > > > > to > > > > be able to guarantee that any context can always get all the stolen > > > > dynamic pages back before we start running it on the gpu. > > > > > > Actually, CSF stands in the way of re-allocating memory to other > > > contexts, because once we've allocated memory to a tiler heap, the FW > > > manages this pool of chunks, and recycles them. Mesa can intercept > > > the "returned chunks" and collect those chunks instead of re-assiging > > > then to the tiler heap through a CS instruction (which goes thought > > > the FW internallu), but that involves extra collaboration between the > > > UMD, KMD and FW which we don't have at the moment. Not saying never, > > > but I'd rather fix things gradually (first the blocking alloc in the > > > fence-signalling path, then the optimization to share the extra mem > > > reservation cost among contexts by returning the chunks to the global > > > kernel pool rather than directly to the heap). > > > > The additional issue with borrowing memory from idle contexts is that it > > will > > involve MMU operations, as we will have to move the memory into the active > > context address space. CSF GPUs have a limitation that they can only work > > with > > one address space for the active job when it comes to memory used internally > > by the job, so we either have to map the scratch dynamic memory in all the > > jobs before we submit them, or we will have to do MMU maintainance > > operations > > in the OOM path in order to borrow memory from other contexts. > > Hm, this could be tricky. So mmu operations shouldn't be an issue because > they must work for GFP_NOFS contexts for i/o writeback. You might need to > much more carefully manage this and make sure the iommu has big enough > range of pagetables preallocated. This also holds for the other games > we're playing here, at least for gpu pagetables. But since pagetables are > really small overhead it might be good to somewhat aggressively > preallocate them. > > But yeah this is possible an issue if you you need iommu wrangling, I > have honestly not looked at the exact rules in there. We already have a mechanism to pre-allocate page tables in panthor (for aync VM_BIND requests where we're not allowed to allocate in the run_job() path), but as said before, I probably won't try this global mem pool thing on Panthor, since Panthor can do without it for now. The page table pre-allocation mechanism is something we can easily transpose to panfrost though.