On 5/13/26 21:43, David Hildenbrand (Arm) wrote: > On 5/13/26 13:54, Christian König wrote: >> On 5/13/26 10:37, David Hildenbrand (Arm) wrote: >>> On 5/13/26 09:47, Christian König wrote: >>>> Hi David & Thomas, >>>> >>>> ... >>>> >>>> Exactly that is one of the major reasons why we aren't using a shmem as >>>> backing store for TTM buffers in the first place. >>> >>> What was the problem with that the last time this was considered? >>> >>> shmem nowadays supports THP (e.g., 2M) and even mTHP (e.g., 64K). >>> >>> For internal mounts, it must be enabled accordingly >>> (/sys/kernel/mm/transparent_hugepage/.../shmem_enabled). >>> >>> Some distributions still default to "never". I guess if an admin enables >>> it, you >>> would just get THPs. >> >> Yeah, exactly that is not acceptable. We have some customers who already use >> that approach through udmabuf, so we already have some experience with it. >> >> And I can't count how often I had to explain that it's a configuration issue >> and that the admin has to enable THP to get decent performance. > > A lot of concern with shmem and THP is around memory waste. > > But the DRM use case of shmem is really different than ordinary shmem usage, > in > particular when it can steer the folio size used (and better control memory > waste I assume?).
Yeah memory waste is absolutely no concern for this use case. The problem why THP often causes memory waste is because applications tend to over allocate and the kernel compensates for that by doing on demand allocation. In other words you only allocate the memory on first page fault and when you switch to 2MiB allocations it can easily be that you waste a lot of that. But here we explicitly allocate X amount of memory, so what drivers do when they get an 49MiB allocation request is to allocate 24x2MiB and then fill the remaining 1MiB with the biggest order available or just 4KiB pages. So for this use case here there is no memory waste at all. >>> If "distro default" is the only problem, I guess we could think about how to >>> improve that. For example, just let internal GPU DRM objects allocate any >>> folio >>> size available and supported etc. >> >> Mhm, that sounds not so bad. >> >> I think what drivers really need is that they can give the order to >> shmem_read_folio_gfp() and get a folio with that order or -ENOMEM as return. > > Is it really helpful to fail and not make progress on a fragmented system? Yeah that is pretty much a must have. Background is that THP is a best effort feature, e.g. you try to archive the best performance possible but it's ok if you can't because of fragmentation. But in this use case here the caller does want to kick of defragmentation and wait for it to finish before you fail or fallback to smaller orders. > I'd assume you'd want the largest folio up to a specific order. Not larger, > because it would waste memory. But maybe smaller to make progress. What is possible would be to give a more complex structure than just order+gfp flags which completely describes the requirement of what to do before returning memory. But I think that this just makes the interface more complicated for a rather narrow use case. The gfp flags already give the ability to say that a certain allocation can fail without a warning and that seems to be sufficient. >> >> In other words we need to enforce it and if the desired page size doesn't >> work we can then still decide if we want a fallback or not based on the use >> case the driver tries to implement. > > In which use case would you not want to fallback? Would it be something to > tackle separately later? Well basically every use case which wants to guarantee rendering performance. E.g. everything real time. For a simple example just think of virtual reality, you rather want a black screen instead of laggy rendering which causes motion sickness. Regards, Christian. >> >>> Would that make it possible to just use shmem natively? (e.g., how would >>> this >>> interact with shmem features like folio migration, would that be workable >>> with >>> DRM objects?). >> >> Mostly, I mean there is still the use case for UC and USWC memory but at >> least for AMD GPUs that is mostly negligible (we need it for a handfull of >> workarounds for HW bugs etc...). > > Thanks! >
