On Mon, 2025-08-11 at 13:51 +0200, Christian König wrote: > On 07.08.25 18:47, Thomas Hellström wrote: > > > Well it's surprising that even modern CPU do stuff like that. > > > That > > > could explain some of the problems we had with uncached mappings > > > on > > > ARM and RISC-V. > > > > Yeah. I agree. Need to double-check with HW people whether that is > > gone > > with Panther Lake. Don't have a confirmation yet on that. > > Going to ask around AMD internally as well. > > > If it wasn't for the writeback of speculative prefetches we > > could've > > settled for only have TTM map the user-space mappings without > > changing > > the kernel map, just like the i915 driver does for older GPUS. > > We should probably come up with a CPU whitelist or blacklist where > that is actually needed. > > > > > Second, IIRC vm_insert_pfn_prot() on X86 will override the > > > > given > > > > caching mode with the last caching mode set for the kernel > > > > linear > > > > map, > > > > so if you try to set up a write-combined GPU mapping without a > > > > previous > > > > call to set_pages_xxxxx it will actually end up cached. see > > > > track_pfn_insert(). > > > > > > That is exactly the same incorrect assumption I made as well. > > > > > > It's not the linear mapping where that comes from but a separate > > > page > > > attribute table, see /sys/kernel/debug/x86/pat_memtype_list. > > > > > > Question is why the heck should we do this? I mean we keep an > > > extra > > > rb tree around to overwrite something the driver knows in the > > > first > > > place? > > > > > > That is basically just tons of extra overhead for nothing as far > > > as I > > > can see. > > > > IIRC it was PAT people enforcing the x86 documentation that aliased > > mappings with conflicting caching attributes were not allowed. But > > it > > has proven to work at least on those CPUs not suffering from the > > clean > > cache-line writeback mentioned above. > > Makes sense. With the PAT handling even accessing things through > /dev/mem gives you the right caching. > > Do you have a list of Intel CPUs where it works?
AFAIK LNL is the only one so far where it doesn't work. I need to get more information internally. > > > FWIW If I understand the code correctly, i915 bypasses this by > > setting > > up user-space mappings not by vm_insert_pfn_prot() but using > > apply_to_page_range(), mapping the whole bo. > > Yeah, that's probably not something we can do. Even filling in 2MiB > of address space at a time caused performance problems for TTM. Wasn't that because of repeated calls to vmf_insert_pfn_prot(), repeating the caching checks and page-table walk all the time? I think apply_to_page_range() should be pretty fast. Also, to avoid regressions due to changing the number of prefaulted pages, we could perhaps honor the MADV_RANDOM and MADV_SEQUENTIAL advises for UMD to use; one faulting a single page only, one faulting the whole bo, but also see below: > > We should probably just drop overriding the attributes in > vmf_insert_pfn_prot(). I think either solution will see resistance with arch people. We should probably involve them in the discussion. Thanks, /Thomas > > Regards, > Christian. > > > > > /Thomas > > > > > > > > > > Thanks for taking a look, > > > Christian. > > > > > > > > > > > /Thomas > > > > > > > > > > > > > --- > > > > > drivers/gpu/drm/ttm/ttm_pool.c | 16 +++++++++++----- > > > > > 1 file changed, 11 insertions(+), 5 deletions(-) > > > > > > > > > > diff --git a/drivers/gpu/drm/ttm/ttm_pool.c > > > > > b/drivers/gpu/drm/ttm/ttm_pool.c > > > > > index baf27c70a419..7487eac29398 100644 > > > > > --- a/drivers/gpu/drm/ttm/ttm_pool.c > > > > > +++ b/drivers/gpu/drm/ttm/ttm_pool.c > > > > > @@ -38,7 +38,7 @@ > > > > > #include <linux/highmem.h> > > > > > #include <linux/sched/mm.h> > > > > > > > > > > -#ifdef CONFIG_X86 > > > > > +#ifdef CONFIG_X86_32 > > > > > #include <asm/set_memory.h> > > > > > #endif > > > > > > > > > > @@ -46,6 +46,7 @@ > > > > > #include <drm/ttm/ttm_pool.h> > > > > > #include <drm/ttm/ttm_tt.h> > > > > > #include <drm/ttm/ttm_bo.h> > > > > > +#include <drm/drm_cache.h> > > > > > > > > > > #include "ttm_module.h" > > > > > > > > > > @@ -192,7 +193,7 @@ static void ttm_pool_free_page(struct > > > > > ttm_pool > > > > > *pool, enum ttm_caching caching, > > > > > struct ttm_pool_dma *dma; > > > > > void *vaddr; > > > > > > > > > > -#ifdef CONFIG_X86 > > > > > +#ifdef CONFIG_X86_32 > > > > > /* We don't care that set_pages_wb is inefficient > > > > > here. > > > > > This > > > > > is only > > > > > * used when we have to shrink and CPU overhead is > > > > > irrelevant then. > > > > > */ > > > > > @@ -218,7 +219,7 @@ static void ttm_pool_free_page(struct > > > > > ttm_pool > > > > > *pool, enum ttm_caching caching, > > > > > /* Apply any cpu-caching deferred during page allocation */ > > > > > static int ttm_pool_apply_caching(struct > > > > > ttm_pool_alloc_state > > > > > *alloc) > > > > > { > > > > > -#ifdef CONFIG_X86 > > > > > +#ifdef CONFIG_X86_32 > > > > > unsigned int num_pages = alloc->pages - alloc- > > > > > > caching_divide; > > > > > > > > > > if (!num_pages) > > > > > @@ -232,6 +233,11 @@ static int ttm_pool_apply_caching(struct > > > > > ttm_pool_alloc_state *alloc) > > > > > case ttm_uncached: > > > > > return set_pages_array_uc(alloc- > > > > > >caching_divide, > > > > > num_pages); > > > > > } > > > > > + > > > > > +#elif defined(CONFIG_X86_64) > > > > > + unsigned int num_pages = alloc->pages - alloc- > > > > > > caching_divide; > > > > > + > > > > > + drm_clflush_pages(alloc->caching_divide, num_pages); > > > > > #endif > > > > > alloc->caching_divide = alloc->pages; > > > > > return 0; > > > > > @@ -342,7 +348,7 @@ static struct ttm_pool_type > > > > > *ttm_pool_select_type(struct ttm_pool *pool, > > > > > if (pool->use_dma_alloc) > > > > > return &pool- > > > > > >caching[caching].orders[order]; > > > > > > > > > > -#ifdef CONFIG_X86 > > > > > +#ifdef CONFIG_X86_32 > > > > > switch (caching) { > > > > > case ttm_write_combined: > > > > > if (pool->nid != NUMA_NO_NODE) > > > > > @@ -980,7 +986,7 @@ long ttm_pool_backup(struct ttm_pool > > > > > *pool, > > > > > struct ttm_tt *tt, > > > > > pool->use_dma_alloc || ttm_tt_is_backed_up(tt)) > > > > > return -EBUSY; > > > > > > > > > > -#ifdef CONFIG_X86 > > > > > +#ifdef CONFIG_X86_32 > > > > > /* Anything returned to the system needs to be > > > > > cached. > > > > > */ > > > > > if (tt->caching != ttm_cached) > > > > > set_pages_array_wb(tt->pages, tt- > > > > > >num_pages); > > > > > > > > > >