On 07.08.25 18:47, Thomas Hellström wrote: >> Well it's surprising that even modern CPU do stuff like that. That >> could explain some of the problems we had with uncached mappings on >> ARM and RISC-V. > > Yeah. I agree. Need to double-check with HW people whether that is gone > with Panther Lake. Don't have a confirmation yet on that.
Going to ask around AMD internally as well. > If it wasn't for the writeback of speculative prefetches we could've > settled for only have TTM map the user-space mappings without changing > the kernel map, just like the i915 driver does for older GPUS. We should probably come up with a CPU whitelist or blacklist where that is actually needed. >>> Second, IIRC vm_insert_pfn_prot() on X86 will override the given >>> caching mode with the last caching mode set for the kernel linear >>> map, >>> so if you try to set up a write-combined GPU mapping without a >>> previous >>> call to set_pages_xxxxx it will actually end up cached. see >>> track_pfn_insert(). >> >> That is exactly the same incorrect assumption I made as well. >> >> It's not the linear mapping where that comes from but a separate page >> attribute table, see /sys/kernel/debug/x86/pat_memtype_list. >> >> Question is why the heck should we do this? I mean we keep an extra >> rb tree around to overwrite something the driver knows in the first >> place? >> >> That is basically just tons of extra overhead for nothing as far as I >> can see. > > IIRC it was PAT people enforcing the x86 documentation that aliased > mappings with conflicting caching attributes were not allowed. But it > has proven to work at least on those CPUs not suffering from the clean > cache-line writeback mentioned above. Makes sense. With the PAT handling even accessing things through /dev/mem gives you the right caching. Do you have a list of Intel CPUs where it works? > FWIW If I understand the code correctly, i915 bypasses this by setting > up user-space mappings not by vm_insert_pfn_prot() but using > apply_to_page_range(), mapping the whole bo. Yeah, that's probably not something we can do. Even filling in 2MiB of address space at a time caused performance problems for TTM. We should probably just drop overriding the attributes in vmf_insert_pfn_prot(). Regards, Christian. > > /Thomas > > >> >> Thanks for taking a look, >> Christian. >> >>> >>> /Thomas >>> >>> >>>> --- >>>> drivers/gpu/drm/ttm/ttm_pool.c | 16 +++++++++++----- >>>> 1 file changed, 11 insertions(+), 5 deletions(-) >>>> >>>> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c >>>> b/drivers/gpu/drm/ttm/ttm_pool.c >>>> index baf27c70a419..7487eac29398 100644 >>>> --- a/drivers/gpu/drm/ttm/ttm_pool.c >>>> +++ b/drivers/gpu/drm/ttm/ttm_pool.c >>>> @@ -38,7 +38,7 @@ >>>> #include <linux/highmem.h> >>>> #include <linux/sched/mm.h> >>>> >>>> -#ifdef CONFIG_X86 >>>> +#ifdef CONFIG_X86_32 >>>> #include <asm/set_memory.h> >>>> #endif >>>> >>>> @@ -46,6 +46,7 @@ >>>> #include <drm/ttm/ttm_pool.h> >>>> #include <drm/ttm/ttm_tt.h> >>>> #include <drm/ttm/ttm_bo.h> >>>> +#include <drm/drm_cache.h> >>>> >>>> #include "ttm_module.h" >>>> >>>> @@ -192,7 +193,7 @@ static void ttm_pool_free_page(struct >>>> ttm_pool >>>> *pool, enum ttm_caching caching, >>>> struct ttm_pool_dma *dma; >>>> void *vaddr; >>>> >>>> -#ifdef CONFIG_X86 >>>> +#ifdef CONFIG_X86_32 >>>> /* We don't care that set_pages_wb is inefficient here. >>>> This >>>> is only >>>> * used when we have to shrink and CPU overhead is >>>> irrelevant then. >>>> */ >>>> @@ -218,7 +219,7 @@ static void ttm_pool_free_page(struct >>>> ttm_pool >>>> *pool, enum ttm_caching caching, >>>> /* Apply any cpu-caching deferred during page allocation */ >>>> static int ttm_pool_apply_caching(struct ttm_pool_alloc_state >>>> *alloc) >>>> { >>>> -#ifdef CONFIG_X86 >>>> +#ifdef CONFIG_X86_32 >>>> unsigned int num_pages = alloc->pages - alloc- >>>>> caching_divide; >>>> >>>> if (!num_pages) >>>> @@ -232,6 +233,11 @@ static int ttm_pool_apply_caching(struct >>>> ttm_pool_alloc_state *alloc) >>>> case ttm_uncached: >>>> return set_pages_array_uc(alloc->caching_divide, >>>> num_pages); >>>> } >>>> + >>>> +#elif defined(CONFIG_X86_64) >>>> + unsigned int num_pages = alloc->pages - alloc- >>>>> caching_divide; >>>> + >>>> + drm_clflush_pages(alloc->caching_divide, num_pages); >>>> #endif >>>> alloc->caching_divide = alloc->pages; >>>> return 0; >>>> @@ -342,7 +348,7 @@ static struct ttm_pool_type >>>> *ttm_pool_select_type(struct ttm_pool *pool, >>>> if (pool->use_dma_alloc) >>>> return &pool->caching[caching].orders[order]; >>>> >>>> -#ifdef CONFIG_X86 >>>> +#ifdef CONFIG_X86_32 >>>> switch (caching) { >>>> case ttm_write_combined: >>>> if (pool->nid != NUMA_NO_NODE) >>>> @@ -980,7 +986,7 @@ long ttm_pool_backup(struct ttm_pool *pool, >>>> struct ttm_tt *tt, >>>> pool->use_dma_alloc || ttm_tt_is_backed_up(tt)) >>>> return -EBUSY; >>>> >>>> -#ifdef CONFIG_X86 >>>> +#ifdef CONFIG_X86_32 >>>> /* Anything returned to the system needs to be cached. >>>> */ >>>> if (tt->caching != ttm_cached) >>>> set_pages_array_wb(tt->pages, tt->num_pages); >>> >> >