On 07.08.25 18:47, Thomas Hellström wrote:
>> Well it's surprising that even modern CPU do stuff like that. That
>> could explain some of the problems we had with uncached mappings on
>> ARM and RISC-V.
> 
> Yeah. I agree. Need to double-check with HW people whether that is gone
> with Panther Lake. Don't have a confirmation yet on that.

Going to ask around AMD internally as well.

> If it wasn't for the writeback of speculative prefetches we could've
> settled for only have TTM map the user-space mappings without changing
> the kernel map, just like the i915 driver does for older GPUS.

We should probably come up with a CPU whitelist or blacklist where that is 
actually needed.

>>> Second, IIRC vm_insert_pfn_prot() on X86 will override the given
>>> caching mode with the last caching mode set for the kernel linear
>>> map,
>>> so if you try to set up a write-combined GPU mapping without a
>>> previous
>>> call to set_pages_xxxxx it will actually end up cached. see
>>> track_pfn_insert().
>>
>> That is exactly the same incorrect assumption I made as well.
>>
>> It's not the linear mapping where that comes from but a separate page
>> attribute table, see /sys/kernel/debug/x86/pat_memtype_list.
>>
>> Question is why the heck should we do this? I mean we keep an extra
>> rb tree around to overwrite something the driver knows in the first
>> place?
>>
>> That is basically just tons of extra overhead for nothing as far as I
>> can see.
> 
> IIRC it was PAT people enforcing the x86 documentation that aliased
> mappings with conflicting caching attributes were not allowed. But it
> has proven to work at least on those CPUs not suffering from the clean
> cache-line writeback mentioned above.

Makes sense. With the PAT handling even accessing things through /dev/mem gives 
you the right caching.

Do you have a list of Intel CPUs where it works?

> FWIW If I understand the code correctly, i915 bypasses this by setting
> up user-space mappings not by vm_insert_pfn_prot() but using
> apply_to_page_range(), mapping the whole bo.

Yeah, that's probably not something we can do. Even filling in 2MiB of address 
space at a time caused performance problems for TTM.

We should probably just drop overriding the attributes in vmf_insert_pfn_prot().

Regards,
Christian.

> 
> /Thomas
> 
> 
>>
>> Thanks for taking a look,
>> Christian.
>>
>>>
>>> /Thomas
>>>
>>>
>>>> ---
>>>>  drivers/gpu/drm/ttm/ttm_pool.c | 16 +++++++++++-----
>>>>  1 file changed, 11 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c
>>>> b/drivers/gpu/drm/ttm/ttm_pool.c
>>>> index baf27c70a419..7487eac29398 100644
>>>> --- a/drivers/gpu/drm/ttm/ttm_pool.c
>>>> +++ b/drivers/gpu/drm/ttm/ttm_pool.c
>>>> @@ -38,7 +38,7 @@
>>>>  #include <linux/highmem.h>
>>>>  #include <linux/sched/mm.h>
>>>>  
>>>> -#ifdef CONFIG_X86
>>>> +#ifdef CONFIG_X86_32
>>>>  #include <asm/set_memory.h>
>>>>  #endif
>>>>  
>>>> @@ -46,6 +46,7 @@
>>>>  #include <drm/ttm/ttm_pool.h>
>>>>  #include <drm/ttm/ttm_tt.h>
>>>>  #include <drm/ttm/ttm_bo.h>
>>>> +#include <drm/drm_cache.h>
>>>>  
>>>>  #include "ttm_module.h"
>>>>  
>>>> @@ -192,7 +193,7 @@ static void ttm_pool_free_page(struct
>>>> ttm_pool
>>>> *pool, enum ttm_caching caching,
>>>>    struct ttm_pool_dma *dma;
>>>>    void *vaddr;
>>>>  
>>>> -#ifdef CONFIG_X86
>>>> +#ifdef CONFIG_X86_32
>>>>    /* We don't care that set_pages_wb is inefficient here.
>>>> This
>>>> is only
>>>>     * used when we have to shrink and CPU overhead is
>>>> irrelevant then.
>>>>     */
>>>> @@ -218,7 +219,7 @@ static void ttm_pool_free_page(struct
>>>> ttm_pool
>>>> *pool, enum ttm_caching caching,
>>>>  /* Apply any cpu-caching deferred during page allocation */
>>>>  static int ttm_pool_apply_caching(struct ttm_pool_alloc_state
>>>> *alloc)
>>>>  {
>>>> -#ifdef CONFIG_X86
>>>> +#ifdef CONFIG_X86_32
>>>>    unsigned int num_pages = alloc->pages - alloc-
>>>>> caching_divide;
>>>>  
>>>>    if (!num_pages)
>>>> @@ -232,6 +233,11 @@ static int ttm_pool_apply_caching(struct
>>>> ttm_pool_alloc_state *alloc)
>>>>    case ttm_uncached:
>>>>            return set_pages_array_uc(alloc->caching_divide,
>>>> num_pages);
>>>>    }
>>>> +
>>>> +#elif defined(CONFIG_X86_64)
>>>> +  unsigned int num_pages = alloc->pages - alloc-
>>>>> caching_divide;
>>>> +
>>>> +  drm_clflush_pages(alloc->caching_divide, num_pages);
>>>>  #endif
>>>>    alloc->caching_divide = alloc->pages;
>>>>    return 0;
>>>> @@ -342,7 +348,7 @@ static struct ttm_pool_type
>>>> *ttm_pool_select_type(struct ttm_pool *pool,
>>>>    if (pool->use_dma_alloc)
>>>>            return &pool->caching[caching].orders[order];
>>>>  
>>>> -#ifdef CONFIG_X86
>>>> +#ifdef CONFIG_X86_32
>>>>    switch (caching) {
>>>>    case ttm_write_combined:
>>>>            if (pool->nid != NUMA_NO_NODE)
>>>> @@ -980,7 +986,7 @@ long ttm_pool_backup(struct ttm_pool *pool,
>>>> struct ttm_tt *tt,
>>>>        pool->use_dma_alloc || ttm_tt_is_backed_up(tt))
>>>>            return -EBUSY;
>>>>  
>>>> -#ifdef CONFIG_X86
>>>> +#ifdef CONFIG_X86_32
>>>>    /* Anything returned to the system needs to be cached.
>>>> */
>>>>    if (tt->caching != ttm_cached)
>>>>            set_pages_array_wb(tt->pages, tt->num_pages);
>>>
>>
> 

Reply via email to