Re: stupid and complicated PAT :)

Christian König Fri, 29 Aug 2025 03:50:26 -0700

You write mails faster than I can answer :)

I will try to answer all questions and comment here, here but please ping me if 
I miss something.


On 28.08.25 23:32, David Hildenbrand wrote:
> On 28.08.25 23:28, David Hildenbrand wrote:
>> On 28.08.25 23:18, David Hildenbrand wrote:
>>> On 26.08.25 18:09, Christian König wrote:
>>>> On 26.08.25 14:07, David Hildenbrand wrote:
[SNIP]
>>>> Well that was nearly 30years ago, PCI, AGP and front side bus are long 
>>>> gone, but the concept of putting video controller (GPU) stuff into 
>>>> uncached system memory has prevailed.
>>>>
>>>> So for example even modern AMD CPU based laptops need uncached system 
>>>> memory if their local memory is not large enough to contain the picture to 
>>>> display on the monitor. And with modern 8k monitors that can actually 
>>>> happen quite fast...
>>>>
>>>> What drivers do today is to call vmf_insert_pfn_prot() either with the PFN 
>>>> of their local memory (iomem) or uncached/wc system memory.
>>>
>>> That makes perfect sense. I assume we might or might not have "struct
>>> page" (pfn_valid) for the iomem, depending on where these areas reside,
>>> correct?

Exactly that, yes.

>>>> To summarize that we have an interface to fill in the page tables with 
>>>> either iomem or system memory is actually part of the design. That's how 
>>>> the HW driver is expected to work.
>>>>
>>>>>> That drivers need to call set_pages_wc/uc() for the linear mapping on 
>>>>>> x86 manually is correct and checking that is clearly a good idea for 
>>>>>> debug kernels.
>>>>>
>>>>> I'll have to think about this a bit: assuming only vmf_insert_pfn() calls 
>>>>> pfnmap_setup_cachemode_pfn() but vmf_insert_pfn_prot() doesn't, how could 
>>>>> we sanity check that somebody is doing something against the will of PAT.
>>>>
>>>> I think the most defensive approach for a quick fix is this change here:
>>>>
>>>>     static inline void pgprot_set_cachemode(pgprot_t *prot, enum 
>>>> page_cache_mode pcm)
>>>>     {
>>>> -       *prot = __pgprot((pgprot_val(*prot) & ~_PAGE_CACHE_MASK) |
>>>> -                        cachemode2protval(pcm));
>>>> +       if (pcm != _PAGE_CACHE_MODE_WB)
>>>> +               *prot = __pgprot((pgprot_val(*prot) & ~_PAGE_CACHE_MASK) |
>>>> +                                cachemode2protval(pcm));
>>>>     }
>>>>
>>>> This applies the PAT value if it's anything else than _PAGE_CACHE_MODE_WB 
>>>> but still allows callers to use something different on normal WB system 
>>>> memory.
>>>>
>>>> What do you think?
>>>
>>> This feels like too big of a hammer. In particular, it changes things
>>> like phys_mem_access_prot_allowed(), which requires more care.
>>>
>>> First, I thought we should limit what we do to vmf_insert_pfn_prot()
>>> only. But then I realized that we have stuff like
>>>
>>>     vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
>>>
>>> I'm still trying to find out the easy way out that is not a complete hack.

Well I think when the change is limited to only to vmf_insert_pfn_prot() for 
now we can limit the risk quite a bit as well.

Background is that only a handful of callers are using vmf_insert_pfn_prot() 
and it looks like all of those actually do know what they are doing and using 
the right flags.

>>> Will the iomem ever be mapped by the driver again with a different cache
>>> mode? (e.g., WB -> UC -> WB)

Yes, that can absolutely happen. But for iomem we would have an explicit call 
to ioremap(), ioremap_wc(), ioremap_cache() for that before anybody would map 
anything into userspace page tables.

But thinking more about it I just had an OMFG moment! Is it possible that the 
PAT currently already has a problem with that?

We had customer projects where BARs of different PCIe devices ended up on 
different physical addresses after a hot remove/re-add.

Is it possible that the PAT keeps enforcing certain caching attributes for a 
physical address? E.g. for example because a driver doesn't clean up properly 
on hot remove?

If yes than that would explain a massive number of problems we had with hot 
add/remove.

>> What I am currently wondering is: assume we get a
>> pfnmap_setup_cachemode_pfn() call and we could reliably identify whether
>> there was a previous registration, then we could do
>>
>> (a) No previous registration: don't modify pgprot. Hopefully the driver
>>       knows what it is doing. Maybe we can add sanity checks that the
>>       direct map was already updated etc.
>> (b) A previous registration: modify pgprot like we do today.

That would work for me.

>> System RAM is the problem. I wonder how many of these registrations we
>> really get and if we could just store them in the same tree as !system
>> RAM instead of abusing page flags.
> 
> commit 9542ada803198e6eba29d3289abb39ea82047b92
> Author: Suresh Siddha <suresh.b.sid...@intel.com>
> Date:   Wed Sep 24 08:53:33 2008 -0700
> 
>     x86: track memtype for RAM in page struct
>         Track the memtype for RAM pages in page struct instead of using the
>     memtype list. This avoids the explosion in the number of entries in
>     memtype list (of the order of 20,000 with AGP) and makes the PAT
>     tracking simpler.
>         We are using PG_arch_1 bit in page->flags.
>         We still use the memtype list for non RAM pages.
> 
> 
> I do wonder if that explosion is still an issue today.

Yes it is. That is exactly the issue I'm working on here.

It's just that AGP was replaced by internal GPU MMUs over time and so we don't 
use the old AGP code any more but just call get_free_pages() (or similar) 
directly.

Thanks a lot for the help,
Christian.

Re: stupid and complicated PAT :)

Reply via email to