Re: stupid and complicated PAT :)

David Hildenbrand Thu, 28 Aug 2025 14:33:14 -0700

On 28.08.25 23:28, David Hildenbrand wrote:

On 28.08.25 23:18, David Hildenbrand wrote:

On 26.08.25 18:09, Christian König wrote:

On 26.08.25 14:07, David Hildenbrand wrote:

2) We add another interface that consumes PFNs, but explicitly states
       that it is only for ordinary system RAM, and that the user is
       required for updating the direct map.

       We could sanity-check the direct map in debug kernels.


I would rather like to see vmf_insert_pfn_prot() fixed instead.

That function was explicitly added to insert the PFN with the given attributes 
and as far as I can see all users of that function expect exactly that.


It's all a bit tricky :(


I would rather say horrible complicated :(


3) We teach PAT code in pfnmap_setup_cachemode_pfn() about treating this
       system RAM differently.


There is also the option for a mixture between 1 and 2, where we get pages, but 
we map them non-refcounted in a VM_PFNMAP.

In general, having pages makes it easier to assert that they are likely 
ordinary system ram pages, and that the interface is not getting abused for 
something else.


Well, exactly that's the use case here and that is not abusive at all as far as 
I can see.

What drivers want is to insert a PFN with a certain set of caching attributes 
regardless if it's system memory or iomem. That's why vmf_insert_pfn_prot() was 
created in the first place.


I mean, the use case of "allocate pages from the buddy and fixup the linear 
map" sounds perfectly reasonable to me. Absolutely no reason to get PAT involved. 
Nobody else should be messing with that memory after all.

As soon as we are talking about other memory ranges (iomem) that are not from 
the buddy, it gets weird to bypass PAT, and the question I am asking myself is, 
when is it okay, and when not.


Ok let me try to explain parts of the history and the big picture for at least 
the graphics use case on x86.

In 1996/97 Intel came up with the idea of AGP: 
https://en.wikipedia.org/wiki/Accelerated_Graphics_Port

At that time the CPUs, PCI bus and system memory were all connected together 
through the north bridge: https://en.wikipedia.org/wiki/Northbridge_(computing)

The problem was that AGP also introduced the concept of putting large amounts 
of data for the video controller (PCI device) into system memory when you don't 
have enough local device memory (VRAM).

But that meant when that memory is cached that the north bridge always had to 
snoop the CPU cache over the front side bus for every access the video 
controller made. This meant a huge performance bottleneck, so the idea was born 
to access that data uncached.


Ack.



Well that was nearly 30years ago, PCI, AGP and front side bus are long gone, 
but the concept of putting video controller (GPU) stuff into uncached system 
memory has prevailed.

So for example even modern AMD CPU based laptops need uncached system memory if 
their local memory is not large enough to contain the picture to display on the 
monitor. And with modern 8k monitors that can actually happen quite fast...

What drivers do today is to call vmf_insert_pfn_prot() either with the PFN of 
their local memory (iomem) or uncached/wc system memory.


That makes perfect sense. I assume we might or might not have "struct
page" (pfn_valid) for the iomem, depending on where these areas reside,
correct?



To summarize that we have an interface to fill in the page tables with either 
iomem or system memory is actually part of the design. That's how the HW driver 
is expected to work.

That drivers need to call set_pages_wc/uc() for the linear mapping on x86 
manually is correct and checking that is clearly a good idea for debug kernels.


I'll have to think about this a bit: assuming only vmf_insert_pfn() calls 
pfnmap_setup_cachemode_pfn() but vmf_insert_pfn_prot() doesn't, how could we 
sanity check that somebody is doing something against the will of PAT.


I think the most defensive approach for a quick fix is this change here:

    static inline void pgprot_set_cachemode(pgprot_t *prot, enum 
page_cache_mode pcm)
    {
-       *prot = __pgprot((pgprot_val(*prot) & ~_PAGE_CACHE_MASK) |
-                        cachemode2protval(pcm));
+       if (pcm != _PAGE_CACHE_MODE_WB)
+               *prot = __pgprot((pgprot_val(*prot) & ~_PAGE_CACHE_MASK) |
+                                cachemode2protval(pcm));
    }

This applies the PAT value if it's anything else than _PAGE_CACHE_MODE_WB but 
still allows callers to use something different on normal WB system memory.

What do you think?


This feels like too big of a hammer. In particular, it changes things
like phys_mem_access_prot_allowed(), which requires more care.

First, I thought we should limit what we do to vmf_insert_pfn_prot()
only. But then I realized that we have stuff like

        vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);

I'm still trying to find out the easy way out that is not a complete hack.

Will the iomem ever be mapped by the driver again with a different cache
mode? (e.g., WB -> UC -> WB)


What I am currently wondering is: assume we get a
pfnmap_setup_cachemode_pfn() call and we could reliably identify whether
there was a previous registration, then we could do

(a) No previous registration: don't modify pgprot. Hopefully the driver
      knows what it is doing. Maybe we can add sanity checks that the
      direct map was already updated etc.

(b) A previous registration: modify pgprot like we do today.

System RAM is the problem. I wonder how many of these registrations we
really get and if we could just store them in the same tree as !system
RAM instead of abusing page flags.


commit 9542ada803198e6eba29d3289abb39ea82047b92
Author: Suresh Siddha <suresh.b.sid...@intel.com>
Date:   Wed Sep 24 08:53:33 2008 -0700

    x86: track memtype for RAM in page struct

Track the memtype for RAM pages in page struct instead of using the

    memtype list. This avoids the explosion in the number of entries in
    memtype list (of the order of 20,000 with AGP) and makes the PAT
    tracking simpler.

We are using PG_arch_1 bit in page->flags.We still use the memtype list for non RAM pages.



I do wonder if that explosion is still an issue today.


--
Cheers

David / dhildenb

Re: stupid and complicated PAT :)

Reply via email to