Re:

David Hildenbrand Wed, 27 Aug 2025 00:57:04 -0700

On 26.08.25 10:38, Christian König wrote:

On 25.08.25 21:10, David Hildenbrand wrote:

On 21.08.25 10:10, Christian König wrote:

On 20.08.25 17:23, David Hildenbrand wrote:

CCing Lorenzo


On 20.08.25 16:33, Christian König wrote:

Hi everyone,

sorry for CCing so many people, but that rabbit hole turned out to be
deeper than originally thought.

TTM always had problems with UC/WC mappings on 32bit systems and drivers
often had to revert to hacks like using GFP_DMA32 to get things working
while having no rational explanation why that helped (see the TTM AGP,
radeon and nouveau driver code for that).

It turned out that the PAT implementation we use on x86 not only enforces
the same caching attributes for pages in the linear kernel mapping, but
also for highmem pages through a separate R/B tree.

That was unexpected and TTM never updated that R/B tree for highmem pages,
so the function pgprot_set_cachemode() just overwrote the caching
attributes drivers passed in to vmf_insert_pfn_prot() and that essentially
caused all kind of random trouble.

An R/B tree is potentially not a good data structure to hold thousands if
not millions of different attributes for each page, so updating that is
probably not the way to solve this issue.

Thomas pointed out that the i915 driver is using apply_page_range()
instead of vmf_insert_pfn_prot() to circumvent the PAT implementation and
just fill in the page tables with what the driver things is the right
caching attribute.


I assume you mean apply_to_page_range() -- same issue in patch subjects.


Oh yes, of course. Sorry.

Oh this sounds horrible. Why oh why do we have these hacks in core-mm and have 
drivers abuse them :(


Yeah I was also a bit hesitated to use that, but the performance advantage is 
so high that we probably can't avoid the general approach.

Honestly, apply_to_pte_range() is just the entry in doing all kinds of weird crap to page 
tables because "you know better".


Exactly that's the problem I'm pointing out, drivers *do* know it better. The 
core memory management has applied incorrect values which caused all kind of 
the trouble.

The problem is not a bug in PAT nor TTM/drivers but rather how they interact 
with each other.

What I don't understand is why do we have the PAT in the first place? No other 
architecture does it this way.


Probably because no other architecture has these weird glitches I assume ... 
skimming over memtype_reserve() and friends there are quite some corner cases 
the code is handling (BIOS, ACPI, low ISA, system RAM, ...)


I did a lot of work on the higher PAT level functions, but I am no expert on 
the lower level management functions, and in particular all the special cases 
with different memory types.

IIRC, the goal of the PAT subsystem is to make sure that no two page tables map 
the same PFN with different caching attributes.


Yeah, that actually makes sense. Thomas from Intel recently explained the 
technical background to me:

Some x86 CPUs write back cache lines even if they aren't dirty and what can 
happen is that because of the linear mapping the CPU speculatively loads a 
cache line which is elsewhere mapped uncached.

So the end result is that the writeback of not dirty cache lines potentially 
corrupts the data in the otherwise uncached system memory.

But that a) only applies to memory in the linear mapping and b) only to a 
handful of x86 CPU types (e.g. recently Intels Luna Lake, AMD Athlons produced 
before 2004, maybe others).

It treats ordinary system RAM (IORESOURCE_SYSTEM_RAM) usually in a special way: 
no special caching mode.

For everything else, it expects that someone first reserves a memory range for 
a specific caching mode.

For example, remap_pfn_range()...->pfnmap_track()->memtype_reserve() will make 
sure that there are no conflicts, to the call memtype_kernel_map_sync() to make sure 
the identity mapping is updated to the new type.

In case someone ends up calling pfnmap_setup_cachemode(), the expectation is 
that there was a previous call to memtype_reserve_io() or similar, such that 
pfnmap_setup_cachemode() will find that caching mode.


So my assumption would be that that is missing for the drivers here?


Well yes and no.

See the PAT is optimized for applying specific caching attributes to ranges 
[A..B] (e.g. it uses an R/B tree). But what drivers do here is that they have 
single pages (usually for get_free_page or similar) and want to apply a certain 
caching attribute to it.

So what would happen is that we completely clutter the R/B tree used by the PAT 
with thousands if not millions of entries.

Hm, above you're saying that there is no direct map, but now you aresaying that the pages were obtained through get_free_page()?

I agree that what you describe here sounds suboptimal. But if the pageswhere obtained from the buddy, there surely is a direct map -- unless weexplicitly remove it :(

If we're talking about individual pages without a directmap, I wouldwonder if they are actually part of a bigger memory region that can justbe reserved in one go (similar to how remap_pfn_range()) would handle it.

Can you briefly describe how your use case obtains these PFNs, and howscattered tehy + their caching attributes might be?


--
Cheers

David / dhildenb

Re:

Reply via email to