Hi everyone, sorry for CCing so many people, but that rabbit hole turned out to be deeper than originally thought.
TTM always had problems with UC/WC mappings on 32bit systems and drivers often had to revert to hacks like using GFP_DMA32 to get things working while having no rational explanation why that helped (see the TTM AGP, radeon and nouveau driver code for that). It turned out that the PAT implementation we use on x86 not only enforces the same caching attributes for pages in the linear kernel mapping, but also for highmem pages through a separate R/B tree. That was unexpected and TTM never updated that R/B tree for highmem pages, so the function pgprot_set_cachemode() just overwrote the caching attributes drivers passed in to vmf_insert_pfn_prot() and that essentially caused all kind of random trouble. An R/B tree is potentially not a good data structure to hold thousands if not millions of different attributes for each page, so updating that is probably not the way to solve this issue. Thomas pointed out that the i915 driver is using apply_page_range() instead of vmf_insert_pfn_prot() to circumvent the PAT implementation and just fill in the page tables with what the driver things is the right caching attribute. This patch set here implements this and it turns out to much *faster* than the old implementation. Together with another change on my test system mapping 1GiB of memory through TTM improved nearly by a factor of 10 (197ms -> 20ms)! Please review the general idea and/or comment on the patches. Thanks, Christian.