On 26.08.25 11:17, David Hildenbrand wrote: > On 26.08.25 11:00, Christian König wrote: >> On 26.08.25 10:46, David Hildenbrand wrote: >>>>> So my assumption would be that that is missing for the drivers here? >>>> >>>> Well yes and no. >>>> >>>> See the PAT is optimized for applying specific caching attributes to >>>> ranges [A..B] (e.g. it uses an R/B tree). But what drivers do here is that >>>> they have single pages (usually for get_free_page or similar) and want to >>>> apply a certain caching attribute to it. >>>> >>>> So what would happen is that we completely clutter the R/B tree used by >>>> the PAT with thousands if not millions of entries. >>>> >>> >>> Hm, above you're saying that there is no direct map, but now you are saying >>> that the pages were obtained through get_free_page()? >> >> The problem only happens with highmem pages on 32bit kernels. Those pages >> are not in the linear mapping. > > Right, in the common case there is a direct map. > >> >>> I agree that what you describe here sounds suboptimal. But if the pages >>> where obtained from the buddy, there surely is a direct map -- unless we >>> explicitly remove it :( >>> >>> If we're talking about individual pages without a directmap, I would wonder >>> if they are actually part of a bigger memory region that can just be >>> reserved in one go (similar to how remap_pfn_range()) would handle it. >>> >>> Can you briefly describe how your use case obtains these PFNs, and how >>> scattered tehy + their caching attributes might be? >> >> What drivers do is to call get_free_page() or alloc_pages_node() with the >> GFP_HIGHUSER flag set. >> >> For non highmem pages drivers then calls set_pages_wc/uc() which changes the >> caching of the linear mapping, but for highmem pages there is no linear >> mapping so set_pages_wc() or set_pages_uc() doesn't work and drivers avoid >> calling it. >> >> Those are basically just random system memory pages. So they are potentially >> scattered over the whole memory address space. > > Thanks, that's valuable information. > > So essentially these drivers maintain their own consistency and PAT is not > aware of that. > > And the real problem is ordinary system RAM. > > There are various ways forward. > > 1) We use another interface that consumes pages instead of PFNs, like a > vm_insert_pages_pgprot() we would be adding. > > Is there any strong requirement for inserting non-refcounted PFNs?
Yes, there is a strong requirement to insert non-refcounted PFNs. We had a lot of trouble with KVM people trying to grab a reference to those pages even if the VMA had the VM_PFNMAP flag set. > 2) We add another interface that consumes PFNs, but explicitly states > that it is only for ordinary system RAM, and that the user is > required for updating the direct map. > > We could sanity-check the direct map in debug kernels. I would rather like to see vmf_insert_pfn_prot() fixed instead. That function was explicitly added to insert the PFN with the given attributes and as far as I can see all users of that function expect exactly that. > > 3) We teach PAT code in pfnmap_setup_cachemode_pfn() about treating this > system RAM differently. > > > There is also the option for a mixture between 1 and 2, where we get pages, > but we map them non-refcounted in a VM_PFNMAP. > > In general, having pages makes it easier to assert that they are likely > ordinary system ram pages, and that the interface is not getting abused for > something else. Well, exactly that's the use case here and that is not abusive at all as far as I can see. What drivers want is to insert a PFN with a certain set of caching attributes regardless if it's system memory or iomem. That's why vmf_insert_pfn_prot() was created in the first place. That drivers need to call set_pages_wc/uc() for the linear mapping on x86 manually is correct and checking that is clearly a good idea for debug kernels. > We could also perform the set_pages_wc/uc() from inside that function, but > maybe it depends on the use case whether we want to do that whenever we map > them into a process? It sounds like a good idea in theory, but I think it is potentially to much overhead to be applicable. Thanks, Christian.