On 26.08.25 11:17, David Hildenbrand wrote:
> On 26.08.25 11:00, Christian König wrote:
>> On 26.08.25 10:46, David Hildenbrand wrote:
>>>>> So my assumption would be that that is missing for the drivers here?
>>>>
>>>> Well yes and no.
>>>>
>>>> See the PAT is optimized for applying specific caching attributes to 
>>>> ranges [A..B] (e.g. it uses an R/B tree). But what drivers do here is that 
>>>> they have single pages (usually for get_free_page or similar) and want to 
>>>> apply a certain caching attribute to it.
>>>>
>>>> So what would happen is that we completely clutter the R/B tree used by 
>>>> the PAT with thousands if not millions of entries.
>>>>
>>>
>>> Hm, above you're saying that there is no direct map, but now you are saying 
>>> that the pages were obtained through get_free_page()?
>>
>> The problem only happens with highmem pages on 32bit kernels. Those pages 
>> are not in the linear mapping.
> 
> Right, in the common case there is a direct map.
> 
>>
>>> I agree that what you describe here sounds suboptimal. But if the pages 
>>> where obtained from the buddy, there surely is a direct map -- unless we 
>>> explicitly remove it :(
>>>
>>> If we're talking about individual pages without a directmap, I would wonder 
>>> if they are actually part of a bigger memory region that can just be 
>>> reserved in one go (similar to how remap_pfn_range()) would handle it.
>>>
>>> Can you briefly describe how your use case obtains these PFNs, and how 
>>> scattered tehy + their caching attributes might be?
>>
>> What drivers do is to call get_free_page() or alloc_pages_node() with the 
>> GFP_HIGHUSER flag set.
>>
>> For non highmem pages drivers then calls set_pages_wc/uc() which changes the 
>> caching of the linear mapping, but for highmem pages there is no linear 
>> mapping so set_pages_wc() or set_pages_uc() doesn't work and drivers avoid 
>> calling it.
>>
>> Those are basically just random system memory pages. So they are potentially 
>> scattered over the whole memory address space.
> 
> Thanks, that's valuable information.
> 
> So essentially these drivers maintain their own consistency and PAT is not 
> aware of that.
> 
> And the real problem is ordinary system RAM.
> 
> There are various ways forward.
> 
> 1) We use another interface that consumes pages instead of PFNs, like a
>    vm_insert_pages_pgprot() we would be adding.
> 
>    Is there any strong requirement for inserting non-refcounted PFNs?

Yes, there is a strong requirement to insert non-refcounted PFNs.

We had a lot of trouble with KVM people trying to grab a reference to those 
pages even if the VMA had the VM_PFNMAP flag set.

> 2) We add another interface that consumes PFNs, but explicitly states
>    that it is only for ordinary system RAM, and that the user is
>    required for updating the direct map.
> 
>    We could sanity-check the direct map in debug kernels.

I would rather like to see vmf_insert_pfn_prot() fixed instead.

That function was explicitly added to insert the PFN with the given attributes 
and as far as I can see all users of that function expect exactly that.

> 
> 3) We teach PAT code in pfnmap_setup_cachemode_pfn() about treating this
>    system RAM differently.
> 
> 
> There is also the option for a mixture between 1 and 2, where we get pages, 
> but we map them non-refcounted in a VM_PFNMAP.
> 
> In general, having pages makes it easier to assert that they are likely 
> ordinary system ram pages, and that the interface is not getting abused for 
> something else.

Well, exactly that's the use case here and that is not abusive at all as far as 
I can see.

What drivers want is to insert a PFN with a certain set of caching attributes 
regardless if it's system memory or iomem. That's why vmf_insert_pfn_prot() was 
created in the first place.

That drivers need to call set_pages_wc/uc() for the linear mapping on x86 
manually is correct and checking that is clearly a good idea for debug kernels.

> We could also perform the set_pages_wc/uc() from inside that function, but 
> maybe it depends on the use case whether we want to do that whenever we map 
> them into a process?

It sounds like a good idea in theory, but I think it is potentially to much 
overhead to be applicable.

Thanks,
Christian.

Reply via email to