On 8/30/21 2:07 PM, Jason Gunthorpe wrote:
> On Fri, Aug 27, 2021 at 07:34:54PM +0100, Joao Martins wrote:
>> On 8/27/21 5:25 PM, Jason Gunthorpe wrote:
>>> On Fri, Aug 27, 2021 at 03:58:13PM +0100, Joao Martins wrote:
>>>
>>>> #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) &&
>>>> defined(CONFIG_TRANSPARENT_HUGEPAGE)
>>>> static int __gup_device_huge(unsigned long pfn, unsigned long addr,
>>>> unsigned long end, unsigned int flags,
>>>> struct page **pages, int *nr)
>>>> {
>>>> - int nr_start = *nr;
>>>> + int refs, nr_start = *nr;
>>>> struct dev_pagemap *pgmap = NULL;
>>>> int ret = 1;
>>>>
>>>> do {
>>>> - struct page *page = pfn_to_page(pfn);
>>>> + struct page *head, *page = pfn_to_page(pfn);
>>>> + unsigned long next = addr + PAGE_SIZE;
>>>>
>>>> pgmap = get_dev_pagemap(pfn, pgmap);
>>>> if (unlikely(!pgmap)) {
>>>> @@ -2252,16 +2265,25 @@ static int __gup_device_huge(unsigned long pfn,
>>>> unsigned long addr,
>>>> ret = 0;
>>>> break;
>>>> }
>>>> - SetPageReferenced(page);
>>>> - pages[*nr] = page;
>>>> - if (unlikely(!try_grab_page(page, flags))) {
>>>> - undo_dev_pagemap(nr, nr_start, flags, pages);
>>>> +
>>>> + head = compound_head(page);
>>>> + /* @end is assumed to be limited at most one compound page */
>>>> + if (PageHead(head))
>>>> + next = end;
>>>> + refs = record_subpages(page, addr, next, pages + *nr);
>>>> +
>>>> + SetPageReferenced(head);
>>>> + if (unlikely(!try_grab_compound_head(head, refs, flags))) {
>>>> + if (PageHead(head))
>>>> + ClearPageReferenced(head);
>>>> + else
>>>> + undo_dev_pagemap(nr, nr_start, flags, pages);
>>>> ret = 0;
>>>> break;
>>>
>>> Why is this special cased for devmap?
>>>
>>> Shouldn't everything processing pud/pmds/etc use the same basic loop
>>> that is similar in idea to the 'for_each_compound_head' scheme in
>>> unpin_user_pages_dirty_lock()?
>>>
>>> Doesn't that work for all the special page type cases here?
>>
>> We are iterating over PFNs to create an array of base pages (regardless of
>> page table
>> type), rather than iterating over an array of pages to work on.
>
> That is part of it, yes, but the slow bit here is to minimally find
> the head pages and do the atomics on them, much like the
> unpin_user_pages_dirty_lock()
>
> I would think this should be designed similar to how things work on
> the unpin side.
>
I don't think it's the same thing. The bit you say 'minimally find the
head pages' carries a visible overhead in unpin_user_pages() as we are
checking each of the pages belongs to the same head page -- because you
can pass an arbritary set of pages. This does have a cost which is not
in gup-fast right now AIUI. Whereas in our gup-fast 'handler' you
already know that you are processing a contiguous chunk of pages.
If anything, we are closer to unpin_user_page_range*()
than unpin_user_pages().
> Sweep the page tables to find a proper start/end - eg even if a
> compound is spread across multiple pte/pmd/pud/etc we should find a
> linear range of starting PFN (ie starting page*) and npages across as
> much of the page tables as we can manage. This is the same as where
> things end up in the unpin case where all the contiguous PFNs are
> grouped togeher into a range.
>
> Then 'assign' that range to the output array which requires walking
> over each compount_head in the range and pinning it, then writing out
> the tail pages to the output struct page array.
>
> And this approach should apply universally no matter what is under the
> pte's - ie huge pages, THPs and devmaps should all be treated the same
> way. Currently each case is different, like above which is unique to
> device_huge.
>
Only devmap gup-fast is different IIUC.
Switching to similar iteration logic to unpin would look something like
this (still untested):
for_each_compound_range(index, &page, npages, head, refs) {
pgmap = get_dev_pagemap(pfn + *nr, pgmap);
if (unlikely(!pgmap)) {
undo_dev_pagemap(nr, nr_start, flags, pages);
ret = 0;
break;
}
SetPageReferenced(head);
if (unlikely(!try_grab_compound_head(head, refs, flags))) {
if (PageHead(head))
ClearPageReferenced(head);
else
undo_dev_pagemap(nr, nr_start, flags, pages);
ret = 0;
break;
}
record_subpages(page + *nr, addr,
addr + (refs << PAGE_SHIFT), pages + *nr);
*(nr) += refs;
addr += (refs << PAGE_SHIFT);
}
But it looks to be a tidbit more complex and not really aligning with the
rest of gup-fast.
All in all, I am dealing with the fact that 1) devmap pmds/puds may not
be represented with compound pages and 2) we temporarily grab dev_pagemap
reference
prior to pinning the page. Those two items is what makes this different than
THPs/HugeTLB
(which do have the same logic). And thus it's what lead me to *slightly* improve
gup_device_huge().