from:"Peter Xu"

Re: [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-17 Thread Peter Xu

 this page and
> + * replace them with special swap entries to grant a device exclusive access 
> to
> + * the page. Caller must hold the page lock.
> + *
> + * Returns false if the page is still mapped, or if it could not be unmapped
> + * from the expected address. Otherwise returns true (success).
> + */
> +static bool try_to_protect(struct page *page, struct mm_struct *mm,
> +unsigned long address, void *arg)
> +{
> + struct ttp_args ttp = {
> + .mm = mm,
> + .address = address,
> + .arg = arg,
> + .valid = false,
> + };
> + struct rmap_walk_control rwc = {
> + .rmap_one = try_to_protect_one,
> + .done = page_not_mapped,
> + .anon_lock = page_lock_anon_vma_read,
> + .arg = &ttp,
> + };
> +
> + /*
> +  * Restrict to anonymous pages for now to avoid potential writeback
> +  * issues.
> +  */
> + if (!PageAnon(page))
> + return false;
> +
> + /*
> +  * During exec, a temporary VMA is setup and later moved.
> +  * The VMA is moved under the anon_vma lock but not the
> +  * page tables leading to a race where migration cannot
> +  * find the migration ptes. Rather than increasing the
> +  * locking requirements of exec(), migration skips
> +  * temporary VMAs until after exec() completes.
> +  */
> + if (!PageKsm(page) && PageAnon(page))

I think we can drop the PageAnon() check as it's just done above.

I feel like this chunk was copied over from try_to_unmap(), however is that
necessary?  Is it possible that the caller of make_device_exclusive_range()
pass in a temp stack vma during exec()?

> + rwc.invalid_vma = invalid_migration_vma;
> +
> + rmap_walk(page, &rwc);
> +
> + return ttp.valid && !page_mapcount(page);
> +}
> +
> +/**
> + * make_device_exclusive_range() - Mark a range for exclusive use by a device
> + * @mm: mm_struct of assoicated target process
> + * @start: start of the region to mark for exclusive device access
> + * @end: end address of region
> + * @pages: returns the pages which were successfully marked for exclusive 
> access
> + * @arg: passed to MMU_NOTIFY_EXCLUSIVE range notifier too allow filtering

s/too/to/?

> + *
> + * Returns: number of pages successfully marked for exclusive access

Hmm, I see that try_to_protect() can fail even if npages returned from GUP, so
perhaps "number of pages successfully GUPed, however the page is marked for
exclusive access only if the page pointer is non-NULL", or something like that?

> + *
> + * This function finds ptes mapping page(s) to the given address range, locks
> + * them and replaces mappings with special swap entries preventing userspace 
> CPU

s/userspace//?  As same for kernel access?

(I don't think I fully read all the codes in this patch, but I'll stop here for
 today..)

Thanks,

> + * access. On fault these entries are replaced with the original mapping 
> after
> + * calling MMU notifiers.
> + *
> + * A driver using this to program access from a device must use a mmu 
> notifier
> + * critical section to hold a device specific lock during programming. Once
> + * programming is complete it should drop the page lock and reference after
> + * which point CPU access to the page will revoke the exclusive access.
> + */
> +int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
> + unsigned long end, struct page **pages,
> + void *arg)
> +{
> + unsigned long npages = (end - start) >> PAGE_SHIFT;
> + unsigned long i;
> +
> + npages = get_user_pages_remote(mm, start, npages,
> +FOLL_GET | FOLL_WRITE | FOLL_SPLIT_PMD,
> +pages, NULL, NULL);
> + for (i = 0; i < npages; i++, start += PAGE_SIZE) {
> + if (!trylock_page(pages[i])) {
> + put_page(pages[i]);
> + pages[i] = NULL;
> + continue;
> + }
> +
> + if (!try_to_protect(pages[i], mm, start, arg)) {
> + unlock_page(pages[i]);
> + put_page(pages[i]);
> + pages[i] = NULL;
> + }
> + }
> +
> + return npages;
> +}
> +EXPORT_SYMBOL_GPL(make_device_exclusive_range);
> +
>  void __put_anon_vma(struct anon_vma *anon_vma)
>  {
>   struct anon_vma *root = anon_vma->root;
> -- 
> 2.20.1
> 
> 

-- 
Peter Xu

Re: [PATCH v8 1/8] mm: Remove special swap entry functions

2021-05-17 Thread Peter Xu

On Wed, Apr 07, 2021 at 06:42:31PM +1000, Alistair Popple wrote:
> +static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
> +{
> + struct page *p = pfn_to_page(swp_offset(entry));
> +
> + /*
> +  * Any use of migration entries may only occur while the
> +  * corresponding page is locked
> +  */
> + BUG_ON(is_migration_entry(entry) && !PageLocked(p));
> +
> + return p;
> +}

Would swap_pfn_entry_to_page() be slightly better?

The thing is it's very easy to read pfn_*() as a function to take a pfn as
parameter...

Since I'm also recently working on some swap-related new ptes [1], I'm thinking
whether we could name these swap entries as "swap XXX entries".  Say, "swap
hwpoison entry", "swap pfn entry" (which is a superset of "swap migration
entry", "swap device exclusive entry", ...).  That's where I came with the
above swap_pfn_entry_to_page(), then below will be naturally 
is_swap_pfn_entry().

No strong opinion as this is already a v8 series (and sorry to chim in this
late), just to raise this up.

[1] https://lore.kernel.org/lkml/20210427161317.50682-1-pet...@redhat.com/

Thanks,

> +
> +/*
> + * A pfn swap entry is a special type of swap entry that always has a pfn 
> stored
> + * in the swap offset. They are used to represent unaddressable device memory
> + * and to restrict access to a page undergoing migration.
> + */
> +static inline bool is_pfn_swap_entry(swp_entry_t entry)
> +{
> + return is_migration_entry(entry) || is_device_private_entry(entry);
> +}

-- 
Peter Xu

Re: [PATCH v8 1/8] mm: Remove special swap entry functions

2021-05-18 Thread Peter Xu

On Tue, May 18, 2021 at 09:58:09PM +1000, Alistair Popple wrote:
> On Tuesday, 18 May 2021 12:17:32 PM AEST Peter Xu wrote:
> > On Wed, Apr 07, 2021 at 06:42:31PM +1000, Alistair Popple wrote:
> > > +static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
> > > +{
> > > + struct page *p = pfn_to_page(swp_offset(entry));
> > > +
> > > + /*
> > > +  * Any use of migration entries may only occur while the
> > > +  * corresponding page is locked
> > > +  */
> > > + BUG_ON(is_migration_entry(entry) && !PageLocked(p));
> > > +
> > > + return p;
> > > +}
> > 
> > Would swap_pfn_entry_to_page() be slightly better?
> > 
> > The thing is it's very easy to read pfn_*() as a function to take a pfn as
> > parameter...
> > 
> > Since I'm also recently working on some swap-related new ptes [1], I'm
> > thinking whether we could name these swap entries as "swap XXX entries". 
> > Say, "swap hwpoison entry", "swap pfn entry" (which is a superset of "swap
> > migration entry", "swap device exclusive entry", ...).  That's where I came
> > with the above swap_pfn_entry_to_page(), then below will be naturally
> > is_swap_pfn_entry().
> 
> Equally though "hwpoison swap entry", "pfn swap entry", "migration swap 
> entry", etc. also makes sense (at least to me), but does that not fit in as 
> well with your series? I haven't looked too deeply at your series but have 
> been meaning to so thanks for the pointer.

Yeah it looks good too.  It's just to avoid starting with "pfn_" I guess, so at
least we know that's something related to swap not one specific pfn.

I found the naming is important as e.g. in my series I introduced another idea
called "swap special pte" or "special swap pte" (yeah so indeed my naming is
not that coherent as well... :), then I noticed if without a good naming I'm
afraid we can get lost easier.

I have a description here in the commit message re the new swap special pte:

https://lore.kernel.org/lkml/20210427161317.50682-5-pet...@redhat.com/

In which the uffd-wp marker pte will be the first swap special pte.  Feel free
to ignore the details too, just want to mention some context, while it should
be orthogonal to this series.

I think yet-another swap entry suites for my case too but it'll be a waste as
in that pte I don't need to keep pfn information, but only a marker (one single
bit would suffice), so I chose one single pte value (one for each arch, I only
defined that on x86 in my series which is a special pte with only one bit set)
to not pollute the swap entry address spaces.

> 
> > No strong opinion as this is already a v8 series (and sorry to chim in this
> > late), just to raise this up.
> 
> No worries, it's good timing as I was about to send a v9 which was just a 
> rebase anyway. I am hoping to try and get this accepted for the next merge 
> window but I will wait before sending v9 to see if anyone else has thoughts 
> on 
> the naming here.
> 
> I don't have a particularly strong opinion either, and your justification is 
> more thought than I gave to naming these originally so am happy to rename if 
> it's more readable or fits better with your series.

I'll leave the decision to you, especially in case you still prefer the old
naming.  Or feel free to wait too until someone else shares the thoughts so as
to avoid unnecessary rebase work.

Thanks,

-- 
Peter Xu

Re: [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-18 Thread Peter Xu

; > > vma->vm_mm,
> > > + vmf->address & PAGE_MASK,
> > > + (vmf->address & PAGE_MASK) + PAGE_SIZE);
> > > + mmu_notifier_invalidate_range_start(&range);
> > 
> > I looked at MMU_NOTIFIER_CLEAR document and it tells me:
> > 
> >  * @MMU_NOTIFY_CLEAR: clear page table entry (many reasons for this like
> >  * madvise() or replacing a page by another one, ...).
> > 
> > Does MMU_NOTIFIER_CLEAR suite for this case?  Normally I think for such a
> > case (existing pte is invalid) we don't need to notify at all.  However
> > from what I read from the whole series, this seems to be a critical point
> > where we'd like to kick the owner/driver to synchronously stop doing atomic
> > operations from the device.  Not sure whether we'd like a new notifier
> > type, or maybe at least comment on why to use CLEAR?
> 
> Right, notifying the owner/driver when it no longer has exclusive access to 
> the page and allowing it to stop atomic operations is the critical point and 
> why it notifies when we ordinarily wouldn't (ie. invalid -> valid).
> 
> I did consider adding a new type, but in the driver implementation it ends up 
> being treated the same as a CLEAR notification anyway so didn't think it was 
> necessary. But I suppose adding a different type would allow other listening 
> notifiers to filter these which might be worthwhile.

Sounds good to me.

[...]

> > > + /*
> > > +  * Check that our target page is still mapped at the
> > > expected
> > > +  * address.
> > > +  */
> > > + if (ttp->mm == mm && ttp->address == address &&
> > > + pte_write(pteval))
> > > + ttp->valid = true;
> > 
> > I think I get the point of doing this (as after GUP the pte could have been
> > changed to point to another page), however it smells a bit odd to me (or
> > it's also possible that I'm not familiar enough with the code base..). 
> > IIUC this is the _only_ reason that we passed in "address" into
> > try_to_protect() too, and further into the ttp_args.
> 
> Yes, this is why "address" is passed up to ttp_args.
> 
> > The odd part is the remote GUP should have walked the page table already, so
> > since the target here is the vaddr to replace, the 1st page table walk
> > should be able to both trylock/lock the page, then modify the pte with
> > pgtable lock held, return the locked page, then walk the rmap again to
> > remove all the rest of the ptes that are mapping to this page.  In that
> > case before we call the rmap_walk() we know this must be the page we want
> > to take care of, and no one will be able to restore the original mm pte
> > either (as we're with the page lock).  Then we don't need this check,
> > neither do we need ttp->address.
> 
> If I am understanding you correctly I think this would be similar to the 
> approach that was taken in v2. However it pretty much ended up being just an 
> open-coded version of gup which is useful anyway to fault the page in.

I see.  For easier reference this is v2 patch 1:

https://lore.kernel.org/lkml/20210219020750.16444-2-apop...@nvidia.com/

Indeed that looks like it, it's just that instead of grabbing the page only in
hmm_exclusive_pmd() we can do the pte modification along the way to seal the
whole thing (address/pte & page).  I saw Christoph and Jason commented in that
patch, but not regarding to this approach.  So is there a reason that you
switched?  Do you think it'll work?

I have no strong opinion either, it's just not crystal clear why we'd need that
ttp->address at all for a rmap walk along with that "valid" field. Meanwhile it
should be slightly less efficient too to go with current approach, especially
when the page array gets huge, I think: since there'll be longer time we do GUP
before doing the rmap walk, so higher possibility that the GUPed pages got
replaced for whatever reason.  Then the call to make_device_exclusive_range()
will fail as a whole just for a single page replacement within the range.

Thanks,

-- 
Peter Xu

Re: [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-18 Thread Peter Xu

On Tue, May 18, 2021 at 02:33:34PM -0300, Jason Gunthorpe wrote:
> On Tue, May 18, 2021 at 01:27:42PM -0400, Peter Xu wrote:
> 
> > I also have a pure and high level question regarding a process fork() when
> > there're device exclusive ptes: would the two processes then own the device
> > together?  Is this a real usecase?
> 
> If the pages are MAP_SHARED then yes. All VMAs should point at the
> same device_exclusive page and all VMA should migrate back to CPU
> pages together.

Makes sense.  If we keep the anonymous-only in this series (I think it's good
to separate these), maybe we can drop the !COW case, plus some proper installed
WARN_ON_ONCE()s.

> 
> > Indeed it'll be odd for a COW page since for COW page then it means after
> > parent/child writting to the page it'll clone into two, then it's a mistery 
> > on
> > which one will be the one that "exclusived owned" by the device..
> 
> For COW pages it is like every other fork case.. We can't reliably
> write-protect the device_exclusive page during fork so we must copy it
> at fork time.
> 
> Thus three reasonable choices:
>  - Copy to a new CPU page
>  - Migrate back to a CPU page and write protect it
>  - Copy to a new device exclusive page

IMHO the ownership question would really help us to answer this one..

If the device ownership should be kept in parent IMHO option (1) is the best
approach. To be explicit on page copy: we can do best-effort, even if the copy
is during a device atomic operation, perhaps?

If the ownership will be shared, seems option (3) will be easier as I don't see
a strong reason to do immediate restorinng of ptes; as long as we're careful on
the refcounting.

Thanks,

-- 
Peter Xu

Re: [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-18 Thread Peter Xu

On Tue, May 18, 2021 at 04:45:09PM -0300, Jason Gunthorpe wrote:
> On Tue, May 18, 2021 at 02:01:36PM -0400, Peter Xu wrote:
> > > > Indeed it'll be odd for a COW page since for COW page then it means 
> > > > after
> > > > parent/child writting to the page it'll clone into two, then it's a 
> > > > mistery on
> > > > which one will be the one that "exclusived owned" by the device..
> > > 
> > > For COW pages it is like every other fork case.. We can't reliably
> > > write-protect the device_exclusive page during fork so we must copy it
> > > at fork time.
> > > 
> > > Thus three reasonable choices:
> > >  - Copy to a new CPU page
> > >  - Migrate back to a CPU page and write protect it
> > >  - Copy to a new device exclusive page
> > 
> > IMHO the ownership question would really help us to answer this one..
> 
> I'm confused about what device ownership you are talking about

My question was more about the user scenario rather than anything related to
the kernel code, nor does it related to page struct at all.

Let me try to be a little bit more verbose...

Firstly, I think one simple solution to handle fork() of device exclusive ptes
is to do just like device private ptes: if COW we convert writable ptes into
readable ptes.  Then when CPU access happens (in either parent/child) page
restore triggers which will convert those readable ptes into read-only present
ptes (with the original page backing it).  Then do_wp_page() will take care of
page copy.

However... if you see that also means parent/child have the equal opportunity
to reuse that original page: who access first will do COW because refcount>1
for that page (note! it's possible that mapcount==1 here, as we drop mapcount
when converting to device exclusive ptes; however with the most recent
do_wp_page change from Linus where we'll also check page_count(), we'll still
do COW just like when this page was GUPed by someone else).  While that matters
because the device is writting to that original page only, not the COWed one.

Then here comes the ownership question: If we still want to have the parent
process behave like before it fork()ed, IMHO we must make sure that original
page (that exclusively owned by the device once) still belongs to the parent
process not the child.  That's why I think if that's the case we'd do early cow
in fork(), because it guarantees that.

I can't say I fully understand the whole picture, so sorry if I missed
something important there.

Thanks,

-- 
Peter Xu

Re: [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-18 Thread Peter Xu

On Wed, Apr 07, 2021 at 06:42:35PM +1000, Alistair Popple wrote:

[...]

> +static bool try_to_protect(struct page *page, struct mm_struct *mm,
> +unsigned long address, void *arg)
> +{
> + struct ttp_args ttp = {
> + .mm = mm,
> + .address = address,
> + .arg = arg,
> + .valid = false,
> + };
> + struct rmap_walk_control rwc = {
> + .rmap_one = try_to_protect_one,
> + .done = page_not_mapped,
> + .anon_lock = page_lock_anon_vma_read,
> + .arg = &ttp,
> + };
> +
> + /*
> +  * Restrict to anonymous pages for now to avoid potential writeback
> +  * issues.
> +  */
> + if (!PageAnon(page))
> + return false;
> +
> + /*
> +  * During exec, a temporary VMA is setup and later moved.
> +  * The VMA is moved under the anon_vma lock but not the
> +  * page tables leading to a race where migration cannot
> +  * find the migration ptes. Rather than increasing the
> +  * locking requirements of exec(), migration skips
> +  * temporary VMAs until after exec() completes.
> +  */
> + if (!PageKsm(page) && PageAnon(page))
> + rwc.invalid_vma = invalid_migration_vma;
> +
> + rmap_walk(page, &rwc);
> +
> + return ttp.valid && !page_mapcount(page);
> +}

I raised a question in the other thread regarding fork():

https://lore.kernel.org/lkml/YKQjmtMo+YQGx%2FwZ@t490s/

While I suddenly noticed that we may have similar issues even if we fork()
before creating the ptes.

In that case, we may see multiple read-only ptes pointing to the same page.  We
will convert all of them into device exclusive read ptes in rmap_walk() above,
however how do we guarantee after all COW done in the parent and all the childs
processes, the device owned page will be returned to the parent?

E.g., if parent accesses the page earlier than the children processes
(actually, as long as not the last one), do_wp_page() will do COW for parent on
this page because refcount(page)>1, then the page seems to get lost to a random
child too..

To resolve all these complexity, not sure whether try_to_protect() could
enforce VM_DONTCOPY (needs madvise MADV_DONTFORK in the user app), meanwhile
make sure mapcount(page)==1 before granting the page to the device, so that
this will guarantee this mm owns this page forever, I think?  It'll simplify
fork() too as a side effect, since VM_DONTCOPY vma go away when fork.

-- 
Peter Xu

Re: [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-18 Thread Peter Xu

On Tue, May 18, 2021 at 08:03:27PM -0300, Jason Gunthorpe wrote:
> Logically during fork all these device exclusive pages should be
> reverted back to their CPU pages, write protected and the CPU page PTE
> copied to the fork.
> 
> We should not copy the device exclusive page PTE to the fork. I think
> I pointed to this on an earlier rev..

Agreed.  Though please see the question I posted in the other thread: now I am
not very sure whether we'll be able to mark a page as device exclusive if that
page has mapcount>1.

> 
> We can optimize this into the various variants above, but logically
> device exclusive stop existing during fork.

Makes sense, I think that's indeed what this patch did at least for the COW
case, so I think Alistair did address that comment.  It's just that I think we
need to drop the other !COW case (imho that should correspond to the changes in
copy_nonpresent_pte()) in this patch to guarantee it.

I also hope we don't make copy_pte_range() even more complicated just to do the
lock_page() right, so we could fail the fork() if the lock is hard to take.

-- 
Peter Xu

Re: [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-19 Thread Peter Xu

On Wed, May 19, 2021 at 09:04:53PM +1000, Alistair Popple wrote:
> Failing fork() because we couldn't take a lock doesn't seem like the right 
> approach though, especially as there is already existing code that retries. I 
> get this adds complexity though, so would be happy to take a look at cleaning 
> copy_pte_range() up in future.

Yes, I proposed that as this one won't affect any existing applications (unlike
the existing ones) but only new userspace driver apps that will use this new
atomic feature.

IMHO it'll be a pity to add extra complexity and maintainance burden into
fork() if only for keeping the "logical correctness of fork()" however the code
never triggers. If we start with trylock we'll know whether people will use it,
since people will complain with a reason when needed; however I still doubt
whether a sane userspace device driver should fork() within busy interaction
with the device underneath..

In all cases, please still consider to keep them in copy_nonpresent_pte() (and
if to rework, separating patches would be great).

Thanks,

-- 
Peter Xu

Re: [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-19 Thread Peter Xu

On Wed, May 19, 2021 at 09:35:10PM +1000, Alistair Popple wrote:
> I think the approach you are describing is similar to what 
> migrate_vma_collect()/migrate_vma_unamp() does now and I think it could be 
> made to work. I ended up going with the GUP+unmap approach in part because 
> Christoph suggested it but primarily because it required less code especially 
> given we needed to call something to fault the page in/break COW anyway (or 
> extend what was there to call handle_mm_fault(), etc.).

I see, thank. Would you mind add a short paragraph in the commit message
talking about these two solutions and why we choose the GUP way?

-- 
Peter Xu

Re: [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-19 Thread Peter Xu

On Wed, May 19, 2021 at 08:49:01PM +1000, Alistair Popple wrote:
> On Wednesday, 19 May 2021 7:16:38 AM AEST Peter Xu wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > On Wed, Apr 07, 2021 at 06:42:35PM +1000, Alistair Popple wrote:
> > 
> > [...]
> > 
> > > +static bool try_to_protect(struct page *page, struct mm_struct *mm,
> > > +unsigned long address, void *arg)
> > > +{
> > > + struct ttp_args ttp = {
> > > + .mm = mm,
> > > + .address = address,
> > > + .arg = arg,
> > > + .valid = false,
> > > + };
> > > + struct rmap_walk_control rwc = {
> > > + .rmap_one = try_to_protect_one,
> > > + .done = page_not_mapped,
> > > + .anon_lock = page_lock_anon_vma_read,
> > > + .arg = &ttp,
> > > + };
> > > +
> > > + /*
> > > +  * Restrict to anonymous pages for now to avoid potential writeback
> > > +  * issues.
> > > +  */
> > > + if (!PageAnon(page))
> > > + return false;
> > > +
> > > + /*
> > > +  * During exec, a temporary VMA is setup and later moved.
> > > +  * The VMA is moved under the anon_vma lock but not the
> > > +  * page tables leading to a race where migration cannot
> > > +  * find the migration ptes. Rather than increasing the
> > > +  * locking requirements of exec(), migration skips
> > > +  * temporary VMAs until after exec() completes.
> > > +  */
> > > + if (!PageKsm(page) && PageAnon(page))
> > > + rwc.invalid_vma = invalid_migration_vma;
> > > +
> > > + rmap_walk(page, &rwc);
> > > +
> > > + return ttp.valid && !page_mapcount(page);
> > > +}
> > 
> > I raised a question in the other thread regarding fork():
> > 
> > https://lore.kernel.org/lkml/YKQjmtMo+YQGx%2FwZ@t490s/
> > 
> > While I suddenly noticed that we may have similar issues even if we fork()
> > before creating the ptes.
> > 
> > In that case, we may see multiple read-only ptes pointing to the same page. 
> > We will convert all of them into device exclusive read ptes in rmap_walk()
> > above, however how do we guarantee after all COW done in the parent and all
> > the childs processes, the device owned page will be returned to the parent?
> 
> I assume you are talking about a fork() followed by a call to 
> make_device_exclusive()? I think this should be ok because 
> make_device_exclusive() always calls GUP with FOLL_WRITE both to break COW 
> and 
> because a device performing atomic operations needs to write to the page. I 
> suppose a comment here highlighting the need to break COW to avoid this 
> scenario would be useful though.

Indeed, sorry for the false alarm!  Yes it would be great to mention that too.

-- 
Peter Xu

Re: [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-19 Thread Peter Xu

On Wed, May 19, 2021 at 11:11:55PM +1000, Alistair Popple wrote:
> On Wednesday, 19 May 2021 10:15:41 PM AEST Peter Xu wrote:
> > External email: Use caution opening links or attachments
> > 
> > On Wed, May 19, 2021 at 09:04:53PM +1000, Alistair Popple wrote:
> > > Failing fork() because we couldn't take a lock doesn't seem like the right
> > > approach though, especially as there is already existing code that
> > > retries. I get this adds complexity though, so would be happy to take a
> > > look at cleaning copy_pte_range() up in future.
> > 
> > Yes, I proposed that as this one won't affect any existing applications
> > (unlike the existing ones) but only new userspace driver apps that will use
> > this new atomic feature.
> > 
> > IMHO it'll be a pity to add extra complexity and maintainance burden into
> > fork() if only for keeping the "logical correctness of fork()" however the
> > code never triggers. If we start with trylock we'll know whether people
> > will use it, since people will complain with a reason when needed; however
> > I still doubt whether a sane userspace device driver should fork() within
> > busy interaction with the device underneath..
> 
> I will refrain from commenting on the sanity or otherwise of doing that :-)
> 
> Agree such a scenario seems unlikely in practice (and possibly unreasonable). 
> Keeping the "logical correctness of fork()" still seems worthwhile to me, but 
> if the added complexity/maintenance burden for an admittedly fairly specific 
> feature is going to stop progress here I am happy to take the fail fork 
> approach. I could then possibly fix it up as a future clean up to 
> copy_pte_range(). Perhaps others have thoughts?

Yes, it's more about making this series easier to be accepted, and it'll be
great to have others' input.

Btw, just to mention that I don't even think fail fork() on failed trylock() is
against "logical correctness of fork()": IMHO it's still 100% correct just like
most syscalls can return with -EAGAIN, that suggests the userspace to try again
the syscall, and I hope that also works for fork().  I'd be more than glad to
be corrected too.

-- 
Peter Xu

Re: [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-19 Thread Peter Xu

On Wed, May 19, 2021 at 10:28:42AM -0300, Jason Gunthorpe wrote:
> On Tue, May 18, 2021 at 07:45:05PM -0400, Peter Xu wrote:
> > On Tue, May 18, 2021 at 08:03:27PM -0300, Jason Gunthorpe wrote:
> > > Logically during fork all these device exclusive pages should be
> > > reverted back to their CPU pages, write protected and the CPU page PTE
> > > copied to the fork.
> > > 
> > > We should not copy the device exclusive page PTE to the fork. I think
> > > I pointed to this on an earlier rev..
> > 
> > Agreed.  Though please see the question I posted in the other thread: now I 
> > am
> > not very sure whether we'll be able to mark a page as device exclusive if 
> > that
> > page has mapcount>1.
> 
> IMHO it is similar to write protect done by filesystems on shared
> mappings - all VMAs with a copy of the CPU page have to get switched
> to the device exclusive PTE. This is why the rmap stuff is involved in
> the migration helpers

Right, I think Alistair corrected me there that I missed the early COW
happening in GUP.

Actually even without that GUP triggering early COW it won't be a problem,
because as long as one child mm restored the pte from exclusive to normal
(before any further COW happens) device exclusiveness is broken in the mmu
notifiers, and after that point all previous-exclusive ptes actually becomes
the same as a very normal PageAnon.  Then it's very sane to even not have the
original page in parent process, because we know each COWed page will contain
all the device atomic modifications (so we don't really have the requirement to
return the original page to parent).

Sorry for the noise.

-- 
Peter Xu

Re: [PATCH v9 07/10] mm: Device exclusive memory access

2021-05-26 Thread Peter Xu

On Mon, May 24, 2021 at 11:27:22PM +1000, Alistair Popple wrote:
> Some devices require exclusive write access to shared virtual
> memory (SVM) ranges to perform atomic operations on that memory. This
> requires CPU page tables to be updated to deny access whilst atomic
> operations are occurring.
> 
> In order to do this introduce a new swap entry
> type (SWP_DEVICE_EXCLUSIVE). When a SVM range needs to be marked for
> exclusive access by a device all page table mappings for the particular
> range are replaced with device exclusive swap entries. This causes any
> CPU access to the page to result in a fault.
> 
> Faults are resovled by replacing the faulting entry with the original
> mapping. This results in MMU notifiers being called which a driver uses
> to update access permissions such as revoking atomic access. After
> notifiers have been called the device will no longer have exclusive
> access to the region.
> 
> Walking of the page tables to find the target pages is handled by
> get_user_pages() rather than a direct page table walk. A direct page
> table walk similar to what migrate_vma_collect()/unmap() does could also
> have been utilised. However this resulted in more code similar in
> functionality to what get_user_pages() provides as page faulting is
> required to make the PTEs present and to break COW.
> 
> Signed-off-by: Alistair Popple 
> Reviewed-by: Christoph Hellwig 
> 
> ---
> 
> v9:
> * Split rename of migrate_pgmap_owner into a separate patch.
> * Added comments explaining SWP_DEVICE_EXCLUSIVE_* entries.
> * Renamed try_to_protect{_one} to page_make_device_exclusive{_one} based
>   somewhat on a suggestion from Peter Xu. I was never particularly happy
>   with try_to_protect() as a name so think this is better.
> * Removed unneccesary code and reworded some comments based on feedback
>   from Peter Xu.
> * Removed the VMA walk when restoring PTEs for device-exclusive entries.
> * Simplified implementation of copy_pte_range() to fail if the page
>   cannot be locked. This might lead to occasional fork() failures but at
>   this stage we don't think that will be an issue.
> 
> v8:
> * Remove device exclusive entries on fork rather than copy them.
> 
> v7:
> * Added Christoph's Reviewed-by.
> * Minor cosmetic cleanups suggested by Christoph.
> * Replace mmu_notifier_range_init_migrate/exclusive with
>   mmu_notifier_range_init_owner as suggested by Christoph.
> * Replaced lock_page() with lock_page_retry() when handling faults.
> * Restrict to anonymous pages for now.
> 
> v6:
> * Fixed a bisectablity issue due to incorrectly applying the rename of
>   migrate_pgmap_owner to the wrong patches for Nouveau and hmm_test.
> 
> v5:
> * Renamed range->migrate_pgmap_owner to range->owner.
> * Added MMU_NOTIFY_EXCLUSIVE to allow passing of a driver cookie which
>   allows notifiers called as a result of make_device_exclusive_range() to
>   be ignored.
> * Added a check to try_to_protect_one() to detect if the pages originally
>   returned from get_user_pages() have been unmapped or not.
> * Removed check_device_exclusive_range() as it is no longer required with
>   the other changes.
> * Documentation update.
> 
> v4:
> * Add function to check that mappings are still valid and exclusive.
> * s/long/unsigned long/ in make_device_exclusive_entry().
> ---
>  Documentation/vm/hmm.rst |  17 
>  include/linux/mmu_notifier.h |   6 ++
>  include/linux/rmap.h |   4 +
>  include/linux/swap.h |   7 +-
>  include/linux/swapops.h  |  44 -
>  mm/hmm.c |   5 +
>  mm/memory.c  | 128 +++-
>  mm/mprotect.c|   8 ++
>  mm/page_vma_mapped.c |   9 +-
>  mm/rmap.c| 186 +++
>  10 files changed, 405 insertions(+), 9 deletions(-)
> 
> diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst
> index 3df79307a797..a14c2938e7af 100644
> --- a/Documentation/vm/hmm.rst
> +++ b/Documentation/vm/hmm.rst
> @@ -405,6 +405,23 @@ between device driver specific code and shared common 
> code:
>  
> The lock can now be released.
>  
> +Exclusive access memory
> +===
> +
> +Some devices have features such as atomic PTE bits that can be used to 
> implement
> +atomic access to system memory. To support atomic operations to a shared 
> virtual
> +memory page such a device needs access to that page which is exclusive of any
> +userspace access from the CPU. The ``make_device_exclusive_range()`` function
> +can be used to make a memory range inaccessible from userspace.
> +
> +This replaces all mappings for page

Re: [PATCH v9 05/10] mm: Rename migrate_pgmap_owner

2021-05-26 Thread Peter Xu

On Mon, May 24, 2021 at 11:27:20PM +1000, Alistair Popple wrote:
> @@ -521,14 +521,14 @@ static inline void mmu_notifier_range_init(struct 
> mmu_notifier_range *range,
>   range->flags = flags;
>  }
>  
> -static inline void mmu_notifier_range_init_migrate(
> - struct mmu_notifier_range *range, unsigned int flags,
> +static inline void mmu_notifier_range_init_owner(
> + struct mmu_notifier_range *range,
> + enum mmu_notifier_event event, unsigned int flags,
>   struct vm_area_struct *vma, struct mm_struct *mm,
> - unsigned long start, unsigned long end, void *pgmap)
> + unsigned long start, unsigned long end, void *owner)
>  {
> - mmu_notifier_range_init(range, MMU_NOTIFY_MIGRATE, flags, vma, mm,
> - start, end);
> - range->migrate_pgmap_owner = pgmap;
> + mmu_notifier_range_init(range, event, flags, vma, mm, start, end);
> + range->owner = owner;
>  }

mmu_notifier_range_init_migrate() can even be kept to just call the new helper,
then existing callers are unaffected.  Not a big deal, though:

Reviewed-by: Peter Xu 

Thanks,

-- 
Peter Xu

Re: [PATCH v9 06/10] mm/memory.c: Allow different return codes for copy_nonpresent_pte()

2021-05-26 Thread Peter Xu

On Mon, May 24, 2021 at 11:27:21PM +1000, Alistair Popple wrote:
> Currently if copy_nonpresent_pte() returns a non-zero value it is
> assumed to be a swap entry which requires further processing outside the
> loop in copy_pte_range() after dropping locks. This prevents other
> values being returned to signal conditions such as failure which a
> subsequent change requires.
> 
> Instead make copy_nonpresent_pte() return an error code if further
> processing is required and read the value for the swap entry in the main
> loop under the ptl.
> 
> Signed-off-by: Alistair Popple 
> 
> ---
> 
> v9:
> 
> New for v9 to allow device exclusive handling to occur in
> copy_nonpresent_pte().
> ---
>  mm/memory.c | 12 +++-
>  1 file changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 2fb455c365c2..e061cfa18c11 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -718,7 +718,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct 
> mm_struct *src_mm,
>  
>   if (likely(!non_swap_entry(entry))) {
>   if (swap_duplicate(entry) < 0)
> - return entry.val;
> + return -EAGAIN;
>  
>   /* make sure dst_mm is on swapoff's mmlist. */
>   if (unlikely(list_empty(&dst_mm->mmlist))) {
> @@ -974,11 +974,13 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct 
> vm_area_struct *src_vma,
>   continue;
>   }
>   if (unlikely(!pte_present(*src_pte))) {
> - entry.val = copy_nonpresent_pte(dst_mm, src_mm,
> - dst_pte, src_pte,
> - src_vma, addr, rss);
> - if (entry.val)
> + ret = copy_nonpresent_pte(dst_mm, src_mm,
> + dst_pte, src_pte,
> + src_vma, addr, rss);
> + if (ret == -EAGAIN) {
> + entry = pte_to_swp_entry(*src_pte);
>   break;
> + }
>   progress += 8;
>   continue;
>   }

Note that -EAGAIN was previously used by copy_present_page() for early cow
use.  Here later although we check entry.val first:

if (entry.val) {
if (add_swap_count_continuation(entry, GFP_KERNEL) < 0) {
ret = -ENOMEM;
goto out;
}
entry.val = 0;
} else if (ret) {
WARN_ON_ONCE(ret != -EAGAIN);
prealloc = page_copy_prealloc(src_mm, src_vma, addr);
if (!prealloc)
return -ENOMEM;
/* We've captured and resolved the error. Reset, try again. */
ret = 0;
}

We didn't reset "ret" in entry.val case (maybe we should?). Then in the next
round of "goto again" if "ret" is unluckily untouched, it could reach the 2nd
if check, and I think it could cause an unexpected page_copy_prealloc().

-- 
Peter Xu

Re: [PATCH v9 06/10] mm/memory.c: Allow different return codes for copy_nonpresent_pte()

2021-05-26 Thread Peter Xu

On Thu, May 27, 2021 at 11:20:36AM +1000, Alistair Popple wrote:
> On Thursday, 27 May 2021 5:50:05 AM AEST Peter Xu wrote:
> > On Mon, May 24, 2021 at 11:27:21PM +1000, Alistair Popple wrote:
> > > Currently if copy_nonpresent_pte() returns a non-zero value it is
> > > assumed to be a swap entry which requires further processing outside the
> > > loop in copy_pte_range() after dropping locks. This prevents other
> > > values being returned to signal conditions such as failure which a
> > > subsequent change requires.
> > > 
> > > Instead make copy_nonpresent_pte() return an error code if further
> > > processing is required and read the value for the swap entry in the main
> > > loop under the ptl.
> > > 
> > > Signed-off-by: Alistair Popple 
> > > 
> > > ---
> > > 
> > > v9:
> > > 
> > > New for v9 to allow device exclusive handling to occur in
> > > copy_nonpresent_pte().
> > > ---
> > > 
> > >  mm/memory.c | 12 +++-
> > >  1 file changed, 7 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 2fb455c365c2..e061cfa18c11 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -718,7 +718,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct
> > > mm_struct *src_mm,> 
> > >   if (likely(!non_swap_entry(entry))) {
> > >   
> > >   if (swap_duplicate(entry) < 0)
> > > 
> > > - return entry.val;
> > > + return -EAGAIN;
> > > 
> > >   /* make sure dst_mm is on swapoff's mmlist. */
> > >   if (unlikely(list_empty(&dst_mm->mmlist))) {
> > > 
> > > @@ -974,11 +974,13 @@ copy_pte_range(struct vm_area_struct *dst_vma,
> > > struct vm_area_struct *src_vma,> 
> > >   continue;
> > >   
> > >   }
> > >   if (unlikely(!pte_present(*src_pte))) {
> > > 
> > > - entry.val = copy_nonpresent_pte(dst_mm, src_mm,
> > > - dst_pte, src_pte,
> > > - src_vma, addr, rss);
> > > - if (entry.val)
> > > + ret = copy_nonpresent_pte(dst_mm, src_mm,
> > > + dst_pte, src_pte,
> > > + src_vma, addr, rss);
> > > + if (ret == -EAGAIN) {
> > > + entry = pte_to_swp_entry(*src_pte);
> > > 
> > >   break;
> > > 
> > > + }
> > > 
> > >   progress += 8;
> > >   continue;
> > >   
> > >   }
> > 
> > Note that -EAGAIN was previously used by copy_present_page() for early cow
> > use.  Here later although we check entry.val first:
> > 
> > if (entry.val) {
> > if (add_swap_count_continuation(entry, GFP_KERNEL) < 0) {
> > ret = -ENOMEM;
> > goto out;
> > }
> > entry.val = 0;
> > } else if (ret) {
> > WARN_ON_ONCE(ret != -EAGAIN);
> > prealloc = page_copy_prealloc(src_mm, src_vma, addr);
> > if (!prealloc)
> > return -ENOMEM;
> > /* We've captured and resolved the error. Reset, try again.
> > */ ret = 0;
> > }
> > 
> > We didn't reset "ret" in entry.val case (maybe we should?). Then in the next
> > round of "goto again" if "ret" is unluckily untouched, it could reach the
> > 2nd if check, and I think it could cause an unexpected
> > page_copy_prealloc().
> 
> Thanks, I had considered that but saw "ret" was always set either by 
> copy_nonpresent_pte() or copy_present_pte(). However missed the "unlucky" 
> case 
> at the start of the loop:
> 
>   if (progress >= 32) {
>   progress = 0;
>   if (need_resched() ||
>   spin_needbreak(src_ptl) || 
> pin_needbreak(dst_ptl))
>   break;
> 
> Looking at this again though checking different variables to figure out what 
> to do outside the locks and reusing error codes seems error prone. I reused -
> EAGAIN for copy_nonpresent_pte() simply because that seemed the most sensible 
> error code, but I don't think that aids readability and it might be better to 
> use a unique error code for each case needing extra handling.
> 
> So it might be better if I update this patch to:
> 1) Use unique error codes for each case requiring special handling outside 
> the 
> lock.
> 2) Only check "ret" to determine what to do outside locks (ie. not entry.val)
> 3) Document these.
> 4) Always reset ret after handling.
> 
> Thoughts?

Looks good to me.  Thanks,

-- 
Peter Xu

Re: [PATCH v9 07/10] mm: Device exclusive memory access

2021-05-27 Thread Peter Xu

On Thu, May 27, 2021 at 01:35:39PM +1000, Alistair Popple wrote:
> > > + *
> > > + * @MMU_NOTIFY_EXCLUSIVE: to signal a device driver that the device will
> > > no + * longer have exclusive access to the page. May ignore the
> > > invalidation that's + * part of make_device_exclusive_range() if the
> > > owner field
> > > + * matches the value passed to make_device_exclusive_range().
> > 
> > Perhaps s/matches/does not match/?
> 
> No, "matches" is correct. The MMU_NOTIFY_EXCLUSIVE notifier is to notify a 
> listener that a range is being invalidated for the purpose of making the 
> range 
> available for some device to have exclusive access to. Which does also mean a 
> device getting the notification no longer has exclusive access if it already 
> did.
> 
> A unique type is needed because when creating the range a driver needs to 
> form 
> a mmu critical section (with mmu_interval_read_begin()/
> mmu_interval_read_end()) to ensure the entry remains valid long enough to 
> program the device pte and hasn't been invalidated.
> 
> However without a way of filtering any invalidations will result in a retry, 
> but make_device_exclusive_range() needs to do an invalidation during 
> installation of the entry. To avoid this causing infinite retries the driver 
> ignores specific invalidation events that it knows don't apply, ie. the 
> invalidations that are a result of that driver asking for device exclusive 
> entries.

OK I think I get it now.. so the driver checks both EXCLUSIVE and owner, if all
match it skips the notify, otherwise it's treated like all the rest.  Thanks.

However then it's still confusing (as I raised it too in previous comment) that
we use CLEAR when re-installing the valid pte.  It's merely against what CLEAR
means.

How about sending EXCLUSIVE for both mark/restore?  Just that when restore we
notify with owner==NULL telling that no one is owning it anymore so driver
needs to drop the ownership.  I assume your driver patch does not need change
too.  Would that be much cleaner than CLEAR?  I bet it also makes commenting
the new notify easier.

What do you think?

[...]

> > > +   vma->vm_mm, address, min(vma->vm_end,
> > > +   address + page_size(page)),
> > > args->owner); + mmu_notifier_invalidate_range_start(&range);
> > > +
> > > + while (page_vma_mapped_walk(&pvmw)) {
> > > + /* Unexpected PMD-mapped THP? */
> > > + VM_BUG_ON_PAGE(!pvmw.pte, page);
> > > +
> > > + if (!pte_present(*pvmw.pte)) {
> > > + ret = false;
> > > + page_vma_mapped_walk_done(&pvmw);
> > > + break;
> > > + }
> > > +
> > > + subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
> > 
> > I see that all pages passed in should be done after FOLL_SPLIT_PMD, so is
> > this needed?  Or say, should subpage==page always be true?
> 
> Not always, in the case of a thp there are small ptes which will get device 
> exclusive entries.

FOLL_SPLIT_PMD will first split the huge thp into smaller pages, then do
follow_page_pte() on them (in follow_pmd_mask):

if (flags & FOLL_SPLIT_PMD) {
int ret;
page = pmd_page(*pmd);
if (is_huge_zero_page(page)) {
spin_unlock(ptl);
ret = 0;
split_huge_pmd(vma, pmd, address);
if (pmd_trans_unstable(pmd))
ret = -EBUSY;
} else {
        spin_unlock(ptl);
split_huge_pmd(vma, pmd, address);
ret = pte_alloc(mm, pmd) ? -ENOMEM : 0;
}

return ret ? ERR_PTR(ret) :
follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
}

So I thought all pages are small pages?

-- 
Peter Xu

Re: [PATCH v9 07/10] mm: Device exclusive memory access

2021-05-28 Thread Peter Xu

On Fri, May 28, 2021 at 11:48:40AM +1000, Alistair Popple wrote:

[...]

> > > > > + while (page_vma_mapped_walk(&pvmw)) {
> > > > > + /* Unexpected PMD-mapped THP? */
> > > > > + VM_BUG_ON_PAGE(!pvmw.pte, page);
> > > > > +
> > > > > + if (!pte_present(*pvmw.pte)) {
> > > > > + ret = false;
> > > > > + page_vma_mapped_walk_done(&pvmw);
> > > > > + break;
> > > > > + }
> > > > > +
> > > > > + subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
> > > > 
> > > > I see that all pages passed in should be done after FOLL_SPLIT_PMD, so
> > > > is
> > > > this needed?  Or say, should subpage==page always be true?
> > > 
> > > Not always, in the case of a thp there are small ptes which will get
> > > device
> > > exclusive entries.
> > 
> > FOLL_SPLIT_PMD will first split the huge thp into smaller pages, then do
> > follow_page_pte() on them (in follow_pmd_mask):
> > 
> > if (flags & FOLL_SPLIT_PMD) {
> > int ret;
> > page = pmd_page(*pmd);
> > if (is_huge_zero_page(page)) {
> > spin_unlock(ptl);
> > ret = 0;
> > split_huge_pmd(vma, pmd, address);
> > if (pmd_trans_unstable(pmd))
> > ret = -EBUSY;
> > } else {
> > spin_unlock(ptl);
> > split_huge_pmd(vma, pmd, address);
> > ret = pte_alloc(mm, pmd) ? -ENOMEM : 0;
> > }
> > 
> > return ret ? ERR_PTR(ret) :
> > follow_page_pte(vma, address, pmd, flags,
> > &ctx->pgmap); }
> > 
> > So I thought all pages are small pages?
> 
> The page will remain as a transparent huge page though (at least as I 
> understand things). FOLL_SPLIT_PMD turns it into a pte mapped thp by 
> splitting 
> the pmd and creating pte's mapping the subpages but doesn't split the page 
> itself. For comparison FOLL_SPLIT (which has been removed in v5.13 due to 
> lack 
> of use) is what would be used to split the page in the above GUP code by 
> calling split_huge_page() rather than split_huge_pmd().

But shouldn't FOLL_SPLIT_PMD filled in small pfns for each pte?  See
__split_huge_pmd_locked():

for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
...
} else {
entry = mk_pte(page + i, READ_ONCE(vma->vm_page_prot));
...
}
...
set_pte_at(mm, addr, pte, entry);
}

Then iiuc the coming follow_page_pte() will directly fetch the small pages?

-- 
Peter Xu

Re: [PATCH v9 07/10] mm: Device exclusive memory access

2021-06-02 Thread Peter Xu

On Wed, Jun 02, 2021 at 06:50:37PM +1000, Balbir Singh wrote:
> On Wed, May 26, 2021 at 12:17:18AM -0700, John Hubbard wrote:
> > On 5/25/21 4:51 AM, Balbir Singh wrote:
> > ...
> > > > How beneficial is this code to nouveau users?  I see that it permits a
> > > > part of OpenCL to be implemented, but how useful/important is this in
> > > > the real world?
> > > 
> > > That is a very good question! I've not reviewed the code, but a sample
> > > program with the described use case would make things easy to parse.
> > > I suspect that is not easy to build at the moment?
> > > 
> > 
> > The cover letter says this:
> > 
> > This has been tested with upstream Mesa 21.1.0 and a simple OpenCL program
> > which checks that GPU atomic accesses to system memory are atomic. Without
> > this series the test fails as there is no way of write-protecting the page
> > mapping which results in the device clobbering CPU writes. For reference
> > the test is available at https://ozlabs.org/~apopple/opencl_svm_atomics/
> > 
> > Further testing has been performed by adding support for testing exclusive
> > access to the hmm-tests kselftests.
> > 
> > ...so that seems to cover the "sample program" request, at least.
> 
> Thanks, I'll take a look
> 
> > 
> > > I wonder how we co-ordinate all the work the mm is doing, page migration,
> > > reclaim with device exclusive access? Do we have any numbers for the worst
> > > case page fault latency when something is marked away for exclusive 
> > > access?
> > 
> > CPU page fault latency is approximately "terrible", if a page is resident on
> > the GPU. We have to spin up a DMA engine on the GPU and have it copy the 
> > page
> > over the PCIe bus, after all.
> > 
> > > I presume for now this is anonymous memory only? SWP_DEVICE_EXCLUSIVE 
> > > would
> > 
> > Yes, for now.
> > 
> > > only impact the address space of programs using the GPU. Should the 
> > > exclusively
> > > marked range live in the unreclaimable list and recycled back to 
> > > active/in-active
> > > to account for the fact that
> > > 
> > > 1. It is not reclaimable and reclaim will only hurt via page faults?
> > > 2. It ages the page correctly or at-least allows for that possibility 
> > > when the
> > > page is used by the GPU.
> > 
> > I'm not sure that that is *necessarily* something we can conclude. It 
> > depends upon
> > access patterns of each program. For example, a "reduction" parallel 
> > program sends
> > over lots of data to the GPU, and only a tiny bit of (reduced!) data comes 
> > back
> > to the CPU. In that case, freeing the physical page on the CPU is actually 
> > the
> > best decision for the OS to make (if the OS is sufficiently prescient).
> >
> 
> With a shared device or a device exclusive range, it would be good to get the 
> device
> usage pattern and update the mm with that knowledge, so that the LRU can be 
> better
> maintained. With your comment you seem to suggest that a page used by the GPU 
> might
> be a good candidate for reclaim based on the CPU's understanding of the age of
> the page should not account for use by the device
> (are GPU workloads - access once and discard?) 

Hmm, besides the aging info, this reminded me: do we need to isolate the page
from lru too when marking device exclusive access?

Afaict the current patch didn't do that so I think it's reclaimable.  If we
still have the rmap then we'll get a mmu notify CLEAR when unmapping that
special pte, so device driver should be able to drop the ownership.  However we
dropped the rmap when marking exclusive.  Now I don't know whether and how
it'll work if page reclaim runs with the page being exclusively owned if
without isolating the page..

-- 
Peter Xu

Re: [PATCH v9 07/10] mm: Device exclusive memory access

2021-06-03 Thread Peter Xu

On Thu, Jun 03, 2021 at 09:39:32PM +1000, Alistair Popple wrote:
> Reclaim won't run on the page due to the extra references from the special 
> swap entries.

That sounds reasonable, but I didn't find the point that stops it, probably due
to my limited knowledge on the reclaim code.  Could you elaborate?

-- 
Peter Xu

Re: [PATCH v9 07/10] mm: Device exclusive memory access

2021-06-04 Thread Peter Xu

On Fri, Jun 04, 2021 at 11:07:42AM +1000, Alistair Popple wrote:
> On Friday, 4 June 2021 12:47:40 AM AEST Peter Xu wrote:
> > External email: Use caution opening links or attachments
> > 
> > On Thu, Jun 03, 2021 at 09:39:32PM +1000, Alistair Popple wrote:
> > > Reclaim won't run on the page due to the extra references from the special
> > > swap entries.
> > 
> > That sounds reasonable, but I didn't find the point that stops it, probably
> > due to my limited knowledge on the reclaim code.  Could you elaborate?
> 
> Sure, it isn't immediately obvious but it ends up being detected at the start 
> of is_page_cache_freeable() in the pageout code:
> 
> 
> static pageout_t pageout(struct page *page, struct address_space *mapping)
> {
> 
> [...]
> 
>   if (!is_page_cache_freeable(page))
>   return PAGE_KEEP;

I did look at pageout() but still missed this small helper indeed (while it's
so important to know..), thanks!

-- 
Peter Xu

Re: [PATCH v10 05/10] mm: Rename migrate_pgmap_owner

2021-06-08 Thread Peter Xu

On Mon, Jun 07, 2021 at 05:58:50PM +1000, Alistair Popple wrote:
> MMU notifier ranges have a migrate_pgmap_owner field which is used by
> drivers to store a pointer. This is subsequently used by the driver
> callback to filter MMU_NOTIFY_MIGRATE events. Other notifier event types
> can also benefit from this filtering, so rename the
> 'migrate_pgmap_owner' field to 'owner' and create a new notifier
> initialisation function to initialise this field.
> 
> Signed-off-by: Alistair Popple 
> Suggested-by: Peter Xu 

Reviewed-by: Peter Xu 

-- 
Peter Xu

Re: [PATCH v10 06/10] mm/memory.c: Allow different return codes for copy_nonpresent_pte()

2021-06-08 Thread Peter Xu

On Mon, Jun 07, 2021 at 05:58:51PM +1000, Alistair Popple wrote:
> Currently if copy_nonpresent_pte() returns a non-zero value it is
> assumed to be a swap entry which requires further processing outside the
> loop in copy_pte_range() after dropping locks. This prevents other
> values being returned to signal conditions such as failure which a
> subsequent change requires.
> 
> Instead make copy_nonpresent_pte() return an error code if further
> processing is required and read the value for the swap entry in the main
> loop under the ptl.
> 
> Signed-off-by: Alistair Popple 
> 
> ---
> 
> v10:
> 
> Use a unique error code and only check return codes for handling.
> 
> v9:
> 
> New for v9 to allow device exclusive handling to occur in
> copy_nonpresent_pte().
> ---
>  mm/memory.c | 26 --
>  1 file changed, 16 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 2fb455c365c2..0982cab37ecb 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -718,7 +718,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct 
> mm_struct *src_mm,
>  
>   if (likely(!non_swap_entry(entry))) {
>   if (swap_duplicate(entry) < 0)
> - return entry.val;
> + return -EIO;
>  
>   /* make sure dst_mm is on swapoff's mmlist. */
>   if (unlikely(list_empty(&dst_mm->mmlist))) {
> @@ -974,11 +974,13 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct 
> vm_area_struct *src_vma,
>   continue;
>   }
>   if (unlikely(!pte_present(*src_pte))) {
> - entry.val = copy_nonpresent_pte(dst_mm, src_mm,
> - dst_pte, src_pte,
> - src_vma, addr, rss);
> - if (entry.val)
> + ret = copy_nonpresent_pte(dst_mm, src_mm,
> + dst_pte, src_pte,
> + src_vma, addr, rss);
> + if (ret == -EIO) {
> + entry = pte_to_swp_entry(*src_pte);
>   break;
> + }
>   progress += 8;
>   continue;
>   }
> @@ -1011,20 +1013,24 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct 
> vm_area_struct *src_vma,
>   pte_unmap_unlock(orig_dst_pte, dst_ptl);
>   cond_resched();
>  
> - if (entry.val) {
> + if (ret == -EIO) {
> + VM_WARN_ON_ONCE(!entry.val);
>   if (add_swap_count_continuation(entry, GFP_KERNEL) < 0) {
>   ret = -ENOMEM;
>   goto out;
>   }
>   entry.val = 0;
> - } else if (ret) {
> - WARN_ON_ONCE(ret != -EAGAIN);
> + } else if (ret ==  -EAGAIN) {
  ^
  |- one more space here

>   prealloc = page_copy_prealloc(src_mm, src_vma, addr);
>   if (!prealloc)
>   return -ENOMEM;
> - /* We've captured and resolved the error. Reset, try again. */
> - ret = 0;
> + } else if (ret) {
> + VM_WARN_ON_ONCE(1);
>   }
> +
> + /* We've captured and resolved the error. Reset, try again. */

Maybe better as:

  /*
   * We've resolved all error even if there is, reset error code and try
   * again if necessary.
   */

as it also covers the no-error path.  But I guess not a big deal..

Reviewed-by: Peter Xu 

Thanks,

> + ret = 0;
> +
>   if (addr != end)
>   goto again;
>  out:
> -- 
> 2.20.1
> 

-- 
Peter Xu

Re: [PATCH v10 07/10] mm: Device exclusive memory access

2021-06-08 Thread Peter Xu

fier callbacks
> + *
> + * Tries to remove all the page table entries which are mapping this page and
> + * replace them with special device exclusive swap entries to grant a device
> + * exclusive access to the page. Caller must hold the page lock.
> + *
> + * Returns false if the page is still mapped, or if it could not be unmapped
> + * from the expected address. Otherwise returns true (success).
> + */
> +static bool page_make_device_exclusive(struct page *page, struct mm_struct 
> *mm,
> + unsigned long address, void *owner)
> +{
> + struct make_exclusive_args args = {
> + .mm = mm,
> + .address = address,
> + .owner = owner,
> + .valid = false,
> + };
> + struct rmap_walk_control rwc = {
> + .rmap_one = page_make_device_exclusive_one,
> + .done = page_not_mapped,
> + .anon_lock = page_lock_anon_vma_read,
> + .arg = &args,
> + };
> +
> + /*
> +  * Restrict to anonymous pages for now to avoid potential writeback
> +  * issues.
> +  */
> + if (!PageAnon(page))
> + return false;
> +
> + rmap_walk(page, &rwc);

Here we call rmap_walk() on each page we've got.  If it was thp then IIUC it'll
become the tail pages to walk as the outcome of FOLL_SPLIT_PMD gup (please
refer to the last reply of mine).  However now I'm uncertain whether we can do
rmap_walk on tail page at all...  As rmap_walk_anon() has thp_nr_pages() which
has:

VM_BUG_ON_PGFLAGS(PageTail(page), page);

So... for thp mappings, wondering whether we should do normal GUP (without
SPLIT), pass in always normal or head pages into rmap_walk(), but then
unconditionally split_huge_pmd_address() in page_make_device_exclusive_one()?

Please correct me if I made silly mistakes on above, as I am looking at the
code when/during trying to review the patch, so it's possible I missed
something again.  Neither does this code a huge matter since it's not in
general mm path, but still raise this question up.

Thanks,

> +
> + return args.valid && !page_mapcount(page);
> +}
> +
> +/**
> + * make_device_exclusive_range() - Mark a range for exclusive use by a device
> + * @mm: mm_struct of assoicated target process
> + * @start: start of the region to mark for exclusive device access
> + * @end: end address of region
> + * @pages: returns the pages which were successfully marked for exclusive 
> access
> + * @owner: passed to MMU_NOTIFY_EXCLUSIVE range notifier to allow filtering
> + *
> + * Returns: number of pages found in the range by GUP. A page is marked for
> + * exclusive access only if the page pointer is non-NULL.
> + *
> + * This function finds ptes mapping page(s) to the given address range, locks
> + * them and replaces mappings with special swap entries preventing userspace 
> CPU
> + * access. On fault these entries are replaced with the original mapping 
> after
> + * calling MMU notifiers.
> + *
> + * A driver using this to program access from a device must use a mmu 
> notifier
> + * critical section to hold a device specific lock during programming. Once
> + * programming is complete it should drop the page lock and reference after
> + * which point CPU access to the page will revoke the exclusive access.
> + */
> +int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
> + unsigned long end, struct page **pages,
> + void *owner)
> +{
> + long npages = (end - start) >> PAGE_SHIFT;
> + unsigned long i;
> +
> + npages = get_user_pages_remote(mm, start, npages,
> +FOLL_GET | FOLL_WRITE | FOLL_SPLIT_PMD,
> +pages, NULL, NULL);
> + for (i = 0; i < npages; i++, start += PAGE_SIZE) {
> + if (!trylock_page(pages[i])) {
> + put_page(pages[i]);
> + pages[i] = NULL;
> + continue;
> + }
> +
> + if (!page_make_device_exclusive(pages[i], mm, start, owner)) {
> + unlock_page(pages[i]);
> + put_page(pages[i]);
> + pages[i] = NULL;
> + }
> + }
> +
> + return npages;
> +}
> +EXPORT_SYMBOL_GPL(make_device_exclusive_range);
> +#endif
> +
>  void __put_anon_vma(struct anon_vma *anon_vma)
>  {
>   struct anon_vma *root = anon_vma->root;
> -- 
> 2.20.1
> 

-- 
Peter Xu

Re: [PATCH v10 07/10] mm: Device exclusive memory access

2021-06-09 Thread Peter Xu

On Wed, Jun 09, 2021 at 07:38:04PM +1000, Alistair Popple wrote:
> On Wednesday, 9 June 2021 4:33:52 AM AEST Peter Xu wrote:
> > On Mon, Jun 07, 2021 at 05:58:52PM +1000, Alistair Popple wrote:
> > 
> > [...]
> > 
> > > +static bool page_make_device_exclusive_one(struct page *page,
> > > + struct vm_area_struct *vma, unsigned long address, void 
> > > *priv)
> > > +{
> > > + struct mm_struct *mm = vma->vm_mm;
> > > + struct page_vma_mapped_walk pvmw = {
> > > + .page = page,
> > > + .vma = vma,
> > > + .address = address,
> > > + };
> > > + struct make_exclusive_args *args = priv;
> > > + pte_t pteval;
> > > + struct page *subpage;
> > > + bool ret = true;
> > > + struct mmu_notifier_range range;
> > > + swp_entry_t entry;
> > > + pte_t swp_pte;
> > > +
> > > + mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0, vma,
> > > +   vma->vm_mm, address, min(vma->vm_end,
> > > +   address + page_size(page)), 
> > > args->owner);
> > > + mmu_notifier_invalidate_range_start(&range);
> > > +
> > > + while (page_vma_mapped_walk(&pvmw)) {
> > > + /* Unexpected PMD-mapped THP? */
> > > + VM_BUG_ON_PAGE(!pvmw.pte, page);
> > 
> > [1]
> > 
> > > +
> > > + if (!pte_present(*pvmw.pte)) {
> > > + ret = false;
> > > + page_vma_mapped_walk_done(&pvmw);
> > > + break;
> > > + }
> > > +
> > > + subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
> > > + address = pvmw.address;
> > 
> > I raised a question here previously and didn't get an answer...
> > 
> > https://lore.kernel.org/linux-mm/YLDr%2FRyAdUR4q0kk@t490s/
> 
> Sorry, I had overlooked that. Will continue the discussion here.

No problem.  I also didn't really express clearly last time, I'm happy we can
discuss this more thoroughly, even if it may be a corner case only.

> 
> > I think I get your point now and it does look possible that the split page 
> > can
> > still be mapped somewhere else as thp, then having some subpage maintainance
> > looks necessary.  The confusing part is above [1] you've also got that
> > VM_BUG_ON_PAGE() assuming it must not be a mapped pmd at all..
> 
> Going back I thought your original question was whether subpage != page is
> possible. My main point was it's possible if we get a thp head. In that case 
> we
> need to replace all pte's with exclusive entries because I haven't (yet)
> defined a pmd version of device exclusive entries and also rmap_walk won't 
> deal
> with tail pages (see below).
> 
> > Then I remembered these code majorly come from the try_to_unmap() so I 
> > looked
> > there.  I _think_ what's missing here is something like:
> > 
> > if (flags & TTU_SPLIT_HUGE_PMD)
> > split_huge_pmd_address(vma, address, false, page);
> > 
> > at the entry of page_make_device_exclusive_one()?
> >
> > That !pte assertion in try_to_unmap() makes sense to me as long as it has 
> > split
> > the thp page first always.  However seems not the case for FOLL_SPLIT_PMD as
> > you previously mentioned.
> 
> At present this is limited to PageAnon pages which have had CoW broken, which 
> I
> think means there shouldn't be other mappings so I expect the PMD will always
> have been split into small PTEs mapping subpages by GUP which is what that
> assertion [1] is checking. I could call split_huge_pmd_address() 
> unconditionally
> as suggested but see the discussion below.

Yes, I think calling that unconditionally should be enough.

> 
> > Meanwhile, I also started to wonder whether it's even right to call 
> > rmap_walk()
> > with tail pages...  Please see below.
> > 
> > > +
> > > + /* Nuke the page table entry. */
> > > + flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
> > > + pteval = ptep_clear_flush(vma, address, pvmw.pte);
> > > +
> > > + /* Move the dirty bit to the page. Now the pte is gone. */
> > > + if (pte_dirty(pteval))
> > > +

Re: [PATCH v10 07/10] mm: Device exclusive memory access

2021-06-10 Thread Peter Xu

On Thu, Jun 10, 2021 at 10:18:25AM +1000, Alistair Popple wrote:
> > > The main problem is split_huge_pmd_address() unconditionally calls a mmu
> > > notifier so I would need to plumb in passing an owner everywhere which 
> > > could
> > > get messy.
> > 
> > Could I ask why?  split_huge_pmd_address() will notify with CLEAR, so I'm a 
> > bit
> > confused why we need to pass over the owner.
> 
> Sure, it is the same reason we need to pass it for the exclusive notifier.
> Any invalidation during the make exclusive operation will break the mmu read
> side critical section forcing a retry of the operation. The owner field is 
> what
> is used to filter out invalidations (such as the exclusive invalidation) that
> don't need to be retried.

Do you mean the mmu_interval_read_begin|retry() calls?

Hmm, the thing is.. to me FOLL_SPLIT_PMD should have similar effect to explicit
call split_huge_pmd_address(), afaict.  Since both of them use 
__split_huge_pmd()
internally which will generate that unwanted CLEAR notify.

If that's the case, I think it fails because split_huge_pmd_address() will
trigger that CLEAR notify unconditionally (even if it's not a thp; not sure
whether it should be optimized to not notify at all... definitely another
story), while FOLL_SPLIT_PMD will skip the notify as it calls split_huge_pmd()
instead, who checks the pmd before calling __split_huge_pmd().

Does it also mean that if there's a real THP it won't really work?  As then
FOLL_SPLIT_PMD will start to trigger that CLEAR notify too, I think..

-- 
Peter Xu

Re: [PATCH v10 07/10] mm: Device exclusive memory access

2021-06-10 Thread Peter Xu

On Fri, Jun 11, 2021 at 12:21:26AM +1000, Alistair Popple wrote:
> > Hmm, the thing is.. to me FOLL_SPLIT_PMD should have similar effect to 
> > explicit
> > call split_huge_pmd_address(), afaict.  Since both of them use 
> > __split_huge_pmd()
> > internally which will generate that unwanted CLEAR notify.
> 
> Agree that gup calls __split_huge_pmd() via split_huge_pmd_address()
> which will always CLEAR. However gup only calls split_huge_pmd_address() if it
> finds a thp pmd. In follow_pmd_mask() we have:
> 
>   if (likely(!pmd_trans_huge(pmdval)))
>   return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
> 
> So I don't think we have a problem here.

Sorry I didn't follow here..  We do FOLL_SPLIT_PMD after this check, right?  I
mean, if it's a thp for the current mm, afaict pmd_trans_huge() should return
true above, so we'll skip follow_page_pte(); then we'll check FOLL_SPLIT_PMD
and do the split, then the CLEAR notify.  Hmm.. Did I miss something?

-- 
Peter Xu

Re: [PATCH v10 07/10] mm: Device exclusive memory access

2021-06-10 Thread Peter Xu

On Fri, Jun 11, 2021 at 09:17:14AM +1000, Alistair Popple wrote:
> On Friday, 11 June 2021 9:04:19 AM AEST Peter Xu wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > On Fri, Jun 11, 2021 at 12:21:26AM +1000, Alistair Popple wrote:
> > > > Hmm, the thing is.. to me FOLL_SPLIT_PMD should have similar effect to 
> > > > explicit
> > > > call split_huge_pmd_address(), afaict.  Since both of them use 
> > > > __split_huge_pmd()
> > > > internally which will generate that unwanted CLEAR notify.
> > >
> > > Agree that gup calls __split_huge_pmd() via split_huge_pmd_address()
> > > which will always CLEAR. However gup only calls split_huge_pmd_address() 
> > > if it
> > > finds a thp pmd. In follow_pmd_mask() we have:
> > >
> > >   if (likely(!pmd_trans_huge(pmdval)))
> > >   return follow_page_pte(vma, address, pmd, flags, 
> > > &ctx->pgmap);
> > >
> > > So I don't think we have a problem here.
> > 
> > Sorry I didn't follow here..  We do FOLL_SPLIT_PMD after this check, right? 
> >  I
> > mean, if it's a thp for the current mm, afaict pmd_trans_huge() should 
> > return
> > true above, so we'll skip follow_page_pte(); then we'll check FOLL_SPLIT_PMD
> > and do the split, then the CLEAR notify.  Hmm.. Did I miss something?
> 
> That seems correct - if the thp is not mapped with a pmd we won't split and we
> won't CLEAR. If there is a thp pmd we will split and CLEAR, but in that case 
> it
> is fine - we will retry, but the retry will won't CLEAR because the pmd has
> already been split.

Aha!

> 
> The issue arises with doing it unconditionally in make device exclusive is 
> that
> you *always* CLEAR even if there is no thp pmd to split. Or at least that's my
> understanding, please let me know if it doesn't make sense.

Exactly.  But if you see what I meant here, even if it can work like this, it
sounds still fragile, isn't it?  I just feel something is slightly off there..

IMHO split_huge_pmd() checked pmd before calling __split_huge_pmd() for
performance, afaict, because if it's not a thp even without locking, then it
won't be, so further __split_huge_pmd() is not necessary.

IOW, it's very legal if someday we'd like to let split_huge_pmd() call
__split_huge_pmd() directly, then AFAIU device exclusive API will be the 1st
one to be broken with that seems-to-be-irrelevant change I'm afraid..

This lets me goes back a step to think about why do we need this notifier at
all to cover this whole range of make_device_exclusive() procedure..

What I am thinking is, we're afraid some CPU accesses this page so the pte got
quickly restored when device atomic operation is carrying on.  Then with this
notifier we'll be able to cancel it.  Makes perfect sense.

However do we really need to register this notifier so early?  The thing is the
GPU driver still has all the page locks, so even if there's a race to restore
the ptes, they'll block at taking the page lock until the driver releases it.

IOW, I'm wondering whether the "non-fragile" way to do this is not do
mmu_interval_notifier_insert() that early: what if we register that notifier
after make_device_exclusive_range() returns but before page_unlock() somehow?
So before page_unlock(), race is protected fully by the lock itself; after
that, it's done by mmu notifier.  Then maybe we don't need to worry about all
these notifications during marking exclusive (while we shouldn't)?

Sorry in advance if I overlooked anything as I know little on device side (even
less than mm itself).  Also sorry to know that this series got marked
to-be-update in -mm; hopefully it'll still land soon even if it still needs
some rebase to other more important bugfixes - I definitely jumped in too late
even if to mess this all up. :-)

-- 
Peter Xu

Re: [PATCH v10 07/10] mm: Device exclusive memory access

2021-06-11 Thread Peter Xu

On Fri, Jun 11, 2021 at 01:43:20PM +1000, Alistair Popple wrote:
> On Friday, 11 June 2021 11:00:34 AM AEST Peter Xu wrote:
> > On Fri, Jun 11, 2021 at 09:17:14AM +1000, Alistair Popple wrote:
> > > On Friday, 11 June 2021 9:04:19 AM AEST Peter Xu wrote:
> > > > On Fri, Jun 11, 2021 at 12:21:26AM +1000, Alistair Popple wrote:
> > > > > > Hmm, the thing is.. to me FOLL_SPLIT_PMD should have similar effect 
> > > > > > to explicit
> > > > > > call split_huge_pmd_address(), afaict.  Since both of them use 
> > > > > > __split_huge_pmd()
> > > > > > internally which will generate that unwanted CLEAR notify.
> > > > >
> > > > > Agree that gup calls __split_huge_pmd() via split_huge_pmd_address()
> > > > > which will always CLEAR. However gup only calls 
> > > > > split_huge_pmd_address() if it
> > > > > finds a thp pmd. In follow_pmd_mask() we have:
> > > > >
> > > > >   if (likely(!pmd_trans_huge(pmdval)))
> > > > >   return follow_page_pte(vma, address, pmd, flags, 
> > > > > &ctx->pgmap);
> > > > >
> > > > > So I don't think we have a problem here.
> > > >
> > > > Sorry I didn't follow here..  We do FOLL_SPLIT_PMD after this check, 
> > > > right?  I
> > > > mean, if it's a thp for the current mm, afaict pmd_trans_huge() should 
> > > > return
> > > > true above, so we'll skip follow_page_pte(); then we'll check 
> > > > FOLL_SPLIT_PMD
> > > > and do the split, then the CLEAR notify.  Hmm.. Did I miss something?
> > >
> > > That seems correct - if the thp is not mapped with a pmd we won't split 
> > > and we
> > > won't CLEAR. If there is a thp pmd we will split and CLEAR, but in that 
> > > case it
> > > is fine - we will retry, but the retry will won't CLEAR because the pmd 
> > > has
> > > already been split.
> > 
> > Aha!
> > 
> > >
> > > The issue arises with doing it unconditionally in make device exclusive 
> > > is that
> > > you *always* CLEAR even if there is no thp pmd to split. Or at least 
> > > that's my
> > > understanding, please let me know if it doesn't make sense.
> > 
> > Exactly.  But if you see what I meant here, even if it can work like this, 
> > it
> > sounds still fragile, isn't it?  I just feel something is slightly off 
> > there..
> > 
> > IMHO split_huge_pmd() checked pmd before calling __split_huge_pmd() for
> > performance, afaict, because if it's not a thp even without locking, then it
> > won't be, so further __split_huge_pmd() is not necessary.
> > 
> > IOW, it's very legal if someday we'd like to let split_huge_pmd() call
> > __split_huge_pmd() directly, then AFAIU device exclusive API will be the 1st
> > one to be broken with that seems-to-be-irrelevant change I'm afraid..
> 
> Well I would argue the performance of memory notifiers is becoming 
> increasingly
> important, and a change that causes them to be called unnecessarily is
> therefore not very legal. Likely the correct fix here is to optimise
> __split_huge_pmd() to only call the notifier if it's actually going to split a
> pmd. As you said though that's a completely different story which I think 
> would
> be best done as a separate series.

Right, maybe I can look a bit more into that later; but my whole point was to
express that one functionality shouldn't depend on such a trivial detail of
implementation of other modules (thp split in this case).

> 
> > This lets me goes back a step to think about why do we need this notifier at
> > all to cover this whole range of make_device_exclusive() procedure..
> > 
> > What I am thinking is, we're afraid some CPU accesses this page so the pte 
> > got
> > quickly restored when device atomic operation is carrying on.  Then with 
> > this
> > notifier we'll be able to cancel it.  Makes perfect sense.
> > 
> > However do we really need to register this notifier so early?  The thing is 
> > the
> > GPU driver still has all the page locks, so even if there's a race to 
> > restore
> > the ptes, they'll block at taking the page lock until the driver releases 
> > it.
> > 
> > IOW, I'm wondering whether the "non-fragile" way to do this is not do
> > mmu_interval_notifier_insert() that early: wh

Re: [PATCH v10 07/10] mm: Device exclusive memory access

2021-06-15 Thread Peter Xu

On Tue, Jun 15, 2021 at 01:08:11PM +1000, Alistair Popple wrote:
> On Saturday, 12 June 2021 1:01:42 AM AEST Peter Xu wrote:
> > On Fri, Jun 11, 2021 at 01:43:20PM +1000, Alistair Popple wrote:
> > > On Friday, 11 June 2021 11:00:34 AM AEST Peter Xu wrote:
> > > > On Fri, Jun 11, 2021 at 09:17:14AM +1000, Alistair Popple wrote:
> > > > > On Friday, 11 June 2021 9:04:19 AM AEST Peter Xu wrote:
> > > > > > On Fri, Jun 11, 2021 at 12:21:26AM +1000, Alistair Popple wrote:
> > > > > > > > Hmm, the thing is.. to me FOLL_SPLIT_PMD should have similar 
> > > > > > > > effect to explicit
> > > > > > > > call split_huge_pmd_address(), afaict.  Since both of them use 
> > > > > > > > __split_huge_pmd()
> > > > > > > > internally which will generate that unwanted CLEAR notify.
> > > > > > >
> > > > > > > Agree that gup calls __split_huge_pmd() via 
> > > > > > > split_huge_pmd_address()
> > > > > > > which will always CLEAR. However gup only calls 
> > > > > > > split_huge_pmd_address() if it
> > > > > > > finds a thp pmd. In follow_pmd_mask() we have:
> > > > > > >
> > > > > > >   if (likely(!pmd_trans_huge(pmdval)))
> > > > > > >   return follow_page_pte(vma, address, pmd, flags, 
> > > > > > > &ctx->pgmap);
> > > > > > >
> > > > > > > So I don't think we have a problem here.
> > > > > >
> > > > > > Sorry I didn't follow here..  We do FOLL_SPLIT_PMD after this 
> > > > > > check, right?  I
> > > > > > mean, if it's a thp for the current mm, afaict pmd_trans_huge() 
> > > > > > should return
> > > > > > true above, so we'll skip follow_page_pte(); then we'll check 
> > > > > > FOLL_SPLIT_PMD
> > > > > > and do the split, then the CLEAR notify.  Hmm.. Did I miss 
> > > > > > something?
> > > > >
> > > > > That seems correct - if the thp is not mapped with a pmd we won't 
> > > > > split and we
> > > > > won't CLEAR. If there is a thp pmd we will split and CLEAR, but in 
> > > > > that case it
> > > > > is fine - we will retry, but the retry will won't CLEAR because the 
> > > > > pmd has
> > > > > already been split.
> > > >
> > > > Aha!
> > > >
> > > > >
> > > > > The issue arises with doing it unconditionally in make device 
> > > > > exclusive is that
> > > > > you *always* CLEAR even if there is no thp pmd to split. Or at least 
> > > > > that's my
> > > > > understanding, please let me know if it doesn't make sense.
> > > >
> > > > Exactly.  But if you see what I meant here, even if it can work like 
> > > > this, it
> > > > sounds still fragile, isn't it?  I just feel something is slightly off 
> > > > there..
> > > >
> > > > IMHO split_huge_pmd() checked pmd before calling __split_huge_pmd() for
> > > > performance, afaict, because if it's not a thp even without locking, 
> > > > then it
> > > > won't be, so further __split_huge_pmd() is not necessary.
> > > >
> > > > IOW, it's very legal if someday we'd like to let split_huge_pmd() call
> > > > __split_huge_pmd() directly, then AFAIU device exclusive API will be 
> > > > the 1st
> > > > one to be broken with that seems-to-be-irrelevant change I'm afraid..
> > >
> > > Well I would argue the performance of memory notifiers is becoming 
> > > increasingly
> > > important, and a change that causes them to be called unnecessarily is
> > > therefore not very legal. Likely the correct fix here is to optimise
> > > __split_huge_pmd() to only call the notifier if it's actually going to 
> > > split a
> > > pmd. As you said though that's a completely different story which I think 
> > > would
> > > be best done as a separate series.
> > 
> > Right, maybe I can look a bit more into that later; but my whole point was 
> > to
> > express that one functionality shouldn't depend on such a trivial detail of
> > implementation of

Re: [PATCH v9 02/14] mm: move page zone helpers from mm.h to mmzone.h

2023-06-15 Thread Peter Xu

Hello, all,

On Fri, Jul 15, 2022 at 10:05:09AM -0500, Alex Sierra wrote:
> +static inline enum zone_type page_zonenum(const struct page *page)
> +{
> + ASSERT_EXCLUSIVE_BITS(page->flags, ZONES_MASK << ZONES_PGSHIFT);
> + return (page->flags >> ZONES_PGSHIFT) & ZONES_MASK;
> +}

Sorry to hijack this patch - not directly relevant to the movement, but
relevant to this helper, so maybe I can leverage the cc list..

My question is whether page_zonenum() is ready for taking all kinds of tail
pages?

Zone device tail pages all look fine, per memmap_init_zone_device().  The
question was other kinds of usual compound pages, like either thp or
hugetlb.  IIUC page->flags can be uninitialized for those tail pages.

Asking because I noticed it seems possible that page_zonenum() can just
take any random tail page as input, e.g.:

try_grab_folio -> is_pci_p2pdma_page -> is_zone_device_page -> page_zonenum

I'm worried it'll just read fake things, but maybe I just missed something?

Thanks,

-- 
Peter Xu

Re: [PATCH v9 02/14] mm: move page zone helpers from mm.h to mmzone.h

2023-06-15 Thread Peter Xu

On Thu, Jun 15, 2023 at 09:15:26PM +0100, Matthew Wilcox wrote:
> On Thu, Jun 15, 2023 at 03:33:12PM -0400, Peter Xu wrote:
> > My question is whether page_zonenum() is ready for taking all kinds of tail
> > pages?
> > 
> > Zone device tail pages all look fine, per memmap_init_zone_device().  The
> > question was other kinds of usual compound pages, like either thp or
> > hugetlb.  IIUC page->flags can be uninitialized for those tail pages.
> 
> I don't think that's true.  It's my understanding that page->flags is
> initialised for all pages in memmap at boot / hotplug / delayed-init
> time.  So you can check things like zone, node, etc on literally any
> page.  Contrariwise, those flags are not available in tail pages for
> use by the entity that has allocated a compound page / large folio.

Oh so the zone mask is special.  Fair enough.

> 
> Also, I don't believe zone device pages support compound allocation.
> I think they're always allocated as order-0.

Totally not familiar with zone device pages, but memmap_init_zone_device()
has pfns_per_compound which can be >1.  From there, memmap_init_compound()
does go ahead and setup pages as compound ones.

Thanks!

-- 
Peter Xu

Re: [PATCH v1 0/2] udmabuf: Add back support for mapping hugetlb pages

2023-06-23 Thread Peter Xu

ttachment().

It seems the previous concern on using gup was majorly fork(), if this is it:

https://patchwork.freedesktop.org/patch/210992/?series=39879&rev=2#comment_414213

Could it also be guarded by just making sure the pages are MAP_SHARED when
creating the udmabuf, if fork() is a requirement of the feature?

I had a feeling that the userspace still needs to always do the right thing
to make it work, even using pure PFN mappings.

For instance, what if the userapp just punchs a hole in the shmem/hugetlbfs
file after creating the udmabuf (I see that F_SEAL_SHRINK is required, but
at least not F_SEAL_WRITE with current impl), and fault a new page into the
page cache?

Thanks,

> 
> > 
> > 
> > There are *probably* more issues on the QEMU side when udmabuf is
> > paired
> > with things like MADV_DONTNEED/FALLOC_FL_PUNCH_HOLE used for
> > virtio-balloon, virtio-mem, postcopy live migration, ... for example, in
> > the vfio/vdpa case we make sure that we disallow most of these, because
> > otherwise there can be an accidental "disconnect" between the pages
> > mapped into the VM (guest view) and the pages mapped into the IOMMU
> > (device view), for example, after a reboot.
> Ok; I am not sure if I can figure out if there is any acceptable way to 
> address
> these issues but given the current constraints associated with udmabuf, what
> do you suggest is the most reasonable way to deal with these problems you
> have identified?
> 
> Thanks,
> Vivek
> 
> > 
> > --
> > Cheers,
> > 
> > David / dhildenb
> 

-- 
Peter Xu

Re: [PATCH v1 0/2] udmabuf: Add back support for mapping hugetlb pages

2023-06-23 Thread Peter Xu

On Fri, Jun 23, 2023 at 01:37:58PM -0300, Jason Gunthorpe wrote:
> On Fri, Jun 23, 2023 at 12:35:45PM -0400, Peter Xu wrote:
> 
> > It seems the previous concern on using gup was majorly fork(), if this is 
> > it:
> > 
> > https://patchwork.freedesktop.org/patch/210992/?series=39879&rev=2#comment_414213
> 
> Fork and GUP have been fixed since that comment anyhow there is no
> longer a problem using GUP and fork together.

Ah, I read it previously as a requirement that the child will also be able
the see the same / coherent page when manipulating the dmabuf later after
fork(), e.g., the udmabuf can be created before fork().

> 
> > Could it also be guarded by just making sure the pages are MAP_SHARED when
> > creating the udmabuf, if fork() is a requirement of the feature?
> 
> Also a resaonable thing to do
> 
> > For instance, what if the userapp just punchs a hole in the shmem/hugetlbfs
> > file after creating the udmabuf (I see that F_SEAL_SHRINK is required, but
> > at least not F_SEAL_WRITE with current impl), and fault a new page into the
> > page cache?
> 
> It becomes incoherent just like all other GUP users if userspace
> explicitly manipulates the VMAs. It is no different to what happens
> with VFIO, qemu should treat this memory the same as it does for VFIO
> memory.

Agreed.

-- 
Peter Xu

Re: [PATCH v1 0/2] udmabuf: Add back support for mapping hugetlb pages

2023-06-26 Thread Peter Xu

ple, pairing udmabuf with vfio (which pins pages using
> > > > pin_user_pages(FOLL_LONGTERM)) in QEMU will most probably not work
> > in
> > > > all cases: if udmabuf longterm pinned the pages "the wrong way", vfio
> > > > will fail to migrate them during FOLL_LONGTERM and consequently fail
> > > > pin_user_pages(). As long as udmabuf holds a reference on these pages,
> > > > that will never succeed.
> > > Dma-buf rules (for exporters) indicate that the pages only need to be
> > pinned
> > > during the map_attachment phase (and until unmap attachment happens).
> > > In other words, only when the sg_table is created by udmabuf. I guess one
> > > option would be to not hold any references during UDMABUF_CREATE and
> > > only grab references to the pages (as and when it gets used) during this
> > step.
> > > Would this help?
> > 
> > IIUC the refcount is needed, otherwise I don't see what to protect the page
> > from being freed and even reused elsewhere before map_attachment().
> > 
> > It seems the previous concern on using gup was majorly fork(), if this is 
> > it:
> > 
> > https://patchwork.freedesktop.org/patch/210992/?series=39879&rev=2#co
> > mment_414213
> > 
> > Could it also be guarded by just making sure the pages are MAP_SHARED
> > when
> > creating the udmabuf, if fork() is a requirement of the feature?
> > 
> > I had a feeling that the userspace still needs to always do the right thing
> > to make it work, even using pure PFN mappings.
> > 
> > For instance, what if the userapp just punchs a hole in the shmem/hugetlbfs
> > file after creating the udmabuf (I see that F_SEAL_SHRINK is required, but
> > at least not F_SEAL_WRITE with current impl), and fault a new page into the
> > page cache?
> IIUC, Qemu creates and owns the memfd that is associated with Guest memory.
> And if it punches a hole in its own memfd that goes through Guest pinned 
> pages 
> associated with buffers/resources, then I think the proper way to fix this is 
> to
> somehow notify the Guest driver (virtio-gpu in this case) that the backing 
> store
> of the affected resources is no longer valid and that the resources need to be
> destroyed and re-created again.
> 
> Having said that, one option I could think of is to probably install a 
> mmu_notifier
> associated with the relevant pages in udmabuf and once we get notified about
> any invalidation event concerning any of the pages, we'd fail any subsequent
> attempt to access these pages and propagate the error across the stack. 

Sounds right, maybe it needs to go back to the old GUP solution, though, as
mmu notifiers are also mm-based not fd-based. Or to be explicit, I think
it'll be pin_user_pages(FOLL_LONGTERM) with the new API.  It'll also solve
the movable pages issue on pinning.

> 
> However, it feels like udmabuf is not the right place to handle this issue 
> because
> there are very limited options for taking proper corrective action if Qemu 
> decides
> to punch a hole in Guest memory that takes out pages that are pinned.

I'm not familiar with the use case that much, but IMHO it's fine if the
driver relies on proper behaviors of userapp to work.

IIUC the worst case here is the udmabuf holds some pages that are not the
pages of the guest mem anymore, but it only happens on misbehaved
userspace, then it looks all fine as long as they can at least be properly
released when releasing the udmabuf.  It'll be better if the udmabuf can
fail hard when detecting this, but IMHO even that can be optional depending
on the need, while any corrective action will be even one step further.

Thanks,

-- 
Peter Xu

Re: [PATCH v1 0/2] udmabuf: Add back support for mapping hugetlb pages

2023-06-26 Thread Peter Xu

On Mon, Jun 26, 2023 at 03:18:48PM -0300, Jason Gunthorpe wrote:
> On Mon, Jun 26, 2023 at 08:14:27PM +0200, David Hildenbrand wrote:
> 
> > So we might have to implement the same page migration as gup does on
> > FOLL_LONGTERM here ... maybe there are more such cases/drivers that actually
> > require that handling when simply taking pages out of the memfd, believing
> > they can hold on to them forever.
> 
> In general I would like to see an interface to FOLL_LONGTERM pin pages
> from a memfd. I would quite happily use that in iommufd as well.
> 
> It solves some problems we have there with fork/exec/etc if the pages
> are not linked to a mm_struct.

Afaiu any fd based approach should mean it'll never work with private
memories, while mm-based should be able to work on any kind.  Maybe that's
not a problem - I assume at least udmabuf should just only work with shared
memories; not sure on iommufd, though.

-- 
Peter Xu

Re: [PATCH v1 0/2] udmabuf: Add back support for mapping hugetlb pages

2023-06-27 Thread Peter Xu

On Tue, Jun 27, 2023 at 12:52:34PM -0300, Jason Gunthorpe wrote:
> On Mon, Jun 26, 2023 at 03:04:21PM -0400, Peter Xu wrote:
> > On Mon, Jun 26, 2023 at 03:18:48PM -0300, Jason Gunthorpe wrote:
> > > On Mon, Jun 26, 2023 at 08:14:27PM +0200, David Hildenbrand wrote:
> > > 
> > > > So we might have to implement the same page migration as gup does on
> > > > FOLL_LONGTERM here ... maybe there are more such cases/drivers that 
> > > > actually
> > > > require that handling when simply taking pages out of the memfd, 
> > > > believing
> > > > they can hold on to them forever.
> > > 
> > > In general I would like to see an interface to FOLL_LONGTERM pin pages
> > > from a memfd. I would quite happily use that in iommufd as well.
> > > 
> > > It solves some problems we have there with fork/exec/etc if the pages
> > > are not linked to a mm_struct.
> > 
> > Afaiu any fd based approach should mean it'll never work with private
> > memories, while mm-based should be able to work on any kind.  
> 
> Is there a significant use case to open a memfd and then use
> MAP_PRIVATE? Why would anyone want to do that instead of just using
> normal mmap anonymous memory?

I remember David Hildenbrand somewhere mentioned the use case where one
wants to snapshot a VM RAM into a file, then start multiple instances by
loading that VM RAM with MAP_PRIVATE, so it clones a bunch of snapshoted VM
running with a single RAM file shared as a template.  Not a generic use
case, I guess.

My question applies not only memfd but also in general - qemu by default
doesn't use memfd afaict, so it boils down to e.g. whether you'll target
the iommufd project to work in that case, where qemu uses anonymous memory.
Privately mapped file memory is only one of those kinds.

-- 
Peter Xu

Re: [PATCH v2 2/2] udmabuf: Add back support for mapping hugetlb pages (v2)

2023-07-27 Thread Peter Xu

On Wed, Jul 19, 2023 at 05:42:54AM +, Kasireddy, Vivek wrote:
> > > + } else {
> > > + page =
> > shmem_read_mapping_page(mapping, pgoff + pgidx);
> > 
> > It may not matter to your users, but the semantics for hugetlb and shmem
> > pages is different.  hugetlb requires the pages exist in the page cache
> > while shmem will create/add pages to the cache if necessary.
> Ok; got it.

Would it be nice to do the same for both (perhaps always try to allocate)?

Thanks,

-- 
Peter Xu

Re: [RFC v1 1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages)

2023-07-27 Thread Peter Xu

Hi, Vivek,

On Tue, Jul 25, 2023 at 10:24:21PM +, Kasireddy, Vivek wrote:
> Hi Hugh,
> 
> > 
> > On Mon, 24 Jul 2023, Kasireddy, Vivek wrote:
> > > Hi Jason,
> > > > On Mon, Jul 24, 2023 at 07:54:38AM +, Kasireddy, Vivek wrote:
> > > >
> > > > > > I'm not at all familiar with the udmabuf use case but that sounds
> > > > > > brittle and effectively makes this notifier udmabuf specific right?
> > > > > Oh, Qemu uses the udmabuf driver to provide Host Graphics
> > components
> > > > > (such as Spice, Gstreamer, UI, etc) zero-copy access to Guest created
> > > > > buffers. In other words, from a core mm standpoint, udmabuf just
> > > > > collects a bunch of pages (associated with buffers) scattered inside
> > > > > the memfd (Guest ram backed by shmem or hugetlbfs) and wraps
> > > > > them in a dmabuf fd. And, since we provide zero-copy access, we
> > > > > use DMA fences to ensure that the components on the Host and
> > > > > Guest do not access the buffer simultaneously.
> > > >
> > > > So why do you need to track updates proactively like this?
> > > As David noted in the earlier series, if Qemu punches a hole in its memfd
> > > that goes through pages that are registered against a udmabuf fd, then
> > > udmabuf needs to update its list with new pages when the hole gets
> > > filled after (guest) writes. Otherwise, we'd run into the coherency
> > > problem (between udmabuf and memfd) as demonstrated in the selftest
> > > (patch #3 in this series).
> > 
> > Wouldn't this all be very much better if Qemu stopped punching holes there?
> I think holes can be punched anywhere in the memfd for various reasons. Some

I just start to read this thread, even haven't finished all of them.. but
so far I'm not sure whether this is right at all..

udmabuf is a file, it means it should follow the file semantics. mmu
notifier is per-mm, otoh.

Imagine for some reason QEMU mapped the guest pages twice, udmabuf is
created with vma1, so udmabuf registers the mm changes over vma1 only.

However the shmem/hugetlb page cache can be populated in either vma1, or
vma2.  It means when populating on vma2 udmabuf won't get update notify at
all, udmabuf pages can still be obsolete.  Same thing to when multi-process
QEMU is used, where we can have vma1 in QEMU while vma2 in the other
process like vhost-user.

I think the trick here is we tried to "hide" the fact that these are
actually normal file pages, but we're doing PFNMAP on them... then we want
the file features back, like hole punching..

If we used normal file operations, everything will just work fine; TRUNCATE
will unmap the host mapped frame buffers when needed, and when accessed
it'll fault on demand from the page cache.  We seem to be trying to
reinvent "truncation" for pfnmap but mmu notifier doesn't sound right to
this at least..

> of the use-cases where this would be done were identified by David. Here is 
> what
> he said in an earlier discussion:
> "There are *probably* more issues on the QEMU side when udmabuf is paired 
> with things like MADV_DONTNEED/FALLOC_FL_PUNCH_HOLE used for 
> virtio-balloon, virtio-mem, postcopy live migration, ... for example, in"

Now after seething this, I'm truly wondering whether we can still simply
use the file semantics we already have (for either shmem/hugetlb/...), or
is it a must we need to use a single fd to represent all?

Say, can we just use a tuple (fd, page_array) rather than the udmabuf
itself to do host zero-copy mapping?  the page_array can be e.g. a list of
file offsets that points to the pages (rather than pinning the pages using
FOLL_GET).  The good thing is then the fd can be the guest memory file
itself.  With that, we can mmap() over the shmem/hugetlb in whatever vma
and whatever process.  Truncation (and actually everything... e.g. page
migration, swapping, ... which will be disabled if we use PFNMAP pins) will
just all start to work, afaiu.

Thanks,

-- 
Peter Xu

Re: [RFC v1 1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages)

2023-07-31 Thread Peter Xu

  break;
> }
> if (addr != -EFAULT)
> break;
> }
> mmap_read_unlock(vmm_mm);
> 
> return vma;
> }

This is hackish to me, and not working when across mm (multi-proc qemu).

> 
> > QEMU is used, where we can have vma1 in QEMU while vma2 in the other
> > process like vhost-user.
> > 
> > I think the trick here is we tried to "hide" the fact that these are
> > actually normal file pages, but we're doing PFNMAP on them... then we want
> > the file features back, like hole punching..
> > 
> > If we used normal file operations, everything will just work fine; TRUNCATE
> > will unmap the host mapped frame buffers when needed, and when
> > accessed
> > it'll fault on demand from the page cache.  We seem to be trying to
> > reinvent "truncation" for pfnmap but mmu notifier doesn't sound right to
> > this at least..
> If we can figure out the VMA ranges where the guest buffer pages are mapped,
> we should be able to register mmu notifiers for those ranges right?

In general, sorry to say that, but, mmu notifiers still do not sound like
the right approach here.

> 
> > 
> > > of the use-cases where this would be done were identified by David. Here
> > is what
> > > he said in an earlier discussion:
> > > "There are *probably* more issues on the QEMU side when udmabuf is
> > paired
> > > with things like MADV_DONTNEED/FALLOC_FL_PUNCH_HOLE used for
> > > virtio-balloon, virtio-mem, postcopy live migration, ... for example, in"
> > 
> > Now after seething this, I'm truly wondering whether we can still simply
> > use the file semantics we already have (for either shmem/hugetlb/...), or
> > is it a must we need to use a single fd to represent all?
> > 
> > Say, can we just use a tuple (fd, page_array) rather than the udmabuf
> > itself to do host zero-copy mapping?  the page_array can be e.g. a list of
> That (tuple) is essentially what we are doing (with udmabuf) but in a
> standardized way that follows convention using the dmabuf buffer sharing
> framework that all the importers (other drivers and userspace components)
> know and understand.
> 
> > file offsets that points to the pages (rather than pinning the pages using
> If we are using the dmabuf framework, the pages must be pinned when the
> importers map them.

Oh so the pages are for DMAs from hardwares, rather than accessed by the
host programs?

I really have merely zero knowledge from that aspect, sorry.  If so I don't
know how truncation can work with that, while keeping the page coherent.

Hugh asked why not QEMU just doesn't do that truncation, I'll then ask the
same.  Probably virtio-mem will not be able to work. I think postcopy will
not be affected - postcopy only drops pages at very early stage of dest
QEMU, not after VM started there, so either not affected or maybe there's
chance it'll work.

IIUC it's then the same as VFIO attached then we try to blow some pages
away from anything like virtio-balloon - AFAIR qemu just explicitly don't
allow that to happen.  See vfio_ram_block_discard_disable().

> 
> > FOLL_GET).  The good thing is then the fd can be the guest memory file
> > itself.  With that, we can mmap() over the shmem/hugetlb in whatever vma
> > and whatever process.  Truncation (and actually everything... e.g. page
> > migration, swapping, ... which will be disabled if we use PFNMAP pins) will
> > just all start to work, afaiu.
> IIUC, we'd not be able to use the fd of the guest memory file because the
> dmabuf fds are expected to have constant size that reflects the size of the
> buffer that is being shared. I just don't think it'd be feasible given all the
> other restrictions:
> https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html?highlight=dma_buf#userspace-interface-notes

Yeah I also don't know well on the dmabuf APIs, but I think if the page
must be pinned for real world DMA then it's already another story to me..
what I said on the [guest_mem_fd, offset_array] tuple idea could only (if
still possible..) work if the udmabuf access is only from the processor
side, never from the device.

Thanks,

-- 
Peter Xu

Re: [RFC v1 1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages)

2023-08-01 Thread Peter Xu

On Tue, Aug 01, 2023 at 07:11:09AM +, Kasireddy, Vivek wrote:
> Ok, I'll keep your use-case in mind but AFAICS, the process that creates
> the udmabuf can be considered the owner. So, I think it makes sense that
> the owner's VMA range can be registered (via mmu_notifiers) for updates.

No need to have your special attention on this; my use case is not anything
useful with details, just wanted to show the idea that virtual address
range based notification might not work.

[...]

> What limitation do you see with the usage of mmu notifiers for this use-case?
> And, if using mmu notifiers is not the right approach, how do you suggest we
> can solve this problem?

AFAIU, even if there'll be a notification chanism, it needs to be at least
in per-file address space (probably in file offsets) rather than per-mm for
a shmem backend, so that any mapping of the file should notify that.

Isn't it already too late though to wait that notification until page is
installed?  Because here you pinned the page for DMA, I think it means
before a new page installed (but after the page is invalidated) the device
can DMA to an invalid buffer.

To come back to the original question: I don't know how that could work at
all, the userapp should just never do that invalidation, because right
after it does, the dma buffer will be invalid, and the device can update
data into trash.  So.. I don't have an easy way to do this right.. besides
disabling ram discard just like what vfio does already.

Thanks,

-- 
Peter Xu

Re: [RFC v1 1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages)

2023-08-03 Thread Peter Xu

Hi, Vivek,

On Thu, Aug 03, 2023 at 08:08:41AM +, Kasireddy, Vivek wrote:
> > Isn't it already too late though to wait that notification until page is
> > installed?  Because here you pinned the page for DMA, I think it means
> > before a new page installed (but after the page is invalidated) the device
> > can DMA to an invalid buffer.
>
> The page is only invalidated in the memfd. Until the hole is written to,
> we (udmabuf) can choose to handle any reads (or DMA) using old pages
> if needed.

But what happens if there's DMA writes?  I don't see anything that will
stop the device from doing so - the whole design looks fully transparent, I
just still don't see how it can be done without synchronizing with the
device.

IIUC, we need to e.g. quiesce the device when any page got invalidated in
some way like hole punching, and should happen right before it happens
(comparing to the notification of new page update which should be right
after the installation OTOH).

I think the vfio use case currently face the same condition and challenge,
assuming there's currently no easy solution so that was just prohibited.  I
guess people are just waiting for hardware vendors to support device page
faults, like processors - then we can synchronize with the device using the
device IOMMU page tables (by clearing it at proper time and blocks DMA
writes).

Thanks,

-- 
Peter Xu

Re: [PATCH] mm: do not rely on mm == current->mm in __get_user_pages_locked

2020-09-28 Thread Peter Xu

On Mon, Sep 28, 2020 at 12:35:07PM +0200, Jason A. Donenfeld wrote:
> It seems likely this block was pasted from internal_get_user_pages_fast,
> which is not passed an mm struct and therefore uses current's. But
> __get_user_pages_locked is passed an explicit mm, and current->mm is not
> always valid. This was hit when being called from i915, which uses:
> 
>   pin_user_pages_remote->
> __get_user_pages_remote->
>   __gup_longterm_locked->
> __get_user_pages_locked

Afaict it's not only an "current->mm can be NULL" issue - because this flag is
used to mark "whether the mm pinned any page", so for remote pinning we
definitely should mark the remote mm rather than the current mm, simply because
it's the target mm page table that we'd want to stablize rather than the
current->mm (even if current->mm always existed).

> 
> Before, this would lead to an OOPS:
> 
>   BUG: kernel NULL pointer dereference, address: 0064
>   #PF: supervisor write access in kernel mode
>   #PF: error_code(0x0002) - not-present page
>   PGD 0 P4D 0
>   Oops: 0002 [#1] SMP
>   CPU: 10 PID: 1431 Comm: kworker/u33:1 Tainted: P S   U O  
> 5.9.0-rc7+ #140
>   Hardware name: LENOVO 20QTCTO1WW/20QTCTO1WW, BIOS N2OET47W (1.34 ) 
> 08/06/2020
>   Workqueue: i915-userptr-acquire __i915_gem_userptr_get_pages_worker [i915]
>   RIP: 0010:__get_user_pages_remote+0xd7/0x310
>   Code: f5 01 00 00 83 7d 00 01 0f 85 ed 01 00 00 f7 c1 00 00 04 00 0f 84 58 
> 01 00 00 65 48 8b 04 25 00 6d 01 00 48 8b 80 40 03 00 00  40 64 01 00 00 
> 00 65 48 8b 04 25 00 6d 01 00 48 c7 44 24 18 00
>   RSP: 0018:888fdfe47de0 EFLAGS: 00010206
>   RAX:  RBX: 7fe188531000 RCX: 00040001
>   RDX: 0001 RSI: 7fe188531000 RDI: 888ff0748f00
>   RBP: 888fdfe47e54 R08: 888fedc7d7c8 R09: 
>   R10: 0018 R11: fefefefefefefeff R12: 888ff0748f00
>   R13: 888fedc7d7c8 R14: 888f81fe3a40 R15: 00042003
>   FS:  () GS:888ffc48() knlGS:
>   CS:  0010 DS:  ES:  CR0: 80050033
>   CR2: 0064 CR3: 02009003 CR4: 003706e0
>   DR0:  DR1:  DR2: 
>   DR3:  DR6: fffe0ff0 DR7: 0400
>   Call Trace:
>__i915_gem_userptr_get_pages_worker+0xc8/0x260 [i915]
>process_one_work+0x1ca/0x390
>worker_thread+0x48/0x3c0
>? rescuer_thread+0x3d0/0x3d0
>kthread+0x114/0x130
>? kthread_create_worker_on_cpu+0x40/0x40
>ret_from_fork+0x1f/0x30
>   CR2: 0064
> 
> This commit fixes the problem by using the mm pointer passed to the
> function rather than the bogus one in current.
> 
> Fixes: 008cfe4418b3 ("mm: Introduce mm_struct.has_pinned")
> Cc: Jason Gunthorpe 
> Cc: Peter Xu 
> Signed-off-by: Jason A. Donenfeld 
> ---
>  mm/gup.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index dfe781d2ad4c..e869c634cc9a 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1256,7 +1256,7 @@ static __always_inline long 
> __get_user_pages_locked(struct mm_struct *mm,
>   }
>  
>   if (flags & FOLL_PIN)
> - atomic_set(¤t->mm->has_pinned, 1);
> + atomic_set(&mm->has_pinned, 1);
>  
>   /*
>* FOLL_PIN and FOLL_GET are mutually exclusive. Traditional behavior
> -- 
> 2.28.0
> 

Thanks!  And sorry for this silly mistake.  I even didn't understand how it was
written, because the normal gup change should have come earlier, anyway...

Reviewed-by: Peter Xu 

-- 
Peter Xu

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v1 02/11] mm: convert track_pfn_insert() to pfnmap_sanitize_pgprot()

2025-04-28 Thread Peter Xu

On Mon, Apr 28, 2025 at 04:58:46PM +0200, David Hildenbrand wrote:
> 
> > > What it does on PAT (only implementation so far ...) is looking up the
> > > memory type to select the caching mode that can be use.
> > > 
> > > "sanitize" was IMHO a good fit, because we must make sure that we don't 
> > > use
> > > the wrong caching mode.
> > > 
> > > update/setup/... don't make that quite clear. Any other suggestions?
> > 
> > I'm very poor on naming.. :( So far anything seems slightly better than
> > sanitize to me, as the word "sanitize" is actually also used in memtype.c
> > for other purpose.. see sanitize_phys().
> 
> Sure, one can sanitize a lot of things. Here it's the cachemode/pgrpot, in
> the other functions it's an address.
> 
> Likely we should just call it pfnmap_X_cachemode()/
> 
> Set/update don't really fit for X in case pfnmap_X_cachemode() is a NOP.
> 
> pfnmap_setup_cachemode() ? Hm.

Sounds good here.

> 
> > 
> > > 
> > > > 
> > > > > + * @pfn: the start of the pfn range
> > > > > + * @size: the size of the pfn range
> > > > > + * @prot: the pgprot to sanitize
> > > > > + *
> > > > > + * Sanitize the given pgprot for a pfn range, for example, adjusting 
> > > > > the
> > > > > + * cachemode.
> > > > > + *
> > > > > + * This function cannot fail for a single page, but can fail for 
> > > > > multiple
> > > > > + * pages.
> > > > > + *
> > > > > + * Returns 0 on success and -EINVAL on error.
> > > > > + */
> > > > > +int pfnmap_sanitize_pgprot(unsigned long pfn, unsigned long size,
> > > > > + pgprot_t *prot);
> > > > >extern int track_pfn_copy(struct vm_area_struct *dst_vma,
> > > > >   struct vm_area_struct *src_vma, unsigned long *pfn);
> > > > >extern void untrack_pfn_copy(struct vm_area_struct *dst_vma,
> > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > > > index fdcf0a6049b9f..b8ae5e1493315 100644
> > > > > --- a/mm/huge_memory.c
> > > > > +++ b/mm/huge_memory.c
> > > > > @@ -1455,7 +1455,9 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault 
> > > > > *vmf, pfn_t pfn, bool write)
> > > > >   return VM_FAULT_OOM;
> > > > >   }
> > > > > - track_pfn_insert(vma, &pgprot, pfn);
> > > > > + if (pfnmap_sanitize_pgprot(pfn_t_to_pfn(pfn), PAGE_SIZE, 
> > > > > &pgprot))
> > > > > + return VM_FAULT_FALLBACK;
> > > > 
> > > > Would "pgtable" leak if it fails?  If it's PAGE_SIZE, IIUC it won't ever
> > > > trigger, though.
> > > > 
> > > > Maybe we could have a "void pfnmap_sanitize_pgprot_pfn(&pgprot, pfn)" to
> > > > replace track_pfn_insert() and never fail?  Dropping vma ref is 
> > > > definitely
> > > > a win already in all cases.
> > > 
> > > It could be a simple wrapper around pfnmap_sanitize_pgprot(), yes. That's
> > > certainly helpful for the single-page case.
> > > 
> > > Regarding never failing here: we should check the whole range. We have to
> > > make sure that none of the pages has a memory type / caching mode that is
> > > incompatible with what we setup.
> > 
> > Would it happen in real world?
> > > IIUC per-vma registration needs to happen first, which checks for
> memtype
> > conflicts in the first place, or reserve_pfn_range() could already have
> > failed.
> > > Here it's the fault path looking up the memtype, so I would expect it is
> > guaranteed all pfns under the same vma is following the verified (and same)
> > memtype?
> 
> The whole point of track_pfn_insert() is that it is used when we *don't* use
> reserve_pfn_range()->track_pfn_remap(), no?
> 
> track_pfn_remap() would check the whole range that gets mapped, so
> track_pfn_insert() user must similarly check the whole range that gets
> mapped.
> 
> Note that even track_pfn_insert() is already pretty clear on the intended
> usage: "called when a _new_ single pfn is established"

We need to define "new" then..  But I agree it's not crystal clear at
least.  I think I just wasn't the first to assume it was reserved, see this
(especially, the "Expectation" part..):

commit 5180da410db6369d1f95c9014da1c9bc33fb043e
Author: Suresh Siddha 
Date:   Mon Oct 8 16:28:29 2012 -0700

x86, pat: separate the pfn attribute tracking for remap_pfn_range and 
vm_insert_pfn

With PAT enabled, vm_insert_pfn() looks up the existing pfn memory
attribute and uses it.  Expectation is that the driver reserves the
memory attributes for the pfn before calling vm_insert_pfn().

-- 
Peter Xu

Re: [PATCH v1 05/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()

2025-04-28 Thread Peter Xu

On Mon, Apr 28, 2025 at 06:16:21PM +0200, David Hildenbrand wrote:
> > Probably due to what config you have.  E.g., when I'm looking mine it's
> > much bigger and already consuming 256B, but it's because I enabled more
> > things (userfaultfd, lockdep, etc.).
> 
> Note that I enabled everything that you would expect on a production system
> (incld. userfaultfd, mempolicy, per-vma locks), so I didn't enable lockep.

I still doubt whether you at least enabled userfaultfd, e.g., your previous
paste has:

  struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   176 0 */

Not something that matters.. but just in case you didn't use the expected
config file you wanted to use..

-- 
Peter Xu

Re: [PATCH v1 04/11] mm/memremap: convert to pfnmap_track() + pfnmap_untrack()

2025-04-25 Thread Peter Xu

On Fri, Apr 25, 2025 at 10:17:08AM +0200, David Hildenbrand wrote:
> Let's use the new, cleaner interface.
> 
> Signed-off-by: David Hildenbrand 
> ---
>  mm/memremap.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 2aebc1b192da9..c417c843e9b1f 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -130,7 +130,7 @@ static void pageunmap_range(struct dev_pagemap *pgmap, 
> int range_id)
>   }
>   mem_hotplug_done();
>  
> - untrack_pfn(NULL, PHYS_PFN(range->start), range_len(range), true);
> + pfnmap_untrack(PHYS_PFN(range->start), range_len(range));
>   pgmap_array_delete(range);
>  }
>  
> @@ -211,8 +211,8 @@ static int pagemap_range(struct dev_pagemap *pgmap, 
> struct mhp_params *params,
>   if (nid < 0)
>   nid = numa_mem_id();
>  
> - error = track_pfn_remap(NULL, ¶ms->pgprot, PHYS_PFN(range->start), 
> 0,
> - range_len(range));
> + error = pfnmap_track(PHYS_PFN(range->start), range_len(range),
> +  ¶ms->pgprot);
>   if (error)
>   goto err_pfn_remap;
>  
> @@ -277,7 +277,7 @@ static int pagemap_range(struct dev_pagemap *pgmap, 
> struct mhp_params *params,
>   if (!is_private)
>   kasan_remove_zero_shadow(__va(range->start), range_len(range));
>  err_kasan:
> - untrack_pfn(NULL, PHYS_PFN(range->start), range_len(range), true);
> + pfnmap_untrack(PHYS_PFN(range->start), range_len(range));

Not a huge deal, but maybe we could merge this and previous patch?  It
might be easier to reference the impl when reading the call site changes.

>  err_pfn_remap:
>   pgmap_array_delete(range);
>   return error;
> -- 
> 2.49.0
> 

-- 
Peter Xu

Re: [PATCH v1 02/11] mm: convert track_pfn_insert() to pfnmap_sanitize_pgprot()

2025-04-25 Thread Peter Xu

t_t *prot);
>  extern int track_pfn_copy(struct vm_area_struct *dst_vma,
>   struct vm_area_struct *src_vma, unsigned long *pfn);
>  extern void untrack_pfn_copy(struct vm_area_struct *dst_vma,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index fdcf0a6049b9f..b8ae5e1493315 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1455,7 +1455,9 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, 
> pfn_t pfn, bool write)
>   return VM_FAULT_OOM;
>   }
>  
> - track_pfn_insert(vma, &pgprot, pfn);
> + if (pfnmap_sanitize_pgprot(pfn_t_to_pfn(pfn), PAGE_SIZE, &pgprot))
> + return VM_FAULT_FALLBACK;

Would "pgtable" leak if it fails?  If it's PAGE_SIZE, IIUC it won't ever
trigger, though.

Maybe we could have a "void pfnmap_sanitize_pgprot_pfn(&pgprot, pfn)" to
replace track_pfn_insert() and never fail?  Dropping vma ref is definitely
a win already in all cases.

> +
>   ptl = pmd_lock(vma->vm_mm, vmf->pmd);
>   error = insert_pfn_pmd(vma, addr, vmf->pmd, pfn, pgprot, write,
>   pgtable);
> @@ -1577,7 +1579,8 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, 
> pfn_t pfn, bool write)
>   if (addr < vma->vm_start || addr >= vma->vm_end)
>   return VM_FAULT_SIGBUS;
>  
> - track_pfn_insert(vma, &pgprot, pfn);
> + if (pfnmap_sanitize_pgprot(pfn_t_to_pfn(pfn), PAGE_SIZE, &pgprot))
> + return VM_FAULT_FALLBACK;
>  
>   ptl = pud_lock(vma->vm_mm, vmf->pud);
>   insert_pfn_pud(vma, addr, vmf->pud, pfn, write);
> diff --git a/mm/memory.c b/mm/memory.c
> index 424420349bd3c..c737a8625866a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2563,7 +2563,7 @@ vm_fault_t vmf_insert_pfn_prot(struct vm_area_struct 
> *vma, unsigned long addr,
>   if (!pfn_modify_allowed(pfn, pgprot))
>   return VM_FAULT_SIGBUS;
>  
> - track_pfn_insert(vma, &pgprot, __pfn_to_pfn_t(pfn, PFN_DEV));
> + pfnmap_sanitize_pgprot(pfn, PAGE_SIZE, &pgprot);
>  
>   return insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, PFN_DEV), pgprot,
>   false);
> @@ -2626,7 +2626,7 @@ static vm_fault_t __vm_insert_mixed(struct 
> vm_area_struct *vma,
>   if (addr < vma->vm_start || addr >= vma->vm_end)
>   return VM_FAULT_SIGBUS;
>  
> - track_pfn_insert(vma, &pgprot, pfn);
> + pfnmap_sanitize_pgprot(pfn_t_to_pfn(pfn), PAGE_SIZE, &pgprot);
>  
>   if (!pfn_modify_allowed(pfn_t_to_pfn(pfn), pgprot))
>   return VM_FAULT_SIGBUS;
> -- 
> 2.49.0
> 

-- 
Peter Xu

Re: [PATCH v1 05/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()

2025-04-25 Thread Peter Xu

On Fri, Apr 25, 2025 at 10:17:09AM +0200, David Hildenbrand wrote:
> Let's use our new interface. In remap_pfn_range(), we'll now decide
> whether we have to track (full VMA covered) or only sanitize the pgprot
> (partial VMA covered).
> 
> Remember what we have to untrack by linking it from the VMA. When
> duplicating VMAs (e.g., splitting, mremap, fork), we'll handle it similar
> to anon VMA names, and use a kref to share the tracking.
> 
> Once the last VMA un-refs our tracking data, we'll do the untracking,
> which simplifies things a lot and should sort our various issues we saw
> recently, for example, when partially unmapping/zapping a tracked VMA.
> 
> This change implies that we'll keep tracking the original PFN range even
> after splitting + partially unmapping it: not too bad, because it was
> not working reliably before. The only thing that kind-of worked before
> was shrinking such a mapping using mremap(): we managed to adjust the
> reservation in a hacky way, now we won't adjust the reservation but
> leave it around until all involved VMAs are gone.
> 
> Signed-off-by: David Hildenbrand 
> ---
>  include/linux/mm_inline.h |  2 +
>  include/linux/mm_types.h  | 11 ++
>  kernel/fork.c | 54 --
>  mm/memory.c   | 81 +++
>  mm/mremap.c   |  4 --
>  5 files changed, 128 insertions(+), 24 deletions(-)
> 
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index f9157a0c42a5c..89b518ff097e6 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -447,6 +447,8 @@ static inline bool anon_vma_name_eq(struct anon_vma_name 
> *anon_name1,
>  
>  #endif  /* CONFIG_ANON_VMA_NAME */
>  
> +void pfnmap_track_ctx_release(struct kref *ref);
> +
>  static inline void init_tlb_flush_pending(struct mm_struct *mm)
>  {
>   atomic_set(&mm->tlb_flush_pending, 0);
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 56d07edd01f91..91124761cfda8 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -764,6 +764,14 @@ struct vma_numab_state {
>   int prev_scan_seq;
>  };
>  
> +#ifdef __HAVE_PFNMAP_TRACKING
> +struct pfnmap_track_ctx {
> + struct kref kref;
> + unsigned long pfn;
> + unsigned long size;
> +};
> +#endif
> +
>  /*
>   * This struct describes a virtual memory area. There is one of these
>   * per VM-area/task. A VM area is any part of the process virtual memory
> @@ -877,6 +885,9 @@ struct vm_area_struct {
>   struct anon_vma_name *anon_name;
>  #endif
>   struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
> +#ifdef __HAVE_PFNMAP_TRACKING
> + struct pfnmap_track_ctx *pfnmap_track_ctx;
> +#endif

So this was originally the small concern (or is it small?) that this will
grow every vma on x86, am I right?

After all pfnmap vmas are the minority, I was thinking whether we could
work it out without extending vma struct.

I had a quick thought quite a while ago, but never tried out (it was almost
off-track since vfio switched away from remap_pfn_range..), which is to
have x86 maintain its own mapping of vma <-> pfn tracking using a global
stucture.  After all, the memtype code did it already with the
memtype_rbroot, so I was thinking if vma info can be memorized as well, so
as to get rid of get_pat_info() too.

Maybe it also needs the 2nd layer like what you did with the track ctx, but
the tree maintains the mapping instead of adding the ctx pointer into vma.

Maybe it could work with squashing the two layers (or say, extending
memtype rbtree), but maybe not..

It could make it slightly slower than vma->pfnmap_track_ctx ref when
looking up pfn when holding a vma ref, but I assume it's ok considering
that track/untrack should be slow path for pfnmaps, and pfnmaps shouldn't
be a huge lot.

I didn't think further, but if that'll work it'll definitely avoids the
additional fields on x86 vmas.  I'm curious whether you explored that
direction, or maybe it's a known decision that the 8 bytes isn't a concern.

Thanks,

-- 
Peter Xu

Re: [PATCH v1 02/11] mm: convert track_pfn_insert() to pfnmap_sanitize_pgprot()

2025-04-25 Thread Peter Xu

On Fri, Apr 25, 2025 at 09:48:50PM +0200, David Hildenbrand wrote:
> On 25.04.25 21:31, Peter Xu wrote:
> > On Fri, Apr 25, 2025 at 10:17:06AM +0200, David Hildenbrand wrote:
> > > ... by factoring it out from track_pfn_remap().
> > > 
> > > For PMDs/PUDs, actually check the full range, and trigger a fallback
> > > if we run into this "different memory types / cachemodes" scenario.
> > 
> > The current patch looks like to still pass PAGE_SIZE into the new helper at
> > all track_pfn_insert() call sites, so it seems this comment does not 100%
> > match with the code?  Or I may have misread somewhere.
> 
> No, you're right, while reshuffling the patches I forgot to add the actual
> PMD/PUD size.
> 
> > 
> > Maybe it's still easier to keep the single-pfn lookup to never fail..  more
> > below.
> > 
> 
> [...]
> 
> > >   /*
> > > @@ -1556,8 +1553,23 @@ static inline void untrack_pfn_clear(struct 
> > > vm_area_struct *vma)
> > >   extern int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
> > >  unsigned long pfn, unsigned long addr,
> > >  unsigned long size);
> > > -extern void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
> > > -  pfn_t pfn);
> > > +
> > > +/**
> > > + * pfnmap_sanitize_pgprot - sanitize the pgprot for a pfn range
> > 
> > Nit: s/sanitize/update|setup|.../?
> > 
> > But maybe you have good reason to use sanitize.  No strong opinions.
> 
> What it does on PAT (only implementation so far ...) is looking up the
> memory type to select the caching mode that can be use.
> 
> "sanitize" was IMHO a good fit, because we must make sure that we don't use
> the wrong caching mode.
> 
> update/setup/... don't make that quite clear. Any other suggestions?

I'm very poor on naming.. :( So far anything seems slightly better than
sanitize to me, as the word "sanitize" is actually also used in memtype.c
for other purpose.. see sanitize_phys().

> 
> > 
> > > + * @pfn: the start of the pfn range
> > > + * @size: the size of the pfn range
> > > + * @prot: the pgprot to sanitize
> > > + *
> > > + * Sanitize the given pgprot for a pfn range, for example, adjusting the
> > > + * cachemode.
> > > + *
> > > + * This function cannot fail for a single page, but can fail for multiple
> > > + * pages.
> > > + *
> > > + * Returns 0 on success and -EINVAL on error.
> > > + */
> > > +int pfnmap_sanitize_pgprot(unsigned long pfn, unsigned long size,
> > > + pgprot_t *prot);
> > >   extern int track_pfn_copy(struct vm_area_struct *dst_vma,
> > >   struct vm_area_struct *src_vma, unsigned long *pfn);
> > >   extern void untrack_pfn_copy(struct vm_area_struct *dst_vma,
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > index fdcf0a6049b9f..b8ae5e1493315 100644
> > > --- a/mm/huge_memory.c
> > > +++ b/mm/huge_memory.c
> > > @@ -1455,7 +1455,9 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, 
> > > pfn_t pfn, bool write)
> > >   return VM_FAULT_OOM;
> > >   }
> > > - track_pfn_insert(vma, &pgprot, pfn);
> > > + if (pfnmap_sanitize_pgprot(pfn_t_to_pfn(pfn), PAGE_SIZE, &pgprot))
> > > + return VM_FAULT_FALLBACK;
> > 
> > Would "pgtable" leak if it fails?  If it's PAGE_SIZE, IIUC it won't ever
> > trigger, though.
> > 
> > Maybe we could have a "void pfnmap_sanitize_pgprot_pfn(&pgprot, pfn)" to
> > replace track_pfn_insert() and never fail?  Dropping vma ref is definitely
> > a win already in all cases.
> 
> It could be a simple wrapper around pfnmap_sanitize_pgprot(), yes. That's
> certainly helpful for the single-page case.
> 
> Regarding never failing here: we should check the whole range. We have to
> make sure that none of the pages has a memory type / caching mode that is
> incompatible with what we setup.

Would it happen in real world?

IIUC per-vma registration needs to happen first, which checks for memtype
conflicts in the first place, or reserve_pfn_range() could already have
failed.

Here it's the fault path looking up the memtype, so I would expect it is
guaranteed all pfns under the same vma is following the verified (and same)
memtype?

Thanks,

-- 
Peter Xu

Re: [PATCH v1 02/11] mm: convert track_pfn_insert() to pfnmap_sanitize_pgprot()

2025-04-29 Thread Peter Xu

On Mon, Apr 28, 2025 at 10:37:49PM +0200, David Hildenbrand wrote:
> On 28.04.25 18:21, Peter Xu wrote:
> > On Mon, Apr 28, 2025 at 04:58:46PM +0200, David Hildenbrand wrote:
> > > 
> > > > > What it does on PAT (only implementation so far ...) is looking up the
> > > > > memory type to select the caching mode that can be use.
> > > > > 
> > > > > "sanitize" was IMHO a good fit, because we must make sure that we 
> > > > > don't use
> > > > > the wrong caching mode.
> > > > > 
> > > > > update/setup/... don't make that quite clear. Any other suggestions?
> > > > 
> > > > I'm very poor on naming.. :( So far anything seems slightly better than
> > > > sanitize to me, as the word "sanitize" is actually also used in 
> > > > memtype.c
> > > > for other purpose.. see sanitize_phys().
> > > 
> > > Sure, one can sanitize a lot of things. Here it's the cachemode/pgrpot, in
> > > the other functions it's an address.
> > > 
> > > Likely we should just call it pfnmap_X_cachemode()/
> > > 
> > > Set/update don't really fit for X in case pfnmap_X_cachemode() is a NOP.
> > > 
> > > pfnmap_setup_cachemode() ? Hm.
> > 
> > Sounds good here.
> 
> Okay, I'll use that one. If ever something else besides PAT would require
> different semantics, they can bother with finding a better name :)
> 
> > 
> > > 
> > > > 
> > > > > 
> > > > > > 
> > > > > > > + * @pfn: the start of the pfn range
> > > > > > > + * @size: the size of the pfn range
> > > > > > > + * @prot: the pgprot to sanitize
> > > > > > > + *
> > > > > > > + * Sanitize the given pgprot for a pfn range, for example, 
> > > > > > > adjusting the
> > > > > > > + * cachemode.
> > > > > > > + *
> > > > > > > + * This function cannot fail for a single page, but can fail for 
> > > > > > > multiple
> > > > > > > + * pages.
> > > > > > > + *
> > > > > > > + * Returns 0 on success and -EINVAL on error.
> > > > > > > + */
> > > > > > > +int pfnmap_sanitize_pgprot(unsigned long pfn, unsigned long size,
> > > > > > > + pgprot_t *prot);
> > > > > > > extern int track_pfn_copy(struct vm_area_struct *dst_vma,
> > > > > > >   struct vm_area_struct *src_vma, unsigned long 
> > > > > > > *pfn);
> > > > > > > extern void untrack_pfn_copy(struct vm_area_struct *dst_vma,
> > > > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > > > > > index fdcf0a6049b9f..b8ae5e1493315 100644
> > > > > > > --- a/mm/huge_memory.c
> > > > > > > +++ b/mm/huge_memory.c
> > > > > > > @@ -1455,7 +1455,9 @@ vm_fault_t vmf_insert_pfn_pmd(struct 
> > > > > > > vm_fault *vmf, pfn_t pfn, bool write)
> > > > > > >   return VM_FAULT_OOM;
> > > > > > >   }
> > > > > > > - track_pfn_insert(vma, &pgprot, pfn);
> > > > > > > + if (pfnmap_sanitize_pgprot(pfn_t_to_pfn(pfn), PAGE_SIZE, 
> > > > > > > &pgprot))
> > > > > > > + return VM_FAULT_FALLBACK;
> > > > > > 
> > > > > > Would "pgtable" leak if it fails?  If it's PAGE_SIZE, IIUC it won't 
> > > > > > ever
> > > > > > trigger, though.
> > > > > > 
> > > > > > Maybe we could have a "void pfnmap_sanitize_pgprot_pfn(&pgprot, 
> > > > > > pfn)" to
> > > > > > replace track_pfn_insert() and never fail?  Dropping vma ref is 
> > > > > > definitely
> > > > > > a win already in all cases.
> > > > > 
> > > > > It could be a simple wrapper around pfnmap_sanitize_pgprot(), yes. 
> > > > > That's
> > > > > certainly helpful for the single-page case.
> > > > > 
> > > > > Regarding never failing here: we should check the whole range. We 
> > > > > have to
> > > > > make sure that none of the

Re: [PATCH v1 05/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()

2025-04-28 Thread Peter Xu

On Fri, Apr 25, 2025 at 10:36:55PM +0200, David Hildenbrand wrote:
> On 25.04.25 22:23, Peter Xu wrote:
> > On Fri, Apr 25, 2025 at 10:17:09AM +0200, David Hildenbrand wrote:
> > > Let's use our new interface. In remap_pfn_range(), we'll now decide
> > > whether we have to track (full VMA covered) or only sanitize the pgprot
> > > (partial VMA covered).
> > > 
> > > Remember what we have to untrack by linking it from the VMA. When
> > > duplicating VMAs (e.g., splitting, mremap, fork), we'll handle it similar
> > > to anon VMA names, and use a kref to share the tracking.
> > > 
> > > Once the last VMA un-refs our tracking data, we'll do the untracking,
> > > which simplifies things a lot and should sort our various issues we saw
> > > recently, for example, when partially unmapping/zapping a tracked VMA.
> > > 
> > > This change implies that we'll keep tracking the original PFN range even
> > > after splitting + partially unmapping it: not too bad, because it was
> > > not working reliably before. The only thing that kind-of worked before
> > > was shrinking such a mapping using mremap(): we managed to adjust the
> > > reservation in a hacky way, now we won't adjust the reservation but
> > > leave it around until all involved VMAs are gone.
> > > 
> > > Signed-off-by: David Hildenbrand 
> > > ---
> > >   include/linux/mm_inline.h |  2 +
> > >   include/linux/mm_types.h  | 11 ++
> > >   kernel/fork.c | 54 --
> > >   mm/memory.c   | 81 +++
> > >   mm/mremap.c   |  4 --
> > >   5 files changed, 128 insertions(+), 24 deletions(-)
> > > 
> > > diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> > > index f9157a0c42a5c..89b518ff097e6 100644
> > > --- a/include/linux/mm_inline.h
> > > +++ b/include/linux/mm_inline.h
> > > @@ -447,6 +447,8 @@ static inline bool anon_vma_name_eq(struct 
> > > anon_vma_name *anon_name1,
> > >   #endif  /* CONFIG_ANON_VMA_NAME */
> > > +void pfnmap_track_ctx_release(struct kref *ref);
> > > +
> > >   static inline void init_tlb_flush_pending(struct mm_struct *mm)
> > >   {
> > >   atomic_set(&mm->tlb_flush_pending, 0);
> > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > > index 56d07edd01f91..91124761cfda8 100644
> > > --- a/include/linux/mm_types.h
> > > +++ b/include/linux/mm_types.h
> > > @@ -764,6 +764,14 @@ struct vma_numab_state {
> > >   int prev_scan_seq;
> > >   };
> > > +#ifdef __HAVE_PFNMAP_TRACKING
> > > +struct pfnmap_track_ctx {
> > > + struct kref kref;
> > > + unsigned long pfn;
> > > + unsigned long size;
> > > +};
> > > +#endif
> > > +
> > >   /*
> > >* This struct describes a virtual memory area. There is one of these
> > >* per VM-area/task. A VM area is any part of the process virtual memory
> > > @@ -877,6 +885,9 @@ struct vm_area_struct {
> > >   struct anon_vma_name *anon_name;
> > >   #endif
> > >   struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
> > > +#ifdef __HAVE_PFNMAP_TRACKING
> > > + struct pfnmap_track_ctx *pfnmap_track_ctx;
> > > +#endif
> > 
> > So this was originally the small concern (or is it small?) that this will
> > grow every vma on x86, am I right?
> 
> Yeah, and last time I looked into this, it would have grown it such that it 
> would
> require a bigger slab. Right now:

Probably due to what config you have.  E.g., when I'm looking mine it's
much bigger and already consuming 256B, but it's because I enabled more
things (userfaultfd, lockdep, etc.).

> 
> Before this change:
> 
> struct vm_area_struct {
>   union {
>   struct {
>   long unsigned int vm_start;  /* 0 8 */
>   long unsigned int vm_end;/* 8 8 */
>   };   /* 016 */
>   freeptr_t  vm_freeptr;   /* 0 8 */
>   };   /* 016 */
>   struct mm_struct * vm_mm;/*16 8 */
>   pgprot_t   vm_page_prot; /*24 8 */
>   union {
>   const vm_flags_t   vm_flags; /*32

Re: [PATCH v1 02/11] mm: convert track_pfn_insert() to pfnmap_sanitize_pgprot()

2025-04-29 Thread Peter Xu

On Tue, Apr 29, 2025 at 06:25:06PM +0200, David Hildenbrand wrote:
> I do wonder why we even have to lookup the memtype again if the caller
> apparently reserved it (which implied checking it). All a bit weird.

Maybe it's because the memtype info isn't always visible to the upper
layers, e.g. default pci_iomap() for MMIOs doesn't need to specify anything
on cache mode.  There's some pci_iomap_wc() variance, but still looks like
only the internal has full control..

-- 
Peter Xu

53 matches

Mail list logo