Re: [PATCH] mm: Remove double faults once write a device pfn

2024-01-24 Thread Alistair Popple


"Zhou, Xianrong"  writes:

> [AMD Official Use Only - General]
>
>> > The vmf_insert_pfn_prot could cause unnecessary double faults on a
>> > device pfn. Because currently the vmf_insert_pfn_prot does not
>> > make the pfn writable so the pte entry is normally read-only or
>> > dirty catching.
>>  What? How do you got to this conclusion?
>> >>> Sorry. I did not mention that this problem only exists on arm64 platform.
>> >> Ok, that makes at least a little bit more sense.
>> >>
>> >>> Because on arm64 platform the PTE_RDONLY is automatically attached
>> >>> to the userspace pte entries even through VM_WRITE + VM_SHARE.
>> >>> The  PTE_RDONLY needs to be cleared in vmf_insert_pfn_prot. However
>> >>> vmf_insert_pfn_prot do not make the pte writable passing false
>> >>> @mkwrite to insert_pfn.
>> >> Question is why is arm64 doing this? As far as I can see they must
>> >> have some hardware reason for that.
>> >>
>> >> The mkwrite parameter to insert_pfn() was added by commit
>> >> b2770da642540 to make insert_pfn() look more like insert_pfn_pmd() so
>> >> that the DAX code can insert PTEs which are writable and dirty at the same
>> time.
>> >>
>> > This is one scenario to do so. In fact on arm64 there are many
>> > scenarios could be to do so. So we can let vmf_insert_pfn_prot
>> > supporting @mkwrite for drivers at core layer and let drivers to
>> > decide whether or not to make writable and dirty at one time. The
>> > patch did this. Otherwise double faults on arm64 when call
>> vmf_insert_pfn_prot.
>>
>> Well, that doesn't answer my question why arm64 is double faulting in the
>> first place,.
>>
>
>
> Eh.
>
> On arm64 When userspace mmap() with PROT_WRITE and MAP_SHARED the
> vma->vm_page_prot has the PTE_RDONLY and PTE_WRITE within
> PAGE_SHARED_EXEC. (seeing arm64 protection_map)
>
> When write the userspace virtual address the first fault happen and call
> into driver's 
> .fault->ttm_bo_vm_fault_reserved->vmf_insert_pfn_prot->insert_pfn.
> The insert_pfn will establish the pte entry. However the vmf_insert_pfn_prot
> pass false @mkwrite to insert_pfn by default and so insert_pfn could not make
> the pfn writable and it do not call maybe_mkwrite(pte_mkdirty(entry), vma)
> to clear the PTE_RDONLY bit. So the pte entry is actually write protection 
> for mmu.
> So when the first fault return and re-execute the store instruction the second
> fault happen again. And in second fault it just only do pte_mkdirty(entry) 
> which
> clear the PTE_RDONLY.

It depends if the ARM64 CPU in question supports hardware dirty bit
management (DBM). If that is the case and PTE_DBM (ie. PTE_WRITE) is set
HW will automatically clear PTE_RDONLY bit to mark the entry dirty
instead of raising a write fault. So you shouldn't see a double fault if
PTE_DBM/WRITE is set.

On ARM64 you can kind of think of PTE_RDONLY as the HW dirty bit and
PTE_DBM as the read/write permission bit with SW being responsible for
updating PTE_RDONLY via the fault handler if DBM is not supported by HW.

At least that's my understanding from having hacked on this in the
past. You can see all this weirdness happening in the definitions of
pte_dirty() and pte_write() for ARM64.

> I think so and hope no wrong.
>
>> So as long as this isn't sorted out I'm going to reject this patch.
>>
>> Regards,
>> Christian.
>>
>> >
>> >> This is a completely different use case to what you try to use it
>> >> here for and that looks extremely fishy to me.
>> >>
>> >> Regards,
>> >> Christian.
>> >>
>> > The first fault only sets up the pte entry which actually is dirty
>> > catching. And the second immediate fault to the pfn due to first
>> > dirty catching when the cpu re-execute the store instruction.
>>  It could be that this is done to work around some hw behavior, but
>>  not because of dirty catching.
>> 
>> > Normally if the drivers call vmf_insert_pfn_prot and also supply
>> > 'pfn_mkwrite' callback within vm_operations_struct which requires
>> > the pte to be dirty catching then the vmf_insert_pfn_prot and the
>> > double fault are reasonable. It is not a problem.
>>  Well, as far as I can see that behavior absolutely doesn't make sense.
>> 
>>  When pfn_mkwrite is requested then the driver should use PAGE_COPY,
>>  which is exactly what VMWGFX (the only driver using dirty tracking)
>>  is
>> >> doing.
>>  Everybody else uses PAGE_SHARED which should make the pte writeable
>>  immediately.
>> 
>>  Regards,
>>  Christian.
>> 
>> > However the most of drivers calling vmf_insert_pfn_prot do not
>> > supply the 'pfn_mkwrite' callback so that the second fault is
>> unnecessary.
>> >
>> > So just like vmf_insert_mixed and vmf_insert_mixed_mkwrite pair,
>> > we should also supply vmf_insert_pfn_mkwrite for drivers as well.
>> >
>> > Signed-off-by: Xianrong Zhou 
>> > ---
>> > arch/x86/entry/vdso/vma.c  |  3 

Re: [Nouveau] [PATCH v2] mm: Take a page reference when removing device exclusive entries

2023-03-29 Thread Alistair Popple


Christoph Hellwig  writes:

> s/page/folio/ in the entire commit log?

I debated that but settled on leaving it as is because device exclusive
entries only deal with non-compound pages for now and didn't want to
give any other impression. Happy to change that though if people think
it would be better/clearer.


[Nouveau] [PATCH v2] mm: Take a page reference when removing device exclusive entries

2023-03-29 Thread Alistair Popple
Device exclusive page table entries are used to prevent CPU access to
a page whilst it is being accessed from a device. Typically this is
used to implement atomic operations when the underlying bus does not
support atomic access. When a CPU thread encounters a device exclusive
entry it locks the page and restores the original entry after calling
mmu notifiers to signal drivers that exclusive access is no longer
available.

The device exclusive entry holds a reference to the page making it
safe to access the struct page whilst the entry is present. However
the fault handling code does not hold the PTL when taking the page
lock. This means if there are multiple threads faulting concurrently
on the device exclusive entry one will remove the entry whilst others
will wait on the page lock without holding a reference.

This can lead to threads locking or waiting on a folio with a zero
refcount. Whilst mmap_lock prevents the pages getting freed via
munmap() they may still be freed by a migration. This leads to
warnings such as PAGE_FLAGS_CHECK_AT_FREE due to the page being locked
when the refcount drops to zero.

Fix this by trying to take a reference on the folio before locking
it. The code already checks the PTE under the PTL and aborts if the
entry is no longer there. It is also possible the folio has been
unmapped, freed and re-allocated allowing a reference to be taken on
an unrelated folio. This case is also detected by the PTE check and
the folio is unlocked without further changes.

Signed-off-by: Alistair Popple 
Reviewed-by: Ralph Campbell 
Reviewed-by: John Hubbard 
Fixes: b756a3b5e7ea ("mm: device exclusive memory access")
Cc: sta...@vger.kernel.org

---

Changes for v2:

 - Rebased to Linus master
 - Reworded commit message
 - Switched to using folios (thanks Matthew!)
 - Added Reviewed-by's
---
 mm/memory.c | 16 +++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index f456f3b5049c..01a23ad48a04 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3563,8 +3563,21 @@ static vm_fault_t remove_device_exclusive_entry(struct 
vm_fault *vmf)
struct vm_area_struct *vma = vmf->vma;
struct mmu_notifier_range range;
 
-   if (!folio_lock_or_retry(folio, vma->vm_mm, vmf->flags))
+   /*
+* We need a reference to lock the folio because we don't hold
+* the PTL so a racing thread can remove the device-exclusive
+* entry and unmap it. If the folio is free the entry must
+* have been removed already. If it happens to have already
+* been re-allocated after being freed all we do is lock and
+* unlock it.
+*/
+   if (!folio_try_get(folio))
+   return 0;
+
+   if (!folio_lock_or_retry(folio, vma->vm_mm, vmf->flags)) {
+   folio_put(folio);
return VM_FAULT_RETRY;
+   }
mmu_notifier_range_init_owner(, MMU_NOTIFY_EXCLUSIVE, 0,
vma->vm_mm, vmf->address & PAGE_MASK,
(vmf->address & PAGE_MASK) + PAGE_SIZE, NULL);
@@ -3577,6 +3590,7 @@ static vm_fault_t remove_device_exclusive_entry(struct 
vm_fault *vmf)
 
pte_unmap_unlock(vmf->pte, vmf->ptl);
folio_unlock(folio);
+   folio_put(folio);
 
mmu_notifier_invalidate_range_end();
return 0;
-- 
2.39.2



Re: [Nouveau] [PATCH] mm: Take a page reference when removing device exclusive entries

2023-03-29 Thread Alistair Popple


Alistair Popple  writes:

> John Hubbard  writes:
>
>> On 3/28/23 20:16, Matthew Wilcox wrote:
>> ...
>>>> +  if (!get_page_unless_zero(vmf->page))
>>>> +  return 0;
>>>  From a folio point of view: what the hell are you doing here?  Tail
>>> pages don't have individual refcounts; all the refcounts are actually
>
> I had stuck with using the page because none of this stuff (yet)
> supports compound pages anyway so we shouldn't see a tail page
> anyway. But point taken, I admit I need to find some time to get a
> deeper internalised understanding of folios than just s/page/folio.

And looking at this while updating it made the mixed usage of page/folio
look really werid/wrong so thanks for pointing that out. Will send an
update.


Re: [Nouveau] [PATCH] mm: Take a page reference when removing device exclusive entries

2023-03-29 Thread Alistair Popple


John Hubbard  writes:

> On 3/28/23 20:16, Matthew Wilcox wrote:
> ...
>>> +   if (!get_page_unless_zero(vmf->page))
>>> +   return 0;
>>  From a folio point of view: what the hell are you doing here?  Tail
>> pages don't have individual refcounts; all the refcounts are actually

I had stuck with using the page because none of this stuff (yet)
supports compound pages anyway so we shouldn't see a tail page
anyway. But point taken, I admit I need to find some time to get a
deeper internalised understanding of folios than just s/page/folio.

> ohh, and I really should have caught that too. I plead spending too much
> time recently in a somewhat more driver-centric mindset, and failing to
> mentally shift gears properly for this case.
>
> Sorry for missing that!
>
> thanks,



Re: [Nouveau] [PATCH] mm: Take a page reference when removing device exclusive entries

2023-03-28 Thread Alistair Popple


John Hubbard  writes:

>> warnings such as PAGE_FLAGS_CHECK_AT_FREE due to the page being locked
>> when the refcount drops to zero. Note that during removal of the
>> device exclusive entry the PTE is currently re-checked under the PTL
>> so no futher bad page accesses occur once it is locked.
>
> Maybe change that last sentence to something like this:
>
> "Fix this by taking a page reference before starting to remove a device
> exclusive pte. This is done safely in a lock-free way by first getting a
> reference via get_page_unless_zero(), and then re-checking after
> acquiring the PTL, that the page is the correct one."
>
> ?
>
> ...well, maybe that's not all that much help. But it does at least
> provide the traditional description of what the patch *does*, at
> the end of the commit description. But please treat this as just
> an optional suggestion.

My wording was probably a little awkward. The intent was to point out
the existing code subsequent to taking the page lock was already
correct/safe. I figured the patch itself does a pretty good of
describing the actual fix so am inclined to leave it.

Andrew Morton  writes:

> On Mon, 27 Mar 2023 23:25:49 -0700 John Hubbard  wrote:
>
>> On the patch process, I see that this applies to linux-stable's 6.1.y
>> branch. I'd suggest two things:
>> 
>> 1) Normally, what I've seen done is to post against either the current
>> top of tree linux.git, or else against one of the mm-stable branches.
>> And then after it's accepted, create a version for -stable. 
>
> Yup.  I had to jiggle the patch a bit because
> mmu_notifier_range_init_owner()'s arguments have changed.  Once this
> hits mainline, the -stable maintainers will probably ask for a version
> which suits the relevant kernel version(s).

Thanks Andrew. That's my bad, I was developing on top of v6.1 and
neglected to rebase. Happy to provide versions for -stable as required.


[Nouveau] [PATCH] mm: Take a page reference when removing device exclusive entries

2023-03-27 Thread Alistair Popple
Device exclusive page table entries are used to prevent CPU access to
a page whilst it is being accessed from a device. Typically this is
used to implement atomic operations when the underlying bus does not
support atomic access. When a CPU thread encounters a device exclusive
entry it locks the page and restores the original entry after calling
mmu notifiers to signal drivers that exclusive access is no longer
available.

The device exclusive entry holds a reference to the page making it
safe to access the struct page whilst the entry is present. However
the fault handling code does not hold the PTL when taking the page
lock. This means if there are multiple threads faulting concurrently
on the device exclusive entry one will remove the entry whilst others
will wait on the page lock without holding a reference.

This can lead to threads locking or waiting on a page with a zero
refcount. Whilst mmap_lock prevents the pages getting freed via
munmap() they may still be freed by a migration. This leads to
warnings such as PAGE_FLAGS_CHECK_AT_FREE due to the page being locked
when the refcount drops to zero. Note that during removal of the
device exclusive entry the PTE is currently re-checked under the PTL
so no futher bad page accesses occur once it is locked.

Signed-off-by: Alistair Popple 
Fixes: b756a3b5e7ea ("mm: device exclusive memory access")
Cc: sta...@vger.kernel.org
---
 mm/memory.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 8c8420934d60..b499bd283d8e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3623,8 +3623,19 @@ static vm_fault_t remove_device_exclusive_entry(struct 
vm_fault *vmf)
struct vm_area_struct *vma = vmf->vma;
struct mmu_notifier_range range;
 
-   if (!folio_lock_or_retry(folio, vma->vm_mm, vmf->flags))
+   /*
+* We need a page reference to lock the page because we don't
+* hold the PTL so a racing thread can remove the
+* device-exclusive entry and unmap the page. If the page is
+* free the entry must have been removed already.
+*/
+   if (!get_page_unless_zero(vmf->page))
+   return 0;
+
+   if (!folio_lock_or_retry(folio, vma->vm_mm, vmf->flags)) {
+   put_page(vmf->page);
return VM_FAULT_RETRY;
+   }
mmu_notifier_range_init_owner(, MMU_NOTIFY_EXCLUSIVE, 0, vma,
vma->vm_mm, vmf->address & PAGE_MASK,
(vmf->address & PAGE_MASK) + PAGE_SIZE, NULL);
@@ -3637,6 +3648,7 @@ static vm_fault_t remove_device_exclusive_entry(struct 
vm_fault *vmf)
 
pte_unmap_unlock(vmf->pte, vmf->ptl);
folio_unlock(folio);
+   put_page(vmf->page);
 
mmu_notifier_invalidate_range_end();
return 0;
-- 
2.39.2



Re: [Nouveau] [PATCH v2 0/8] Fix several device private page reference counting issues

2022-10-25 Thread Alistair Popple


"Vlastimil Babka (SUSE)"  writes:

> On 9/28/22 14:01, Alistair Popple wrote:
>> This series aims to fix a number of page reference counting issues in
>> drivers dealing with device private ZONE_DEVICE pages. These result in
>> use-after-free type bugs, either from accessing a struct page which no
>> longer exists because it has been removed or accessing fields within the
>> struct page which are no longer valid because the page has been freed.
>>
>> During normal usage it is unlikely these will cause any problems. However
>> without these fixes it is possible to crash the kernel from userspace.
>> These crashes can be triggered either by unloading the kernel module or
>> unbinding the device from the driver prior to a userspace task exiting. In
>> modules such as Nouveau it is also possible to trigger some of these issues
>> by explicitly closing the device file-descriptor prior to the task exiting
>> and then accessing device private memory.
>
> Hi, as this series was noticed to create a CVE [1], do you think a stable
> backport is warranted? I think the "It is possible to launch the attack
> remotely." in [1] is incorrect though, right?

Right, I don't see how this could be exploited remotely. And I'm pretty
sure you need root as well because in practice the pgmap needs to be
freed, and for Nouveau at least that only happens on device removal.

> It looks to me that patch 1 would be needed since the CONFIG_DEVICE_PRIVATE
> introduction, while the following few only to kernels with 27674ef6c73f
> (probably not so critical as that includes no LTS)?

Patch 3 already has a fixes tag for 27674ef6c73f. Patch 1 would need to
go back to CONFIG_DEVICE_PRIVATE introduction. I think patches 4-8 would
also need to go back to introduction of CONFIG_DEVICE_PRIVATE, but there
isn't as much impact there and they would be harder to backport I think.
Without them device removal can loop indefinitely in kernel mode (if
patch 3 is present or the kernel is older than 27674ef6c73f).

 - Alistair

> Thanks,
> Vlastimil
>
> [1] https://nvd.nist.gov/vuln/detail/CVE-2022-3523
>
>> This involves some minor changes to both PowerPC and AMD GPU code.
>> Unfortunately I lack hardware to test either of those so any help there
>> would be appreciated. The changes mimic what is done in for both Nouveau
>> and hmm-tests though so I doubt they will cause problems.
>>
>> To: Andrew Morton 
>> To: linux...@kvack.org
>> Cc: linux-ker...@vger.kernel.org
>> Cc: amd-...@lists.freedesktop.org
>> Cc: nouveau@lists.freedesktop.org
>> Cc: dri-de...@lists.freedesktop.org
>>
>> Alistair Popple (8):
>>   mm/memory.c: Fix race when faulting a device private page
>>   mm: Free device private pages have zero refcount
>>   mm/memremap.c: Take a pgmap reference on page allocation
>>   mm/migrate_device.c: Refactor migrate_vma and 
>> migrate_deivce_coherent_page()
>>   mm/migrate_device.c: Add migrate_device_range()
>>   nouveau/dmem: Refactor nouveau_dmem_fault_copy_one()
>>   nouveau/dmem: Evict device private memory during release
>>   hmm-tests: Add test for migrate_device_range()
>>
>>  arch/powerpc/kvm/book3s_hv_uvmem.c   |  17 +-
>>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |  19 +-
>>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.h |   2 +-
>>  drivers/gpu/drm/amd/amdkfd/kfd_svm.c |  11 +-
>>  drivers/gpu/drm/nouveau/nouveau_dmem.c   | 108 +++
>>  include/linux/memremap.h |   1 +-
>>  include/linux/migrate.h  |  15 ++-
>>  lib/test_hmm.c   | 129 ++---
>>  lib/test_hmm_uapi.h  |   1 +-
>>  mm/memory.c  |  16 +-
>>  mm/memremap.c|  30 ++-
>>  mm/migrate.c |  34 +--
>>  mm/migrate_device.c  | 239 +---
>>  mm/page_alloc.c  |   8 +-
>>  tools/testing/selftests/vm/hmm-tests.c   |  49 +-
>>  15 files changed, 516 insertions(+), 163 deletions(-)
>>
>> base-commit: 088b8aa537c2c767765f1c19b555f21ffe555786


[Nouveau] [PATCH] nouveau: Fix migrate_to_ram() for faulting page

2022-10-19 Thread Alistair Popple
Commit 16ce101db85d ("mm/memory.c: fix race when faulting a device private
page") changed the migrate_to_ram() callback to take a reference on the
device page to ensure it can't be freed while handling the fault.
Unfortunately the corresponding update to Nouveau to accommodate this
change was inadvertently dropped from that patch causing GPU to CPU
migration to fail so add it here.

Signed-off-by: Alistair Popple 
Fixes: 16ce101db85d ("mm/memory.c: fix race when faulting a device private 
page")
Cc: John Hubbard 
Cc: Ralph Campbell 
Cc: Lyude Paul 
Cc: Ben Skeggs 

---

Hi Andrew/Ben, apologies I must have accidentally dropped this small hunk
when rebasing prior to sending v2 of the original series. Without this
migration from GPU to CPU won't work in Nouveau so hopefully one of you can
take this for v6.1-rcX. Thanks.
---
 drivers/gpu/drm/nouveau/nouveau_dmem.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 5fe209107246..20fe53815b20 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -176,6 +176,7 @@ static vm_fault_t nouveau_dmem_migrate_to_ram(struct 
vm_fault *vmf)
.src= ,
.dst= ,
.pgmap_owner= drm->dev,
+   .fault_page = vmf->page,
.flags  = MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
};
 
-- 
2.35.1



Re: [Nouveau] [PATCH v2 1/8] mm/memory.c: Fix race when faulting a device private page

2022-10-02 Thread Alistair Popple


Felix Kuehling  writes:

> On 2022-09-28 08:01, Alistair Popple wrote:
>> When the CPU tries to access a device private page the migrate_to_ram()
>> callback associated with the pgmap for the page is called. However no
>> reference is taken on the faulting page. Therefore a concurrent
>> migration of the device private page can free the page and possibly the
>> underlying pgmap. This results in a race which can crash the kernel due
>> to the migrate_to_ram() function pointer becoming invalid. It also means
>> drivers can't reliably read the zone_device_data field because the page
>> may have been freed with memunmap_pages().
>>
>> Close the race by getting a reference on the page while holding the ptl
>> to ensure it has not been freed. Unfortunately the elevated reference
>> count will cause the migration required to handle the fault to fail. To
>> avoid this failure pass the faulting page into the migrate_vma functions
>> so that if an elevated reference count is found it can be checked to see
>> if it's expected or not.
>
> Do we really have to drag the fault_page all the way into the migrate 
> structure?
> Is the elevated refcount still needed at that time? Maybe it would be easier 
> to
> drop the refcount early in the ops->migrage_to_ram callbacks, so we won't have
> to deal with it in all the migration code.

That would also work. Honestly I don't really like either solution :-)

I didn't like having to plumb it all through the migration code
but I ended up going this way because I felt it was easier to explain
the life time of vmf->page for the migrate_to_ram() callback. This way
vmf->page is guaranteed to be valid for the duration of the
migrate_to_ram() callbak.

As you suggest we could instead have drivers call put_page(vmf->page)
somewhere in migrate_to_ram() before calling migrate_vma_setup(). The
reason I didn't go this way is IMHO it's more subtle because in general
the page will remain valid after that put_page() anyway. So it would be
easy for drivers to introduce a bug assuming the vmf->page is still
valid and not notice because most of the time it is.

Let me know if you disagree with my reasoning though - would appreciate
any review here.

> Regards,
>   Felix
>
>
>>
>> Signed-off-by: Alistair Popple 
>> Cc: Jason Gunthorpe 
>> Cc: John Hubbard 
>> Cc: Ralph Campbell 
>> Cc: Michael Ellerman 
>> Cc: Felix Kuehling 
>> Cc: Lyude Paul 
>> ---
>>   arch/powerpc/kvm/book3s_hv_uvmem.c   | 15 ++-
>>   drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 17 +++--
>>   drivers/gpu/drm/amd/amdkfd/kfd_migrate.h |  2 +-
>>   drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 11 +---
>>   include/linux/migrate.h  |  8 ++-
>>   lib/test_hmm.c   |  7 ++---
>>   mm/memory.c  | 16 +++-
>>   mm/migrate.c | 34 ++---
>>   mm/migrate_device.c  | 18 +
>>   9 files changed, 87 insertions(+), 41 deletions(-)
>>
>> diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
>> b/arch/powerpc/kvm/book3s_hv_uvmem.c
>> index 5980063..d4eacf4 100644
>> --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
>> +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
>> @@ -508,10 +508,10 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
>>   static int __kvmppc_svm_page_out(struct vm_area_struct *vma,
>>  unsigned long start,
>>  unsigned long end, unsigned long page_shift,
>> -struct kvm *kvm, unsigned long gpa)
>> +struct kvm *kvm, unsigned long gpa, struct page *fault_page)
>>   {
>>  unsigned long src_pfn, dst_pfn = 0;
>> -struct migrate_vma mig;
>> +struct migrate_vma mig = { 0 };
>>  struct page *dpage, *spage;
>>  struct kvmppc_uvmem_page_pvt *pvt;
>>  unsigned long pfn;
>> @@ -525,6 +525,7 @@ static int __kvmppc_svm_page_out(struct vm_area_struct 
>> *vma,
>>  mig.dst = _pfn;
>>  mig.pgmap_owner = _uvmem_pgmap;
>>  mig.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
>> +mig.fault_page = fault_page;
>>  /* The requested page is already paged-out, nothing to do */
>>  if (!kvmppc_gfn_is_uvmem_pfn(gpa >> page_shift, kvm, NULL))
>> @@ -580,12 +581,14 @@ static int __kvmppc_svm_page_out(struct vm_area_struct 
>> *vma,
>>   static inline int kvmppc_svm_page_out(struct vm_area_struct *vma,
>>unsigned long start, unsigned long end,
>>unsigned long page_shift,
>> -   

Re: [Nouveau] [PATCH 2/7] mm: Free device private pages have zero refcount

2022-09-29 Thread Alistair Popple


Dan Williams  writes:

> Alistair Popple wrote:
>>
>> Jason Gunthorpe  writes:
>>
>> > On Mon, Sep 26, 2022 at 04:03:06PM +1000, Alistair Popple wrote:
>> >> Since 27674ef6c73f ("mm: remove the extra ZONE_DEVICE struct page
>> >> refcount") device private pages have no longer had an extra reference
>> >> count when the page is in use. However before handing them back to the
>> >> owning device driver we add an extra reference count such that free
>> >> pages have a reference count of one.
>> >>
>> >> This makes it difficult to tell if a page is free or not because both
>> >> free and in use pages will have a non-zero refcount. Instead we should
>> >> return pages to the drivers page allocator with a zero reference count.
>> >> Kernel code can then safely use kernel functions such as
>> >> get_page_unless_zero().
>> >>
>> >> Signed-off-by: Alistair Popple 
>> >> ---
>> >>  arch/powerpc/kvm/book3s_hv_uvmem.c   | 1 +
>> >>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 1 +
>> >>  drivers/gpu/drm/nouveau/nouveau_dmem.c   | 1 +
>> >>  lib/test_hmm.c   | 1 +
>> >>  mm/memremap.c| 5 -
>> >>  mm/page_alloc.c  | 6 ++
>> >>  6 files changed, 10 insertions(+), 5 deletions(-)
>> >
>> > I think this is a great idea, but I'm surprised no dax stuff is
>> > touched here?
>>
>> free_zone_device_page() shouldn't be called for pgmap->type ==
>> MEMORY_DEVICE_FS_DAX so I don't think we should have to worry about DAX
>> there. Except that the folio code looks like it might have introduced a
>> bug. AFAICT put_page() always calls
>> put_devmap_managed_page(>page) but folio_put() does not (although
>> folios_put() does!). So it seems folio_put() won't end up calling
>> __put_devmap_managed_page_refs() as I think it should.
>>
>> I think you're right about the change to __init_zone_device_page() - I
>> should limit it to DEVICE_PRIVATE/COHERENT pages only. But I need to
>> look at Dan's patch series more closely as I suspect it might be better
>> to rebase this patch on top of that.
>
> Apologies for the delay I was travelling the past few days. Yes, I think
> this patch slots in nicely to avoid the introduction of an init_mode
> [1]:
>
> https://lore.kernel.org/nvdimm/166329940343.2786261.6047770378829215962.st...@dwillia2-xfh.jf.intel.com/
>
> Mind if I steal it into my series?

No problem, although I notice Andrew has already merged it into
mm-unstable. If you end up rebasing your series on top of mine I think
all that's needed is a patch somewhere in your series to drop the
various `if (pgmap->type == MEMORY_DEVICE_*)` I added to (hopefully)
avoid breaking DAX. Assuming DAX takes a pagemap reference on struct
page allocation something like below.

---

diff --git a/mm/memremap.c b/mm/memremap.c
index 421bec3a29ee..da1a0e0abb8b 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -507,15 +507,7 @@ void free_zone_device_page(struct page *page)
page->mapping = NULL;
page->pgmap->ops->page_free(page);

-   if (page->pgmap->type != MEMORY_DEVICE_PRIVATE &&
-   page->pgmap->type != MEMORY_DEVICE_COHERENT)
-   /*
-* Reset the page count to 1 to prepare for handing out the page
-* again.
-*/
-   set_page_count(page, 1);
-   else
-   put_dev_pagemap(page->pgmap);
+   put_dev_pagemap(page->pgmap);
 }

 void zone_device_page_init(struct page *page)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 014dbdf54d62..3e5ff06700ca 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6816,9 +6816,7 @@ static void __ref __init_zone_device_page(struct page 
*page, unsigned long pfn,
 * ZONE_DEVICE pages are released directly to the driver page allocator
 * which will set the page count to 1 when allocating the page.
 */
-   if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
-   pgmap->type == MEMORY_DEVICE_COHERENT)
-   set_page_count(page, 0);
+   set_page_count(page, 0);
 }

 /*


Re: [Nouveau] [PATCH v2 8/8] hmm-tests: Add test for migrate_device_range()

2022-09-29 Thread Alistair Popple


Andrew Morton  writes:

> On Wed, 28 Sep 2022 22:01:22 +1000 Alistair Popple  wrote:
>
>> @@ -1401,22 +1494,7 @@ static int dmirror_device_init(struct dmirror_device 
>> *mdevice, int id)
>>
>>  static void dmirror_device_remove(struct dmirror_device *mdevice)
>>  {
>> -unsigned int i;
>> -
>> -if (mdevice->devmem_chunks) {
>> -for (i = 0; i < mdevice->devmem_count; i++) {
>> -struct dmirror_chunk *devmem =
>> -mdevice->devmem_chunks[i];
>> -
>> -memunmap_pages(>pagemap);
>> -if (devmem->pagemap.type == MEMORY_DEVICE_PRIVATE)
>> -release_mem_region(devmem->pagemap.range.start,
>> -   
>> range_len(>pagemap.range));
>> -kfree(devmem);
>> -}
>> -kfree(mdevice->devmem_chunks);
>> -}
>> -
>> +dmirror_device_remove_chunks(mdevice);
>>  cdev_del(>cdevice);
>>  }
>
> Needed a bit or rework due to
> https://lkml.kernel.org/r/20220826050631.25771-1-mpent...@redhat.com.
> Please check my resolution.

Thanks. Rework looks good to me.

> --- a/lib/test_hmm.c~hmm-tests-add-test-for-migrate_device_range
> +++ a/lib/test_hmm.c
> @@ -100,6 +100,7 @@ struct dmirror {
>  struct dmirror_chunk {
>   struct dev_pagemap  pagemap;
>   struct dmirror_device   *mdevice;
> + bool remove;
>  };
>
>  /*
> @@ -192,11 +193,15 @@ static int dmirror_fops_release(struct i
>   return 0;
>  }
>
> +static struct dmirror_chunk *dmirror_page_to_chunk(struct page *page)
> +{
> + return container_of(page->pgmap, struct dmirror_chunk, pagemap);
> +}
> +
>  static struct dmirror_device *dmirror_page_to_device(struct page *page)
>
>  {
> - return container_of(page->pgmap, struct dmirror_chunk,
> - pagemap)->mdevice;
> + return dmirror_page_to_chunk(page)->mdevice;
>  }
>
>  static int dmirror_do_fault(struct dmirror *dmirror, struct hmm_range *range)
> @@ -1218,6 +1223,85 @@ static int dmirror_snapshot(struct dmirr
>   return ret;
>  }
>
> +static void dmirror_device_evict_chunk(struct dmirror_chunk *chunk)
> +{
> + unsigned long start_pfn = chunk->pagemap.range.start >> PAGE_SHIFT;
> + unsigned long end_pfn = chunk->pagemap.range.end >> PAGE_SHIFT;
> + unsigned long npages = end_pfn - start_pfn + 1;
> + unsigned long i;
> + unsigned long *src_pfns;
> + unsigned long *dst_pfns;
> +
> + src_pfns = kcalloc(npages, sizeof(*src_pfns), GFP_KERNEL);
> + dst_pfns = kcalloc(npages, sizeof(*dst_pfns), GFP_KERNEL);
> +
> + migrate_device_range(src_pfns, start_pfn, npages);
> + for (i = 0; i < npages; i++) {
> + struct page *dpage, *spage;
> +
> + spage = migrate_pfn_to_page(src_pfns[i]);
> + if (!spage || !(src_pfns[i] & MIGRATE_PFN_MIGRATE))
> + continue;
> +
> + if (WARN_ON(!is_device_private_page(spage) &&
> + !is_device_coherent_page(spage)))
> + continue;
> + spage = BACKING_PAGE(spage);
> + dpage = alloc_page(GFP_HIGHUSER_MOVABLE | __GFP_NOFAIL);
> + lock_page(dpage);
> + copy_highpage(dpage, spage);
> + dst_pfns[i] = migrate_pfn(page_to_pfn(dpage));
> + if (src_pfns[i] & MIGRATE_PFN_WRITE)
> + dst_pfns[i] |= MIGRATE_PFN_WRITE;
> + }
> + migrate_device_pages(src_pfns, dst_pfns, npages);
> + migrate_device_finalize(src_pfns, dst_pfns, npages);
> + kfree(src_pfns);
> + kfree(dst_pfns);
> +}
> +
> +/* Removes free pages from the free list so they can't be re-allocated */
> +static void dmirror_remove_free_pages(struct dmirror_chunk *devmem)
> +{
> + struct dmirror_device *mdevice = devmem->mdevice;
> + struct page *page;
> +
> + for (page = mdevice->free_pages; page; page = page->zone_device_data)
> + if (dmirror_page_to_chunk(page) == devmem)
> + mdevice->free_pages = page->zone_device_data;
> +}
> +
> +static void dmirror_device_remove_chunks(struct dmirror_device *mdevice)
> +{
> + unsigned int i;
> +
> + mutex_lock(>devmem_lock);
> + if (mdevice->devmem_chunks) {
> + for (i = 0; i < mdevice->devmem_count; i++) {
> + struct dmirror_

Re: [Nouveau] [PATCH 1/7] mm/memory.c: Fix race when faulting a device private page

2022-09-28 Thread Alistair Popple


Michael Ellerman  writes:

> Alistair Popple  writes:
>> When the CPU tries to access a device private page the migrate_to_ram()
>> callback associated with the pgmap for the page is called. However no
>> reference is taken on the faulting page. Therefore a concurrent
>> migration of the device private page can free the page and possibly the
>> underlying pgmap. This results in a race which can crash the kernel due
>> to the migrate_to_ram() function pointer becoming invalid. It also means
>> drivers can't reliably read the zone_device_data field because the page
>> may have been freed with memunmap_pages().
>>
>> Close the race by getting a reference on the page while holding the ptl
>> to ensure it has not been freed. Unfortunately the elevated reference
>> count will cause the migration required to handle the fault to fail. To
>> avoid this failure pass the faulting page into the migrate_vma functions
>> so that if an elevated reference count is found it can be checked to see
>> if it's expected or not.
>>
>> Signed-off-by: Alistair Popple 
>> ---
>>  arch/powerpc/kvm/book3s_hv_uvmem.c   | 15 ++-
>>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 17 +++--
>>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.h |  2 +-
>>  drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 11 +---
>>  include/linux/migrate.h  |  8 ++-
>>  lib/test_hmm.c   |  7 ++---
>>  mm/memory.c  | 16 +++-
>>  mm/migrate.c | 34 ++---
>>  mm/migrate_device.c  | 18 +
>>  9 files changed, 87 insertions(+), 41 deletions(-)
>>
>> diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
>> b/arch/powerpc/kvm/book3s_hv_uvmem.c
>> index 5980063..d4eacf4 100644
>> --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
>> +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
>> @@ -508,10 +508,10 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
>>  static int __kvmppc_svm_page_out(struct vm_area_struct *vma,
>>  unsigned long start,
>>  unsigned long end, unsigned long page_shift,
>> -struct kvm *kvm, unsigned long gpa)
>> +struct kvm *kvm, unsigned long gpa, struct page *fault_page)
>>  {
>>  unsigned long src_pfn, dst_pfn = 0;
>> -struct migrate_vma mig;
>> +struct migrate_vma mig = { 0 };
>>  struct page *dpage, *spage;
>>  struct kvmppc_uvmem_page_pvt *pvt;
>>  unsigned long pfn;
>> @@ -525,6 +525,7 @@ static int __kvmppc_svm_page_out(struct vm_area_struct 
>> *vma,
>>  mig.dst = _pfn;
>>  mig.pgmap_owner = _uvmem_pgmap;
>>  mig.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
>> +mig.fault_page = fault_page;
>>
>>  /* The requested page is already paged-out, nothing to do */
>>  if (!kvmppc_gfn_is_uvmem_pfn(gpa >> page_shift, kvm, NULL))
>> @@ -580,12 +581,14 @@ static int __kvmppc_svm_page_out(struct vm_area_struct 
>> *vma,
>>  static inline int kvmppc_svm_page_out(struct vm_area_struct *vma,
>>unsigned long start, unsigned long end,
>>unsigned long page_shift,
>> -  struct kvm *kvm, unsigned long gpa)
>> +  struct kvm *kvm, unsigned long gpa,
>> +  struct page *fault_page)
>>  {
>>  int ret;
>>
>>  mutex_lock(>arch.uvmem_lock);
>> -ret = __kvmppc_svm_page_out(vma, start, end, page_shift, kvm, gpa);
>> +ret = __kvmppc_svm_page_out(vma, start, end, page_shift, kvm, gpa,
>> +fault_page);
>>  mutex_unlock(>arch.uvmem_lock);
>>
>>  return ret;
>> @@ -736,7 +739,7 @@ static int kvmppc_svm_page_in(struct vm_area_struct *vma,
>>  bool pagein)
>>  {
>>  unsigned long src_pfn, dst_pfn = 0;
>> -struct migrate_vma mig;
>> +struct migrate_vma mig = { 0 };
>>  struct page *spage;
>>  unsigned long pfn;
>>  struct page *dpage;
>> @@ -994,7 +997,7 @@ static vm_fault_t kvmppc_uvmem_migrate_to_ram(struct 
>> vm_fault *vmf)
>>
>>  if (kvmppc_svm_page_out(vmf->vma, vmf->address,
>>  vmf->address + PAGE_SIZE, PAGE_SHIFT,
>> -pvt->kvm, pvt->gpa))
>> +pvt->kvm, pvt->gpa, vmf->page))
>>  return VM_FAULT_SIG

[Nouveau] [PATCH v2 8/8] hmm-tests: Add test for migrate_device_range()

2022-09-28 Thread Alistair Popple
Signed-off-by: Alistair Popple 
Cc: Jason Gunthorpe 
Cc: Ralph Campbell 
Cc: John Hubbard 
Cc: Alex Sierra 
Cc: Felix Kuehling 
---
 lib/test_hmm.c | 120 +-
 lib/test_hmm_uapi.h|   1 +-
 tools/testing/selftests/vm/hmm-tests.c |  49 +++-
 3 files changed, 149 insertions(+), 21 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 688c15d..6c2fc85 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -100,6 +100,7 @@ struct dmirror {
 struct dmirror_chunk {
struct dev_pagemap  pagemap;
struct dmirror_device   *mdevice;
+   bool remove;
 };
 
 /*
@@ -192,11 +193,15 @@ static int dmirror_fops_release(struct inode *inode, 
struct file *filp)
return 0;
 }
 
+static struct dmirror_chunk *dmirror_page_to_chunk(struct page *page)
+{
+   return container_of(page->pgmap, struct dmirror_chunk, pagemap);
+}
+
 static struct dmirror_device *dmirror_page_to_device(struct page *page)
 
 {
-   return container_of(page->pgmap, struct dmirror_chunk,
-   pagemap)->mdevice;
+   return dmirror_page_to_chunk(page)->mdevice;
 }
 
 static int dmirror_do_fault(struct dmirror *dmirror, struct hmm_range *range)
@@ -1218,6 +1223,85 @@ static int dmirror_snapshot(struct dmirror *dmirror,
return ret;
 }
 
+static void dmirror_device_evict_chunk(struct dmirror_chunk *chunk)
+{
+   unsigned long start_pfn = chunk->pagemap.range.start >> PAGE_SHIFT;
+   unsigned long end_pfn = chunk->pagemap.range.end >> PAGE_SHIFT;
+   unsigned long npages = end_pfn - start_pfn + 1;
+   unsigned long i;
+   unsigned long *src_pfns;
+   unsigned long *dst_pfns;
+
+   src_pfns = kcalloc(npages, sizeof(*src_pfns), GFP_KERNEL);
+   dst_pfns = kcalloc(npages, sizeof(*dst_pfns), GFP_KERNEL);
+
+   migrate_device_range(src_pfns, start_pfn, npages);
+   for (i = 0; i < npages; i++) {
+   struct page *dpage, *spage;
+
+   spage = migrate_pfn_to_page(src_pfns[i]);
+   if (!spage || !(src_pfns[i] & MIGRATE_PFN_MIGRATE))
+   continue;
+
+   if (WARN_ON(!is_device_private_page(spage) &&
+   !is_device_coherent_page(spage)))
+   continue;
+   spage = BACKING_PAGE(spage);
+   dpage = alloc_page(GFP_HIGHUSER_MOVABLE | __GFP_NOFAIL);
+   lock_page(dpage);
+   copy_highpage(dpage, spage);
+   dst_pfns[i] = migrate_pfn(page_to_pfn(dpage));
+   if (src_pfns[i] & MIGRATE_PFN_WRITE)
+   dst_pfns[i] |= MIGRATE_PFN_WRITE;
+   }
+   migrate_device_pages(src_pfns, dst_pfns, npages);
+   migrate_device_finalize(src_pfns, dst_pfns, npages);
+   kfree(src_pfns);
+   kfree(dst_pfns);
+}
+
+/* Removes free pages from the free list so they can't be re-allocated */
+static void dmirror_remove_free_pages(struct dmirror_chunk *devmem)
+{
+   struct dmirror_device *mdevice = devmem->mdevice;
+   struct page *page;
+
+   for (page = mdevice->free_pages; page; page = page->zone_device_data)
+   if (dmirror_page_to_chunk(page) == devmem)
+   mdevice->free_pages = page->zone_device_data;
+}
+
+static void dmirror_device_remove_chunks(struct dmirror_device *mdevice)
+{
+   unsigned int i;
+
+   mutex_lock(>devmem_lock);
+   if (mdevice->devmem_chunks) {
+   for (i = 0; i < mdevice->devmem_count; i++) {
+   struct dmirror_chunk *devmem =
+   mdevice->devmem_chunks[i];
+
+   spin_lock(>lock);
+   devmem->remove = true;
+   dmirror_remove_free_pages(devmem);
+   spin_unlock(>lock);
+
+   dmirror_device_evict_chunk(devmem);
+   memunmap_pages(>pagemap);
+   if (devmem->pagemap.type == MEMORY_DEVICE_PRIVATE)
+   release_mem_region(devmem->pagemap.range.start,
+  
range_len(>pagemap.range));
+   kfree(devmem);
+   }
+   mdevice->devmem_count = 0;
+   mdevice->devmem_capacity = 0;
+   mdevice->free_pages = NULL;
+   kfree(mdevice->devmem_chunks);
+   mdevice->devmem_chunks = NULL;
+   }
+   mutex_unlock(>devmem_lock);
+}
+
 static long dmirror_fops_unlocked_ioctl(struct file *filp,
unsigned int command,
unsigned long arg)
@@ -1272,6 +1356,11 @@ static long dmirror_fops_unlocked_ioctl(struct file 
*filp,
ret = dmirror_snapshot(dmirror, );
  

[Nouveau] [PATCH v2 7/8] nouveau/dmem: Evict device private memory during release

2022-09-28 Thread Alistair Popple
When the module is unloaded or a GPU is unbound from the module it is
possible for device private pages to still be mapped in currently
running processes. This can lead to a hangs and RCU stall warnings when
unbinding the device as memunmap_pages() will wait in an uninterruptible
state until all device pages have been freed which may never happen.

Fix this by migrating device mappings back to normal CPU memory prior to
freeing the GPU memory chunks and associated device private pages.

Signed-off-by: Alistair Popple 
Cc: Lyude Paul 
Cc: Ben Skeggs 
Cc: Ralph Campbell 
Cc: John Hubbard 
---
 drivers/gpu/drm/nouveau/nouveau_dmem.c | 48 +++-
 1 file changed, 48 insertions(+)

diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 65f51fb..5fe2091 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -367,6 +367,52 @@ nouveau_dmem_suspend(struct nouveau_drm *drm)
mutex_unlock(>dmem->mutex);
 }
 
+/*
+ * Evict all pages mapping a chunk.
+ */
+static void
+nouveau_dmem_evict_chunk(struct nouveau_dmem_chunk *chunk)
+{
+   unsigned long i, npages = range_len(>pagemap.range) >> 
PAGE_SHIFT;
+   unsigned long *src_pfns, *dst_pfns;
+   dma_addr_t *dma_addrs;
+   struct nouveau_fence *fence;
+
+   src_pfns = kcalloc(npages, sizeof(*src_pfns), GFP_KERNEL);
+   dst_pfns = kcalloc(npages, sizeof(*dst_pfns), GFP_KERNEL);
+   dma_addrs = kcalloc(npages, sizeof(*dma_addrs), GFP_KERNEL);
+
+   migrate_device_range(src_pfns, chunk->pagemap.range.start >> PAGE_SHIFT,
+   npages);
+
+   for (i = 0; i < npages; i++) {
+   if (src_pfns[i] & MIGRATE_PFN_MIGRATE) {
+   struct page *dpage;
+
+   /*
+* _GFP_NOFAIL because the GPU is going away and there
+* is nothing sensible we can do if we can't copy the
+* data back.
+*/
+   dpage = alloc_page(GFP_HIGHUSER | __GFP_NOFAIL);
+   dst_pfns[i] = migrate_pfn(page_to_pfn(dpage));
+   nouveau_dmem_copy_one(chunk->drm,
+   migrate_pfn_to_page(src_pfns[i]), dpage,
+   _addrs[i]);
+   }
+   }
+
+   nouveau_fence_new(chunk->drm->dmem->migrate.chan, false, );
+   migrate_device_pages(src_pfns, dst_pfns, npages);
+   nouveau_dmem_fence_done();
+   migrate_device_finalize(src_pfns, dst_pfns, npages);
+   kfree(src_pfns);
+   kfree(dst_pfns);
+   for (i = 0; i < npages; i++)
+   dma_unmap_page(chunk->drm->dev->dev, dma_addrs[i], PAGE_SIZE, 
DMA_BIDIRECTIONAL);
+   kfree(dma_addrs);
+}
+
 void
 nouveau_dmem_fini(struct nouveau_drm *drm)
 {
@@ -378,8 +424,10 @@ nouveau_dmem_fini(struct nouveau_drm *drm)
mutex_lock(>dmem->mutex);
 
list_for_each_entry_safe(chunk, tmp, >dmem->chunks, list) {
+   nouveau_dmem_evict_chunk(chunk);
nouveau_bo_unpin(chunk->bo);
nouveau_bo_ref(NULL, >bo);
+   WARN_ON(chunk->callocated);
list_del(>list);
memunmap_pages(>pagemap);
release_mem_region(chunk->pagemap.range.start,
-- 
git-series 0.9.1


[Nouveau] [PATCH v2 5/8] mm/migrate_device.c: Add migrate_device_range()

2022-09-28 Thread Alistair Popple
Device drivers can use the migrate_vma family of functions to migrate
existing private anonymous mappings to device private pages. These pages
are backed by memory on the device with drivers being responsible for
copying data to and from device memory.

Device private pages are freed via the pgmap->page_free() callback when
they are unmapped and their refcount drops to zero. Alternatively they
may be freed indirectly via migration back to CPU memory in response to
a pgmap->migrate_to_ram() callback called whenever the CPU accesses
an address mapped to a device private page.

In other words drivers cannot control the lifetime of data allocated on
the devices and must wait until these pages are freed from userspace.
This causes issues when memory needs to reclaimed on the device, either
because the device is going away due to a ->release() callback or
because another user needs to use the memory.

Drivers could use the existing migrate_vma functions to migrate data off
the device. However this would require them to track the mappings of
each page which is both complicated and not always possible. Instead
drivers need to be able to migrate device pages directly so they can
free up device memory.

To allow that this patch introduces the migrate_device family of
functions which are functionally similar to migrate_vma but which skips
the initial lookup based on mapping.

Signed-off-by: Alistair Popple 
Cc: "Huang, Ying" 
Cc: Zi Yan 
Cc: Matthew Wilcox 
Cc: Yang Shi 
Cc: David Hildenbrand 
Cc: Ralph Campbell 
Cc: John Hubbard 
---
 include/linux/migrate.h |  7 +++-
 mm/migrate_device.c | 89 ++
 2 files changed, 89 insertions(+), 7 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 82ffa47..582cdc7 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -225,6 +225,13 @@ struct migrate_vma {
 int migrate_vma_setup(struct migrate_vma *args);
 void migrate_vma_pages(struct migrate_vma *migrate);
 void migrate_vma_finalize(struct migrate_vma *migrate);
+int migrate_device_range(unsigned long *src_pfns, unsigned long start,
+   unsigned long npages);
+void migrate_device_pages(unsigned long *src_pfns, unsigned long *dst_pfns,
+   unsigned long npages);
+void migrate_device_finalize(unsigned long *src_pfns,
+   unsigned long *dst_pfns, unsigned long npages);
+
 #endif /* CONFIG_MIGRATION */
 
 #endif /* _LINUX_MIGRATE_H */
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index ba479b5..824860a 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -681,7 +681,7 @@ static void migrate_vma_insert_page(struct migrate_vma 
*migrate,
*src &= ~MIGRATE_PFN_MIGRATE;
 }
 
-static void migrate_device_pages(unsigned long *src_pfns,
+static void __migrate_device_pages(unsigned long *src_pfns,
unsigned long *dst_pfns, unsigned long npages,
struct migrate_vma *migrate)
 {
@@ -703,6 +703,9 @@ static void migrate_device_pages(unsigned long *src_pfns,
if (!page) {
unsigned long addr;
 
+   if (!(src_pfns[i] & MIGRATE_PFN_MIGRATE))
+   continue;
+
/*
 * The only time there is no vma is when called from
 * migrate_device_coherent_page(). However this isn't
@@ -710,8 +713,6 @@ static void migrate_device_pages(unsigned long *src_pfns,
 */
VM_BUG_ON(!migrate);
addr = migrate->start + i*PAGE_SIZE;
-   if (!(src_pfns[i] & MIGRATE_PFN_MIGRATE))
-   continue;
if (!notified) {
notified = true;
 
@@ -767,6 +768,22 @@ static void migrate_device_pages(unsigned long *src_pfns,
 }
 
 /**
+ * migrate_device_pages() - migrate meta-data from src page to dst page
+ * @src_pfns: src_pfns returned from migrate_device_range()
+ * @dst_pfns: array of pfns allocated by the driver to migrate memory to
+ * @npages: number of pages in the range
+ *
+ * Equivalent to migrate_vma_pages(). This is called to migrate struct page
+ * meta-data from source struct page to destination.
+ */
+void migrate_device_pages(unsigned long *src_pfns, unsigned long *dst_pfns,
+   unsigned long npages)
+{
+   __migrate_device_pages(src_pfns, dst_pfns, npages, NULL);
+}
+EXPORT_SYMBOL(migrate_device_pages);
+
+/**
  * migrate_vma_pages() - migrate meta-data from src page to dst page
  * @migrate: migrate struct containing all migration information
  *
@@ -776,12 +793,22 @@ static void migrate_device_pages(unsigned long *src_pfns,
  */
 void migrate_vma_pages(struct migrate_vma *migrate)
 {
-   migrate_device_pages(migrate->src, mi

[Nouveau] [PATCH v2 6/8] nouveau/dmem: Refactor nouveau_dmem_fault_copy_one()

2022-09-28 Thread Alistair Popple
nouveau_dmem_fault_copy_one() is used during handling of CPU faults via
the migrate_to_ram() callback and is used to copy data from GPU to CPU
memory. It is currently specific to fault handling, however a future
patch implementing eviction of data during teardown needs similar
functionality.

Refactor out the core functionality so that it is not specific to fault
handling.

Signed-off-by: Alistair Popple 
Reviewed-by: Lyude Paul 
Cc: Ben Skeggs 
Cc: Ralph Campbell 
Cc: John Hubbard 
---
 drivers/gpu/drm/nouveau/nouveau_dmem.c | 58 +--
 1 file changed, 28 insertions(+), 30 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index b092988..65f51fb 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -139,44 +139,24 @@ static void nouveau_dmem_fence_done(struct nouveau_fence 
**fence)
}
 }
 
-static vm_fault_t nouveau_dmem_fault_copy_one(struct nouveau_drm *drm,
-   struct vm_fault *vmf, struct migrate_vma *args,
-   dma_addr_t *dma_addr)
+static int nouveau_dmem_copy_one(struct nouveau_drm *drm, struct page *spage,
+   struct page *dpage, dma_addr_t *dma_addr)
 {
struct device *dev = drm->dev->dev;
-   struct page *dpage, *spage;
-   struct nouveau_svmm *svmm;
-
-   spage = migrate_pfn_to_page(args->src[0]);
-   if (!spage || !(args->src[0] & MIGRATE_PFN_MIGRATE))
-   return 0;
 
-   dpage = alloc_page_vma(GFP_HIGHUSER, vmf->vma, vmf->address);
-   if (!dpage)
-   return VM_FAULT_SIGBUS;
lock_page(dpage);
 
*dma_addr = dma_map_page(dev, dpage, 0, PAGE_SIZE, DMA_BIDIRECTIONAL);
if (dma_mapping_error(dev, *dma_addr))
-   goto error_free_page;
+   return -EIO;
 
-   svmm = spage->zone_device_data;
-   mutex_lock(>mutex);
-   nouveau_svmm_invalidate(svmm, args->start, args->end);
if (drm->dmem->migrate.copy_func(drm, 1, NOUVEAU_APER_HOST, *dma_addr,
-   NOUVEAU_APER_VRAM, nouveau_dmem_page_addr(spage)))
-   goto error_dma_unmap;
-   mutex_unlock(>mutex);
+NOUVEAU_APER_VRAM, 
nouveau_dmem_page_addr(spage))) {
+   dma_unmap_page(dev, *dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
+   return -EIO;
+   }
 
-   args->dst[0] = migrate_pfn(page_to_pfn(dpage));
return 0;
-
-error_dma_unmap:
-   mutex_unlock(>mutex);
-   dma_unmap_page(dev, *dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
-error_free_page:
-   __free_page(dpage);
-   return VM_FAULT_SIGBUS;
 }
 
 static vm_fault_t nouveau_dmem_migrate_to_ram(struct vm_fault *vmf)
@@ -184,9 +164,11 @@ static vm_fault_t nouveau_dmem_migrate_to_ram(struct 
vm_fault *vmf)
struct nouveau_drm *drm = page_to_drm(vmf->page);
struct nouveau_dmem *dmem = drm->dmem;
struct nouveau_fence *fence;
+   struct nouveau_svmm *svmm;
+   struct page *spage, *dpage;
unsigned long src = 0, dst = 0;
dma_addr_t dma_addr = 0;
-   vm_fault_t ret;
+   vm_fault_t ret = 0;
struct migrate_vma args = {
.vma= vmf->vma,
.start  = vmf->address,
@@ -207,9 +189,25 @@ static vm_fault_t nouveau_dmem_migrate_to_ram(struct 
vm_fault *vmf)
if (!args.cpages)
return 0;
 
-   ret = nouveau_dmem_fault_copy_one(drm, vmf, , _addr);
-   if (ret || dst == 0)
+   spage = migrate_pfn_to_page(src);
+   if (!spage || !(src & MIGRATE_PFN_MIGRATE))
+   goto done;
+
+   dpage = alloc_page_vma(GFP_HIGHUSER, vmf->vma, vmf->address);
+   if (!dpage)
+   goto done;
+
+   dst = migrate_pfn(page_to_pfn(dpage));
+
+   svmm = spage->zone_device_data;
+   mutex_lock(>mutex);
+   nouveau_svmm_invalidate(svmm, args.start, args.end);
+   ret = nouveau_dmem_copy_one(drm, spage, dpage, _addr);
+   mutex_unlock(>mutex);
+   if (ret) {
+   ret = VM_FAULT_SIGBUS;
goto done;
+   }
 
nouveau_fence_new(dmem->migrate.chan, false, );
migrate_vma_pages();
-- 
git-series 0.9.1


[Nouveau] [PATCH v2 4/8] mm/migrate_device.c: Refactor migrate_vma and migrate_deivce_coherent_page()

2022-09-28 Thread Alistair Popple
migrate_device_coherent_page() reuses the existing migrate_vma family of
functions to migrate a specific page without providing a valid mapping
or vma. This looks a bit odd because it means we are calling
migrate_vma_*() without setting a valid vma, however it was considered
acceptable at the time because the details were internal to
migrate_device.c and there was only a single user.

One of the reasons the details could be kept internal was that this was
strictly for migrating device coherent memory. Such memory can be copied
directly by the CPU without intervention from a driver. However this
isn't true for device private memory, and a future change requires
similar functionality for device private memory. So refactor the code
into something more sensible for migrating device memory without a vma.

Signed-off-by: Alistair Popple 
Cc: "Huang, Ying" 
Cc: Zi Yan 
Cc: Matthew Wilcox 
Cc: Yang Shi 
Cc: David Hildenbrand 
Cc: Ralph Campbell 
Cc: John Hubbard 
---
 mm/migrate_device.c | 150 +
 1 file changed, 85 insertions(+), 65 deletions(-)

diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index f756c00..ba479b5 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -345,26 +345,20 @@ static bool migrate_vma_check_page(struct page *page, 
struct page *fault_page)
 }
 
 /*
- * migrate_vma_unmap() - replace page mapping with special migration pte entry
- * @migrate: migrate struct containing all migration information
- *
- * Isolate pages from the LRU and replace mappings (CPU page table pte) with a
- * special migration pte entry and check if it has been pinned. Pinned pages 
are
- * restored because we cannot migrate them.
- *
- * This is the last step before we call the device driver callback to allocate
- * destination memory and copy contents of original page over to new page.
+ * Unmaps pages for migration. Returns number of unmapped pages.
  */
-static void migrate_vma_unmap(struct migrate_vma *migrate)
+static unsigned long migrate_device_unmap(unsigned long *src_pfns,
+ unsigned long npages,
+ struct page *fault_page)
 {
-   const unsigned long npages = migrate->npages;
unsigned long i, restore = 0;
bool allow_drain = true;
+   unsigned long unmapped = 0;
 
lru_add_drain();
 
for (i = 0; i < npages; i++) {
-   struct page *page = migrate_pfn_to_page(migrate->src[i]);
+   struct page *page = migrate_pfn_to_page(src_pfns[i]);
struct folio *folio;
 
if (!page)
@@ -379,8 +373,7 @@ static void migrate_vma_unmap(struct migrate_vma *migrate)
}
 
if (isolate_lru_page(page)) {
-   migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
-   migrate->cpages--;
+   src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
restore++;
continue;
}
@@ -394,34 +387,54 @@ static void migrate_vma_unmap(struct migrate_vma *migrate)
try_to_migrate(folio, 0);
 
if (page_mapped(page) ||
-   !migrate_vma_check_page(page, migrate->fault_page)) {
+   !migrate_vma_check_page(page, fault_page)) {
if (!is_zone_device_page(page)) {
get_page(page);
putback_lru_page(page);
}
 
-   migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
-   migrate->cpages--;
+   src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
restore++;
continue;
}
+
+   unmapped++;
}
 
for (i = 0; i < npages && restore; i++) {
-   struct page *page = migrate_pfn_to_page(migrate->src[i]);
+   struct page *page = migrate_pfn_to_page(src_pfns[i]);
struct folio *folio;
 
-   if (!page || (migrate->src[i] & MIGRATE_PFN_MIGRATE))
+   if (!page || (src_pfns[i] & MIGRATE_PFN_MIGRATE))
continue;
 
folio = page_folio(page);
remove_migration_ptes(folio, folio, false);
 
-   migrate->src[i] = 0;
+   src_pfns[i] = 0;
folio_unlock(folio);
folio_put(folio);
restore--;
}
+
+   return unmapped;
+}
+
+/*
+ * migrate_vma_unmap() - replace page mapping with special migration pte entry
+ * @migrate: migrate struct containing all migration information
+ *
+ * Isolate pages from the LRU and replace mappings (CPU page table pte) with a
+ * special migration pte entry and check if it 

[Nouveau] [PATCH v2 3/8] mm/memremap.c: Take a pgmap reference on page allocation

2022-09-28 Thread Alistair Popple
ZONE_DEVICE pages have a struct dev_pagemap which is allocated by a
driver. When the struct page is first allocated by the kernel in
memremap_pages() a reference is taken on the associated pagemap to
ensure it is not freed prior to the pages being freed.

Prior to 27674ef6c73f ("mm: remove the extra ZONE_DEVICE struct page
refcount") pages were considered free and returned to the driver when
the reference count dropped to one. However the pagemap reference was
not dropped until the page reference count hit zero. This would occur as
part of the final put_page() in memunmap_pages() which would wait for
all pages to be freed prior to returning.

When the extra refcount was removed the pagemap reference was no longer
being dropped in put_page(). Instead memunmap_pages() was changed to
explicitly drop the pagemap references. This means that memunmap_pages()
can complete even though pages are still mapped by the kernel which can
lead to kernel crashes, particularly if a driver frees the pagemap.

To fix this drivers should take a pagemap reference when allocating the
page. This reference can then be returned when the page is freed.

Signed-off-by: Alistair Popple 
Fixes: 27674ef6c73f ("mm: remove the extra ZONE_DEVICE struct page refcount")
Cc: Jason Gunthorpe 
Cc: Felix Kuehling 
Cc: Alex Deucher 
Cc: Christian König 
Cc: Ben Skeggs 
Cc: Lyude Paul 
Cc: Ralph Campbell 
Cc: Alex Sierra 
Cc: John Hubbard 
Cc: Dan Williams 

---

Again I expect this will conflict with Dan's series. This implements the
first suggestion from Jason at
https://lore.kernel.org/linux-mm/yzly5jjof0jdl...@nvidia.com/ so
whatever we end up doing for DAX we should do the same here.
---
 mm/memremap.c | 25 +++--
 1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/mm/memremap.c b/mm/memremap.c
index 1c2c038..421bec3 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -138,8 +138,11 @@ void memunmap_pages(struct dev_pagemap *pgmap)
int i;
 
percpu_ref_kill(>ref);
-   for (i = 0; i < pgmap->nr_range; i++)
-   percpu_ref_put_many(>ref, pfn_len(pgmap, i));
+   if (pgmap->type != MEMORY_DEVICE_PRIVATE &&
+   pgmap->type != MEMORY_DEVICE_COHERENT)
+   for (i = 0; i < pgmap->nr_range; i++)
+   percpu_ref_put_many(>ref, pfn_len(pgmap, i));
+
wait_for_completion(>done);
 
for (i = 0; i < pgmap->nr_range; i++)
@@ -264,7 +267,9 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct 
mhp_params *params,
memmap_init_zone_device(_DATA(nid)->node_zones[ZONE_DEVICE],
PHYS_PFN(range->start),
PHYS_PFN(range_len(range)), pgmap);
-   percpu_ref_get_many(>ref, pfn_len(pgmap, range_id));
+   if (pgmap->type != MEMORY_DEVICE_PRIVATE &&
+   pgmap->type != MEMORY_DEVICE_COHERENT)
+   percpu_ref_get_many(>ref, pfn_len(pgmap, range_id));
return 0;
 
 err_add_memory:
@@ -502,16 +507,24 @@ void free_zone_device_page(struct page *page)
page->mapping = NULL;
page->pgmap->ops->page_free(page);
 
-   /*
-* Reset the page count to 1 to prepare for handing out the page again.
-*/
if (page->pgmap->type != MEMORY_DEVICE_PRIVATE &&
page->pgmap->type != MEMORY_DEVICE_COHERENT)
+   /*
+* Reset the page count to 1 to prepare for handing out the page
+* again.
+*/
set_page_count(page, 1);
+   else
+   put_dev_pagemap(page->pgmap);
 }
 
 void zone_device_page_init(struct page *page)
 {
+   /*
+* Drivers shouldn't be allocating pages after calling
+* memunmap_pages().
+*/
+   WARN_ON_ONCE(!percpu_ref_tryget_live(>pgmap->ref));
set_page_count(page, 1);
lock_page(page);
 }
-- 
git-series 0.9.1


[Nouveau] [PATCH v2 1/8] mm/memory.c: Fix race when faulting a device private page

2022-09-28 Thread Alistair Popple
When the CPU tries to access a device private page the migrate_to_ram()
callback associated with the pgmap for the page is called. However no
reference is taken on the faulting page. Therefore a concurrent
migration of the device private page can free the page and possibly the
underlying pgmap. This results in a race which can crash the kernel due
to the migrate_to_ram() function pointer becoming invalid. It also means
drivers can't reliably read the zone_device_data field because the page
may have been freed with memunmap_pages().

Close the race by getting a reference on the page while holding the ptl
to ensure it has not been freed. Unfortunately the elevated reference
count will cause the migration required to handle the fault to fail. To
avoid this failure pass the faulting page into the migrate_vma functions
so that if an elevated reference count is found it can be checked to see
if it's expected or not.

Signed-off-by: Alistair Popple 
Cc: Jason Gunthorpe 
Cc: John Hubbard 
Cc: Ralph Campbell 
Cc: Michael Ellerman 
Cc: Felix Kuehling 
Cc: Lyude Paul 
---
 arch/powerpc/kvm/book3s_hv_uvmem.c   | 15 ++-
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 17 +++--
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.h |  2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 11 +---
 include/linux/migrate.h  |  8 ++-
 lib/test_hmm.c   |  7 ++---
 mm/memory.c  | 16 +++-
 mm/migrate.c | 34 ++---
 mm/migrate_device.c  | 18 +
 9 files changed, 87 insertions(+), 41 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 5980063..d4eacf4 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -508,10 +508,10 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
 static int __kvmppc_svm_page_out(struct vm_area_struct *vma,
unsigned long start,
unsigned long end, unsigned long page_shift,
-   struct kvm *kvm, unsigned long gpa)
+   struct kvm *kvm, unsigned long gpa, struct page *fault_page)
 {
unsigned long src_pfn, dst_pfn = 0;
-   struct migrate_vma mig;
+   struct migrate_vma mig = { 0 };
struct page *dpage, *spage;
struct kvmppc_uvmem_page_pvt *pvt;
unsigned long pfn;
@@ -525,6 +525,7 @@ static int __kvmppc_svm_page_out(struct vm_area_struct *vma,
mig.dst = _pfn;
mig.pgmap_owner = _uvmem_pgmap;
mig.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
+   mig.fault_page = fault_page;
 
/* The requested page is already paged-out, nothing to do */
if (!kvmppc_gfn_is_uvmem_pfn(gpa >> page_shift, kvm, NULL))
@@ -580,12 +581,14 @@ static int __kvmppc_svm_page_out(struct vm_area_struct 
*vma,
 static inline int kvmppc_svm_page_out(struct vm_area_struct *vma,
  unsigned long start, unsigned long end,
  unsigned long page_shift,
- struct kvm *kvm, unsigned long gpa)
+ struct kvm *kvm, unsigned long gpa,
+ struct page *fault_page)
 {
int ret;
 
mutex_lock(>arch.uvmem_lock);
-   ret = __kvmppc_svm_page_out(vma, start, end, page_shift, kvm, gpa);
+   ret = __kvmppc_svm_page_out(vma, start, end, page_shift, kvm, gpa,
+   fault_page);
mutex_unlock(>arch.uvmem_lock);
 
return ret;
@@ -736,7 +739,7 @@ static int kvmppc_svm_page_in(struct vm_area_struct *vma,
bool pagein)
 {
unsigned long src_pfn, dst_pfn = 0;
-   struct migrate_vma mig;
+   struct migrate_vma mig = { 0 };
struct page *spage;
unsigned long pfn;
struct page *dpage;
@@ -994,7 +997,7 @@ static vm_fault_t kvmppc_uvmem_migrate_to_ram(struct 
vm_fault *vmf)
 
if (kvmppc_svm_page_out(vmf->vma, vmf->address,
vmf->address + PAGE_SIZE, PAGE_SHIFT,
-   pvt->kvm, pvt->gpa))
+   pvt->kvm, pvt->gpa, vmf->page))
return VM_FAULT_SIGBUS;
else
return 0;
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index b059a77..776448b 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -409,7 +409,7 @@ svm_migrate_vma_to_vram(struct amdgpu_device *adev, struct 
svm_range *prange,
uint64_t npages = (end - start) >> PAGE_SHIFT;
struct kfd_process_device *pdd;
struct dma_fence *mfence = NULL;
-   struct migrate_vma migrate;
+   struct migrate_vma migrate = { 0 };
unsigned long cpages = 0;
dma_add

[Nouveau] [PATCH v2 2/8] mm: Free device private pages have zero refcount

2022-09-28 Thread Alistair Popple
Since 27674ef6c73f ("mm: remove the extra ZONE_DEVICE struct page
refcount") device private pages have no longer had an extra reference
count when the page is in use. However before handing them back to the
owning device driver we add an extra reference count such that free
pages have a reference count of one.

This makes it difficult to tell if a page is free or not because both
free and in use pages will have a non-zero refcount. Instead we should
return pages to the drivers page allocator with a zero reference count.
Kernel code can then safely use kernel functions such as
get_page_unless_zero().

Signed-off-by: Alistair Popple 
Cc: Jason Gunthorpe 
Cc: Michael Ellerman 
Cc: Felix Kuehling 
Cc: Alex Deucher 
Cc: Christian König 
Cc: Ben Skeggs 
Cc: Lyude Paul 
Cc: Ralph Campbell 
Cc: Alex Sierra 
Cc: John Hubbard 
Cc: Dan Williams 

---

This will conflict with Dan's series to fix reference counts for DAX[1].
At the moment this only makes changes for device private and coherent
pages, however if DAX is fixed to remove the extra refcount then we
should just be able to drop the checks for private/coherent pages and
treat them the same.

[1] - 
https://lore.kernel.org/linux-mm/166329930818.2786261.6086109734008025807.st...@dwillia2-xfh.jf.intel.com/
---
 arch/powerpc/kvm/book3s_hv_uvmem.c   |  2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |  2 +-
 drivers/gpu/drm/nouveau/nouveau_dmem.c   |  2 +-
 include/linux/memremap.h |  1 +
 lib/test_hmm.c   |  2 +-
 mm/memremap.c|  9 +
 mm/page_alloc.c  |  8 
 7 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index d4eacf4..9d8de68 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -718,7 +718,7 @@ static struct page *kvmppc_uvmem_get_page(unsigned long 
gpa, struct kvm *kvm)
 
dpage = pfn_to_page(uvmem_pfn);
dpage->zone_device_data = pvt;
-   lock_page(dpage);
+   zone_device_page_init(dpage);
return dpage;
 out_clear:
spin_lock(_uvmem_bitmap_lock);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index 776448b..97a6845 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -223,7 +223,7 @@ svm_migrate_get_vram_page(struct svm_range *prange, 
unsigned long pfn)
page = pfn_to_page(pfn);
svm_range_bo_ref(prange->svm_bo);
page->zone_device_data = prange->svm_bo;
-   lock_page(page);
+   zone_device_page_init(page);
 }
 
 static void
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 1635661..b092988 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -326,7 +326,7 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm)
return NULL;
}
 
-   lock_page(page);
+   zone_device_page_init(page);
return page;
 }
 
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 1901049..f68bf6d 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -182,6 +182,7 @@ static inline bool folio_is_device_coherent(const struct 
folio *folio)
 }
 
 #ifdef CONFIG_ZONE_DEVICE
+void zone_device_page_init(struct page *page);
 void *memremap_pages(struct dev_pagemap *pgmap, int nid);
 void memunmap_pages(struct dev_pagemap *pgmap);
 void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 89463ff..688c15d 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -627,8 +627,8 @@ static struct page *dmirror_devmem_alloc_page(struct 
dmirror_device *mdevice)
goto error;
}
 
+   zone_device_page_init(dpage);
dpage->zone_device_data = rpage;
-   lock_page(dpage);
return dpage;
 
 error:
diff --git a/mm/memremap.c b/mm/memremap.c
index 25029a4..1c2c038 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -505,8 +505,17 @@ void free_zone_device_page(struct page *page)
/*
 * Reset the page count to 1 to prepare for handing out the page again.
 */
+   if (page->pgmap->type != MEMORY_DEVICE_PRIVATE &&
+   page->pgmap->type != MEMORY_DEVICE_COHERENT)
+   set_page_count(page, 1);
+}
+
+void zone_device_page_init(struct page *page)
+{
set_page_count(page, 1);
+   lock_page(page);
 }
+EXPORT_SYMBOL_GPL(zone_device_page_init);
 
 #ifdef CONFIG_FS_DAX
 bool __put_devmap_managed_page_refs(struct page *page, int refs)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9d49803..4df1e43 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6744,6 +6744,14 @@ static void __ref __init_zone_devi

[Nouveau] [PATCH v2 0/8] Fix several device private page reference counting issues

2022-09-28 Thread Alistair Popple
This series aims to fix a number of page reference counting issues in
drivers dealing with device private ZONE_DEVICE pages. These result in
use-after-free type bugs, either from accessing a struct page which no
longer exists because it has been removed or accessing fields within the
struct page which are no longer valid because the page has been freed.

During normal usage it is unlikely these will cause any problems. However
without these fixes it is possible to crash the kernel from userspace.
These crashes can be triggered either by unloading the kernel module or
unbinding the device from the driver prior to a userspace task exiting. In
modules such as Nouveau it is also possible to trigger some of these issues
by explicitly closing the device file-descriptor prior to the task exiting
and then accessing device private memory.

This involves some minor changes to both PowerPC and AMD GPU code.
Unfortunately I lack hardware to test either of those so any help there
would be appreciated. The changes mimic what is done in for both Nouveau
and hmm-tests though so I doubt they will cause problems.

To: Andrew Morton 
To: linux...@kvack.org
Cc: linux-ker...@vger.kernel.org
Cc: amd-...@lists.freedesktop.org
Cc: nouveau@lists.freedesktop.org
Cc: dri-de...@lists.freedesktop.org

Alistair Popple (8):
  mm/memory.c: Fix race when faulting a device private page
  mm: Free device private pages have zero refcount
  mm/memremap.c: Take a pgmap reference on page allocation
  mm/migrate_device.c: Refactor migrate_vma and migrate_deivce_coherent_page()
  mm/migrate_device.c: Add migrate_device_range()
  nouveau/dmem: Refactor nouveau_dmem_fault_copy_one()
  nouveau/dmem: Evict device private memory during release
  hmm-tests: Add test for migrate_device_range()

 arch/powerpc/kvm/book3s_hv_uvmem.c   |  17 +-
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |  19 +-
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.h |   2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c |  11 +-
 drivers/gpu/drm/nouveau/nouveau_dmem.c   | 108 +++
 include/linux/memremap.h |   1 +-
 include/linux/migrate.h  |  15 ++-
 lib/test_hmm.c   | 129 ++---
 lib/test_hmm_uapi.h  |   1 +-
 mm/memory.c  |  16 +-
 mm/memremap.c|  30 ++-
 mm/migrate.c |  34 +--
 mm/migrate_device.c  | 239 +---
 mm/page_alloc.c  |   8 +-
 tools/testing/selftests/vm/hmm-tests.c   |  49 +-
 15 files changed, 516 insertions(+), 163 deletions(-)

base-commit: 088b8aa537c2c767765f1c19b555f21ffe555786
-- 
git-series 0.9.1


Re: [Nouveau] [PATCH 5/7] nouveau/dmem: Refactor nouveau_dmem_fault_copy_one()

2022-09-28 Thread Alistair Popple


Lyude Paul  writes:

> On Mon, 2022-09-26 at 16:03 +1000, Alistair Popple wrote:
>> nouveau_dmem_fault_copy_one() is used during handling of CPU faults via
>> the migrate_to_ram() callback and is used to copy data from GPU to CPU
>> memory. It is currently specific to fault handling, however a future
>> patch implementing eviction of data during teardown needs similar
>> functionality.
>>
>> Refactor out the core functionality so that it is not specific to fault
>> handling.
>>
>> Signed-off-by: Alistair Popple 
>> ---
>>  drivers/gpu/drm/nouveau/nouveau_dmem.c | 59 +--
>>  1 file changed, 29 insertions(+), 30 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
>> b/drivers/gpu/drm/nouveau/nouveau_dmem.c
>> index f9234ed..66ebbd4 100644
>> --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
>> +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
>> @@ -139,44 +139,25 @@ static void nouveau_dmem_fence_done(struct 
>> nouveau_fence **fence)
>>  }
>>  }
>>
>> -static vm_fault_t nouveau_dmem_fault_copy_one(struct nouveau_drm *drm,
>> -struct vm_fault *vmf, struct migrate_vma *args,
>> -dma_addr_t *dma_addr)
>> +static int nouveau_dmem_copy_one(struct nouveau_drm *drm, struct page 
>> *spage,
>> +struct page *dpage, dma_addr_t *dma_addr)
>>  {
>>  struct device *dev = drm->dev->dev;
>> -struct page *dpage, *spage;
>> -struct nouveau_svmm *svmm;
>> -
>> -spage = migrate_pfn_to_page(args->src[0]);
>> -if (!spage || !(args->src[0] & MIGRATE_PFN_MIGRATE))
>> -return 0;
>>
>> -dpage = alloc_page_vma(GFP_HIGHUSER, vmf->vma, vmf->address);
>> -if (!dpage)
>> -return VM_FAULT_SIGBUS;
>>  lock_page(dpage);
>>
>>  *dma_addr = dma_map_page(dev, dpage, 0, PAGE_SIZE, DMA_BIDIRECTIONAL);
>>  if (dma_mapping_error(dev, *dma_addr))
>> -goto error_free_page;
>> +return -EIO;
>>
>> -svmm = spage->zone_device_data;
>> -mutex_lock(>mutex);
>> -nouveau_svmm_invalidate(svmm, args->start, args->end);
>>  if (drm->dmem->migrate.copy_func(drm, 1, NOUVEAU_APER_HOST, *dma_addr,
>> -NOUVEAU_APER_VRAM, nouveau_dmem_page_addr(spage)))
>> -goto error_dma_unmap;
>> -mutex_unlock(>mutex);
>> + NOUVEAU_APER_VRAM,
>> + nouveau_dmem_page_addr(spage))) {
>> +dma_unmap_page(dev, *dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
>> +return -EIO;
>> +}
>
> Feel free to just align this with the starting (, as long as it doesn't go
> above 100 characters it doesn't really matter imho and would look nicer that
> way.
>
> Otherwise:
>
> Reviewed-by: Lyude Paul 

Thanks! I'm not sure I precisely understood your alignment comment above
but feel free to let me know if I got it wrong in v2.

> Will look at the other patch in a moment
>
>>
>> -args->dst[0] = migrate_pfn(page_to_pfn(dpage));
>>  return 0;
>> -
>> -error_dma_unmap:
>> -mutex_unlock(>mutex);
>> -dma_unmap_page(dev, *dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
>> -error_free_page:
>> -__free_page(dpage);
>> -return VM_FAULT_SIGBUS;
>>  }
>>
>>  static vm_fault_t nouveau_dmem_migrate_to_ram(struct vm_fault *vmf)
>> @@ -184,9 +165,11 @@ static vm_fault_t nouveau_dmem_migrate_to_ram(struct 
>> vm_fault *vmf)
>>  struct nouveau_drm *drm = page_to_drm(vmf->page);
>>  struct nouveau_dmem *dmem = drm->dmem;
>>  struct nouveau_fence *fence;
>> +struct nouveau_svmm *svmm;
>> +struct page *spage, *dpage;
>>  unsigned long src = 0, dst = 0;
>>  dma_addr_t dma_addr = 0;
>> -vm_fault_t ret;
>> +vm_fault_t ret = 0;
>>  struct migrate_vma args = {
>>  .vma= vmf->vma,
>>  .start  = vmf->address,
>> @@ -207,9 +190,25 @@ static vm_fault_t nouveau_dmem_migrate_to_ram(struct 
>> vm_fault *vmf)
>>  if (!args.cpages)
>>  return 0;
>>
>> -ret = nouveau_dmem_fault_copy_one(drm, vmf, , _addr);
>> -if (ret || dst == 0)
>> +spage = migrate_pfn_to_page(src);
>> +if (!spage || !(src & MIGRATE_PFN_MIGRATE))
>> +goto done;
>> +
>> +dpage = alloc_page_vma(GFP_HIGHUSER, vmf->vma, vmf->address);
>> +if (!dpage)
>> +goto done;
>> +
>> +dst = migrate_pfn(page_to_pfn(dpage));
>> +
>> +svmm = spage->zone_device_data;
>> +mutex_lock(>mutex);
>> +nouveau_svmm_invalidate(svmm, args.start, args.end);
>> +ret = nouveau_dmem_copy_one(drm, spage, dpage, _addr);
>> +mutex_unlock(>mutex);
>> +if (ret) {
>> +ret = VM_FAULT_SIGBUS;
>>  goto done;
>> +}
>>
>>  nouveau_fence_new(dmem->migrate.chan, false, );
>>  migrate_vma_pages();


Re: [Nouveau] [PATCH 2/7] mm: Free device private pages have zero refcount

2022-09-26 Thread Alistair Popple


Jason Gunthorpe  writes:

> On Mon, Sep 26, 2022 at 04:03:06PM +1000, Alistair Popple wrote:
>> Since 27674ef6c73f ("mm: remove the extra ZONE_DEVICE struct page
>> refcount") device private pages have no longer had an extra reference
>> count when the page is in use. However before handing them back to the
>> owning device driver we add an extra reference count such that free
>> pages have a reference count of one.
>>
>> This makes it difficult to tell if a page is free or not because both
>> free and in use pages will have a non-zero refcount. Instead we should
>> return pages to the drivers page allocator with a zero reference count.
>> Kernel code can then safely use kernel functions such as
>> get_page_unless_zero().
>>
>> Signed-off-by: Alistair Popple 
>> ---
>>  arch/powerpc/kvm/book3s_hv_uvmem.c   | 1 +
>>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 1 +
>>  drivers/gpu/drm/nouveau/nouveau_dmem.c   | 1 +
>>  lib/test_hmm.c   | 1 +
>>  mm/memremap.c| 5 -
>>  mm/page_alloc.c  | 6 ++
>>  6 files changed, 10 insertions(+), 5 deletions(-)
>
> I think this is a great idea, but I'm surprised no dax stuff is
> touched here?

free_zone_device_page() shouldn't be called for pgmap->type ==
MEMORY_DEVICE_FS_DAX so I don't think we should have to worry about DAX
there. Except that the folio code looks like it might have introduced a
bug. AFAICT put_page() always calls
put_devmap_managed_page(>page) but folio_put() does not (although
folios_put() does!). So it seems folio_put() won't end up calling
__put_devmap_managed_page_refs() as I think it should.

I think you're right about the change to __init_zone_device_page() - I
should limit it to DEVICE_PRIVATE/COHERENT pages only. But I need to
look at Dan's patch series more closely as I suspect it might be better
to rebase this patch on top of that.

> Jason


Re: [Nouveau] [PATCH 6/7] nouveau/dmem: Evict device private memory during release

2022-09-26 Thread Alistair Popple


Felix Kuehling  writes:

> On 2022-09-26 17:35, Lyude Paul wrote:
>> On Mon, 2022-09-26 at 16:03 +1000, Alistair Popple wrote:
>>> When the module is unloaded or a GPU is unbound from the module it is
>>> possible for device private pages to be left mapped in currently running
>>> processes. This leads to a kernel crash when the pages are either freed
>>> or accessed from the CPU because the GPU and associated data structures
>>> and callbacks have all been freed.
>>>
>>> Fix this by migrating any mappings back to normal CPU memory prior to
>>> freeing the GPU memory chunks and associated device private pages.
>>>
>>> Signed-off-by: Alistair Popple 
>>>
>>> ---
>>>
>>> I assume the AMD driver might have a similar issue. However I can't see
>>> where device private (or coherent) pages actually get unmapped/freed
>>> during teardown as I couldn't find any relevant calls to
>>> devm_memunmap(), memunmap(), devm_release_mem_region() or
>>> release_mem_region(). So it appears that ZONE_DEVICE pages are not being
>>> properly freed during module unload, unless I'm missing something?
>> I've got no idea, will poke Ben to see if they know the answer to this
>
> I guess we're relying on devm to release the region. Isn't the whole point of
> using devm_request_free_mem_region that we don't have to remember to 
> explicitly
> release it when the device gets destroyed? I believe we had an explicit free
> call at some point by mistake, and that caused a double-free during module
> unload. See this commit for reference:

Argh, thanks for that pointer. I was not so familiar with
devm_request_free_mem_region()/devm_memremap_pages() as currently
Nouveau explicitly manages that itself.

> commit 22f4f4faf337d5fb2d2750aff13215726814273e
> Author: Philip Yang 
> Date:   Mon Sep 20 17:25:52 2021 -0400
>
> drm/amdkfd: fix svm_migrate_fini warning
>  Device manager releases device-specific resources when a driver
> disconnects from a device, devm_memunmap_pages and
> devm_release_mem_region calls in svm_migrate_fini are redundant.
>  It causes below warning trace after patch "drm/amdgpu: Split
> amdgpu_device_fini into early and late", so remove function
> svm_migrate_fini.
>  BUG: https://gitlab.freedesktop.org/drm/amd/-/issues/1718
>  WARNING: CPU: 1 PID: 3646 at drivers/base/devres.c:795
> devm_release_action+0x51/0x60
> Call Trace:
> ? memunmap_pages+0x360/0x360
> svm_migrate_fini+0x2d/0x60 [amdgpu]
> kgd2kfd_device_exit+0x23/0xa0 [amdgpu]
> amdgpu_amdkfd_device_fini_sw+0x1d/0x30 [amdgpu]
> amdgpu_device_fini_sw+0x45/0x290 [amdgpu]
> amdgpu_driver_release_kms+0x12/0x30 [amdgpu]
> drm_dev_release+0x20/0x40 [drm]
> release_nodes+0x196/0x1e0
> device_release_driver_internal+0x104/0x1d0
> driver_detach+0x47/0x90
> bus_remove_driver+0x7a/0xd0
> pci_unregister_driver+0x3d/0x90
> amdgpu_exit+0x11/0x20 [amdgpu]
>  Signed-off-by: Philip Yang 
> Reviewed-by: Felix Kuehling 
> Signed-off-by: Alex Deucher 
>
> Furthermore, I guess we are assuming that nobody is using the GPU when the
> module is unloaded. As long as any processes have /dev/kfd open, you won't be
> able to unload the module (except by force-unload). I suppose with ZONE_DEVICE
> memory, we can have references to device memory pages even when user mode has
> closed /dev/kfd. We do have a cleanup handler that runs in an 
> MMU-free-notifier.
> In theory that should run after all the pages in the mm_struct have been 
> freed.
> It releases all sorts of other device resources and needs the driver to still 
> be
> there. I'm not sure if there is anything preventing a module unload before the
> free-notifier runs. I'll look into that.

Right - module unload (or device unbind) is one of the other ways we can
hit this issue in Nouveau at least. You can end up with ZONE_DEVICE
pages mapped in a running process after the module has unloaded.
Although now you mention it that seems a bit wrong - the pgmap refcount
should provide some protection against that. Will have to look into
that too.

> Regards,
>   Felix
>
>
>>
>>> ---
>>>   drivers/gpu/drm/nouveau/nouveau_dmem.c | 48 +++-
>>>   1 file changed, 48 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
>>> b/drivers/gpu/drm/nouveau/nouveau_dmem.c
>>> index 66ebbd4..3b247b8 100644
>>> --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
>>> +++ b/drivers/gpu/drm/

Re: [Nouveau] [PATCH 6/7] nouveau/dmem: Evict device private memory during release

2022-09-26 Thread Alistair Popple


John Hubbard  writes:

> On 9/26/22 14:35, Lyude Paul wrote:
>>> +   for (i = 0; i < npages; i++) {
>>> +   if (src_pfns[i] & MIGRATE_PFN_MIGRATE) {
>>> +   struct page *dpage;
>>> +
>>> +   /*
>>> +* _GFP_NOFAIL because the GPU is going away and there
>>> +* is nothing sensible we can do if we can't copy the
>>> +* data back.
>>> +*/
>>
>> You'll have to excuse me for a moment since this area of nouveau isn't one of
>> my strongpoints, but are we sure about this? IIRC __GFP_NOFAIL means infinite
>> retry, in the case of a GPU hotplug event I would assume we would rather just
>> stop trying to migrate things to the GPU and just drop the data instead of
>> hanging on infinite retries.
>>

No problem, thanks for taking a look!

> Hi Lyude!
>
> Actually, I really think it's better in this case to keep trying
> (presumably not necessarily infinitely, but only until memory becomes
> available), rather than failing out and corrupting data.
>
> That's because I'm not sure it's completely clear that this memory is
> discardable. And at some point, we're going to make this all work with
> file-backed memory, which will *definitely* not be discardable--I
> realize that we're not there yet, of course.
>
> But here, it's reasonable to commit to just retrying indefinitely,
> really. Memory should eventually show up. And if it doesn't, then
> restarting the machine is better than corrupting data, generally.

The memory is definitely not discardable here if the migration failed
because that implies it is still mapped into some userspace process.

We could avoid restarting the machine by doing something similar to what
happens during memory failure and killing every process that maps the
page(s). But overall I think it's better to retry until memory is
available, because that allows things like reclaim to work and in the
worst case allows the OOM killer to select an appropriate task to kill.
It also won't cause data corruption if/when we have file-backed memory.

> thanks,


[Nouveau] [PATCH 7/7] hmm-tests: Add test for migrate_device_range()

2022-09-26 Thread Alistair Popple
Signed-off-by: Alistair Popple 
---
 lib/test_hmm.c | 119 +-
 lib/test_hmm_uapi.h|   1 +-
 tools/testing/selftests/vm/hmm-tests.c |  49 +++-
 3 files changed, 148 insertions(+), 21 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 2bd3a67..d2821dd 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -100,6 +100,7 @@ struct dmirror {
 struct dmirror_chunk {
struct dev_pagemap  pagemap;
struct dmirror_device   *mdevice;
+   bool remove;
 };
 
 /*
@@ -192,11 +193,15 @@ static int dmirror_fops_release(struct inode *inode, 
struct file *filp)
return 0;
 }
 
+static struct dmirror_chunk *dmirror_page_to_chunk(struct page *page)
+{
+   return container_of(page->pgmap, struct dmirror_chunk, pagemap);
+}
+
 static struct dmirror_device *dmirror_page_to_device(struct page *page)
 
 {
-   return container_of(page->pgmap, struct dmirror_chunk,
-   pagemap)->mdevice;
+   return dmirror_page_to_chunk(page)->mdevice;
 }
 
 static int dmirror_do_fault(struct dmirror *dmirror, struct hmm_range *range)
@@ -1219,6 +1224,84 @@ static int dmirror_snapshot(struct dmirror *dmirror,
return ret;
 }
 
+static void dmirror_device_evict_chunk(struct dmirror_chunk *chunk)
+{
+   unsigned long start_pfn = chunk->pagemap.range.start >> PAGE_SHIFT;
+   unsigned long end_pfn = chunk->pagemap.range.end >> PAGE_SHIFT;
+   unsigned long npages = end_pfn - start_pfn + 1;
+   unsigned long i;
+   unsigned long *src_pfns;
+   unsigned long *dst_pfns;
+
+   src_pfns = kcalloc(npages, sizeof(*src_pfns), GFP_KERNEL);
+   dst_pfns = kcalloc(npages, sizeof(*dst_pfns), GFP_KERNEL);
+
+   migrate_device_range(src_pfns, start_pfn, npages);
+   for (i = 0; i < npages; i++) {
+   struct page *dpage, *spage;
+
+   spage = migrate_pfn_to_page(src_pfns[i]);
+   if (!spage || !(src_pfns[i] & MIGRATE_PFN_MIGRATE))
+   continue;
+
+   if (WARN_ON(!is_device_private_page(spage) &&
+   !is_device_coherent_page(spage)))
+   continue;
+   spage = BACKING_PAGE(spage);
+   dpage = alloc_page(GFP_HIGHUSER_MOVABLE | __GFP_NOFAIL);
+   lock_page(dpage);
+   copy_highpage(dpage, spage);
+   dst_pfns[i] = migrate_pfn(page_to_pfn(dpage));
+   if (src_pfns[i] & MIGRATE_PFN_WRITE)
+   dst_pfns[i] |= MIGRATE_PFN_WRITE;
+   }
+   migrate_device_pages(src_pfns, dst_pfns, npages);
+   migrate_device_finalize(src_pfns, dst_pfns, npages);
+   kfree(src_pfns);
+   kfree(dst_pfns);
+}
+
+/* Removes free pages from the free list so they can't be re-allocated */
+static void dmirror_remove_free_pages(struct dmirror_chunk *devmem)
+{
+   struct dmirror_device *mdevice = devmem->mdevice;
+   struct page *page;
+
+   for (page = mdevice->free_pages; page; page = page->zone_device_data)
+   if (dmirror_page_to_chunk(page) == devmem)
+   mdevice->free_pages = page->zone_device_data;
+}
+
+static void dmirror_device_remove_chunks(struct dmirror_device *mdevice)
+{
+   unsigned int i;
+
+   mutex_lock(>devmem_lock);
+   if (mdevice->devmem_chunks) {
+   for (i = 0; i < mdevice->devmem_count; i++) {
+   struct dmirror_chunk *devmem =
+   mdevice->devmem_chunks[i];
+
+   spin_lock(>lock);
+   devmem->remove = true;
+   dmirror_remove_free_pages(devmem);
+   spin_unlock(>lock);
+
+   dmirror_device_evict_chunk(devmem);
+   memunmap_pages(>pagemap);
+   if (devmem->pagemap.type == MEMORY_DEVICE_PRIVATE)
+   release_mem_region(devmem->pagemap.range.start,
+  
range_len(>pagemap.range));
+   kfree(devmem);
+   }
+   mdevice->devmem_count = 0;
+   mdevice->devmem_capacity = 0;
+   mdevice->free_pages = NULL;
+   kfree(mdevice->devmem_chunks);
+   }
+   mutex_unlock(>devmem_lock);
+}
+
 static long dmirror_fops_unlocked_ioctl(struct file *filp,
unsigned int command,
unsigned long arg)
@@ -1273,6 +1356,11 @@ static long dmirror_fops_unlocked_ioctl(struct file 
*filp,
ret = dmirror_snapshot(dmirror, );
break;
 
+   case HMM_DMIRROR_RELEASE:
+   dmirror_device_remove_chunks(dmirror->mdevice);
+   ret = 0;

[Nouveau] [PATCH 6/7] nouveau/dmem: Evict device private memory during release

2022-09-26 Thread Alistair Popple
When the module is unloaded or a GPU is unbound from the module it is
possible for device private pages to be left mapped in currently running
processes. This leads to a kernel crash when the pages are either freed
or accessed from the CPU because the GPU and associated data structures
and callbacks have all been freed.

Fix this by migrating any mappings back to normal CPU memory prior to
freeing the GPU memory chunks and associated device private pages.

Signed-off-by: Alistair Popple 

---

I assume the AMD driver might have a similar issue. However I can't see
where device private (or coherent) pages actually get unmapped/freed
during teardown as I couldn't find any relevant calls to
devm_memunmap(), memunmap(), devm_release_mem_region() or
release_mem_region(). So it appears that ZONE_DEVICE pages are not being
properly freed during module unload, unless I'm missing something?
---
 drivers/gpu/drm/nouveau/nouveau_dmem.c | 48 +++-
 1 file changed, 48 insertions(+)

diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 66ebbd4..3b247b8 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -369,6 +369,52 @@ nouveau_dmem_suspend(struct nouveau_drm *drm)
mutex_unlock(>dmem->mutex);
 }
 
+/*
+ * Evict all pages mapping a chunk.
+ */
+void
+nouveau_dmem_evict_chunk(struct nouveau_dmem_chunk *chunk)
+{
+   unsigned long i, npages = range_len(>pagemap.range) >> 
PAGE_SHIFT;
+   unsigned long *src_pfns, *dst_pfns;
+   dma_addr_t *dma_addrs;
+   struct nouveau_fence *fence;
+
+   src_pfns = kcalloc(npages, sizeof(*src_pfns), GFP_KERNEL);
+   dst_pfns = kcalloc(npages, sizeof(*dst_pfns), GFP_KERNEL);
+   dma_addrs = kcalloc(npages, sizeof(*dma_addrs), GFP_KERNEL);
+
+   migrate_device_range(src_pfns, chunk->pagemap.range.start >> PAGE_SHIFT,
+   npages);
+
+   for (i = 0; i < npages; i++) {
+   if (src_pfns[i] & MIGRATE_PFN_MIGRATE) {
+   struct page *dpage;
+
+   /*
+* _GFP_NOFAIL because the GPU is going away and there
+* is nothing sensible we can do if we can't copy the
+* data back.
+*/
+   dpage = alloc_page(GFP_HIGHUSER | __GFP_NOFAIL);
+   dst_pfns[i] = migrate_pfn(page_to_pfn(dpage));
+   nouveau_dmem_copy_one(chunk->drm,
+   migrate_pfn_to_page(src_pfns[i]), dpage,
+   _addrs[i]);
+   }
+   }
+
+   nouveau_fence_new(chunk->drm->dmem->migrate.chan, false, );
+   migrate_device_pages(src_pfns, dst_pfns, npages);
+   nouveau_dmem_fence_done();
+   migrate_device_finalize(src_pfns, dst_pfns, npages);
+   kfree(src_pfns);
+   kfree(dst_pfns);
+   for (i = 0; i < npages; i++)
+   dma_unmap_page(chunk->drm->dev->dev, dma_addrs[i], PAGE_SIZE, 
DMA_BIDIRECTIONAL);
+   kfree(dma_addrs);
+}
+
 void
 nouveau_dmem_fini(struct nouveau_drm *drm)
 {
@@ -380,8 +426,10 @@ nouveau_dmem_fini(struct nouveau_drm *drm)
mutex_lock(>dmem->mutex);
 
list_for_each_entry_safe(chunk, tmp, >dmem->chunks, list) {
+   nouveau_dmem_evict_chunk(chunk);
nouveau_bo_unpin(chunk->bo);
nouveau_bo_ref(NULL, >bo);
+   WARN_ON(chunk->callocated);
list_del(>list);
memunmap_pages(>pagemap);
release_mem_region(chunk->pagemap.range.start,
-- 
git-series 0.9.1


[Nouveau] [PATCH 5/7] nouveau/dmem: Refactor nouveau_dmem_fault_copy_one()

2022-09-26 Thread Alistair Popple
nouveau_dmem_fault_copy_one() is used during handling of CPU faults via
the migrate_to_ram() callback and is used to copy data from GPU to CPU
memory. It is currently specific to fault handling, however a future
patch implementing eviction of data during teardown needs similar
functionality.

Refactor out the core functionality so that it is not specific to fault
handling.

Signed-off-by: Alistair Popple 
---
 drivers/gpu/drm/nouveau/nouveau_dmem.c | 59 +--
 1 file changed, 29 insertions(+), 30 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index f9234ed..66ebbd4 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -139,44 +139,25 @@ static void nouveau_dmem_fence_done(struct nouveau_fence 
**fence)
}
 }
 
-static vm_fault_t nouveau_dmem_fault_copy_one(struct nouveau_drm *drm,
-   struct vm_fault *vmf, struct migrate_vma *args,
-   dma_addr_t *dma_addr)
+static int nouveau_dmem_copy_one(struct nouveau_drm *drm, struct page *spage,
+   struct page *dpage, dma_addr_t *dma_addr)
 {
struct device *dev = drm->dev->dev;
-   struct page *dpage, *spage;
-   struct nouveau_svmm *svmm;
-
-   spage = migrate_pfn_to_page(args->src[0]);
-   if (!spage || !(args->src[0] & MIGRATE_PFN_MIGRATE))
-   return 0;
 
-   dpage = alloc_page_vma(GFP_HIGHUSER, vmf->vma, vmf->address);
-   if (!dpage)
-   return VM_FAULT_SIGBUS;
lock_page(dpage);
 
*dma_addr = dma_map_page(dev, dpage, 0, PAGE_SIZE, DMA_BIDIRECTIONAL);
if (dma_mapping_error(dev, *dma_addr))
-   goto error_free_page;
+   return -EIO;
 
-   svmm = spage->zone_device_data;
-   mutex_lock(>mutex);
-   nouveau_svmm_invalidate(svmm, args->start, args->end);
if (drm->dmem->migrate.copy_func(drm, 1, NOUVEAU_APER_HOST, *dma_addr,
-   NOUVEAU_APER_VRAM, nouveau_dmem_page_addr(spage)))
-   goto error_dma_unmap;
-   mutex_unlock(>mutex);
+NOUVEAU_APER_VRAM,
+nouveau_dmem_page_addr(spage))) {
+   dma_unmap_page(dev, *dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
+   return -EIO;
+   }
 
-   args->dst[0] = migrate_pfn(page_to_pfn(dpage));
return 0;
-
-error_dma_unmap:
-   mutex_unlock(>mutex);
-   dma_unmap_page(dev, *dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
-error_free_page:
-   __free_page(dpage);
-   return VM_FAULT_SIGBUS;
 }
 
 static vm_fault_t nouveau_dmem_migrate_to_ram(struct vm_fault *vmf)
@@ -184,9 +165,11 @@ static vm_fault_t nouveau_dmem_migrate_to_ram(struct 
vm_fault *vmf)
struct nouveau_drm *drm = page_to_drm(vmf->page);
struct nouveau_dmem *dmem = drm->dmem;
struct nouveau_fence *fence;
+   struct nouveau_svmm *svmm;
+   struct page *spage, *dpage;
unsigned long src = 0, dst = 0;
dma_addr_t dma_addr = 0;
-   vm_fault_t ret;
+   vm_fault_t ret = 0;
struct migrate_vma args = {
.vma= vmf->vma,
.start  = vmf->address,
@@ -207,9 +190,25 @@ static vm_fault_t nouveau_dmem_migrate_to_ram(struct 
vm_fault *vmf)
if (!args.cpages)
return 0;
 
-   ret = nouveau_dmem_fault_copy_one(drm, vmf, , _addr);
-   if (ret || dst == 0)
+   spage = migrate_pfn_to_page(src);
+   if (!spage || !(src & MIGRATE_PFN_MIGRATE))
+   goto done;
+
+   dpage = alloc_page_vma(GFP_HIGHUSER, vmf->vma, vmf->address);
+   if (!dpage)
+   goto done;
+
+   dst = migrate_pfn(page_to_pfn(dpage));
+
+   svmm = spage->zone_device_data;
+   mutex_lock(>mutex);
+   nouveau_svmm_invalidate(svmm, args.start, args.end);
+   ret = nouveau_dmem_copy_one(drm, spage, dpage, _addr);
+   mutex_unlock(>mutex);
+   if (ret) {
+   ret = VM_FAULT_SIGBUS;
goto done;
+   }
 
nouveau_fence_new(dmem->migrate.chan, false, );
migrate_vma_pages();
-- 
git-series 0.9.1


[Nouveau] [PATCH 4/7] mm/migrate_device.c: Add migrate_device_range()

2022-09-26 Thread Alistair Popple
Device drivers can use the migrate_vma family of functions to migrate
existing private anonymous mappings to device private pages. These pages
are backed by memory on the device with drivers being responsible for
copying data to and from device memory.

Device private pages are freed via the pgmap->page_free() callback when
they are unmapped and their refcount drops to zero. Alternatively they
may be freed indirectly via migration back to CPU memory in response to
a pgmap->migrate_to_ram() callback called whenever the CPU accesses
an address mapped to a device private page.

In other words drivers cannot control the lifetime of data allocated on
the devices and must wait until these pages are freed from userspace.
This causes issues when memory needs to reclaimed on the device, either
because the device is going away due to a ->release() callback or
because another user needs to use the memory.

Drivers could use the existing migrate_vma functions to migrate data off
the device. However this would require them to track the mappings of
each page which is both complicated and not always possible. Instead
drivers need to be able to migrate device pages directly so they can
free up device memory.

To allow that this patch introduces the migrate_device family of
functions which are functionally similar to migrate_vma but which skips
the initial lookup based on mapping.

Signed-off-by: Alistair Popple 
---
 include/linux/migrate.h |  7 +++-
 mm/migrate_device.c | 89 ++
 2 files changed, 89 insertions(+), 7 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 82ffa47..582cdc7 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -225,6 +225,13 @@ struct migrate_vma {
 int migrate_vma_setup(struct migrate_vma *args);
 void migrate_vma_pages(struct migrate_vma *migrate);
 void migrate_vma_finalize(struct migrate_vma *migrate);
+int migrate_device_range(unsigned long *src_pfns, unsigned long start,
+   unsigned long npages);
+void migrate_device_pages(unsigned long *src_pfns, unsigned long *dst_pfns,
+   unsigned long npages);
+void migrate_device_finalize(unsigned long *src_pfns,
+   unsigned long *dst_pfns, unsigned long npages);
+
 #endif /* CONFIG_MIGRATION */
 
 #endif /* _LINUX_MIGRATE_H */
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index ba479b5..824860a 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -681,7 +681,7 @@ static void migrate_vma_insert_page(struct migrate_vma 
*migrate,
*src &= ~MIGRATE_PFN_MIGRATE;
 }
 
-static void migrate_device_pages(unsigned long *src_pfns,
+static void __migrate_device_pages(unsigned long *src_pfns,
unsigned long *dst_pfns, unsigned long npages,
struct migrate_vma *migrate)
 {
@@ -703,6 +703,9 @@ static void migrate_device_pages(unsigned long *src_pfns,
if (!page) {
unsigned long addr;
 
+   if (!(src_pfns[i] & MIGRATE_PFN_MIGRATE))
+   continue;
+
/*
 * The only time there is no vma is when called from
 * migrate_device_coherent_page(). However this isn't
@@ -710,8 +713,6 @@ static void migrate_device_pages(unsigned long *src_pfns,
 */
VM_BUG_ON(!migrate);
addr = migrate->start + i*PAGE_SIZE;
-   if (!(src_pfns[i] & MIGRATE_PFN_MIGRATE))
-   continue;
if (!notified) {
notified = true;
 
@@ -767,6 +768,22 @@ static void migrate_device_pages(unsigned long *src_pfns,
 }
 
 /**
+ * migrate_device_pages() - migrate meta-data from src page to dst page
+ * @src_pfns: src_pfns returned from migrate_device_range()
+ * @dst_pfns: array of pfns allocated by the driver to migrate memory to
+ * @npages: number of pages in the range
+ *
+ * Equivalent to migrate_vma_pages(). This is called to migrate struct page
+ * meta-data from source struct page to destination.
+ */
+void migrate_device_pages(unsigned long *src_pfns, unsigned long *dst_pfns,
+   unsigned long npages)
+{
+   __migrate_device_pages(src_pfns, dst_pfns, npages, NULL);
+}
+EXPORT_SYMBOL(migrate_device_pages);
+
+/**
  * migrate_vma_pages() - migrate meta-data from src page to dst page
  * @migrate: migrate struct containing all migration information
  *
@@ -776,12 +793,22 @@ static void migrate_device_pages(unsigned long *src_pfns,
  */
 void migrate_vma_pages(struct migrate_vma *migrate)
 {
-   migrate_device_pages(migrate->src, migrate->dst, migrate->npages, 
migrate);
+   __migrate_device_pages(migrate->src, migrate->dst, migrate->npages, 
migrate);
 }
 EXPORT

[Nouveau] [PATCH 3/7] mm/migrate_device.c: Refactor migrate_vma and migrate_deivce_coherent_page()

2022-09-26 Thread Alistair Popple
migrate_device_coherent_page() reuses the existing migrate_vma family of
functions to migrate a specific page without providing a valid mapping
or vma. This looks a bit odd because it means we are calling
migrate_vma_*() without setting a valid vma, however it was considered
acceptable at the time because the details were internal to
migrate_device.c and there was only a single user.

One of the reasons the details could be kept internal was that this was
strictly for migrating device coherent memory. Such memory can be copied
directly by the CPU without intervention from a driver. However this
isn't true for device private memory, and a future change requires
similar functionality for device private memory. So refactor the code
into something more sensible for migrating device memory without a vma.

Signed-off-by: Alistair Popple 
---
 mm/migrate_device.c | 150 +
 1 file changed, 85 insertions(+), 65 deletions(-)

diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index f756c00..ba479b5 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -345,26 +345,20 @@ static bool migrate_vma_check_page(struct page *page, 
struct page *fault_page)
 }
 
 /*
- * migrate_vma_unmap() - replace page mapping with special migration pte entry
- * @migrate: migrate struct containing all migration information
- *
- * Isolate pages from the LRU and replace mappings (CPU page table pte) with a
- * special migration pte entry and check if it has been pinned. Pinned pages 
are
- * restored because we cannot migrate them.
- *
- * This is the last step before we call the device driver callback to allocate
- * destination memory and copy contents of original page over to new page.
+ * Unmaps pages for migration. Returns number of unmapped pages.
  */
-static void migrate_vma_unmap(struct migrate_vma *migrate)
+static unsigned long migrate_device_unmap(unsigned long *src_pfns,
+ unsigned long npages,
+ struct page *fault_page)
 {
-   const unsigned long npages = migrate->npages;
unsigned long i, restore = 0;
bool allow_drain = true;
+   unsigned long unmapped = 0;
 
lru_add_drain();
 
for (i = 0; i < npages; i++) {
-   struct page *page = migrate_pfn_to_page(migrate->src[i]);
+   struct page *page = migrate_pfn_to_page(src_pfns[i]);
struct folio *folio;
 
if (!page)
@@ -379,8 +373,7 @@ static void migrate_vma_unmap(struct migrate_vma *migrate)
}
 
if (isolate_lru_page(page)) {
-   migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
-   migrate->cpages--;
+   src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
restore++;
continue;
}
@@ -394,34 +387,54 @@ static void migrate_vma_unmap(struct migrate_vma *migrate)
try_to_migrate(folio, 0);
 
if (page_mapped(page) ||
-   !migrate_vma_check_page(page, migrate->fault_page)) {
+   !migrate_vma_check_page(page, fault_page)) {
if (!is_zone_device_page(page)) {
get_page(page);
putback_lru_page(page);
}
 
-   migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
-   migrate->cpages--;
+   src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
restore++;
continue;
}
+
+   unmapped++;
}
 
for (i = 0; i < npages && restore; i++) {
-   struct page *page = migrate_pfn_to_page(migrate->src[i]);
+   struct page *page = migrate_pfn_to_page(src_pfns[i]);
struct folio *folio;
 
-   if (!page || (migrate->src[i] & MIGRATE_PFN_MIGRATE))
+   if (!page || (src_pfns[i] & MIGRATE_PFN_MIGRATE))
continue;
 
folio = page_folio(page);
remove_migration_ptes(folio, folio, false);
 
-   migrate->src[i] = 0;
+   src_pfns[i] = 0;
folio_unlock(folio);
folio_put(folio);
restore--;
}
+
+   return unmapped;
+}
+
+/*
+ * migrate_vma_unmap() - replace page mapping with special migration pte entry
+ * @migrate: migrate struct containing all migration information
+ *
+ * Isolate pages from the LRU and replace mappings (CPU page table pte) with a
+ * special migration pte entry and check if it has been pinned. Pinned pages 
are
+ * restored because we cannot migrate them.
+ *
+ * This is the last step before we call the device driver callb

[Nouveau] [PATCH 2/7] mm: Free device private pages have zero refcount

2022-09-26 Thread Alistair Popple
Since 27674ef6c73f ("mm: remove the extra ZONE_DEVICE struct page
refcount") device private pages have no longer had an extra reference
count when the page is in use. However before handing them back to the
owning device driver we add an extra reference count such that free
pages have a reference count of one.

This makes it difficult to tell if a page is free or not because both
free and in use pages will have a non-zero refcount. Instead we should
return pages to the drivers page allocator with a zero reference count.
Kernel code can then safely use kernel functions such as
get_page_unless_zero().

Signed-off-by: Alistair Popple 
---
 arch/powerpc/kvm/book3s_hv_uvmem.c   | 1 +
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 1 +
 drivers/gpu/drm/nouveau/nouveau_dmem.c   | 1 +
 lib/test_hmm.c   | 1 +
 mm/memremap.c| 5 -
 mm/page_alloc.c  | 6 ++
 6 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index d4eacf4..08d2f7d 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -718,6 +718,7 @@ static struct page *kvmppc_uvmem_get_page(unsigned long 
gpa, struct kvm *kvm)
 
dpage = pfn_to_page(uvmem_pfn);
dpage->zone_device_data = pvt;
+   set_page_count(dpage, 1);
lock_page(dpage);
return dpage;
 out_clear:
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index 776448b..05c2f4d 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -223,6 +223,7 @@ svm_migrate_get_vram_page(struct svm_range *prange, 
unsigned long pfn)
page = pfn_to_page(pfn);
svm_range_bo_ref(prange->svm_bo);
page->zone_device_data = prange->svm_bo;
+   set_page_count(page, 1);
lock_page(page);
 }
 
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 1635661..f9234ed 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -326,6 +326,7 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm)
return NULL;
}
 
+   set_page_count(page, 1);
lock_page(page);
return page;
 }
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 89463ff..2bd3a67 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -627,6 +627,7 @@ static struct page *dmirror_devmem_alloc_page(struct 
dmirror_device *mdevice)
goto error;
}
 
+   set_page_count(dpage, 1);
dpage->zone_device_data = rpage;
lock_page(dpage);
return dpage;
diff --git a/mm/memremap.c b/mm/memremap.c
index 25029a4..e065171 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -501,11 +501,6 @@ void free_zone_device_page(struct page *page)
 */
page->mapping = NULL;
page->pgmap->ops->page_free(page);
-
-   /*
-* Reset the page count to 1 to prepare for handing out the page again.
-*/
-   set_page_count(page, 1);
 }
 
 #ifdef CONFIG_FS_DAX
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9d49803..67eaab5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6744,6 +6744,12 @@ static void __ref __init_zone_device_page(struct page 
*page, unsigned long pfn,
set_pageblock_migratetype(page, MIGRATE_MOVABLE);
cond_resched();
}
+
+   /*
+* ZONE_DEVICE pages are released directly to the driver page allocator
+* which will set the page count to 1 when allocating the page.
+*/
+   set_page_count(page, 0);
 }
 
 /*
-- 
git-series 0.9.1


[Nouveau] [PATCH 1/7] mm/memory.c: Fix race when faulting a device private page

2022-09-26 Thread Alistair Popple
When the CPU tries to access a device private page the migrate_to_ram()
callback associated with the pgmap for the page is called. However no
reference is taken on the faulting page. Therefore a concurrent
migration of the device private page can free the page and possibly the
underlying pgmap. This results in a race which can crash the kernel due
to the migrate_to_ram() function pointer becoming invalid. It also means
drivers can't reliably read the zone_device_data field because the page
may have been freed with memunmap_pages().

Close the race by getting a reference on the page while holding the ptl
to ensure it has not been freed. Unfortunately the elevated reference
count will cause the migration required to handle the fault to fail. To
avoid this failure pass the faulting page into the migrate_vma functions
so that if an elevated reference count is found it can be checked to see
if it's expected or not.

Signed-off-by: Alistair Popple 
---
 arch/powerpc/kvm/book3s_hv_uvmem.c   | 15 ++-
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 17 +++--
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.h |  2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 11 +---
 include/linux/migrate.h  |  8 ++-
 lib/test_hmm.c   |  7 ++---
 mm/memory.c  | 16 +++-
 mm/migrate.c | 34 ++---
 mm/migrate_device.c  | 18 +
 9 files changed, 87 insertions(+), 41 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 5980063..d4eacf4 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -508,10 +508,10 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
 static int __kvmppc_svm_page_out(struct vm_area_struct *vma,
unsigned long start,
unsigned long end, unsigned long page_shift,
-   struct kvm *kvm, unsigned long gpa)
+   struct kvm *kvm, unsigned long gpa, struct page *fault_page)
 {
unsigned long src_pfn, dst_pfn = 0;
-   struct migrate_vma mig;
+   struct migrate_vma mig = { 0 };
struct page *dpage, *spage;
struct kvmppc_uvmem_page_pvt *pvt;
unsigned long pfn;
@@ -525,6 +525,7 @@ static int __kvmppc_svm_page_out(struct vm_area_struct *vma,
mig.dst = _pfn;
mig.pgmap_owner = _uvmem_pgmap;
mig.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
+   mig.fault_page = fault_page;
 
/* The requested page is already paged-out, nothing to do */
if (!kvmppc_gfn_is_uvmem_pfn(gpa >> page_shift, kvm, NULL))
@@ -580,12 +581,14 @@ static int __kvmppc_svm_page_out(struct vm_area_struct 
*vma,
 static inline int kvmppc_svm_page_out(struct vm_area_struct *vma,
  unsigned long start, unsigned long end,
  unsigned long page_shift,
- struct kvm *kvm, unsigned long gpa)
+ struct kvm *kvm, unsigned long gpa,
+ struct page *fault_page)
 {
int ret;
 
mutex_lock(>arch.uvmem_lock);
-   ret = __kvmppc_svm_page_out(vma, start, end, page_shift, kvm, gpa);
+   ret = __kvmppc_svm_page_out(vma, start, end, page_shift, kvm, gpa,
+   fault_page);
mutex_unlock(>arch.uvmem_lock);
 
return ret;
@@ -736,7 +739,7 @@ static int kvmppc_svm_page_in(struct vm_area_struct *vma,
bool pagein)
 {
unsigned long src_pfn, dst_pfn = 0;
-   struct migrate_vma mig;
+   struct migrate_vma mig = { 0 };
struct page *spage;
unsigned long pfn;
struct page *dpage;
@@ -994,7 +997,7 @@ static vm_fault_t kvmppc_uvmem_migrate_to_ram(struct 
vm_fault *vmf)
 
if (kvmppc_svm_page_out(vmf->vma, vmf->address,
vmf->address + PAGE_SIZE, PAGE_SHIFT,
-   pvt->kvm, pvt->gpa))
+   pvt->kvm, pvt->gpa, vmf->page))
return VM_FAULT_SIGBUS;
else
return 0;
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index b059a77..776448b 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -409,7 +409,7 @@ svm_migrate_vma_to_vram(struct amdgpu_device *adev, struct 
svm_range *prange,
uint64_t npages = (end - start) >> PAGE_SHIFT;
struct kfd_process_device *pdd;
struct dma_fence *mfence = NULL;
-   struct migrate_vma migrate;
+   struct migrate_vma migrate = { 0 };
unsigned long cpages = 0;
dma_addr_t *scratch;
void *buf;
@@ -668,7 +668,7 @@ svm_migrate_copy_to_ram(struct amdgpu_device *ade

[Nouveau] [PATCH 0/7] Fix several device private page reference counting issues

2022-09-26 Thread Alistair Popple
This series aims to fix a number of page reference counting issues in drivers
dealing with device private ZONE_DEVICE pages. These result in use-after-free
type bugs, either from accessing a struct page which no longer exists because it
has been removed or accessing fields within the struct page which are no longer
valid because the page has been freed.

During normal usage it is unlikely these will cause any problems. However
without these fixes it is possible to crash the kernel from userspace. These
crashes can be triggered either by unloading the kernel module or unbinding the
device from the driver prior to a userspace task exiting. In modules such as
Nouveau it is also possible to trigger some of these issues by explicitly
closing the device file-descriptor prior to the task exiting and then accessing
device private memory.

This involves changes to both PowerPC and AMD GPU code. Unfortunately I lack the
hardware to test on either of these so would appreciate it if someone with
access could test those.

Alistair Popple (7):
  mm/memory.c: Fix race when faulting a device private page
  mm: Free device private pages have zero refcount
  mm/migrate_device.c: Refactor migrate_vma and migrate_deivce_coherent_page()
  mm/migrate_device.c: Add migrate_device_range()
  nouveau/dmem: Refactor nouveau_dmem_fault_copy_one()
  nouveau/dmem: Evict device private memory during release
  hmm-tests: Add test for migrate_device_range()

 arch/powerpc/kvm/book3s_hv_uvmem.c   |  16 +-
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |  18 +-
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.h |   2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c |  11 +-
 drivers/gpu/drm/nouveau/nouveau_dmem.c   | 108 +++
 include/linux/migrate.h  |  15 ++-
 lib/test_hmm.c   | 127 ++---
 lib/test_hmm_uapi.h  |   1 +-
 mm/memory.c  |  16 +-
 mm/memremap.c|   5 +-
 mm/migrate.c |  34 +--
 mm/migrate_device.c  | 239 +---
 mm/page_alloc.c  |   6 +-
 tools/testing/selftests/vm/hmm-tests.c   |  49 +-
 14 files changed, 487 insertions(+), 160 deletions(-)

base-commit: 088b8aa537c2c767765f1c19b555f21ffe555786
-- 
git-series 0.9.1


[Nouveau] [PATCH] mm/migrate.c: Remove MIGRATE_PFN_LOCKED

2021-10-25 Thread Alistair Popple
MIGRATE_PFN_LOCKED is used to indicate to migrate_vma_prepare() that a
source page was already locked during migrate_vma_collect(). If it
wasn't then the a second attempt is made to lock the page. However if
the first attempt failed it's unlikely a second attempt will succeed,
and the retry adds complexity. So clean this up by removing the retry
and MIGRATE_PFN_LOCKED flag.

Destination pages are also meant to have the MIGRATE_PFN_LOCKED flag
set, but nothing actually checks that.

Signed-off-by: Alistair Popple 
---
 Documentation/vm/hmm.rst |   2 +-
 arch/powerpc/kvm/book3s_hv_uvmem.c   |   4 +-
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |   2 -
 drivers/gpu/drm/nouveau/nouveau_dmem.c   |   4 +-
 include/linux/migrate.h  |   1 -
 lib/test_hmm.c   |   5 +-
 mm/migrate.c | 145 +--
 7 files changed, 35 insertions(+), 128 deletions(-)

diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst
index a14c2938e7af..f2a59ed82ed3 100644
--- a/Documentation/vm/hmm.rst
+++ b/Documentation/vm/hmm.rst
@@ -360,7 +360,7 @@ between device driver specific code and shared common code:
system memory page, locks the page with ``lock_page()``, and fills in the
``dst`` array entry with::
 
- dst[i] = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
+ dst[i] = migrate_pfn(page_to_pfn(dpage));
 
Now that the driver knows that this page is being migrated, it can
invalidate device private MMU mappings and copy device private memory
diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index a7061ee3b157..28c436df9935 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -560,7 +560,7 @@ static int __kvmppc_svm_page_out(struct vm_area_struct *vma,
  gpa, 0, page_shift);
 
if (ret == U_SUCCESS)
-   *mig.dst = migrate_pfn(pfn) | MIGRATE_PFN_LOCKED;
+   *mig.dst = migrate_pfn(pfn);
else {
unlock_page(dpage);
__free_page(dpage);
@@ -774,7 +774,7 @@ static int kvmppc_svm_page_in(struct vm_area_struct *vma,
}
}
 
-   *mig.dst = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
+   *mig.dst = migrate_pfn(page_to_pfn(dpage));
migrate_vma_pages();
 out_finalize:
migrate_vma_finalize();
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index 4a16e3c257b9..41d9417f182b 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -300,7 +300,6 @@ svm_migrate_copy_to_vram(struct amdgpu_device *adev, struct 
svm_range *prange,
migrate->dst[i] = svm_migrate_addr_to_pfn(adev, dst[i]);
svm_migrate_get_vram_page(prange, migrate->dst[i]);
migrate->dst[i] = migrate_pfn(migrate->dst[i]);
-   migrate->dst[i] |= MIGRATE_PFN_LOCKED;
src[i] = dma_map_page(dev, spage, 0, PAGE_SIZE,
  DMA_TO_DEVICE);
r = dma_mapping_error(dev, src[i]);
@@ -580,7 +579,6 @@ svm_migrate_copy_to_ram(struct amdgpu_device *adev, struct 
svm_range *prange,
  dst[i] >> PAGE_SHIFT, page_to_pfn(dpage));
 
migrate->dst[i] = migrate_pfn(page_to_pfn(dpage));
-   migrate->dst[i] |= MIGRATE_PFN_LOCKED;
j++;
}
 
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 92987daa5e17..3828aafd3ac4 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -166,7 +166,7 @@ static vm_fault_t nouveau_dmem_fault_copy_one(struct 
nouveau_drm *drm,
goto error_dma_unmap;
mutex_unlock(>mutex);
 
-   args->dst[0] = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
+   args->dst[0] = migrate_pfn(page_to_pfn(dpage));
return 0;
 
 error_dma_unmap:
@@ -602,7 +602,7 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct 
nouveau_drm *drm,
((paddr >> PAGE_SHIFT) << NVIF_VMM_PFNMAP_V0_ADDR_SHIFT);
if (src & MIGRATE_PFN_WRITE)
*pfn |= NVIF_VMM_PFNMAP_V0_W;
-   return migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
+   return migrate_pfn(page_to_pfn(dpage));
 
 out_dma_unmap:
dma_unmap_page(dev, *dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index c8077e936691..479b861ae490 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -119,7 +119,6 @@ static inline int migrate_misplaced_page(struct page *page,
  */
 #define MIGRATE_PFN_VALID  (1UL << 0)
 #define MIGRATE_PFN_MIGRA

Re: [Nouveau] nouveau: failed to initialise sync

2021-07-06 Thread Alistair Popple
Thanks Christian,

That patch fixed it for me, so I guess it just hasn't quite made it upstream
yet.

 - Alistair

On Tuesday, 6 July 2021 4:58:37 PM AEST Christian König wrote:
> Hi guys,
> 
> yes nouveau was using the same functionality for internal BOs without 
> noticing it. This is fixes by the following commit:
> 
> commit d098775ed44021293b1962dea61efb19297b8d02
> Author: Christian König 
> Date:   Wed Jun 9 19:25:56 2021 +0200
> 
>  drm/nouveau: init the base GEM fields for internal BOs
> 
>  TTMs buffer objects are based on GEM objects for quite a while
>  and rely on initializing those fields before initializing the TTM BO.
> 
>  Nouveau now doesn't init the GEM object for internally allocated BOs,
>  so make sure that we at least initialize some necessary fields.
> 
> Could be that the patch needs to be send to stable as well.
> 
> Regards,
> Christian.
> 
> Am 06.07.21 um 04:44 schrieb Alistair Popple:
> > I am also hitting this with upstream. Reverting d02117f8efaa ("drm/ttm: 
> > remove
> > special handling for non GEM drivers") also fixed it for me.
> >
> > The change log for that commit reads:
> >
> >  drm/ttm: remove special handling for non GEM drivers
> >  
> >  vmwgfx is the only driver actually using this. Move the handling into
> >  the driver instead.
> >
> > I wonder if Nouveau might actually have been using this somehow too?
> >
> >   - Alistair
> >
> > On Sunday, 4 July 2021 5:21:29 AM AEST Corentin Labbe wrote:
> >> Hello
> >>
> >> Since some days on next, nouveau fail to load:
> >> [2.754087] nouveau :02:00.0: vgaarb: deactivate vga console
> >> [2.761260] Console: switching to colour dummy device 80x25
> >> [2.766888] nouveau :02:00.0: NVIDIA MCP77/MCP78 (0aa480a2)
> >> [2.783954] nouveau :02:00.0: bios: version 62.77.2a.00.04
> >> [2.810122] nouveau :02:00.0: fb: 256 MiB stolen system memory
> >> [3.484031] nouveau :02:00.0: DRM: VRAM: 256 MiB
> >> [3.488993] nouveau :02:00.0: DRM: GART: 1048576 MiB
> >> [3.494308] nouveau :02:00.0: DRM: TMDS table version 2.0
> >> [3.500052] nouveau :02:00.0: DRM: DCB version 4.0
> >> [3.505192] nouveau :02:00.0: DRM: DCB outp 00: 01000300 001e
> >> [3.511632] nouveau :02:00.0: DRM: DCB outp 01: 01011332 00020010
> >> [3.518074] nouveau :02:00.0: DRM: DCB conn 00: 0100
> >> [3.523728] nouveau :02:00.0: DRM: DCB conn 01: 1261
> >> [3.529455] nouveau :02:00.0: DRM: failed to initialise sync
> > subsystem, -28
> >> [3.545946] nouveau: probe of :02:00.0 failed with error -28
> >>
> >> I bisected it to:
> >> git bisect start
> >> # good: [62fb9874f5da54fdb243003b386128037319b219] Linux 5.13
> >> git bisect good 62fb9874f5da54fdb243003b386128037319b219
> >> # bad: [fb0ca446157a86b75502c1636b0d81e642fe6bf1] Add linux-next specific
> > files for 20210701
> >> git bisect bad fb0ca446157a86b75502c1636b0d81e642fe6bf1
> >> # good: [f63c4fda987a19b1194cc45cb72fd5bf968d9d90] Merge remote-tracking
> > branch 'rdma/for-next'
> >> git bisect good f63c4fda987a19b1194cc45cb72fd5bf968d9d90
> >> # bad: [49c8769be0b910d4134eba07cae5d9c71b861c4a] Merge remote-tracking
> > branch 'drm/drm-next'
> >> git bisect bad 49c8769be0b910d4134eba07cae5d9c71b861c4a
> >> # good: [4e3db44a242a4e2afe33b59793898ecbb61d478e] Merge tag 'wireless-
> > drivers-next-2021-06-25' of git://git.kernel.org/pub/scm/linux/kernel/git/
> > kvalo/wireless-drivers-next
> >> git bisect good 4e3db44a242a4e2afe33b59793898ecbb61d478e
> >> # bad: [5745d647d5563d3e9d32013ad4e5c629acff04d7] Merge tag 'amd-drm-
> > next-5.14-2021-06-02' of 
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinuxdata=04%7C01%7Cchristian.koenig%40amd.com%7C5f05fa59cd3b4432e71108d94027ede6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637611362989756089%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=gJ98NtSRf3IyVfsCzw3DKydeMTGKIkHJNzUUhUfsWzY%3Dreserved=0
> >  into drm-
> > next
> >> git bisect bad 5745d647d5563d3e9d32013ad4e5c629acff04d7
> >> # bad: [c99c4d0ca57c978dcc2a2f41ab8449684ea154cc] Merge tag 'amd-drm-
> > next-5.14-2021-05-19' of 
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinuxdata=04%7C01%7Cchristian.koenig%40amd.com%7C5f05fa59cd3b4432e

Re: [Nouveau] nouveau: failed to initialise sync

2021-07-05 Thread Alistair Popple
I am also hitting this with upstream. Reverting d02117f8efaa ("drm/ttm: remove 
special handling for non GEM drivers") also fixed it for me.

The change log for that commit reads:

drm/ttm: remove special handling for non GEM drivers

vmwgfx is the only driver actually using this. Move the handling into
the driver instead.

I wonder if Nouveau might actually have been using this somehow too?

 - Alistair

On Sunday, 4 July 2021 5:21:29 AM AEST Corentin Labbe wrote:
> Hello
> 
> Since some days on next, nouveau fail to load:
> [2.754087] nouveau :02:00.0: vgaarb: deactivate vga console
> [2.761260] Console: switching to colour dummy device 80x25
> [2.766888] nouveau :02:00.0: NVIDIA MCP77/MCP78 (0aa480a2)
> [2.783954] nouveau :02:00.0: bios: version 62.77.2a.00.04
> [2.810122] nouveau :02:00.0: fb: 256 MiB stolen system memory
> [3.484031] nouveau :02:00.0: DRM: VRAM: 256 MiB
> [3.488993] nouveau :02:00.0: DRM: GART: 1048576 MiB
> [3.494308] nouveau :02:00.0: DRM: TMDS table version 2.0
> [3.500052] nouveau :02:00.0: DRM: DCB version 4.0
> [3.505192] nouveau :02:00.0: DRM: DCB outp 00: 01000300 001e
> [3.511632] nouveau :02:00.0: DRM: DCB outp 01: 01011332 00020010
> [3.518074] nouveau :02:00.0: DRM: DCB conn 00: 0100
> [3.523728] nouveau :02:00.0: DRM: DCB conn 01: 1261
> [3.529455] nouveau :02:00.0: DRM: failed to initialise sync 
subsystem, -28
> [3.545946] nouveau: probe of :02:00.0 failed with error -28
> 
> I bisected it to:
> git bisect start
> # good: [62fb9874f5da54fdb243003b386128037319b219] Linux 5.13
> git bisect good 62fb9874f5da54fdb243003b386128037319b219
> # bad: [fb0ca446157a86b75502c1636b0d81e642fe6bf1] Add linux-next specific 
files for 20210701
> git bisect bad fb0ca446157a86b75502c1636b0d81e642fe6bf1
> # good: [f63c4fda987a19b1194cc45cb72fd5bf968d9d90] Merge remote-tracking 
branch 'rdma/for-next'
> git bisect good f63c4fda987a19b1194cc45cb72fd5bf968d9d90
> # bad: [49c8769be0b910d4134eba07cae5d9c71b861c4a] Merge remote-tracking 
branch 'drm/drm-next'
> git bisect bad 49c8769be0b910d4134eba07cae5d9c71b861c4a
> # good: [4e3db44a242a4e2afe33b59793898ecbb61d478e] Merge tag 'wireless-
drivers-next-2021-06-25' of git://git.kernel.org/pub/scm/linux/kernel/git/
kvalo/wireless-drivers-next
> git bisect good 4e3db44a242a4e2afe33b59793898ecbb61d478e
> # bad: [5745d647d5563d3e9d32013ad4e5c629acff04d7] Merge tag 'amd-drm-
next-5.14-2021-06-02' of https://gitlab.freedesktop.org/agd5f/linux into drm-
next
> git bisect bad 5745d647d5563d3e9d32013ad4e5c629acff04d7
> # bad: [c99c4d0ca57c978dcc2a2f41ab8449684ea154cc] Merge tag 'amd-drm-
next-5.14-2021-05-19' of https://gitlab.freedesktop.org/agd5f/linux into drm-
next
> git bisect bad c99c4d0ca57c978dcc2a2f41ab8449684ea154cc
> # bad: [ae25ec2fc6c5a9e5767bf1922cd648501d0f914c] Merge tag 'drm-misc-
next-2021-05-17' of git://anongit.freedesktop.org/drm/drm-misc into drm-next
> git bisect bad ae25ec2fc6c5a9e5767bf1922cd648501d0f914c
> # bad: [cac80e71cfb0b00202d743c6e90333c45ba77cc5] drm/vkms: rename cursor to 
plane on ops of planes composition
> git bisect bad cac80e71cfb0b00202d743c6e90333c45ba77cc5
> # good: [178bdba84c5f0ad14de384fc7f15fba0e272919d] drm/ttm/ttm_device: 
Demote kernel-doc abuses
> git bisect good 178bdba84c5f0ad14de384fc7f15fba0e272919d
> # bad: [3f3a6524f6065fd3d130515e012f63eac74d96da] drm/dp: Clarify DP AUX 
registration time
> git bisect bad 3f3a6524f6065fd3d130515e012f63eac74d96da
> # bad: [6dd7efc437611db16d432e0030f72d0c7e890127] drm/gud: cleanup coding 
style a bit
> git bisect bad 6dd7efc437611db16d432e0030f72d0c7e890127
> # bad: [13b29cc3a722c2c0bc9ab9f72f9047d55d08a2f9] drm/mxsfb: Don't select 
DRM_KMS_FB_HELPER
> git bisect bad 13b29cc3a722c2c0bc9ab9f72f9047d55d08a2f9
> # bad: [d02117f8efaa5fbc37437df1ae955a147a2a424a] drm/ttm: remove special 
handling for non GEM drivers
> git bisect bad d02117f8efaa5fbc37437df1ae955a147a2a424a
> # good: [13ea9aa1e7d891e950230e82f1dd2c84e5debcff] drm/ttm: fix error 
handling if no BO can be swapped out v4
> git bisect good 13ea9aa1e7d891e950230e82f1dd2c84e5debcff
> # first bad commit: [d02117f8efaa5fbc37437df1ae955a147a2a424a] drm/ttm: 
remove special handling for non GEM drivers
> 
> Reverting the patch permit to have nouveau works again.
> 
> Regards
> 
> 




___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v11 00/10] Add support for SVM atomics in Nouveau

2021-06-16 Thread Alistair Popple
On Thursday, 17 June 2021 9:35:29 AM AEST Andrew Morton wrote:
> On Wed, 16 Jun 2021 20:59:27 +1000 Alistair Popple  
wrote:
> 
> > This is my series to add support for SVM atomics in Nouveau
> 
> Can we please have a nice [0/n] overview for this patchset?
> 

Sorry, I forgot to include that this time. Please see the update I just 
posted. Let me know if you need me to resend the whole series with the updated 
cover letter. Thanks.

 - Alistair



___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


[Nouveau] Update: [PATCH v11 00/10] Add support for SVM atomics in Nouveau

2021-06-16 Thread Alistair Popple
Introduction


Some devices have features such as atomic PTE bits that can be used to
implement atomic access to system memory. To support atomic operations to a
shared virtual memory page such a device needs access to that page which is
exclusive of the CPU. This series introduces a mechanism to temporarily
unmap pages granting exclusive access to a device.

These changes are required to support OpenCL atomic operations in Nouveau
to shared virtual memory (SVM) regions allocated with the
CL_MEM_SVM_ATOMICS clSVMAlloc flag. A more complete description of the
OpenCL SVM feature is available at
https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/
OpenCL_API.html#_shared_virtual_memory .

Implementation
==

Exclusive device access is implemented by adding a new swap entry type
(SWAP_DEVICE_EXCLUSIVE) which is similar to a migration entry. The main
difference is that on fault the original entry is immediately restored by
the fault handler instead of waiting.

Restoring the entry triggers calls to MMU notifers which allows a device
driver to revoke the atomic access permission from the GPU prior to the CPU
finalising the entry.

Patches
===

Patches 1 & 2 refactor existing migration and device private entry
functions.

Patches 3 & 4 rework try_to_unmap_one() by splitting out unrelated
functionality into separate functions - try_to_migrate_one() and
try_to_munlock_one().

Patch 5 renames some existing code but does not introduce functionality.

Patch 6 is a small clean-up to swap entry handling in copy_pte_range().

Patch 7 contains the bulk of the implementation for device exclusive
memory.

Patch 8 contains some additions to the HMM selftests to ensure everything
works as expected.

Patch 9 is a cleanup for the Nouveau SVM implementation.

Patch 10 contains the implementation of atomic access for the Nouveau
driver.

Testing
===

This has been tested with upstream Mesa 21.1.0 and a simple OpenCL program
which checks that GPU atomic accesses to system memory are atomic. Without
this series the test fails as there is no way of write-protecting the page
mapping which results in the device clobbering CPU writes. For reference
the test is available at https://ozlabs.org/~apopple/opencl_svm_atomics/

Further testing has been performed by adding support for testing exclusive
access to the hmm-tests kselftests.
On Wednesday, 16 June 2021 8:59:27 PM AEST Alistair Popple wrote:

Alistair Popple (10):
  mm: Remove special swap entry functions
  mm/swapops: Rework swap entry manipulation code
  mm/rmap: Split try_to_munlock from try_to_unmap
  mm/rmap: Split migration into its own function
  mm: Rename migrate_pgmap_owner
  mm/memory.c: Allow different return codes for copy_nonpresent_pte()
  mm: Device exclusive memory access
  mm: Selftests for exclusive device memory
  nouveau/svm: Refactor nouveau_range_fault
  nouveau/svm: Implement atomic SVM access

 Documentation/vm/hmm.rst  |  19 +-
 Documentation/vm/unevictable-lru.rst  |  33 +-
 arch/s390/mm/pgtable.c|   2 +-
 drivers/gpu/drm/nouveau/include/nvif/if000c.h |   1 +
 drivers/gpu/drm/nouveau/nouveau_svm.c | 156 -
 drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h |   1 +
 .../drm/nouveau/nvkm/subdev/mmu/vmmgp100.c|   6 +
 fs/proc/task_mmu.c|  23 +-
 include/linux/mmu_notifier.h  |  26 +-
 include/linux/rmap.h  |  11 +-
 include/linux/swap.h  |  13 +-
 include/linux/swapops.h   | 123 ++--
 lib/test_hmm.c| 127 +++-
 lib/test_hmm_uapi.h   |   2 +
 mm/debug_vm_pgtable.c |  12 +-
 mm/hmm.c  |  12 +-
 mm/huge_memory.c  |  48 +-
 mm/hugetlb.c  |  10 +-
 mm/memcontrol.c   |   2 +-
 mm/memory.c   | 175 -
 mm/migrate.c  |  51 +-
 mm/mlock.c|  12 +-
 mm/mprotect.c |  18 +-
 mm/page_vma_mapped.c  |  15 +-
 mm/rmap.c | 612 +++---
 tools/testing/selftests/vm/hmm-tests.c| 158 +
 26 files changed, 1341 insertions(+), 327 deletions(-)



___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


[Nouveau] [PATCH v11 10/10] nouveau/svm: Implement atomic SVM access

2021-06-16 Thread Alistair Popple
Some NVIDIA GPUs do not support direct atomic access to system memory
via PCIe. Instead this must be emulated by granting the GPU exclusive
access to the memory. This is achieved by replacing CPU page table
entries with special swap entries that fault on userspace access.

The driver then grants the GPU permission to update the page undergoing
atomic access via the GPU page tables. When CPU access to the page is
required a CPU fault is raised which calls into the device driver via
MMU notifiers to revoke the atomic access. The original page table
entries are then restored allowing CPU access to proceed.

Signed-off-by: Alistair Popple 
Reviewed-by: Ben Skeggs 

---

v10:
* Added a fix from Colin King to check the return code of
  make_device_exclusive.

v9:
* Added Ben's Reviewed-By

v7:
* Removed magic values for fault access levels
* Improved readability of fault comparison code

v4:
* Check that page table entries haven't changed before mapping on the
  device
---
 drivers/gpu/drm/nouveau/include/nvif/if000c.h |   1 +
 drivers/gpu/drm/nouveau/nouveau_svm.c | 126 --
 drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h |   1 +
 .../drm/nouveau/nvkm/subdev/mmu/vmmgp100.c|   6 +
 4 files changed, 123 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/include/nvif/if000c.h 
b/drivers/gpu/drm/nouveau/include/nvif/if000c.h
index d6dd40f21eed..9c7ff56831c5 100644
--- a/drivers/gpu/drm/nouveau/include/nvif/if000c.h
+++ b/drivers/gpu/drm/nouveau/include/nvif/if000c.h
@@ -77,6 +77,7 @@ struct nvif_vmm_pfnmap_v0 {
 #define NVIF_VMM_PFNMAP_V0_APER   0x00f0ULL
 #define NVIF_VMM_PFNMAP_V0_HOST   0xULL
 #define NVIF_VMM_PFNMAP_V0_VRAM   0x0010ULL
+#define NVIF_VMM_PFNMAP_V0_A 0x0004ULL
 #define NVIF_VMM_PFNMAP_V0_W  0x0002ULL
 #define NVIF_VMM_PFNMAP_V0_V  0x0001ULL
 #define NVIF_VMM_PFNMAP_V0_NONE   0xULL
diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c 
b/drivers/gpu/drm/nouveau/nouveau_svm.c
index a195e48c9aee..63a4976b906b 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct nouveau_svm {
struct nouveau_drm *drm;
@@ -67,6 +68,11 @@ struct nouveau_svm {
} buffer[1];
 };
 
+#define FAULT_ACCESS_READ 0
+#define FAULT_ACCESS_WRITE 1
+#define FAULT_ACCESS_ATOMIC 2
+#define FAULT_ACCESS_PREFETCH 3
+
 #define SVM_DBG(s,f,a...) NV_DEBUG((s)->drm, "svm: "f"\n", ##a)
 #define SVM_ERR(s,f,a...) NV_WARN((s)->drm, "svm: "f"\n", ##a)
 
@@ -411,6 +417,24 @@ nouveau_svm_fault_cancel_fault(struct nouveau_svm *svm,
  fault->client);
 }
 
+static int
+nouveau_svm_fault_priority(u8 fault)
+{
+   switch (fault) {
+   case FAULT_ACCESS_PREFETCH:
+   return 0;
+   case FAULT_ACCESS_READ:
+   return 1;
+   case FAULT_ACCESS_WRITE:
+   return 2;
+   case FAULT_ACCESS_ATOMIC:
+   return 3;
+   default:
+   WARN_ON_ONCE(1);
+   return -1;
+   }
+}
+
 static int
 nouveau_svm_fault_cmp(const void *a, const void *b)
 {
@@ -421,9 +445,8 @@ nouveau_svm_fault_cmp(const void *a, const void *b)
return ret;
if ((ret = (s64)fa->addr - fb->addr))
return ret;
-   /*XXX: atomic? */
-   return (fa->access == 0 || fa->access == 3) -
-  (fb->access == 0 || fb->access == 3);
+   return nouveau_svm_fault_priority(fa->access) -
+   nouveau_svm_fault_priority(fb->access);
 }
 
 static void
@@ -487,6 +510,10 @@ static bool nouveau_svm_range_invalidate(struct 
mmu_interval_notifier *mni,
struct svm_notifier *sn =
container_of(mni, struct svm_notifier, notifier);
 
+   if (range->event == MMU_NOTIFY_EXCLUSIVE &&
+   range->owner == sn->svmm->vmm->cli->drm->dev)
+   return true;
+
/*
 * serializes the update to mni->invalidate_seq done by caller and
 * prevents invalidation of the PTE from progressing while HW is being
@@ -555,6 +582,71 @@ static void nouveau_hmm_convert_pfn(struct nouveau_drm 
*drm,
args->p.phys[0] |= NVIF_VMM_PFNMAP_V0_W;
 }
 
+static int nouveau_atomic_range_fault(struct nouveau_svmm *svmm,
+  struct nouveau_drm *drm,
+  struct nouveau_pfnmap_args *args, u32 size,
+  struct svm_notifier *notifier)
+{
+   unsigned long timeout =
+   jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
+   str

[Nouveau] [PATCH v11 09/10] nouveau/svm: Refactor nouveau_range_fault

2021-06-16 Thread Alistair Popple
Call mmu_interval_notifier_insert() as part of nouveau_range_fault().
This doesn't introduce any functional change but makes it easier for a
subsequent patch to alter the behaviour of nouveau_range_fault() to
support GPU atomic operations.

Signed-off-by: Alistair Popple 
Reviewed-by: Ben Skeggs 

---

v9:

* Added Ben's Reviewed-By (thanks!)
---
 drivers/gpu/drm/nouveau/nouveau_svm.c | 34 ---
 1 file changed, 20 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c 
b/drivers/gpu/drm/nouveau/nouveau_svm.c
index 94f841026c3b..a195e48c9aee 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -567,18 +567,27 @@ static int nouveau_range_fault(struct nouveau_svmm *svmm,
unsigned long hmm_pfns[1];
struct hmm_range range = {
.notifier = >notifier,
-   .start = notifier->notifier.interval_tree.start,
-   .end = notifier->notifier.interval_tree.last + 1,
.default_flags = hmm_flags,
.hmm_pfns = hmm_pfns,
.dev_private_owner = drm->dev,
};
-   struct mm_struct *mm = notifier->notifier.mm;
+   struct mm_struct *mm = svmm->notifier.mm;
int ret;
 
+   ret = mmu_interval_notifier_insert(>notifier, mm,
+   args->p.addr, args->p.size,
+   _svm_mni_ops);
+   if (ret)
+   return ret;
+
+   range.start = notifier->notifier.interval_tree.start;
+   range.end = notifier->notifier.interval_tree.last + 1;
+
while (true) {
-   if (time_after(jiffies, timeout))
-   return -EBUSY;
+   if (time_after(jiffies, timeout)) {
+   ret = -EBUSY;
+   goto out;
+   }
 
range.notifier_seq = mmu_interval_read_begin(range.notifier);
mmap_read_lock(mm);
@@ -587,7 +596,7 @@ static int nouveau_range_fault(struct nouveau_svmm *svmm,
if (ret) {
if (ret == -EBUSY)
continue;
-   return ret;
+   goto out;
}
 
mutex_lock(>mutex);
@@ -606,6 +615,9 @@ static int nouveau_range_fault(struct nouveau_svmm *svmm,
svmm->vmm->vmm.object.client->super = false;
mutex_unlock(>mutex);
 
+out:
+   mmu_interval_notifier_remove(>notifier);
+
return ret;
 }
 
@@ -727,14 +739,8 @@ nouveau_svm_fault(struct nvif_notify *notify)
}
 
notifier.svmm = svmm;
-   ret = mmu_interval_notifier_insert(, mm,
-  args.i.p.addr, args.i.p.size,
-  _svm_mni_ops);
-   if (!ret) {
-   ret = nouveau_range_fault(svmm, svm->drm, ,
-   sizeof(args), hmm_flags, );
-   mmu_interval_notifier_remove();
-   }
+   ret = nouveau_range_fault(svmm, svm->drm, ,
+   sizeof(args), hmm_flags, );
mmput(mm);
 
limit = args.i.p.addr + args.i.p.size;
-- 
2.20.1

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


[Nouveau] [PATCH v11 08/10] mm: Selftests for exclusive device memory

2021-06-16 Thread Alistair Popple
Adds some selftests for exclusive device memory.

Signed-off-by: Alistair Popple 
Acked-by: Jason Gunthorpe 
Tested-by: Ralph Campbell 
Reviewed-by: Ralph Campbell 

---

v10:

* Squashed a fix from Colin King. Not sure what the right attribution
tag for that would be though?
---
 lib/test_hmm.c | 125 +++
 lib/test_hmm_uapi.h|   2 +
 tools/testing/selftests/vm/hmm-tests.c | 158 +
 3 files changed, 285 insertions(+)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index fc7a20bc9b42..8c55c4723692 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "test_hmm_uapi.h"
 
@@ -46,6 +47,7 @@ struct dmirror_bounce {
unsigned long   cpages;
 };
 
+#define DPT_XA_TAG_ATOMIC 1UL
 #define DPT_XA_TAG_WRITE 3UL
 
 /*
@@ -619,6 +621,54 @@ static void dmirror_migrate_alloc_and_copy(struct 
migrate_vma *args,
}
 }
 
+static int dmirror_check_atomic(struct dmirror *dmirror, unsigned long start,
+unsigned long end)
+{
+   unsigned long pfn;
+
+   for (pfn = start >> PAGE_SHIFT; pfn < (end >> PAGE_SHIFT); pfn++) {
+   void *entry;
+   struct page *page;
+
+   entry = xa_load(>pt, pfn);
+   page = xa_untag_pointer(entry);
+   if (xa_pointer_tag(entry) == DPT_XA_TAG_ATOMIC)
+   return -EPERM;
+   }
+
+   return 0;
+}
+
+static int dmirror_atomic_map(unsigned long start, unsigned long end,
+ struct page **pages, struct dmirror *dmirror)
+{
+   unsigned long pfn, mapped = 0;
+   int i;
+
+   /* Map the migrated pages into the device's page tables. */
+   mutex_lock(>mutex);
+
+   for (i = 0, pfn = start >> PAGE_SHIFT; pfn < (end >> PAGE_SHIFT); 
pfn++, i++) {
+   void *entry;
+
+   if (!pages[i])
+   continue;
+
+   entry = pages[i];
+   entry = xa_tag_pointer(entry, DPT_XA_TAG_ATOMIC);
+   entry = xa_store(>pt, pfn, entry, GFP_ATOMIC);
+   if (xa_is_err(entry)) {
+   mutex_unlock(>mutex);
+   return xa_err(entry);
+   }
+
+   mapped++;
+   }
+
+   mutex_unlock(>mutex);
+   return mapped;
+}
+
 static int dmirror_migrate_finalize_and_map(struct migrate_vma *args,
struct dmirror *dmirror)
 {
@@ -661,6 +711,72 @@ static int dmirror_migrate_finalize_and_map(struct 
migrate_vma *args,
return 0;
 }
 
+static int dmirror_exclusive(struct dmirror *dmirror,
+struct hmm_dmirror_cmd *cmd)
+{
+   unsigned long start, end, addr;
+   unsigned long size = cmd->npages << PAGE_SHIFT;
+   struct mm_struct *mm = dmirror->notifier.mm;
+   struct page *pages[64];
+   struct dmirror_bounce bounce;
+   unsigned long next;
+   int ret;
+
+   start = cmd->addr;
+   end = start + size;
+   if (end < start)
+   return -EINVAL;
+
+   /* Since the mm is for the mirrored process, get a reference first. */
+   if (!mmget_not_zero(mm))
+   return -EINVAL;
+
+   mmap_read_lock(mm);
+   for (addr = start; addr < end; addr = next) {
+   unsigned long mapped;
+   int i;
+
+   if (end < addr + (ARRAY_SIZE(pages) << PAGE_SHIFT))
+   next = end;
+   else
+   next = addr + (ARRAY_SIZE(pages) << PAGE_SHIFT);
+
+   ret = make_device_exclusive_range(mm, addr, next, pages, NULL);
+   mapped = dmirror_atomic_map(addr, next, pages, dmirror);
+   for (i = 0; i < ret; i++) {
+   if (pages[i]) {
+   unlock_page(pages[i]);
+   put_page(pages[i]);
+   }
+   }
+
+   if (addr + (mapped << PAGE_SHIFT) < next) {
+   mmap_read_unlock(mm);
+   mmput(mm);
+   return -EBUSY;
+   }
+   }
+   mmap_read_unlock(mm);
+   mmput(mm);
+
+   /* Return the migrated data for verification. */
+   ret = dmirror_bounce_init(, start, size);
+   if (ret)
+   return ret;
+   mutex_lock(>mutex);
+   ret = dmirror_do_read(dmirror, start, end, );
+   mutex_unlock(>mutex);
+   if (ret == 0) {
+   if (copy_to_user(u64_to_user_ptr(cmd->ptr), bounce.ptr,
+bounce.size))
+   ret = -EFAULT;
+   }
+
+   cmd->cpages = bounce.cpages;
+   dmirror_bounce_fini();
+   return ret;
+}
+
 static int dmi

[Nouveau] [PATCH v11 07/10] mm: Device exclusive memory access

2021-06-16 Thread Alistair Popple
Some devices require exclusive write access to shared virtual
memory (SVM) ranges to perform atomic operations on that memory. This
requires CPU page tables to be updated to deny access whilst atomic
operations are occurring.

In order to do this introduce a new swap entry
type (SWP_DEVICE_EXCLUSIVE). When a SVM range needs to be marked for
exclusive access by a device all page table mappings for the particular
range are replaced with device exclusive swap entries. This causes any
CPU access to the page to result in a fault.

Faults are resovled by replacing the faulting entry with the original
mapping. This results in MMU notifiers being called which a driver uses
to update access permissions such as revoking atomic access. After
notifiers have been called the device will no longer have exclusive
access to the region.

Walking of the page tables to find the target pages is handled by
get_user_pages() rather than a direct page table walk. A direct page
table walk similar to what migrate_vma_collect()/unmap() does could also
have been utilised. However this resulted in more code similar in
functionality to what get_user_pages() provides as page faulting is
required to make the PTEs present and to break COW.

Signed-off-by: Alistair Popple 
Reviewed-by: Christoph Hellwig 

---

v10:
* Make device exclusive code conditional on CONFIG_DEVICE_PRIVATE.
* Updates to code comments and more minor code cleanups.

v9:
* Split rename of migrate_pgmap_owner into a separate patch.
* Added comments explaining SWP_DEVICE_EXCLUSIVE_* entries.
* Renamed try_to_protect{_one} to page_make_device_exclusive{_one} based
  somewhat on a suggestion from Peter Xu. I was never particularly happy
  with try_to_protect() as a name so think this is better.
* Removed unneccesary code and reworded some comments based on feedback
  from Peter Xu.
* Removed the VMA walk when restoring PTEs for device-exclusive entries.
* Simplified implementation of copy_pte_range() to fail if the page
  cannot be locked. This might lead to occasional fork() failures but at
  this stage we don't think that will be an issue.

v8:
* Remove device exclusive entries on fork rather than copy them.

v7:
* Added Christoph's Reviewed-by.
* Minor cosmetic cleanups suggested by Christoph.
* Replace mmu_notifier_range_init_migrate/exclusive with
  mmu_notifier_range_init_owner as suggested by Christoph.
* Replaced lock_page() with lock_page_retry() when handling faults.
* Restrict to anonymous pages for now.

v6:
* Fixed a bisectablity issue due to incorrectly applying the rename of
  migrate_pgmap_owner to the wrong patches for Nouveau and hmm_test.

v5:
* Renamed range->migrate_pgmap_owner to range->owner.
* Added MMU_NOTIFY_EXCLUSIVE to allow passing of a driver cookie which
  allows notifiers called as a result of make_device_exclusive_range() to
  be ignored.
* Added a check to try_to_protect_one() to detect if the pages originally
  returned from get_user_pages() have been unmapped or not.
* Removed check_device_exclusive_range() as it is no longer required with
  the other changes.
* Documentation update.

v4:
* Add function to check that mappings are still valid and exclusive.
* s/long/unsigned long/ in make_device_exclusive_entry().
---
 Documentation/vm/hmm.rst |  17 
 include/linux/mmu_notifier.h |   6 ++
 include/linux/rmap.h |   4 +
 include/linux/swap.h |   9 +-
 include/linux/swapops.h  |  44 -
 mm/hmm.c |   5 +
 mm/memory.c  | 127 +++-
 mm/mprotect.c|   8 ++
 mm/page_vma_mapped.c |   9 +-
 mm/rmap.c| 183 +++
 10 files changed, 402 insertions(+), 10 deletions(-)

diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst
index 3df79307a797..a14c2938e7af 100644
--- a/Documentation/vm/hmm.rst
+++ b/Documentation/vm/hmm.rst
@@ -405,6 +405,23 @@ between device driver specific code and shared common code:
 
The lock can now be released.
 
+Exclusive access memory
+===
+
+Some devices have features such as atomic PTE bits that can be used to 
implement
+atomic access to system memory. To support atomic operations to a shared 
virtual
+memory page such a device needs access to that page which is exclusive of any
+userspace access from the CPU. The ``make_device_exclusive_range()`` function
+can be used to make a memory range inaccessible from userspace.
+
+This replaces all mappings for pages in the given range with special swap
+entries. Any attempt to access the swap entry results in a fault which is
+resovled by replacing the entry with the original mapping. A driver gets
+notified that the mapping has been changed by MMU notifiers, after which point
+it will no longer have exclusive access to the page. Exclusive access is
+guranteed to last until the driver drops the page lock and page reference, at
+which point any CPU faults on the page may p

[Nouveau] [PATCH v11 06/10] mm/memory.c: Allow different return codes for copy_nonpresent_pte()

2021-06-16 Thread Alistair Popple
Currently if copy_nonpresent_pte() returns a non-zero value it is
assumed to be a swap entry which requires further processing outside the
loop in copy_pte_range() after dropping locks. This prevents other
values being returned to signal conditions such as failure which a
subsequent change requires.

Instead make copy_nonpresent_pte() return an error code if further
processing is required and read the value for the swap entry in the main
loop under the ptl.

Signed-off-by: Alistair Popple 
Reviewed-by: Peter Xu 

---

v11:

Rebase on mmots

v10:

Use a unique error code and only check return codes for handling.

v9:

New for v9 to allow device exclusive handling to occur in
copy_nonpresent_pte().
---
 mm/memory.c | 28 +---
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index b4588402b777..1ac4d0b07c99 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -717,7 +717,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct 
mm_struct *src_mm,
 
if (likely(!non_swap_entry(entry))) {
if (swap_duplicate(entry) < 0)
-   return entry.val;
+   return -EIO;
 
/* make sure dst_mm is on swapoff's mmlist. */
if (unlikely(list_empty(_mm->mmlist))) {
@@ -973,12 +973,14 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct 
vm_area_struct *src_vma,
continue;
}
if (unlikely(!pte_present(*src_pte))) {
-   entry.val = copy_nonpresent_pte(dst_mm, src_mm,
-   dst_pte, src_pte,
-   dst_vma, src_vma,
-   addr, rss);
-   if (entry.val)
+   ret = copy_nonpresent_pte(dst_mm, src_mm,
+ dst_pte, src_pte,
+ dst_vma, src_vma,
+ addr, rss);
+   if (ret == -EIO) {
+   entry = pte_to_swp_entry(*src_pte);
break;
+   }
progress += 8;
continue;
}
@@ -1011,20 +1013,24 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct 
vm_area_struct *src_vma,
pte_unmap_unlock(orig_dst_pte, dst_ptl);
cond_resched();
 
-   if (entry.val) {
+   if (ret == -EIO) {
+   VM_WARN_ON_ONCE(!entry.val);
if (add_swap_count_continuation(entry, GFP_KERNEL) < 0) {
ret = -ENOMEM;
goto out;
}
entry.val = 0;
-   } else if (ret) {
-   WARN_ON_ONCE(ret != -EAGAIN);
+   } else if (ret ==  -EAGAIN) {
prealloc = page_copy_prealloc(src_mm, src_vma, addr);
if (!prealloc)
return -ENOMEM;
-   /* We've captured and resolved the error. Reset, try again. */
-   ret = 0;
+   } else if (ret) {
+   VM_WARN_ON_ONCE(1);
}
+
+   /* We've captured and resolved the error. Reset, try again. */
+   ret = 0;
+
if (addr != end)
goto again;
 out:
-- 
2.20.1

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


[Nouveau] [PATCH v11 05/10] mm: Rename migrate_pgmap_owner

2021-06-16 Thread Alistair Popple
MMU notifier ranges have a migrate_pgmap_owner field which is used by
drivers to store a pointer. This is subsequently used by the driver
callback to filter MMU_NOTIFY_MIGRATE events. Other notifier event types
can also benefit from this filtering, so rename the
'migrate_pgmap_owner' field to 'owner' and create a new notifier
initialisation function to initialise this field.

Signed-off-by: Alistair Popple 
Suggested-by: Peter Xu 
Reviewed-by: Peter Xu 

---

v9:

Previously part of the next patch in the series ('mm: Device exclusive
memory access') but now split out as a separate change as suggested by
Peter Xu.
---
 Documentation/vm/hmm.rst  |  2 +-
 drivers/gpu/drm/nouveau/nouveau_svm.c |  2 +-
 include/linux/mmu_notifier.h  | 20 ++--
 lib/test_hmm.c|  2 +-
 mm/migrate.c  | 10 +-
 5 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst
index 09e28507f5b2..3df79307a797 100644
--- a/Documentation/vm/hmm.rst
+++ b/Documentation/vm/hmm.rst
@@ -332,7 +332,7 @@ between device driver specific code and shared common code:
walks to fill in the ``args->src`` array with PFNs to be migrated.
The ``invalidate_range_start()`` callback is passed a
``struct mmu_notifier_range`` with the ``event`` field set to
-   ``MMU_NOTIFY_MIGRATE`` and the ``migrate_pgmap_owner`` field set to
+   ``MMU_NOTIFY_MIGRATE`` and the ``owner`` field set to
the ``args->pgmap_owner`` field passed to migrate_vma_setup(). This is
allows the device driver to skip the invalidation callback and only
invalidate device private MMU mappings that are actually migrating.
diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c 
b/drivers/gpu/drm/nouveau/nouveau_svm.c
index f18bd53da052..94f841026c3b 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -265,7 +265,7 @@ nouveau_svmm_invalidate_range_start(struct mmu_notifier *mn,
 * the invalidation is handled as part of the migration process.
 */
if (update->event == MMU_NOTIFY_MIGRATE &&
-   update->migrate_pgmap_owner == svmm->vmm->cli->drm->dev)
+   update->owner == svmm->vmm->cli->drm->dev)
goto out;
 
if (limit > svmm->unmanaged.start && start < svmm->unmanaged.limit) {
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index d514e5d06ebf..8eb965ef0028 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -41,7 +41,7 @@ struct mmu_interval_notifier;
  *
  * @MMU_NOTIFY_MIGRATE: used during migrate_vma_collect() invalidate to signal
  * a device driver to possibly ignore the invalidation if the
- * migrate_pgmap_owner field matches the driver's device private pgmap owner.
+ * owner field matches the driver's device private pgmap owner.
  */
 enum mmu_notifier_event {
MMU_NOTIFY_UNMAP = 0,
@@ -269,7 +269,7 @@ struct mmu_notifier_range {
unsigned long end;
unsigned flags;
enum mmu_notifier_event event;
-   void *migrate_pgmap_owner;
+   void *owner;
 };
 
 static inline int mm_has_notifiers(struct mm_struct *mm)
@@ -521,14 +521,14 @@ static inline void mmu_notifier_range_init(struct 
mmu_notifier_range *range,
range->flags = flags;
 }
 
-static inline void mmu_notifier_range_init_migrate(
-   struct mmu_notifier_range *range, unsigned int flags,
+static inline void mmu_notifier_range_init_owner(
+   struct mmu_notifier_range *range,
+   enum mmu_notifier_event event, unsigned int flags,
struct vm_area_struct *vma, struct mm_struct *mm,
-   unsigned long start, unsigned long end, void *pgmap)
+   unsigned long start, unsigned long end, void *owner)
 {
-   mmu_notifier_range_init(range, MMU_NOTIFY_MIGRATE, flags, vma, mm,
-   start, end);
-   range->migrate_pgmap_owner = pgmap;
+   mmu_notifier_range_init(range, event, flags, vma, mm, start, end);
+   range->owner = owner;
 }
 
 #define ptep_clear_flush_young_notify(__vma, __address, __ptep)
\
@@ -655,8 +655,8 @@ static inline void _mmu_notifier_range_init(struct 
mmu_notifier_range *range,
 
 #define mmu_notifier_range_init(range,event,flags,vma,mm,start,end)  \
_mmu_notifier_range_init(range, start, end)
-#define mmu_notifier_range_init_migrate(range, flags, vma, mm, start, end, \
-   pgmap) \
+#define mmu_notifier_range_init_owner(range, event, flags, vma, mm, start, \
+   end, owner) \
_mmu_notifier_range_init(range, start, end)
 
 static inline bool
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 15f2e2db77bc..fc7a20bc9b42 1006

[Nouveau] [PATCH v11 04/10] mm/rmap: Split migration into its own function

2021-06-16 Thread Alistair Popple
Migration is currently implemented as a mode of operation for
try_to_unmap_one() generally specified by passing the TTU_MIGRATION flag
or in the case of splitting a huge anonymous page TTU_SPLIT_FREEZE.

However it does not have much in common with the rest of the unmap
functionality of try_to_unmap_one() and thus splitting it into a
separate function reduces the complexity of try_to_unmap_one() making it
more readable.

Several simplifications can also be made in try_to_migrate_one() based
on the following observations:

 - All users of TTU_MIGRATION also set TTU_IGNORE_MLOCK.
 - No users of TTU_MIGRATION ever set TTU_IGNORE_HWPOISON.
 - No users of TTU_MIGRATION ever set TTU_BATCH_FLUSH.

TTU_SPLIT_FREEZE is a special case of migration used when splitting an
anonymous page. This is most easily dealt with by calling the correct
function from unmap_page() in mm/huge_memory.c  - either
try_to_migrate() for PageAnon or try_to_unmap().

Signed-off-by: Alistair Popple 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Ralph Campbell 

---

v5:
* Added comments about how PMD splitting works for migration vs.
  unmapping
* Tightened up the flag check in try_to_migrate() to be explicit about
  which TTU_XXX flags are supported.
---
 include/linux/rmap.h |   4 +-
 mm/huge_memory.c |  16 +-
 mm/migrate.c |   9 +-
 mm/rmap.c| 367 ---
 4 files changed, 289 insertions(+), 107 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 69190efbd842..b0ea9d98302f 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -86,8 +86,6 @@ struct anon_vma_chain {
 };
 
 enum ttu_flags {
-   TTU_MIGRATION   = 0x1,  /* migration mode */
-
TTU_SPLIT_HUGE_PMD  = 0x4,  /* split huge PMD if any */
TTU_IGNORE_MLOCK= 0x8,  /* ignore mlock */
TTU_SYNC= 0x10, /* avoid racy checks with PVMW_SYNC */
@@ -97,7 +95,6 @@ enum ttu_flags {
 * do a final flush if necessary */
TTU_RMAP_LOCKED = 0x80, /* do not grab rmap lock:
 * caller holds it */
-   TTU_SPLIT_FREEZE= 0x100,/* freeze pte under 
splitting thp */
 };
 
 #ifdef CONFIG_MMU
@@ -194,6 +191,7 @@ static inline void page_dup_rmap(struct page *page, bool 
compound)
 int page_referenced(struct page *, int is_locked,
struct mem_cgroup *memcg, unsigned long *vm_flags);
 
+void try_to_migrate(struct page *page, enum ttu_flags flags);
 void try_to_unmap(struct page *, enum ttu_flags flags);
 
 /* Avoid racy checks */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1c81aa11d618..8b731d53e9f4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2309,16 +2309,20 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma,
 
 static void unmap_page(struct page *page)
 {
-   enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK | TTU_SYNC |
-   TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD;
+   enum ttu_flags ttu_flags = TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD |
+   TTU_SYNC;
 
VM_BUG_ON_PAGE(!PageHead(page), page);
 
-   /* If TTU_SPLIT_FREEZE is ever extended to file, update remap_page() */
+   /*
+* Anon pages need migration entries to preserve them, but file
+* pages can simply be left unmapped, then faulted back on demand.
+* If that is ever changed (perhaps for mlock), update remap_page().
+*/
if (PageAnon(page))
-   ttu_flags |= TTU_SPLIT_FREEZE;
-
-   try_to_unmap(page, ttu_flags);
+   try_to_migrate(page, ttu_flags);
+   else
+   try_to_unmap(page, ttu_flags | TTU_IGNORE_MLOCK);
 
VM_WARN_ON_ONCE_PAGE(page_mapped(page), page);
 }
diff --git a/mm/migrate.c b/mm/migrate.c
index de47bdfbd2f8..a69f54bd92b9 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1109,7 +1109,7 @@ static int __unmap_and_move(struct page *page, struct 
page *newpage,
/* Establish migration ptes */
VM_BUG_ON_PAGE(PageAnon(page) && !PageKsm(page) && !anon_vma,
page);
-   try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK);
+   try_to_migrate(page, 0);
page_was_mapped = 1;
}
 
@@ -1311,7 +1311,7 @@ static int unmap_and_move_huge_page(new_page_t 
get_new_page,
 
if (page_mapped(hpage)) {
bool mapping_locked = false;
-   enum ttu_flags ttu = TTU_MIGRATION|TTU_IGNORE_MLOCK;
+   enum ttu_flags ttu = 0;
 
if (!PageAnon(hpage)) {
/*
@@ -1328,7 +1328,7 @@ static int unmap_and_move_huge_page(new_page_t 
get_new_page,
ttu |= TTU_RMAP_LOCKED;
}
 
-   try_to_unmap(hpage, ttu);
+   try_to_migrate(hpage, ttu);
pag

[Nouveau] [PATCH v11 02/10] mm/swapops: Rework swap entry manipulation code

2021-06-16 Thread Alistair Popple
Both migration and device private pages use special swap entries that
are manipluated by a range of inline functions. The arguments to these
are somewhat inconsitent so rework them to remove flag type arguments
and to make the arguments similar for both read and write entry
creation.

Signed-off-by: Alistair Popple 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Ralph Campbell 
---
 include/linux/swapops.h | 56 ++---
 mm/debug_vm_pgtable.c   | 12 -
 mm/hmm.c|  2 +-
 mm/huge_memory.c| 26 +--
 mm/hugetlb.c| 10 +---
 mm/memory.c | 10 +---
 mm/migrate.c| 26 ++-
 mm/mprotect.c   | 10 +---
 mm/rmap.c   | 10 +---
 9 files changed, 100 insertions(+), 62 deletions(-)

diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index c24c79812bc1..04d76357aa0c 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -107,35 +107,35 @@ static inline void *swp_to_radix_entry(swp_entry_t entry)
 }
 
 #if IS_ENABLED(CONFIG_DEVICE_PRIVATE)
-static inline swp_entry_t make_device_private_entry(struct page *page, bool 
write)
+static inline swp_entry_t make_readable_device_private_entry(pgoff_t offset)
 {
-   return swp_entry(write ? SWP_DEVICE_WRITE : SWP_DEVICE_READ,
-page_to_pfn(page));
+   return swp_entry(SWP_DEVICE_READ, offset);
 }
 
-static inline bool is_device_private_entry(swp_entry_t entry)
+static inline swp_entry_t make_writable_device_private_entry(pgoff_t offset)
 {
-   int type = swp_type(entry);
-   return type == SWP_DEVICE_READ || type == SWP_DEVICE_WRITE;
+   return swp_entry(SWP_DEVICE_WRITE, offset);
 }
 
-static inline void make_device_private_entry_read(swp_entry_t *entry)
+static inline bool is_device_private_entry(swp_entry_t entry)
 {
-   *entry = swp_entry(SWP_DEVICE_READ, swp_offset(*entry));
+   int type = swp_type(entry);
+   return type == SWP_DEVICE_READ || type == SWP_DEVICE_WRITE;
 }
 
-static inline bool is_write_device_private_entry(swp_entry_t entry)
+static inline bool is_writable_device_private_entry(swp_entry_t entry)
 {
return unlikely(swp_type(entry) == SWP_DEVICE_WRITE);
 }
 #else /* CONFIG_DEVICE_PRIVATE */
-static inline swp_entry_t make_device_private_entry(struct page *page, bool 
write)
+static inline swp_entry_t make_readable_device_private_entry(pgoff_t offset)
 {
return swp_entry(0, 0);
 }
 
-static inline void make_device_private_entry_read(swp_entry_t *entry)
+static inline swp_entry_t make_writable_device_private_entry(pgoff_t offset)
 {
+   return swp_entry(0, 0);
 }
 
 static inline bool is_device_private_entry(swp_entry_t entry)
@@ -143,35 +143,32 @@ static inline bool is_device_private_entry(swp_entry_t 
entry)
return false;
 }
 
-static inline bool is_write_device_private_entry(swp_entry_t entry)
+static inline bool is_writable_device_private_entry(swp_entry_t entry)
 {
return false;
 }
 #endif /* CONFIG_DEVICE_PRIVATE */
 
 #ifdef CONFIG_MIGRATION
-static inline swp_entry_t make_migration_entry(struct page *page, int write)
-{
-   BUG_ON(!PageLocked(compound_head(page)));
-
-   return swp_entry(write ? SWP_MIGRATION_WRITE : SWP_MIGRATION_READ,
-   page_to_pfn(page));
-}
-
 static inline int is_migration_entry(swp_entry_t entry)
 {
return unlikely(swp_type(entry) == SWP_MIGRATION_READ ||
swp_type(entry) == SWP_MIGRATION_WRITE);
 }
 
-static inline int is_write_migration_entry(swp_entry_t entry)
+static inline int is_writable_migration_entry(swp_entry_t entry)
 {
return unlikely(swp_type(entry) == SWP_MIGRATION_WRITE);
 }
 
-static inline void make_migration_entry_read(swp_entry_t *entry)
+static inline swp_entry_t make_readable_migration_entry(pgoff_t offset)
 {
-   *entry = swp_entry(SWP_MIGRATION_READ, swp_offset(*entry));
+   return swp_entry(SWP_MIGRATION_READ, offset);
+}
+
+static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
+{
+   return swp_entry(SWP_MIGRATION_WRITE, offset);
 }
 
 extern void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
@@ -181,21 +178,28 @@ extern void migration_entry_wait(struct mm_struct *mm, 
pmd_t *pmd,
 extern void migration_entry_wait_huge(struct vm_area_struct *vma,
struct mm_struct *mm, pte_t *pte);
 #else
+static inline swp_entry_t make_readable_migration_entry(pgoff_t offset)
+{
+   return swp_entry(0, 0);
+}
+
+static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
+{
+   return swp_entry(0, 0);
+}
 
-#define make_migration_entry(page, write) swp_entry(0, 0)
 static inline int is_migration_entry(swp_entry_t swp)
 {
return 0;
 }
 
-static inline void make_migration_entry_read(swp_entry_t *entryp) { }
 static inline void __migration_entry_wait(struct mm_struct *mm

[Nouveau] [PATCH v11 03/10] mm/rmap: Split try_to_munlock from try_to_unmap

2021-06-16 Thread Alistair Popple
The behaviour of try_to_unmap_one() is difficult to follow because it
performs different operations based on a fairly large set of flags used
in different combinations.

TTU_MUNLOCK is one such flag. However it is exclusively used by
try_to_munlock() which specifies no other flags. Therefore rather than
overload try_to_unmap_one() with unrelated behaviour split this out into
it's own function and remove the flag.

Signed-off-by: Alistair Popple 
Reviewed-by: Ralph Campbell 
Reviewed-by: Christoph Hellwig 

---

v11:
* Rebased on Hugh's series

v10:
* More comment fixes
* Restored the check of VM_LOCKED under that ptl. This closess a race in
  unmap path.

v9:
* Improved comments

v8:
* Renamed try_to_munlock to page_mlock to better reflect what the
  function actually does.
* Removed the TODO from the documentation that this patch addresses.

v7:
* Added Christoph's Reviewed-by

v4:
* Removed redundant check for VM_LOCKED
---
 Documentation/vm/unevictable-lru.rst | 33 ++
 include/linux/rmap.h |  3 +-
 mm/mlock.c   | 12 ++---
 mm/rmap.c| 66 +---
 4 files changed, 69 insertions(+), 45 deletions(-)

diff --git a/Documentation/vm/unevictable-lru.rst 
b/Documentation/vm/unevictable-lru.rst
index 0e1490524f53..eae3af17f2d9 100644
--- a/Documentation/vm/unevictable-lru.rst
+++ b/Documentation/vm/unevictable-lru.rst
@@ -389,14 +389,14 @@ mlocked, munlock_vma_page() updates that zone statistics 
for the number of
 mlocked pages.  Note, however, that at this point we haven't checked whether
 the page is mapped by other VM_LOCKED VMAs.
 
-We can't call try_to_munlock(), the function that walks the reverse map to
+We can't call page_mlock(), the function that walks the reverse map to
 check for other VM_LOCKED VMAs, without first isolating the page from the LRU.
-try_to_munlock() is a variant of try_to_unmap() and thus requires that the page
+page_mlock() is a variant of try_to_unmap() and thus requires that the page
 not be on an LRU list [more on these below].  However, the call to
-isolate_lru_page() could fail, in which case we couldn't try_to_munlock().  So,
+isolate_lru_page() could fail, in which case we can't call page_mlock().  So,
 we go ahead and clear PG_mlocked up front, as this might be the only chance we
-have.  If we can successfully isolate the page, we go ahead and
-try_to_munlock(), which will restore the PG_mlocked flag and update the zone
+have.  If we can successfully isolate the page, we go ahead and call
+page_mlock(), which will restore the PG_mlocked flag and update the zone
 page statistics if it finds another VMA holding the page mlocked.  If we fail
 to isolate the page, we'll have left a potentially mlocked page on the LRU.
 This is fine, because we'll catch it later if and if vmscan tries to reclaim
@@ -545,31 +545,24 @@ munlock or munmap system calls, mm teardown 
(munlock_vma_pages_all), reclaim,
 holepunching, and truncation of file pages and their anonymous COWed pages.
 
 
-try_to_munlock() Reverse Map Scan
+page_mlock() Reverse Map Scan
 -
 
-.. warning::
-   [!] TODO/FIXME: a better name might be page_mlocked() - analogous to the
-   page_referenced() reverse map walker.
-
 When munlock_vma_page() [see section :ref:`munlock()/munlockall() System Call
 Handling ` above] tries to munlock a
 page, it needs to determine whether or not the page is mapped by any
 VM_LOCKED VMA without actually attempting to unmap all PTEs from the
 page.  For this purpose, the unevictable/mlock infrastructure
-introduced a variant of try_to_unmap() called try_to_munlock().
+introduced a variant of try_to_unmap() called page_mlock().
 
-try_to_munlock() calls the same functions as try_to_unmap() for anonymous and
-mapped file and KSM pages with a flag argument specifying unlock versus unmap
-processing.  Again, these functions walk the respective reverse maps looking
-for VM_LOCKED VMAs.  When such a VMA is found, as in the try_to_unmap() case,
-the functions mlock the page via mlock_vma_page() and return SWAP_MLOCK.  This
-undoes the pre-clearing of the page's PG_mlocked done by munlock_vma_page.
+page_mlock() walks the respective reverse maps looking for VM_LOCKED VMAs. When
+such a VMA is found the page is mlocked via mlock_vma_page(). This undoes the
+pre-clearing of the page's PG_mlocked done by munlock_vma_page.
 
-Note that try_to_munlock()'s reverse map walk must visit every VMA in a page's
+Note that page_mlock()'s reverse map walk must visit every VMA in a page's
 reverse map to determine that a page is NOT mapped into any VM_LOCKED VMA.
 However, the scan can terminate when it encounters a VM_LOCKED VMA.
-Although try_to_munlock() might be called a great many times when munlocking a
+Although page_mlock() might be called a great many times when munlocking a
 large region or tearing down a large address space that has been mlocked via
 mlockall(), overall

[Nouveau] [PATCH v11 01/10] mm: Remove special swap entry functions

2021-06-16 Thread Alistair Popple
Remove multiple similar inline functions for dealing with different
types of special swap entries.

Both migration and device private swap entries use the swap offset to
store a pfn. Instead of multiple inline functions to obtain a struct
page for each swap entry type use a common function
pfn_swap_entry_to_page(). Also open-code the various entry_to_pfn()
functions as this results is shorter code that is easier to understand.

Signed-off-by: Alistair Popple 
Reviewed-by: Ralph Campbell 
Reviewed-by: Christoph Hellwig 

---

v11:
* Rebased on mmotm

v9:
* Rebased on v5.13-rc2

v8:
* No changes

v7:
* Reworded commit message to include pfn_swap_entry_to_page()
* Added Christoph's Reviewed-by

v6:
* Removed redundant compound_page() call from inside PageLocked()
* Fixed a minor build issue for s390 reported by kernel test bot

v4:
* Added pfn_swap_entry_to_page()
* Reinstated check that migration entries point to locked pages
* Removed #define swapcache_prepare which isn't needed for CONFIG_SWAP=0
  builds
---
 arch/s390/mm/pgtable.c  |  2 +-
 fs/proc/task_mmu.c  | 23 +-
 include/linux/swap.h|  4 +--
 include/linux/swapops.h | 69 ++---
 mm/hmm.c|  5 ++-
 mm/huge_memory.c|  6 ++--
 mm/memcontrol.c |  2 +-
 mm/memory.c | 10 +++---
 mm/migrate.c|  6 ++--
 mm/page_vma_mapped.c|  6 ++--
 10 files changed, 51 insertions(+), 82 deletions(-)

diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 18205f851c24..eec3a9d7176e 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -691,7 +691,7 @@ static void ptep_zap_swap_entry(struct mm_struct *mm, 
swp_entry_t entry)
if (!non_swap_entry(entry))
dec_mm_counter(mm, MM_SWAPENTS);
else if (is_migration_entry(entry)) {
-   struct page *page = migration_entry_to_page(entry);
+   struct page *page = pfn_swap_entry_to_page(entry);
 
dec_mm_counter(mm, mm_counter(page));
}
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 95c8f1e8fea6..eb97468dfe4c 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -514,10 +514,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
} else {
mss->swap_pss += (u64)PAGE_SIZE << PSS_SHIFT;
}
-   } else if (is_migration_entry(swpent))
-   page = migration_entry_to_page(swpent);
-   else if (is_device_private_entry(swpent))
-   page = device_private_entry_to_page(swpent);
+   } else if (is_pfn_swap_entry(swpent))
+   page = pfn_swap_entry_to_page(swpent);
} else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap
&& pte_none(*pte))) {
page = xa_load(>vm_file->f_mapping->i_pages,
@@ -549,7 +547,7 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
swp_entry_t entry = pmd_to_swp_entry(*pmd);
 
if (is_migration_entry(entry))
-   page = migration_entry_to_page(entry);
+   page = pfn_swap_entry_to_page(entry);
}
if (IS_ERR_OR_NULL(page))
return;
@@ -694,10 +692,8 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long 
hmask,
} else if (is_swap_pte(*pte)) {
swp_entry_t swpent = pte_to_swp_entry(*pte);
 
-   if (is_migration_entry(swpent))
-   page = migration_entry_to_page(swpent);
-   else if (is_device_private_entry(swpent))
-   page = device_private_entry_to_page(swpent);
+   if (is_pfn_swap_entry(swpent))
+   page = pfn_swap_entry_to_page(swpent);
}
if (page) {
int mapcount = page_mapcount(page);
@@ -1389,11 +1385,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct 
pagemapread *pm,
frame = swp_type(entry) |
(swp_offset(entry) << MAX_SWAPFILES_SHIFT);
flags |= PM_SWAP;
-   if (is_migration_entry(entry))
-   page = migration_entry_to_page(entry);
-
-   if (is_device_private_entry(entry))
-   page = device_private_entry_to_page(entry);
+   if (is_pfn_swap_entry(entry))
+   page = pfn_swap_entry_to_page(entry);
}
 
if (page && !PageAnon(page))
@@ -1454,7 +1447,7 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long 
addr, unsigned long end,
if (pmd_swp_uffd_wp(pmd))
flags |= PM_UFFD_WP;
VM_BUG_ON(!is_pmd_migration_entry(pmd));
-

[Nouveau] [PATCH v11 00/10] Add support for SVM atomics in Nouveau

2021-06-16 Thread Alistair Popple
Hi Andrew,

This is my series to add support for SVM atomics in Nouveau rebased onto
v5.13-rc6-mmots-2021-06-15-20-28. Functionally there are no changes from
the previous v10, however some changes were made during the rebase. If
anyone would like to see a diff-of-diffs I can post it but my exhaustive
(assuming I didn't miss anything in the diff-of-diff) summary below is
probably easier to read.

Hugh - based on our previous disscussion I'm reasonably confident I haven't
missed anything from you in the rebase, but patch 4 ("Split migration into
its own function") might be worth looking at as that contained the largest
conflicts. Thanks.

[PATCH 01/10] mm: Remove special swap entry functions

- Fixed migration_entry_to_page() reference in __split_huge_pmd_locked()
  introduced by Hugh's "mm/thp: fix __split_huge_pmd_locked() on shmem
  migration entry".

- Fixed migration_entry_to_page() reference in page_vma_mapped_walk()
  changed by Hugh in "mm: page_vma_mapped_walk(): prettify PVMW_MIGRATION
  block"

[PATCH 02/10] mm/swapops: Rework swap entry manipulation code

- No changes required.

[PATCH 03/10] mm/rmap: Split try_to_munlock from try_to_unmap

- No changes required.

[PATCH 04/10] mm/rmap: Split migration into its own function

- Updated try_to_migrate_one() and try_to_migrate() to accept TTU_SYNC
  flag.

- Updated mmu notifier range calculation in try_to_migrate_one() to use
  vma_address_end() introduced by Hugh in "mm/thp: fix vma_address() if
  virtual address below file offset".

- Update to unmap_page() in mm/hugh_memory.c to pass TTU_SYNC and check
  page_mapped() now that try_to_{unmap|migrate} return void.

[PATCH 05/10] mm: Rename migrate_pgmap_owner

- Added Peter's Reviewed-by.

[PATCH 06/10] mm/memory.c: Allow different return codes for 
copy_non_present_pte()

- Updated call to copy_nonpresent_pte() for extra arguments added by Peter
  in "mm/userfaultfd: fix uffd-wp special cases for fork()".

- Added Peter's Reviewed-by.

[PATCH 07/10] mm: Device exclusive memory access

- s/vma/src_vma/ in copy_nonpresent_pte() due to "mm/userfaultfd: fix
  uffd-wp special cases for fork()".

- Skip calling rmap_walk on tail pages.

[PATCH 08/10] mm: Selftests for exclusive device memory

- Squashed in a fix from Colin King (see
  
https://lore.kernel.org/kernel-janitors/20210526170530.3766167-1-colin.k...@canonical.com/).
  Not sure what the best way of attributing that is though given it was
  against next.

[PATCH 09/10] nouveau/svm: Refactor nouveau_range_fault

- No changes.

[PATCH 10/10] nouveau/svm: Implement atomic SVM access

- No changes.

Alistair Popple (10):
  mm: Remove special swap entry functions
  mm/swapops: Rework swap entry manipulation code
  mm/rmap: Split try_to_munlock from try_to_unmap
  mm/rmap: Split migration into its own function
  mm: Rename migrate_pgmap_owner
  mm/memory.c: Allow different return codes for copy_nonpresent_pte()
  mm: Device exclusive memory access
  mm: Selftests for exclusive device memory
  nouveau/svm: Refactor nouveau_range_fault
  nouveau/svm: Implement atomic SVM access

 Documentation/vm/hmm.rst  |  19 +-
 Documentation/vm/unevictable-lru.rst  |  33 +-
 arch/s390/mm/pgtable.c|   2 +-
 drivers/gpu/drm/nouveau/include/nvif/if000c.h |   1 +
 drivers/gpu/drm/nouveau/nouveau_svm.c | 156 -
 drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h |   1 +
 .../drm/nouveau/nvkm/subdev/mmu/vmmgp100.c|   6 +
 fs/proc/task_mmu.c|  23 +-
 include/linux/mmu_notifier.h  |  26 +-
 include/linux/rmap.h  |  11 +-
 include/linux/swap.h  |  13 +-
 include/linux/swapops.h   | 123 ++--
 lib/test_hmm.c| 127 +++-
 lib/test_hmm_uapi.h   |   2 +
 mm/debug_vm_pgtable.c |  12 +-
 mm/hmm.c  |  12 +-
 mm/huge_memory.c  |  48 +-
 mm/hugetlb.c  |  10 +-
 mm/memcontrol.c   |   2 +-
 mm/memory.c   | 175 -
 mm/migrate.c  |  51 +-
 mm/mlock.c|  12 +-
 mm/mprotect.c |  18 +-
 mm/page_vma_mapped.c  |  15 +-
 mm/rmap.c | 612 +++---
 tools/testing/selftests/vm/hmm-tests.c| 158 +
 26 files changed, 1341 insertions(+), 327 deletions(-)

-- 
2.20.1

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v10 07/10] mm: Device exclusive memory access

2021-06-15 Thread Alistair Popple
On Wednesday, 16 June 2021 2:25:09 AM AEST Peter Xu wrote:
> On Tue, Jun 15, 2021 at 01:08:11PM +1000, Alistair Popple wrote:
> > On Saturday, 12 June 2021 1:01:42 AM AEST Peter Xu wrote:
> > > On Fri, Jun 11, 2021 at 01:43:20PM +1000, Alistair Popple wrote:

[...]

> > > Do you think we can restore pte right before wr-protect or zap?  Then all
> > > things serializes with page lock (btw: it's already an insane userspace to
> > > either unmap a page or wr-protect a page if it knows the device is using 
> > > it!).
> > > If these are the only two cases, it still sounds a cleaner approach to me 
> > > than
> > > the current approach.
> >
> > Perhaps we could but it would make {zap|change}_pte_range() much more 
> > complex as
> > we can't sleep taking the page lock whilst holding the ptl, so we'd have to
> > implement a retry scheme similar to copy_pte_range() in both those 
> > functions as
> > well.
> 
> Yes, but shouldn't be hard to do so, imho. E.g., see when __tlb_remove_page()
> returns true in zap_pte_range(), so we already did something like that.  IMHO
> it's not uncommon to have such facilities as we do have requirements to sleep
> during a spinlock critical section for a lot of places in mm, so we release
> them when needed and retake.

Agreed, it's not hard to do and it's a common enough pattern. However we decided
that for such a specific application this (trying to take the lock or drop locks
and retry) was too complex for copy_pte_range() so it seems like the same should
apply here.

Admittedly copy_pte_range() already had several other retry paths so perhaps
it was adding yet another that made it relatively more complex. Overall I have
been trying to minimise the impact on core mm code for this feature, and adding
this pattern to zap_pte_range(), etc. would make it more complex for any future
addition that requires locks to be dropped and retried so I guess in that sense
it is no different.

> > Given mmu_interval_read_begin/retry was IMHO added to solve this type of
> > problem (freezing pte's to safely program device pte's) it seems like the
> > better option rather than adding more complex code to generic mm paths.
> >
> > It's also worth noting i915 seems to use mmu_interval_read_begin/retry() 
> > with
> > gup to sync mappings so this isn't an entirely new concept. I'm not an 
> > expert
> > in that driver but I imagine changing gup to generate unconditional mmu 
> > notifier
> > invalidates would also cause issues there. So I think overall this is the
> > cleanest solution as it reduces the amount of code (particularly in generic 
> > mm
> > paths).
> 
> I could be wrong somewhere, but to me depending on mmu notifiers being
> "accurate" in general is fragile..
> 
> Take an example of change_pte_range(), which will generate PROTECTION_VMA
> notifies.  Let's imaging an userspace calls mprotect() e.g. twice or even more
> times with the same PROT_* and upon the same region, we know very possibly the
> 2nd,3rd,... calls will generate those notifies with totally no change to the
> pgtable at all as they're all done on the 1st shot.  However we'll generate 
> mmu
> notifies anyways for the 2nd,3rd,... calls.  It means mmu notifiers should
> really be tolerant of false positives as it does happen, and such thing can be
> triggered even from userspace system calls very easily like this.  That's why 
> I
> think any kernel facility that depends on mmu notifiers being accurate is
> probably not the right approach..

Argh, thanks. I was focused on the specifics of this series but I think I
understand your point better now - that as a more general principle we can't
assume notifiers are accurate.

> But yeah as you said I think it's working as is with the series (I think the
> follow_pmd_mask() checking pmd_trans_huge before calling split_huge_pmd is a
> double safety-net for it, so even if the GUP split_huge_pmd got replaced with
> __split_huge_pmd it should still work with the one-retry logic), not sure
> whether it matters a lot, as it's not common mm path; I think I'll step back 
> so
> Andrew could still pick it up as wish, I'm just still not fully convinced it's
> the best solution to have for a long term to depend on that..

Ok, thanks. I guess you have somewhat convinced me - depending on it for the
long term might be a bit fragile. However as you say the current implementation
does work and I am starting to look at support for PMD and file backed pages
which require changes here anyway. So I am hoping Andrew can still take this
(once rebased) as it would be easier for me to do those changes if the basic
support and clean ups were already in place.

> > > This also reminded me th

Re: [Nouveau] [PATCH v10 07/10] mm: Device exclusive memory access

2021-06-14 Thread Alistair Popple
On Saturday, 12 June 2021 1:01:42 AM AEST Peter Xu wrote:
> On Fri, Jun 11, 2021 at 01:43:20PM +1000, Alistair Popple wrote:
> > On Friday, 11 June 2021 11:00:34 AM AEST Peter Xu wrote:
> > > On Fri, Jun 11, 2021 at 09:17:14AM +1000, Alistair Popple wrote:
> > > > On Friday, 11 June 2021 9:04:19 AM AEST Peter Xu wrote:
> > > > > On Fri, Jun 11, 2021 at 12:21:26AM +1000, Alistair Popple wrote:
> > > > > > > Hmm, the thing is.. to me FOLL_SPLIT_PMD should have similar 
> > > > > > > effect to explicit
> > > > > > > call split_huge_pmd_address(), afaict.  Since both of them use 
> > > > > > > __split_huge_pmd()
> > > > > > > internally which will generate that unwanted CLEAR notify.
> > > > > >
> > > > > > Agree that gup calls __split_huge_pmd() via split_huge_pmd_address()
> > > > > > which will always CLEAR. However gup only calls 
> > > > > > split_huge_pmd_address() if it
> > > > > > finds a thp pmd. In follow_pmd_mask() we have:
> > > > > >
> > > > > >   if (likely(!pmd_trans_huge(pmdval)))
> > > > > >   return follow_page_pte(vma, address, pmd, flags, 
> > > > > > >pgmap);
> > > > > >
> > > > > > So I don't think we have a problem here.
> > > > >
> > > > > Sorry I didn't follow here..  We do FOLL_SPLIT_PMD after this check, 
> > > > > right?  I
> > > > > mean, if it's a thp for the current mm, afaict pmd_trans_huge() 
> > > > > should return
> > > > > true above, so we'll skip follow_page_pte(); then we'll check 
> > > > > FOLL_SPLIT_PMD
> > > > > and do the split, then the CLEAR notify.  Hmm.. Did I miss something?
> > > >
> > > > That seems correct - if the thp is not mapped with a pmd we won't split 
> > > > and we
> > > > won't CLEAR. If there is a thp pmd we will split and CLEAR, but in that 
> > > > case it
> > > > is fine - we will retry, but the retry will won't CLEAR because the pmd 
> > > > has
> > > > already been split.
> > >
> > > Aha!
> > >
> > > >
> > > > The issue arises with doing it unconditionally in make device exclusive 
> > > > is that
> > > > you *always* CLEAR even if there is no thp pmd to split. Or at least 
> > > > that's my
> > > > understanding, please let me know if it doesn't make sense.
> > >
> > > Exactly.  But if you see what I meant here, even if it can work like 
> > > this, it
> > > sounds still fragile, isn't it?  I just feel something is slightly off 
> > > there..
> > >
> > > IMHO split_huge_pmd() checked pmd before calling __split_huge_pmd() for
> > > performance, afaict, because if it's not a thp even without locking, then 
> > > it
> > > won't be, so further __split_huge_pmd() is not necessary.
> > >
> > > IOW, it's very legal if someday we'd like to let split_huge_pmd() call
> > > __split_huge_pmd() directly, then AFAIU device exclusive API will be the 
> > > 1st
> > > one to be broken with that seems-to-be-irrelevant change I'm afraid..
> >
> > Well I would argue the performance of memory notifiers is becoming 
> > increasingly
> > important, and a change that causes them to be called unnecessarily is
> > therefore not very legal. Likely the correct fix here is to optimise
> > __split_huge_pmd() to only call the notifier if it's actually going to 
> > split a
> > pmd. As you said though that's a completely different story which I think 
> > would
> > be best done as a separate series.
> 
> Right, maybe I can look a bit more into that later; but my whole point was to
> express that one functionality shouldn't depend on such a trivial detail of
> implementation of other modules (thp split in this case).
> 
> >
> > > This lets me goes back a step to think about why do we need this notifier 
> > > at
> > > all to cover this whole range of make_device_exclusive() procedure..
> > >
> > > What I am thinking is, we're afraid some CPU accesses this page so the 
> > > pte got
> > > quickly restored when device atomic operation is carrying on.  Then with 
> > > this
> > > notifier we'll be able to cancel it.  Makes perfect sense.
> > >
> > > However do we really need to register this no

Re: [Nouveau] [PATCH v10 07/10] mm: Device exclusive memory access

2021-06-10 Thread Alistair Popple
On Friday, 11 June 2021 11:00:34 AM AEST Peter Xu wrote:
> On Fri, Jun 11, 2021 at 09:17:14AM +1000, Alistair Popple wrote:
> > On Friday, 11 June 2021 9:04:19 AM AEST Peter Xu wrote:
> > > On Fri, Jun 11, 2021 at 12:21:26AM +1000, Alistair Popple wrote:
> > > > > Hmm, the thing is.. to me FOLL_SPLIT_PMD should have similar effect 
> > > > > to explicit
> > > > > call split_huge_pmd_address(), afaict.  Since both of them use 
> > > > > __split_huge_pmd()
> > > > > internally which will generate that unwanted CLEAR notify.
> > > >
> > > > Agree that gup calls __split_huge_pmd() via split_huge_pmd_address()
> > > > which will always CLEAR. However gup only calls 
> > > > split_huge_pmd_address() if it
> > > > finds a thp pmd. In follow_pmd_mask() we have:
> > > >
> > > >   if (likely(!pmd_trans_huge(pmdval)))
> > > >   return follow_page_pte(vma, address, pmd, flags, 
> > > > >pgmap);
> > > >
> > > > So I don't think we have a problem here.
> > >
> > > Sorry I didn't follow here..  We do FOLL_SPLIT_PMD after this check, 
> > > right?  I
> > > mean, if it's a thp for the current mm, afaict pmd_trans_huge() should 
> > > return
> > > true above, so we'll skip follow_page_pte(); then we'll check 
> > > FOLL_SPLIT_PMD
> > > and do the split, then the CLEAR notify.  Hmm.. Did I miss something?
> >
> > That seems correct - if the thp is not mapped with a pmd we won't split and 
> > we
> > won't CLEAR. If there is a thp pmd we will split and CLEAR, but in that 
> > case it
> > is fine - we will retry, but the retry will won't CLEAR because the pmd has
> > already been split.
> 
> Aha!
> 
> >
> > The issue arises with doing it unconditionally in make device exclusive is 
> > that
> > you *always* CLEAR even if there is no thp pmd to split. Or at least that's 
> > my
> > understanding, please let me know if it doesn't make sense.
> 
> Exactly.  But if you see what I meant here, even if it can work like this, it
> sounds still fragile, isn't it?  I just feel something is slightly off there..
> 
> IMHO split_huge_pmd() checked pmd before calling __split_huge_pmd() for
> performance, afaict, because if it's not a thp even without locking, then it
> won't be, so further __split_huge_pmd() is not necessary.
> 
> IOW, it's very legal if someday we'd like to let split_huge_pmd() call
> __split_huge_pmd() directly, then AFAIU device exclusive API will be the 1st
> one to be broken with that seems-to-be-irrelevant change I'm afraid..

Well I would argue the performance of memory notifiers is becoming increasingly
important, and a change that causes them to be called unnecessarily is
therefore not very legal. Likely the correct fix here is to optimise
__split_huge_pmd() to only call the notifier if it's actually going to split a
pmd. As you said though that's a completely different story which I think would
be best done as a separate series.

> This lets me goes back a step to think about why do we need this notifier at
> all to cover this whole range of make_device_exclusive() procedure..
> 
> What I am thinking is, we're afraid some CPU accesses this page so the pte got
> quickly restored when device atomic operation is carrying on.  Then with this
> notifier we'll be able to cancel it.  Makes perfect sense.
> 
> However do we really need to register this notifier so early?  The thing is 
> the
> GPU driver still has all the page locks, so even if there's a race to restore
> the ptes, they'll block at taking the page lock until the driver releases it.
> 
> IOW, I'm wondering whether the "non-fragile" way to do this is not do
> mmu_interval_notifier_insert() that early: what if we register that notifier
> after make_device_exclusive_range() returns but before page_unlock() somehow?
> So before page_unlock(), race is protected fully by the lock itself; after
> that, it's done by mmu notifier.  Then maybe we don't need to worry about all
> these notifications during marking exclusive (while we shouldn't)?

The notifier is needed to protect against races with pte changes. Once a page
has been marked for exclusive access the driver will update it's page tables to
allow atomic access to the page. However in the meantime the page could become
unmapped entirely or write protected.

As I understand things the page lock won't protect against these kind of pte
changes, hence the need for mmu_interval_read_begin/retry which allows the
driver to hold a mutex protecting against invalidations via blocking the
notifier until the device page tab

Re: [Nouveau] [PATCH v10 07/10] mm: Device exclusive memory access

2021-06-10 Thread Alistair Popple
On Friday, 11 June 2021 9:04:19 AM AEST Peter Xu wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Fri, Jun 11, 2021 at 12:21:26AM +1000, Alistair Popple wrote:
> > > Hmm, the thing is.. to me FOLL_SPLIT_PMD should have similar effect to 
> > > explicit
> > > call split_huge_pmd_address(), afaict.  Since both of them use 
> > > __split_huge_pmd()
> > > internally which will generate that unwanted CLEAR notify.
> >
> > Agree that gup calls __split_huge_pmd() via split_huge_pmd_address()
> > which will always CLEAR. However gup only calls split_huge_pmd_address() if 
> > it
> > finds a thp pmd. In follow_pmd_mask() we have:
> >
> >   if (likely(!pmd_trans_huge(pmdval)))
> >   return follow_page_pte(vma, address, pmd, flags, >pgmap);
> >
> > So I don't think we have a problem here.
> 
> Sorry I didn't follow here..  We do FOLL_SPLIT_PMD after this check, right?  I
> mean, if it's a thp for the current mm, afaict pmd_trans_huge() should return
> true above, so we'll skip follow_page_pte(); then we'll check FOLL_SPLIT_PMD
> and do the split, then the CLEAR notify.  Hmm.. Did I miss something?

That seems correct - if the thp is not mapped with a pmd we won't split and we
won't CLEAR. If there is a thp pmd we will split and CLEAR, but in that case it
is fine - we will retry, but the retry will won't CLEAR because the pmd has
already been split.

The issue arises with doing it unconditionally in make device exclusive is that
you *always* CLEAR even if there is no thp pmd to split. Or at least that's my
understanding, please let me know if it doesn't make sense.

 - Alistair

> --
> Peter Xu
> 




___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v10 07/10] mm: Device exclusive memory access

2021-06-10 Thread Alistair Popple
On Friday, 11 June 2021 4:04:35 AM AEST Peter Xu wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Thu, Jun 10, 2021 at 10:18:25AM +1000, Alistair Popple wrote:
> > > > The main problem is split_huge_pmd_address() unconditionally calls a mmu
> > > > notifier so I would need to plumb in passing an owner everywhere which 
> > > > could
> > > > get messy.
> > >
> > > Could I ask why?  split_huge_pmd_address() will notify with CLEAR, so I'm 
> > > a bit
> > > confused why we need to pass over the owner.
> >
> > Sure, it is the same reason we need to pass it for the exclusive notifier.
> > Any invalidation during the make exclusive operation will break the mmu read
> > side critical section forcing a retry of the operation. The owner field is 
> > what
> > is used to filter out invalidations (such as the exclusive invalidation) 
> > that
> > don't need to be retried.
> 
> Do you mean the mmu_interval_read_begin|retry() calls?

Yep.

> Hmm, the thing is.. to me FOLL_SPLIT_PMD should have similar effect to 
> explicit
> call split_huge_pmd_address(), afaict.  Since both of them use 
> __split_huge_pmd()
> internally which will generate that unwanted CLEAR notify.

Agree that gup calls __split_huge_pmd() via split_huge_pmd_address()
which will always CLEAR. However gup only calls split_huge_pmd_address() if it
finds a thp pmd. In follow_pmd_mask() we have:

if (likely(!pmd_trans_huge(pmdval)))
return follow_page_pte(vma, address, pmd, flags, >pgmap);

So I don't think we have a problem here.

> If that's the case, I think it fails because split_huge_pmd_address() will
> trigger that CLEAR notify unconditionally (even if it's not a thp; not sure
> whether it should be optimized to not notify at all... definitely another
> story), while FOLL_SPLIT_PMD will skip the notify as it calls split_huge_pmd()
> instead, who checks the pmd before calling __split_huge_pmd().
> 
> Does it also mean that if there's a real THP it won't really work?  As then
> FOLL_SPLIT_PMD will start to trigger that CLEAR notify too, I think..
> 
> --
> Peter Xu
> 




___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v10 07/10] mm: Device exclusive memory access

2021-06-09 Thread Alistair Popple
On Thursday, 10 June 2021 2:05:06 AM AEST Peter Xu wrote:
> On Wed, Jun 09, 2021 at 07:38:04PM +1000, Alistair Popple wrote:
> > On Wednesday, 9 June 2021 4:33:52 AM AEST Peter Xu wrote:
> > > On Mon, Jun 07, 2021 at 05:58:52PM +1000, Alistair Popple wrote:

[...]

> > For thp this means we could end up passing
> > tail pages to rmap_walk(), however it doesn't actually walk them.
> >
> > Based on the results of previous testing I had done I assumed rmap_walk()
> > filtered out tail pages. It does, and I didn't hit the BUG_ON above, but the
> > filtering was not as deliberate as assumed.
> >
> > I've gone back and looked at what was happening in my earlier tests and the
> > tail pages get filtered because the VMA is not getting locked in
> > page_lock_anon_vma_read() due to failing this check:
> >
> >   anon_mapping = (unsigned long)READ_ONCE(page->mapping);
> >   if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
> >   goto out;
> >
> > And now I'm not sure it makes sense to read page->mapping of a tail page. So
> > it might be best if we explicitly ignore any tail pages returned from GUP, 
> > at
> > least for now (a future series will improve thp support such as adding a pmd
> > version for exclusive entries).
> 
> I feel like it's illegal to access page->mapping of tail pages; I looked at
> what happens if we call page_anon_vma() on a tail page:
> 
> struct anon_vma *page_anon_vma(struct page *page)
> {
> unsigned long mapping;
> 
> page = compound_head(page);
> mapping = (unsigned long)page->mapping;
> if ((mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
> return NULL;
> return __page_rmapping(page);
> }
> 
> It'll just take the head's mapping instead.  It makes sense since the tail 
> page
> shouldn't have a different value against the head page, afaiu.

Right, it makes no sense to look at ->mapping on a tail page because the field
is used for something else. On the 1st tail page it is ->compound_nr and on the
2nd tail page it is ->deferred_list. See the definitions of compound_nr() and
page_deferred_list() respectively. I suppose on the rest of the pages it could
be anything.

I think in practice it is probably ok - iuc bit 0 won't be set for compound_nr
and certainly not for deferred_list->next (a pointer). But none of that seems
intentional, so it would be better to be explicit and not walk the tail pages.

> It would be great if thp experts could chim in.  Before that happens, I agree
> with you that a safer approach is to explicitly not walk a tail page for its
> rmap (and I think the rmap of a tail page will be the same of the head
> anyways.. since they seem to share the anon_vma as quoted).
> >
> > > So... for thp mappings, wondering whether we should do normal GUP (without
> > > SPLIT), pass in always normal or head pages into rmap_walk(), but then
> > > unconditionally split_huge_pmd_address() in 
> > > page_make_device_exclusive_one()?
> >
> > That could work (although I think GUP will still return tail pages - see
> > follow_trans_huge_pmd() which is called from follow_pmd_mask() in gup).
> 
> Agreed.
> 
> > The main problem is split_huge_pmd_address() unconditionally calls a mmu
> > notifier so I would need to plumb in passing an owner everywhere which could
> > get messy.
> 
> Could I ask why?  split_huge_pmd_address() will notify with CLEAR, so I'm a 
> bit
> confused why we need to pass over the owner.

Sure, it is the same reason we need to pass it for the exclusive notifier.
Any invalidation during the make exclusive operation will break the mmu read
side critical section forcing a retry of the operation. The owner field is what
is used to filter out invalidations (such as the exclusive invalidation) that
don't need to be retried.
 
> I thought plumb it right before your EXCLUSIVE notifier init would work?

I did try this just to double check and it doesn't work due to the unconditional
notifier.

> ---8<---
> diff --git a/mm/rmap.c b/mm/rmap.c
> index a94d9aed9d95..360ce86f3822 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -2042,6 +2042,12 @@ static bool page_make_device_exclusive_one(struct page 
> *page,
> swp_entry_t entry;
> pte_t swp_pte;
> 
> +   /*
> +* Make sure thps split as device exclusive entries only support pte
> +* level for now.
> +*/
> +   split_huge_pmd_address(vma, address, false, page);
> +
> mmu_notifier_range_init_owner(, MMU_NOTIFY_EXCLUSIVE, 0, vma,
>   vma->vm_mm, address, min(vma->vm_end,
>   address + page_size(page)), 
> args->owner);
> ---8<---
> 
> Thanks,
> 
> --
> Peter Xu
> 




___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v10 07/10] mm: Device exclusive memory access

2021-06-09 Thread Alistair Popple
On Wednesday, 9 June 2021 4:33:52 AM AEST Peter Xu wrote:
> On Mon, Jun 07, 2021 at 05:58:52PM +1000, Alistair Popple wrote:
> 
> [...]
> 
> > +static bool page_make_device_exclusive_one(struct page *page,
> > + struct vm_area_struct *vma, unsigned long address, void *priv)
> > +{
> > + struct mm_struct *mm = vma->vm_mm;
> > + struct page_vma_mapped_walk pvmw = {
> > + .page = page,
> > + .vma = vma,
> > + .address = address,
> > + };
> > + struct make_exclusive_args *args = priv;
> > + pte_t pteval;
> > + struct page *subpage;
> > + bool ret = true;
> > + struct mmu_notifier_range range;
> > + swp_entry_t entry;
> > + pte_t swp_pte;
> > +
> > + mmu_notifier_range_init_owner(, MMU_NOTIFY_EXCLUSIVE, 0, vma,
> > +   vma->vm_mm, address, min(vma->vm_end,
> > +   address + page_size(page)), 
> > args->owner);
> > + mmu_notifier_invalidate_range_start();
> > +
> > + while (page_vma_mapped_walk()) {
> > + /* Unexpected PMD-mapped THP? */
> > + VM_BUG_ON_PAGE(!pvmw.pte, page);
> 
> [1]
> 
> > +
> > + if (!pte_present(*pvmw.pte)) {
> > + ret = false;
> > + page_vma_mapped_walk_done();
> > + break;
> > + }
> > +
> > + subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
> > + address = pvmw.address;
> 
> I raised a question here previously and didn't get an answer...
> 
> https://lore.kernel.org/linux-mm/YLDr%2FRyAdUR4q0kk@t490s/

Sorry, I had overlooked that. Will continue the discussion here.

> I think I get your point now and it does look possible that the split page can
> still be mapped somewhere else as thp, then having some subpage maintainance
> looks necessary.  The confusing part is above [1] you've also got that
> VM_BUG_ON_PAGE() assuming it must not be a mapped pmd at all..

Going back I thought your original question was whether subpage != page is
possible. My main point was it's possible if we get a thp head. In that case we
need to replace all pte's with exclusive entries because I haven't (yet)
defined a pmd version of device exclusive entries and also rmap_walk won't deal
with tail pages (see below).

> Then I remembered these code majorly come from the try_to_unmap() so I looked
> there.  I _think_ what's missing here is something like:
> 
> if (flags & TTU_SPLIT_HUGE_PMD)
> split_huge_pmd_address(vma, address, false, page);
> 
> at the entry of page_make_device_exclusive_one()?
>
> That !pte assertion in try_to_unmap() makes sense to me as long as it has 
> split
> the thp page first always.  However seems not the case for FOLL_SPLIT_PMD as
> you previously mentioned.

At present this is limited to PageAnon pages which have had CoW broken, which I
think means there shouldn't be other mappings so I expect the PMD will always
have been split into small PTEs mapping subpages by GUP which is what that
assertion [1] is checking. I could call split_huge_pmd_address() unconditionally
as suggested but see the discussion below.

> Meanwhile, I also started to wonder whether it's even right to call 
> rmap_walk()
> with tail pages...  Please see below.
> 
> > +
> > + /* Nuke the page table entry. */
> > + flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
> > + pteval = ptep_clear_flush(vma, address, pvmw.pte);
> > +
> > + /* Move the dirty bit to the page. Now the pte is gone. */
> > + if (pte_dirty(pteval))
> > + set_page_dirty(page);
> > +
> > + /*
> > +  * Check that our target page is still mapped at the expected
> > +  * address.
> > +  */
> > + if (args->mm == mm && args->address == address &&
> > + pte_write(pteval))
> > + args->valid = true;
> > +
> > + /*
> > +  * Store the pfn of the page in a special migration
> > +  * pte. do_swap_page() will wait until the migration
> > +  * pte is removed and then restart fault handling.
> > +  */
> > + if (pte_write(pteval))
> > + entry = make_writable_device_exclusive_entry(
> > + page_to_pfn(subpage));
> > + 

[Nouveau] [PATCH v10 10/10] nouveau/svm: Implement atomic SVM access

2021-06-07 Thread Alistair Popple
Some NVIDIA GPUs do not support direct atomic access to system memory
via PCIe. Instead this must be emulated by granting the GPU exclusive
access to the memory. This is achieved by replacing CPU page table
entries with special swap entries that fault on userspace access.

The driver then grants the GPU permission to update the page undergoing
atomic access via the GPU page tables. When CPU access to the page is
required a CPU fault is raised which calls into the device driver via
MMU notifiers to revoke the atomic access. The original page table
entries are then restored allowing CPU access to proceed.

Signed-off-by: Alistair Popple 
Reviewed-by: Ben Skeggs 

---

v10:
* Added a fix from Colin King to check the return code of
  make_device_exclusive.

v9:
* Added Ben's Reviewed-By

v7:
* Removed magic values for fault access levels
* Improved readability of fault comparison code

v4:
* Check that page table entries haven't changed before mapping on the
  device
---
 drivers/gpu/drm/nouveau/include/nvif/if000c.h |   1 +
 drivers/gpu/drm/nouveau/nouveau_svm.c | 126 --
 drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h |   1 +
 .../drm/nouveau/nvkm/subdev/mmu/vmmgp100.c|   6 +
 4 files changed, 123 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/include/nvif/if000c.h 
b/drivers/gpu/drm/nouveau/include/nvif/if000c.h
index d6dd40f21eed..9c7ff56831c5 100644
--- a/drivers/gpu/drm/nouveau/include/nvif/if000c.h
+++ b/drivers/gpu/drm/nouveau/include/nvif/if000c.h
@@ -77,6 +77,7 @@ struct nvif_vmm_pfnmap_v0 {
 #define NVIF_VMM_PFNMAP_V0_APER   0x00f0ULL
 #define NVIF_VMM_PFNMAP_V0_HOST   0xULL
 #define NVIF_VMM_PFNMAP_V0_VRAM   0x0010ULL
+#define NVIF_VMM_PFNMAP_V0_A 0x0004ULL
 #define NVIF_VMM_PFNMAP_V0_W  0x0002ULL
 #define NVIF_VMM_PFNMAP_V0_V  0x0001ULL
 #define NVIF_VMM_PFNMAP_V0_NONE   0xULL
diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c 
b/drivers/gpu/drm/nouveau/nouveau_svm.c
index a195e48c9aee..63a4976b906b 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct nouveau_svm {
struct nouveau_drm *drm;
@@ -67,6 +68,11 @@ struct nouveau_svm {
} buffer[1];
 };
 
+#define FAULT_ACCESS_READ 0
+#define FAULT_ACCESS_WRITE 1
+#define FAULT_ACCESS_ATOMIC 2
+#define FAULT_ACCESS_PREFETCH 3
+
 #define SVM_DBG(s,f,a...) NV_DEBUG((s)->drm, "svm: "f"\n", ##a)
 #define SVM_ERR(s,f,a...) NV_WARN((s)->drm, "svm: "f"\n", ##a)
 
@@ -411,6 +417,24 @@ nouveau_svm_fault_cancel_fault(struct nouveau_svm *svm,
  fault->client);
 }
 
+static int
+nouveau_svm_fault_priority(u8 fault)
+{
+   switch (fault) {
+   case FAULT_ACCESS_PREFETCH:
+   return 0;
+   case FAULT_ACCESS_READ:
+   return 1;
+   case FAULT_ACCESS_WRITE:
+   return 2;
+   case FAULT_ACCESS_ATOMIC:
+   return 3;
+   default:
+   WARN_ON_ONCE(1);
+   return -1;
+   }
+}
+
 static int
 nouveau_svm_fault_cmp(const void *a, const void *b)
 {
@@ -421,9 +445,8 @@ nouveau_svm_fault_cmp(const void *a, const void *b)
return ret;
if ((ret = (s64)fa->addr - fb->addr))
return ret;
-   /*XXX: atomic? */
-   return (fa->access == 0 || fa->access == 3) -
-  (fb->access == 0 || fb->access == 3);
+   return nouveau_svm_fault_priority(fa->access) -
+   nouveau_svm_fault_priority(fb->access);
 }
 
 static void
@@ -487,6 +510,10 @@ static bool nouveau_svm_range_invalidate(struct 
mmu_interval_notifier *mni,
struct svm_notifier *sn =
container_of(mni, struct svm_notifier, notifier);
 
+   if (range->event == MMU_NOTIFY_EXCLUSIVE &&
+   range->owner == sn->svmm->vmm->cli->drm->dev)
+   return true;
+
/*
 * serializes the update to mni->invalidate_seq done by caller and
 * prevents invalidation of the PTE from progressing while HW is being
@@ -555,6 +582,71 @@ static void nouveau_hmm_convert_pfn(struct nouveau_drm 
*drm,
args->p.phys[0] |= NVIF_VMM_PFNMAP_V0_W;
 }
 
+static int nouveau_atomic_range_fault(struct nouveau_svmm *svmm,
+  struct nouveau_drm *drm,
+  struct nouveau_pfnmap_args *args, u32 size,
+  struct svm_notifier *notifier)
+{
+   unsigned long timeout =
+   jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
+   str

[Nouveau] [PATCH v10 09/10] nouveau/svm: Refactor nouveau_range_fault

2021-06-07 Thread Alistair Popple
Call mmu_interval_notifier_insert() as part of nouveau_range_fault().
This doesn't introduce any functional change but makes it easier for a
subsequent patch to alter the behaviour of nouveau_range_fault() to
support GPU atomic operations.

Signed-off-by: Alistair Popple 
Reviewed-by: Ben Skeggs 

---

v9:

* Added Ben's Reviewed-By (thanks!)
---
 drivers/gpu/drm/nouveau/nouveau_svm.c | 34 ---
 1 file changed, 20 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c 
b/drivers/gpu/drm/nouveau/nouveau_svm.c
index 94f841026c3b..a195e48c9aee 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -567,18 +567,27 @@ static int nouveau_range_fault(struct nouveau_svmm *svmm,
unsigned long hmm_pfns[1];
struct hmm_range range = {
.notifier = >notifier,
-   .start = notifier->notifier.interval_tree.start,
-   .end = notifier->notifier.interval_tree.last + 1,
.default_flags = hmm_flags,
.hmm_pfns = hmm_pfns,
.dev_private_owner = drm->dev,
};
-   struct mm_struct *mm = notifier->notifier.mm;
+   struct mm_struct *mm = svmm->notifier.mm;
int ret;
 
+   ret = mmu_interval_notifier_insert(>notifier, mm,
+   args->p.addr, args->p.size,
+   _svm_mni_ops);
+   if (ret)
+   return ret;
+
+   range.start = notifier->notifier.interval_tree.start;
+   range.end = notifier->notifier.interval_tree.last + 1;
+
while (true) {
-   if (time_after(jiffies, timeout))
-   return -EBUSY;
+   if (time_after(jiffies, timeout)) {
+   ret = -EBUSY;
+   goto out;
+   }
 
range.notifier_seq = mmu_interval_read_begin(range.notifier);
mmap_read_lock(mm);
@@ -587,7 +596,7 @@ static int nouveau_range_fault(struct nouveau_svmm *svmm,
if (ret) {
if (ret == -EBUSY)
continue;
-   return ret;
+   goto out;
}
 
mutex_lock(>mutex);
@@ -606,6 +615,9 @@ static int nouveau_range_fault(struct nouveau_svmm *svmm,
svmm->vmm->vmm.object.client->super = false;
mutex_unlock(>mutex);
 
+out:
+   mmu_interval_notifier_remove(>notifier);
+
return ret;
 }
 
@@ -727,14 +739,8 @@ nouveau_svm_fault(struct nvif_notify *notify)
}
 
notifier.svmm = svmm;
-   ret = mmu_interval_notifier_insert(, mm,
-  args.i.p.addr, args.i.p.size,
-  _svm_mni_ops);
-   if (!ret) {
-   ret = nouveau_range_fault(svmm, svm->drm, ,
-   sizeof(args), hmm_flags, );
-   mmu_interval_notifier_remove();
-   }
+   ret = nouveau_range_fault(svmm, svm->drm, ,
+   sizeof(args), hmm_flags, );
mmput(mm);
 
limit = args.i.p.addr + args.i.p.size;
-- 
2.20.1

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


[Nouveau] [PATCH v10 08/10] mm: Selftests for exclusive device memory

2021-06-07 Thread Alistair Popple
Adds some selftests for exclusive device memory.

Signed-off-by: Alistair Popple 
Acked-by: Jason Gunthorpe 
Tested-by: Ralph Campbell 
Reviewed-by: Ralph Campbell 
---
 lib/test_hmm.c | 124 +++
 lib/test_hmm_uapi.h|   2 +
 tools/testing/selftests/vm/hmm-tests.c | 158 +
 3 files changed, 284 insertions(+)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 5c9f5a020c1d..305a9d9e2b4c 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "test_hmm_uapi.h"
 
@@ -46,6 +47,7 @@ struct dmirror_bounce {
unsigned long   cpages;
 };
 
+#define DPT_XA_TAG_ATOMIC 1UL
 #define DPT_XA_TAG_WRITE 3UL
 
 /*
@@ -619,6 +621,54 @@ static void dmirror_migrate_alloc_and_copy(struct 
migrate_vma *args,
}
 }
 
+static int dmirror_check_atomic(struct dmirror *dmirror, unsigned long start,
+unsigned long end)
+{
+   unsigned long pfn;
+
+   for (pfn = start >> PAGE_SHIFT; pfn < (end >> PAGE_SHIFT); pfn++) {
+   void *entry;
+   struct page *page;
+
+   entry = xa_load(>pt, pfn);
+   page = xa_untag_pointer(entry);
+   if (xa_pointer_tag(entry) == DPT_XA_TAG_ATOMIC)
+   return -EPERM;
+   }
+
+   return 0;
+}
+
+static int dmirror_atomic_map(unsigned long start, unsigned long end,
+ struct page **pages, struct dmirror *dmirror)
+{
+   unsigned long pfn, mapped = 0;
+   int i;
+
+   /* Map the migrated pages into the device's page tables. */
+   mutex_lock(>mutex);
+
+   for (i = 0, pfn = start >> PAGE_SHIFT; pfn < (end >> PAGE_SHIFT); 
pfn++, i++) {
+   void *entry;
+
+   if (!pages[i])
+   continue;
+
+   entry = pages[i];
+   entry = xa_tag_pointer(entry, DPT_XA_TAG_ATOMIC);
+   entry = xa_store(>pt, pfn, entry, GFP_ATOMIC);
+   if (xa_is_err(entry)) {
+   mutex_unlock(>mutex);
+   return xa_err(entry);
+   }
+
+   mapped++;
+   }
+
+   mutex_unlock(>mutex);
+   return mapped;
+}
+
 static int dmirror_migrate_finalize_and_map(struct migrate_vma *args,
struct dmirror *dmirror)
 {
@@ -661,6 +711,71 @@ static int dmirror_migrate_finalize_and_map(struct 
migrate_vma *args,
return 0;
 }
 
+static int dmirror_exclusive(struct dmirror *dmirror,
+struct hmm_dmirror_cmd *cmd)
+{
+   unsigned long start, end, addr;
+   unsigned long size = cmd->npages << PAGE_SHIFT;
+   struct mm_struct *mm = dmirror->notifier.mm;
+   struct page *pages[64];
+   struct dmirror_bounce bounce;
+   unsigned long next;
+   int ret;
+
+   start = cmd->addr;
+   end = start + size;
+   if (end < start)
+   return -EINVAL;
+
+   /* Since the mm is for the mirrored process, get a reference first. */
+   if (!mmget_not_zero(mm))
+   return -EINVAL;
+
+   mmap_read_lock(mm);
+   for (addr = start; addr < end; addr = next) {
+   int i, mapped;
+
+   if (end < addr + (ARRAY_SIZE(pages) << PAGE_SHIFT))
+   next = end;
+   else
+   next = addr + (ARRAY_SIZE(pages) << PAGE_SHIFT);
+
+   ret = make_device_exclusive_range(mm, addr, next, pages, NULL);
+   mapped = dmirror_atomic_map(addr, next, pages, dmirror);
+   for (i = 0; i < ret; i++) {
+   if (pages[i]) {
+   unlock_page(pages[i]);
+   put_page(pages[i]);
+   }
+   }
+
+   if (addr + (mapped << PAGE_SHIFT) < next) {
+   mmap_read_unlock(mm);
+   mmput(mm);
+   return -EBUSY;
+   }
+   }
+   mmap_read_unlock(mm);
+   mmput(mm);
+
+   /* Return the migrated data for verification. */
+   ret = dmirror_bounce_init(, start, size);
+   if (ret)
+   return ret;
+   mutex_lock(>mutex);
+   ret = dmirror_do_read(dmirror, start, end, );
+   mutex_unlock(>mutex);
+   if (ret == 0) {
+   if (copy_to_user(u64_to_user_ptr(cmd->ptr), bounce.ptr,
+bounce.size))
+   ret = -EFAULT;
+   }
+
+   cmd->cpages = bounce.cpages;
+   dmirror_bounce_fini();
+   return ret;
+}
+
 static int dmirror_migrate(struct dmirror *dmirror,
   struct hmm_dmirror_cmd *cmd)
 {
@@ -949,6 +1064,15 @@ static long dmi

[Nouveau] [PATCH v10 07/10] mm: Device exclusive memory access

2021-06-07 Thread Alistair Popple
Some devices require exclusive write access to shared virtual
memory (SVM) ranges to perform atomic operations on that memory. This
requires CPU page tables to be updated to deny access whilst atomic
operations are occurring.

In order to do this introduce a new swap entry
type (SWP_DEVICE_EXCLUSIVE). When a SVM range needs to be marked for
exclusive access by a device all page table mappings for the particular
range are replaced with device exclusive swap entries. This causes any
CPU access to the page to result in a fault.

Faults are resovled by replacing the faulting entry with the original
mapping. This results in MMU notifiers being called which a driver uses
to update access permissions such as revoking atomic access. After
notifiers have been called the device will no longer have exclusive
access to the region.

Walking of the page tables to find the target pages is handled by
get_user_pages() rather than a direct page table walk. A direct page
table walk similar to what migrate_vma_collect()/unmap() does could also
have been utilised. However this resulted in more code similar in
functionality to what get_user_pages() provides as page faulting is
required to make the PTEs present and to break COW.

Signed-off-by: Alistair Popple 
Reviewed-by: Christoph Hellwig 

---

v10:
* Make device exclusive code conditional on CONFIG_DEVICE_PRIVATE.
* Updates to code comments and more minor code cleanups.

v9:
* Split rename of migrate_pgmap_owner into a separate patch.
* Added comments explaining SWP_DEVICE_EXCLUSIVE_* entries.
* Renamed try_to_protect{_one} to page_make_device_exclusive{_one} based
  somewhat on a suggestion from Peter Xu. I was never particularly happy
  with try_to_protect() as a name so think this is better.
* Removed unneccesary code and reworded some comments based on feedback
  from Peter Xu.
* Removed the VMA walk when restoring PTEs for device-exclusive entries.
* Simplified implementation of copy_pte_range() to fail if the page
  cannot be locked. This might lead to occasional fork() failures but at
  this stage we don't think that will be an issue.

v8:
* Remove device exclusive entries on fork rather than copy them.

v7:
* Added Christoph's Reviewed-by.
* Minor cosmetic cleanups suggested by Christoph.
* Replace mmu_notifier_range_init_migrate/exclusive with
  mmu_notifier_range_init_owner as suggested by Christoph.
* Replaced lock_page() with lock_page_retry() when handling faults.
* Restrict to anonymous pages for now.

v6:
* Fixed a bisectablity issue due to incorrectly applying the rename of
  migrate_pgmap_owner to the wrong patches for Nouveau and hmm_test.

v5:
* Renamed range->migrate_pgmap_owner to range->owner.
* Added MMU_NOTIFY_EXCLUSIVE to allow passing of a driver cookie which
  allows notifiers called as a result of make_device_exclusive_range() to
  be ignored.
* Added a check to try_to_protect_one() to detect if the pages originally
  returned from get_user_pages() have been unmapped or not.
* Removed check_device_exclusive_range() as it is no longer required with
  the other changes.
* Documentation update.

v4:
* Add function to check that mappings are still valid and exclusive.
* s/long/unsigned long/ in make_device_exclusive_entry().
---
 Documentation/vm/hmm.rst |  17 
 include/linux/mmu_notifier.h |   6 ++
 include/linux/rmap.h |   4 +
 include/linux/swap.h |   9 +-
 include/linux/swapops.h  |  44 -
 mm/hmm.c |   5 +
 mm/memory.c  | 127 +++-
 mm/mprotect.c|   8 ++
 mm/page_vma_mapped.c |   9 +-
 mm/rmap.c| 182 +++
 10 files changed, 401 insertions(+), 10 deletions(-)

diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst
index 3df79307a797..a14c2938e7af 100644
--- a/Documentation/vm/hmm.rst
+++ b/Documentation/vm/hmm.rst
@@ -405,6 +405,23 @@ between device driver specific code and shared common code:
 
The lock can now be released.
 
+Exclusive access memory
+===
+
+Some devices have features such as atomic PTE bits that can be used to 
implement
+atomic access to system memory. To support atomic operations to a shared 
virtual
+memory page such a device needs access to that page which is exclusive of any
+userspace access from the CPU. The ``make_device_exclusive_range()`` function
+can be used to make a memory range inaccessible from userspace.
+
+This replaces all mappings for pages in the given range with special swap
+entries. Any attempt to access the swap entry results in a fault which is
+resovled by replacing the entry with the original mapping. A driver gets
+notified that the mapping has been changed by MMU notifiers, after which point
+it will no longer have exclusive access to the page. Exclusive access is
+guranteed to last until the driver drops the page lock and page reference, at
+which point any CPU faults on the page may p

[Nouveau] [PATCH v10 06/10] mm/memory.c: Allow different return codes for copy_nonpresent_pte()

2021-06-07 Thread Alistair Popple
Currently if copy_nonpresent_pte() returns a non-zero value it is
assumed to be a swap entry which requires further processing outside the
loop in copy_pte_range() after dropping locks. This prevents other
values being returned to signal conditions such as failure which a
subsequent change requires.

Instead make copy_nonpresent_pte() return an error code if further
processing is required and read the value for the swap entry in the main
loop under the ptl.

Signed-off-by: Alistair Popple 

---

v10:

Use a unique error code and only check return codes for handling.

v9:

New for v9 to allow device exclusive handling to occur in
copy_nonpresent_pte().
---
 mm/memory.c | 26 --
 1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 2fb455c365c2..0982cab37ecb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -718,7 +718,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct 
mm_struct *src_mm,
 
if (likely(!non_swap_entry(entry))) {
if (swap_duplicate(entry) < 0)
-   return entry.val;
+   return -EIO;
 
/* make sure dst_mm is on swapoff's mmlist. */
if (unlikely(list_empty(_mm->mmlist))) {
@@ -974,11 +974,13 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct 
vm_area_struct *src_vma,
continue;
}
if (unlikely(!pte_present(*src_pte))) {
-   entry.val = copy_nonpresent_pte(dst_mm, src_mm,
-   dst_pte, src_pte,
-   src_vma, addr, rss);
-   if (entry.val)
+   ret = copy_nonpresent_pte(dst_mm, src_mm,
+   dst_pte, src_pte,
+   src_vma, addr, rss);
+   if (ret == -EIO) {
+   entry = pte_to_swp_entry(*src_pte);
break;
+   }
progress += 8;
continue;
}
@@ -1011,20 +1013,24 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct 
vm_area_struct *src_vma,
pte_unmap_unlock(orig_dst_pte, dst_ptl);
cond_resched();
 
-   if (entry.val) {
+   if (ret == -EIO) {
+   VM_WARN_ON_ONCE(!entry.val);
if (add_swap_count_continuation(entry, GFP_KERNEL) < 0) {
ret = -ENOMEM;
goto out;
}
entry.val = 0;
-   } else if (ret) {
-   WARN_ON_ONCE(ret != -EAGAIN);
+   } else if (ret ==  -EAGAIN) {
prealloc = page_copy_prealloc(src_mm, src_vma, addr);
if (!prealloc)
return -ENOMEM;
-   /* We've captured and resolved the error. Reset, try again. */
-   ret = 0;
+   } else if (ret) {
+   VM_WARN_ON_ONCE(1);
}
+
+   /* We've captured and resolved the error. Reset, try again. */
+   ret = 0;
+
if (addr != end)
goto again;
 out:
-- 
2.20.1

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


[Nouveau] [PATCH v10 05/10] mm: Rename migrate_pgmap_owner

2021-06-07 Thread Alistair Popple
MMU notifier ranges have a migrate_pgmap_owner field which is used by
drivers to store a pointer. This is subsequently used by the driver
callback to filter MMU_NOTIFY_MIGRATE events. Other notifier event types
can also benefit from this filtering, so rename the
'migrate_pgmap_owner' field to 'owner' and create a new notifier
initialisation function to initialise this field.

Signed-off-by: Alistair Popple 
Suggested-by: Peter Xu 

---

v9:

Previously part of the next patch in the series ('mm: Device exclusive
memory access') but now split out as a separate change as suggested by
Peter Xu.
---
 Documentation/vm/hmm.rst  |  2 +-
 drivers/gpu/drm/nouveau/nouveau_svm.c |  2 +-
 include/linux/mmu_notifier.h  | 20 ++--
 lib/test_hmm.c|  2 +-
 mm/migrate.c  | 10 +-
 5 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst
index 09e28507f5b2..3df79307a797 100644
--- a/Documentation/vm/hmm.rst
+++ b/Documentation/vm/hmm.rst
@@ -332,7 +332,7 @@ between device driver specific code and shared common code:
walks to fill in the ``args->src`` array with PFNs to be migrated.
The ``invalidate_range_start()`` callback is passed a
``struct mmu_notifier_range`` with the ``event`` field set to
-   ``MMU_NOTIFY_MIGRATE`` and the ``migrate_pgmap_owner`` field set to
+   ``MMU_NOTIFY_MIGRATE`` and the ``owner`` field set to
the ``args->pgmap_owner`` field passed to migrate_vma_setup(). This is
allows the device driver to skip the invalidation callback and only
invalidate device private MMU mappings that are actually migrating.
diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c 
b/drivers/gpu/drm/nouveau/nouveau_svm.c
index f18bd53da052..94f841026c3b 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -265,7 +265,7 @@ nouveau_svmm_invalidate_range_start(struct mmu_notifier *mn,
 * the invalidation is handled as part of the migration process.
 */
if (update->event == MMU_NOTIFY_MIGRATE &&
-   update->migrate_pgmap_owner == svmm->vmm->cli->drm->dev)
+   update->owner == svmm->vmm->cli->drm->dev)
goto out;
 
if (limit > svmm->unmanaged.start && start < svmm->unmanaged.limit) {
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 1a6a9eb6d3fa..8e428eb813b8 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -41,7 +41,7 @@ struct mmu_interval_notifier;
  *
  * @MMU_NOTIFY_MIGRATE: used during migrate_vma_collect() invalidate to signal
  * a device driver to possibly ignore the invalidation if the
- * migrate_pgmap_owner field matches the driver's device private pgmap owner.
+ * owner field matches the driver's device private pgmap owner.
  */
 enum mmu_notifier_event {
MMU_NOTIFY_UNMAP = 0,
@@ -269,7 +269,7 @@ struct mmu_notifier_range {
unsigned long end;
unsigned flags;
enum mmu_notifier_event event;
-   void *migrate_pgmap_owner;
+   void *owner;
 };
 
 static inline int mm_has_notifiers(struct mm_struct *mm)
@@ -521,14 +521,14 @@ static inline void mmu_notifier_range_init(struct 
mmu_notifier_range *range,
range->flags = flags;
 }
 
-static inline void mmu_notifier_range_init_migrate(
-   struct mmu_notifier_range *range, unsigned int flags,
+static inline void mmu_notifier_range_init_owner(
+   struct mmu_notifier_range *range,
+   enum mmu_notifier_event event, unsigned int flags,
struct vm_area_struct *vma, struct mm_struct *mm,
-   unsigned long start, unsigned long end, void *pgmap)
+   unsigned long start, unsigned long end, void *owner)
 {
-   mmu_notifier_range_init(range, MMU_NOTIFY_MIGRATE, flags, vma, mm,
-   start, end);
-   range->migrate_pgmap_owner = pgmap;
+   mmu_notifier_range_init(range, event, flags, vma, mm, start, end);
+   range->owner = owner;
 }
 
 #define ptep_clear_flush_young_notify(__vma, __address, __ptep)
\
@@ -655,8 +655,8 @@ static inline void _mmu_notifier_range_init(struct 
mmu_notifier_range *range,
 
 #define mmu_notifier_range_init(range,event,flags,vma,mm,start,end)  \
_mmu_notifier_range_init(range, start, end)
-#define mmu_notifier_range_init_migrate(range, flags, vma, mm, start, end, \
-   pgmap) \
+#define mmu_notifier_range_init_owner(range, event, flags, vma, mm, start, \
+   end, owner) \
_mmu_notifier_range_init(range, start, end)
 
 static inline bool
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 80a78877bd93..5c9f5a020c1d 100644
--- a/lib/test_hmm.c
+++ 

[Nouveau] [PATCH v10 04/10] mm/rmap: Split migration into its own function

2021-06-07 Thread Alistair Popple
Migration is currently implemented as a mode of operation for
try_to_unmap_one() generally specified by passing the TTU_MIGRATION flag
or in the case of splitting a huge anonymous page TTU_SPLIT_FREEZE.

However it does not have much in common with the rest of the unmap
functionality of try_to_unmap_one() and thus splitting it into a
separate function reduces the complexity of try_to_unmap_one() making it
more readable.

Several simplifications can also be made in try_to_migrate_one() based
on the following observations:

 - All users of TTU_MIGRATION also set TTU_IGNORE_MLOCK.
 - No users of TTU_MIGRATION ever set TTU_IGNORE_HWPOISON.
 - No users of TTU_MIGRATION ever set TTU_BATCH_FLUSH.

TTU_SPLIT_FREEZE is a special case of migration used when splitting an
anonymous page. This is most easily dealt with by calling the correct
function from unmap_page() in mm/huge_memory.c  - either
try_to_migrate() for PageAnon or try_to_unmap().

Signed-off-by: Alistair Popple 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Ralph Campbell 

---

v5:
* Added comments about how PMD splitting works for migration vs.
  unmapping
* Tightened up the flag check in try_to_migrate() to be explicit about
  which TTU_XXX flags are supported.
---
 include/linux/rmap.h |   4 +-
 mm/huge_memory.c |  15 +-
 mm/migrate.c |   9 +-
 mm/rmap.c| 358 ---
 4 files changed, 280 insertions(+), 106 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 38a746787c2f..0e25d829f742 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -86,8 +86,6 @@ struct anon_vma_chain {
 };
 
 enum ttu_flags {
-   TTU_MIGRATION   = 0x1,  /* migration mode */
-
TTU_SPLIT_HUGE_PMD  = 0x4,  /* split huge PMD if any */
TTU_IGNORE_MLOCK= 0x8,  /* ignore mlock */
TTU_IGNORE_HWPOISON = 0x20, /* corrupted page is recoverable */
@@ -96,7 +94,6 @@ enum ttu_flags {
 * do a final flush if necessary */
TTU_RMAP_LOCKED = 0x80, /* do not grab rmap lock:
 * caller holds it */
-   TTU_SPLIT_FREEZE= 0x100,/* freeze pte under 
splitting thp */
 };
 
 #ifdef CONFIG_MMU
@@ -193,6 +190,7 @@ static inline void page_dup_rmap(struct page *page, bool 
compound)
 int page_referenced(struct page *, int is_locked,
struct mem_cgroup *memcg, unsigned long *vm_flags);
 
+bool try_to_migrate(struct page *page, enum ttu_flags flags);
 bool try_to_unmap(struct page *, enum ttu_flags flags);
 
 /* Avoid racy checks */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2ec6dab72217..6dddc75b89ee 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2345,16 +2345,21 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma,
 
 static void unmap_page(struct page *page)
 {
-   enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK |
-   TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD;
+   enum ttu_flags ttu_flags = TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD;
bool unmap_success;
 
VM_BUG_ON_PAGE(!PageHead(page), page);
 
if (PageAnon(page))
-   ttu_flags |= TTU_SPLIT_FREEZE;
-
-   unmap_success = try_to_unmap(page, ttu_flags);
+   unmap_success = try_to_migrate(page, ttu_flags);
+   else
+   /*
+* Don't install migration entries for file backed pages. This
+* helps handle cases when i_size is in the middle of the page
+* as there is no need to unmap pages beyond i_size manually.
+*/
+   unmap_success = try_to_unmap(page, ttu_flags |
+   TTU_IGNORE_MLOCK);
VM_BUG_ON_PAGE(!unmap_success, page);
 }
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 930de919b1f2..05740f816bc4 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1103,7 +1103,7 @@ static int __unmap_and_move(struct page *page, struct 
page *newpage,
/* Establish migration ptes */
VM_BUG_ON_PAGE(PageAnon(page) && !PageKsm(page) && !anon_vma,
page);
-   try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK);
+   try_to_migrate(page, 0);
page_was_mapped = 1;
}
 
@@ -1305,7 +1305,7 @@ static int unmap_and_move_huge_page(new_page_t 
get_new_page,
 
if (page_mapped(hpage)) {
bool mapping_locked = false;
-   enum ttu_flags ttu = TTU_MIGRATION|TTU_IGNORE_MLOCK;
+   enum ttu_flags ttu = 0;
 
if (!PageAnon(hpage)) {
/*
@@ -1322,7 +1322,7 @@ static int unmap_and_move_huge_page(new_page_t 
get_new_page,
ttu |= TTU_RMAP_LOCKED;
}
 
-   try_to_unmap(hpage, ttu);
+   try_to_migrate(hpage, ttu);
   

[Nouveau] [PATCH v10 03/10] mm/rmap: Split try_to_munlock from try_to_unmap

2021-06-07 Thread Alistair Popple
The behaviour of try_to_unmap_one() is difficult to follow because it
performs different operations based on a fairly large set of flags used
in different combinations.

TTU_MUNLOCK is one such flag. However it is exclusively used by
try_to_munlock() which specifies no other flags. Therefore rather than
overload try_to_unmap_one() with unrelated behaviour split this out into
it's own function and remove the flag.

Signed-off-by: Alistair Popple 
Reviewed-by: Ralph Campbell 
Reviewed-by: Christoph Hellwig 

---

v10:
* More comment fixes
* Restored the check of VM_LOCKED under that ptl. This closess a race in
  unmap path.

v9:
* Improved comments

v8:
* Renamed try_to_munlock to page_mlock to better reflect what the
  function actually does.
* Removed the TODO from the documentation that this patch addresses.

v7:
* Added Christoph's Reviewed-by

v4:
* Removed redundant check for VM_LOCKED
---
 Documentation/vm/unevictable-lru.rst | 33 ++
 include/linux/rmap.h |  3 +-
 mm/mlock.c   | 12 ++---
 mm/rmap.c| 66 +---
 4 files changed, 69 insertions(+), 45 deletions(-)

diff --git a/Documentation/vm/unevictable-lru.rst 
b/Documentation/vm/unevictable-lru.rst
index 0e1490524f53..eae3af17f2d9 100644
--- a/Documentation/vm/unevictable-lru.rst
+++ b/Documentation/vm/unevictable-lru.rst
@@ -389,14 +389,14 @@ mlocked, munlock_vma_page() updates that zone statistics 
for the number of
 mlocked pages.  Note, however, that at this point we haven't checked whether
 the page is mapped by other VM_LOCKED VMAs.
 
-We can't call try_to_munlock(), the function that walks the reverse map to
+We can't call page_mlock(), the function that walks the reverse map to
 check for other VM_LOCKED VMAs, without first isolating the page from the LRU.
-try_to_munlock() is a variant of try_to_unmap() and thus requires that the page
+page_mlock() is a variant of try_to_unmap() and thus requires that the page
 not be on an LRU list [more on these below].  However, the call to
-isolate_lru_page() could fail, in which case we couldn't try_to_munlock().  So,
+isolate_lru_page() could fail, in which case we can't call page_mlock().  So,
 we go ahead and clear PG_mlocked up front, as this might be the only chance we
-have.  If we can successfully isolate the page, we go ahead and
-try_to_munlock(), which will restore the PG_mlocked flag and update the zone
+have.  If we can successfully isolate the page, we go ahead and call
+page_mlock(), which will restore the PG_mlocked flag and update the zone
 page statistics if it finds another VMA holding the page mlocked.  If we fail
 to isolate the page, we'll have left a potentially mlocked page on the LRU.
 This is fine, because we'll catch it later if and if vmscan tries to reclaim
@@ -545,31 +545,24 @@ munlock or munmap system calls, mm teardown 
(munlock_vma_pages_all), reclaim,
 holepunching, and truncation of file pages and their anonymous COWed pages.
 
 
-try_to_munlock() Reverse Map Scan
+page_mlock() Reverse Map Scan
 -
 
-.. warning::
-   [!] TODO/FIXME: a better name might be page_mlocked() - analogous to the
-   page_referenced() reverse map walker.
-
 When munlock_vma_page() [see section :ref:`munlock()/munlockall() System Call
 Handling ` above] tries to munlock a
 page, it needs to determine whether or not the page is mapped by any
 VM_LOCKED VMA without actually attempting to unmap all PTEs from the
 page.  For this purpose, the unevictable/mlock infrastructure
-introduced a variant of try_to_unmap() called try_to_munlock().
+introduced a variant of try_to_unmap() called page_mlock().
 
-try_to_munlock() calls the same functions as try_to_unmap() for anonymous and
-mapped file and KSM pages with a flag argument specifying unlock versus unmap
-processing.  Again, these functions walk the respective reverse maps looking
-for VM_LOCKED VMAs.  When such a VMA is found, as in the try_to_unmap() case,
-the functions mlock the page via mlock_vma_page() and return SWAP_MLOCK.  This
-undoes the pre-clearing of the page's PG_mlocked done by munlock_vma_page.
+page_mlock() walks the respective reverse maps looking for VM_LOCKED VMAs. When
+such a VMA is found the page is mlocked via mlock_vma_page(). This undoes the
+pre-clearing of the page's PG_mlocked done by munlock_vma_page.
 
-Note that try_to_munlock()'s reverse map walk must visit every VMA in a page's
+Note that page_mlock()'s reverse map walk must visit every VMA in a page's
 reverse map to determine that a page is NOT mapped into any VM_LOCKED VMA.
 However, the scan can terminate when it encounters a VM_LOCKED VMA.
-Although try_to_munlock() might be called a great many times when munlocking a
+Although page_mlock() might be called a great many times when munlocking a
 large region or tearing down a large address space that has been mlocked via
 mlockall(), overall this is a fairly rare event

[Nouveau] [PATCH v10 02/10] mm/swapops: Rework swap entry manipulation code

2021-06-07 Thread Alistair Popple
Both migration and device private pages use special swap entries that
are manipluated by a range of inline functions. The arguments to these
are somewhat inconsitent so rework them to remove flag type arguments
and to make the arguments similar for both read and write entry
creation.

Signed-off-by: Alistair Popple 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Ralph Campbell 
---
 include/linux/swapops.h | 56 ++---
 mm/debug_vm_pgtable.c   | 12 -
 mm/hmm.c|  2 +-
 mm/huge_memory.c| 26 +--
 mm/hugetlb.c| 10 +---
 mm/memory.c | 10 +---
 mm/migrate.c| 26 ++-
 mm/mprotect.c   | 10 +---
 mm/rmap.c   | 10 +---
 9 files changed, 100 insertions(+), 62 deletions(-)

diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 139be8235ad2..4dfd807ae52a 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -100,35 +100,35 @@ static inline void *swp_to_radix_entry(swp_entry_t entry)
 }
 
 #if IS_ENABLED(CONFIG_DEVICE_PRIVATE)
-static inline swp_entry_t make_device_private_entry(struct page *page, bool 
write)
+static inline swp_entry_t make_readable_device_private_entry(pgoff_t offset)
 {
-   return swp_entry(write ? SWP_DEVICE_WRITE : SWP_DEVICE_READ,
-page_to_pfn(page));
+   return swp_entry(SWP_DEVICE_READ, offset);
 }
 
-static inline bool is_device_private_entry(swp_entry_t entry)
+static inline swp_entry_t make_writable_device_private_entry(pgoff_t offset)
 {
-   int type = swp_type(entry);
-   return type == SWP_DEVICE_READ || type == SWP_DEVICE_WRITE;
+   return swp_entry(SWP_DEVICE_WRITE, offset);
 }
 
-static inline void make_device_private_entry_read(swp_entry_t *entry)
+static inline bool is_device_private_entry(swp_entry_t entry)
 {
-   *entry = swp_entry(SWP_DEVICE_READ, swp_offset(*entry));
+   int type = swp_type(entry);
+   return type == SWP_DEVICE_READ || type == SWP_DEVICE_WRITE;
 }
 
-static inline bool is_write_device_private_entry(swp_entry_t entry)
+static inline bool is_writable_device_private_entry(swp_entry_t entry)
 {
return unlikely(swp_type(entry) == SWP_DEVICE_WRITE);
 }
 #else /* CONFIG_DEVICE_PRIVATE */
-static inline swp_entry_t make_device_private_entry(struct page *page, bool 
write)
+static inline swp_entry_t make_readable_device_private_entry(pgoff_t offset)
 {
return swp_entry(0, 0);
 }
 
-static inline void make_device_private_entry_read(swp_entry_t *entry)
+static inline swp_entry_t make_writable_device_private_entry(pgoff_t offset)
 {
+   return swp_entry(0, 0);
 }
 
 static inline bool is_device_private_entry(swp_entry_t entry)
@@ -136,35 +136,32 @@ static inline bool is_device_private_entry(swp_entry_t 
entry)
return false;
 }
 
-static inline bool is_write_device_private_entry(swp_entry_t entry)
+static inline bool is_writable_device_private_entry(swp_entry_t entry)
 {
return false;
 }
 #endif /* CONFIG_DEVICE_PRIVATE */
 
 #ifdef CONFIG_MIGRATION
-static inline swp_entry_t make_migration_entry(struct page *page, int write)
-{
-   BUG_ON(!PageLocked(compound_head(page)));
-
-   return swp_entry(write ? SWP_MIGRATION_WRITE : SWP_MIGRATION_READ,
-   page_to_pfn(page));
-}
-
 static inline int is_migration_entry(swp_entry_t entry)
 {
return unlikely(swp_type(entry) == SWP_MIGRATION_READ ||
swp_type(entry) == SWP_MIGRATION_WRITE);
 }
 
-static inline int is_write_migration_entry(swp_entry_t entry)
+static inline int is_writable_migration_entry(swp_entry_t entry)
 {
return unlikely(swp_type(entry) == SWP_MIGRATION_WRITE);
 }
 
-static inline void make_migration_entry_read(swp_entry_t *entry)
+static inline swp_entry_t make_readable_migration_entry(pgoff_t offset)
 {
-   *entry = swp_entry(SWP_MIGRATION_READ, swp_offset(*entry));
+   return swp_entry(SWP_MIGRATION_READ, offset);
+}
+
+static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
+{
+   return swp_entry(SWP_MIGRATION_WRITE, offset);
 }
 
 extern void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
@@ -174,21 +171,28 @@ extern void migration_entry_wait(struct mm_struct *mm, 
pmd_t *pmd,
 extern void migration_entry_wait_huge(struct vm_area_struct *vma,
struct mm_struct *mm, pte_t *pte);
 #else
+static inline swp_entry_t make_readable_migration_entry(pgoff_t offset)
+{
+   return swp_entry(0, 0);
+}
+
+static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
+{
+   return swp_entry(0, 0);
+}
 
-#define make_migration_entry(page, write) swp_entry(0, 0)
 static inline int is_migration_entry(swp_entry_t swp)
 {
return 0;
 }
 
-static inline void make_migration_entry_read(swp_entry_t *entryp) { }
 static inline void __migration_entry_wait(struct mm_struct *mm

[Nouveau] [PATCH v10 01/10] mm: Remove special swap entry functions

2021-06-07 Thread Alistair Popple
Remove multiple similar inline functions for dealing with different
types of special swap entries.

Both migration and device private swap entries use the swap offset to
store a pfn. Instead of multiple inline functions to obtain a struct
page for each swap entry type use a common function
pfn_swap_entry_to_page(). Also open-code the various entry_to_pfn()
functions as this results is shorter code that is easier to understand.

Signed-off-by: Alistair Popple 
Reviewed-by: Ralph Campbell 
Reviewed-by: Christoph Hellwig 

---

v9:
* Rebased on v5.13-rc2

v8:
* No changes

v7:
* Reworded commit message to include pfn_swap_entry_to_page()
* Added Christoph's Reviewed-by

v6:
* Removed redundant compound_page() call from inside PageLocked()
* Fixed a minor build issue for s390 reported by kernel test bot

v4:
* Added pfn_swap_entry_to_page()
* Reinstated check that migration entries point to locked pages
* Removed #define swapcache_prepare which isn't needed for CONFIG_SWAP=0
  builds
---
 arch/s390/mm/pgtable.c  |  2 +-
 fs/proc/task_mmu.c  | 23 +-
 include/linux/swap.h|  4 +--
 include/linux/swapops.h | 69 ++---
 mm/hmm.c|  5 ++-
 mm/huge_memory.c|  4 +--
 mm/memcontrol.c |  2 +-
 mm/memory.c | 10 +++---
 mm/migrate.c|  6 ++--
 mm/page_vma_mapped.c|  6 ++--
 10 files changed, 50 insertions(+), 81 deletions(-)

diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 18205f851c24..eec3a9d7176e 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -691,7 +691,7 @@ static void ptep_zap_swap_entry(struct mm_struct *mm, 
swp_entry_t entry)
if (!non_swap_entry(entry))
dec_mm_counter(mm, MM_SWAPENTS);
else if (is_migration_entry(entry)) {
-   struct page *page = migration_entry_to_page(entry);
+   struct page *page = pfn_swap_entry_to_page(entry);
 
dec_mm_counter(mm, mm_counter(page));
}
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index fc9784544b24..0953732c8ce1 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -514,10 +514,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
} else {
mss->swap_pss += (u64)PAGE_SIZE << PSS_SHIFT;
}
-   } else if (is_migration_entry(swpent))
-   page = migration_entry_to_page(swpent);
-   else if (is_device_private_entry(swpent))
-   page = device_private_entry_to_page(swpent);
+   } else if (is_pfn_swap_entry(swpent))
+   page = pfn_swap_entry_to_page(swpent);
} else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap
&& pte_none(*pte))) {
page = xa_load(>vm_file->f_mapping->i_pages,
@@ -549,7 +547,7 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
swp_entry_t entry = pmd_to_swp_entry(*pmd);
 
if (is_migration_entry(entry))
-   page = migration_entry_to_page(entry);
+   page = pfn_swap_entry_to_page(entry);
}
if (IS_ERR_OR_NULL(page))
return;
@@ -694,10 +692,8 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long 
hmask,
} else if (is_swap_pte(*pte)) {
swp_entry_t swpent = pte_to_swp_entry(*pte);
 
-   if (is_migration_entry(swpent))
-   page = migration_entry_to_page(swpent);
-   else if (is_device_private_entry(swpent))
-   page = device_private_entry_to_page(swpent);
+   if (is_pfn_swap_entry(swpent))
+   page = pfn_swap_entry_to_page(swpent);
}
if (page) {
int mapcount = page_mapcount(page);
@@ -1384,11 +1380,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct 
pagemapread *pm,
frame = swp_type(entry) |
(swp_offset(entry) << MAX_SWAPFILES_SHIFT);
flags |= PM_SWAP;
-   if (is_migration_entry(entry))
-   page = migration_entry_to_page(entry);
-
-   if (is_device_private_entry(entry))
-   page = device_private_entry_to_page(entry);
+   if (is_pfn_swap_entry(entry))
+   page = pfn_swap_entry_to_page(entry);
}
 
if (page && !PageAnon(page))
@@ -1445,7 +1438,7 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long 
addr, unsigned long end,
if (pmd_swp_soft_dirty(pmd))
flags |= PM_SOFT_DIRTY;
VM_BUG_ON(!is_pmd_migration_entry(pmd));
-   page = migration

[Nouveau] [PATCH v10 00/10] Add support for SVM atomics in Nouveau

2021-06-07 Thread Alistair Popple
Hi Andrew,

This is an update to address some comments on the previous version of
this series. Most are code comment updates, although there were a couple
of code changes as well. The most significant are:

 - Re-introduce the check of VM_LOCKED under the PTL in
   page_mlock_one(). This was present in an earlier version of the series
   but removed because we thought it was redundant. However Shakeel
   provided some background making it clear it is needed.

 - Reworked the return codes in copy_pte_range() based on suggestions
   from Peter Xu to hopefully make the code clearer and less error-prone.

 - Integrated a fix to the Nouveau code reported by Colin King.

As discussed to minimise impact I have also made this dependent on
CONFIG_DEVICE_PRIVATE. Hopefully these changes don't break any other series that
may have been based on the previous version. I see there has been some
discussion from Hugh and others around patch order, so if you need me to rebase
these to a different branch let me know.

Introduction


Some devices have features such as atomic PTE bits that can be used to
implement atomic access to system memory. To support atomic operations to a
shared virtual memory page such a device needs access to that page which is
exclusive of the CPU. This series introduces a mechanism to temporarily
unmap pages granting exclusive access to a device.

These changes are required to support OpenCL atomic operations in Nouveau
to shared virtual memory (SVM) regions allocated with the
CL_MEM_SVM_ATOMICS clSVMAlloc flag. A more complete description of the
OpenCL SVM feature is available at
https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/
OpenCL_API.html#_shared_virtual_memory .

Implementation
==

Exclusive device access is implemented by adding a new swap entry type
(SWAP_DEVICE_EXCLUSIVE) which is similar to a migration entry. The main
difference is that on fault the original entry is immediately restored by
the fault handler instead of waiting.

Restoring the entry triggers calls to MMU notifers which allows a device
driver to revoke the atomic access permission from the GPU prior to the CPU
finalising the entry.

Patches
===

Patches 1 & 2 refactor existing migration and device private entry
functions.

Patches 3 & 4 rework try_to_unmap_one() by splitting out unrelated
functionality into separate functions - try_to_migrate_one() and
try_to_munlock_one().

Patch 5 renames some existing code but does not introduce functionality.

Patch 6 is a small clean-up to swap entry handling in copy_pte_range().

Patch 7 contains the bulk of the implementation for device exclusive
memory.

Patch 8 contains some additions to the HMM selftests to ensure everything
works as expected.

Patch 9 is a cleanup for the Nouveau SVM implementation.

Patch 10 contains the implementation of atomic access for the Nouveau
driver.

Testing
===

This has been tested with upstream Mesa 21.1.0 and a simple OpenCL program
which checks that GPU atomic accesses to system memory are atomic. Without
this series the test fails as there is no way of write-protecting the page
mapping which results in the device clobbering CPU writes. For reference
the test is available at https://ozlabs.org/~apopple/opencl_svm_atomics/

Further testing has been performed by adding support for testing exclusive
access to the hmm-tests kselftests.


Alistair Popple (10):
  mm: Remove special swap entry functions
  mm/swapops: Rework swap entry manipulation code
  mm/rmap: Split try_to_munlock from try_to_unmap
  mm/rmap: Split migration into its own function
  mm: Rename migrate_pgmap_owner
  mm/memory.c: Allow different return codes for copy_nonpresent_pte()
  mm: Device exclusive memory access
  mm: Selftests for exclusive device memory
  nouveau/svm: Refactor nouveau_range_fault
  nouveau/svm: Implement atomic SVM access

 Documentation/vm/hmm.rst  |  19 +-
 Documentation/vm/unevictable-lru.rst  |  33 +-
 arch/s390/mm/pgtable.c|   2 +-
 drivers/gpu/drm/nouveau/include/nvif/if000c.h |   1 +
 drivers/gpu/drm/nouveau/nouveau_svm.c | 156 -
 drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h |   1 +
 .../drm/nouveau/nvkm/subdev/mmu/vmmgp100.c|   6 +
 fs/proc/task_mmu.c|  23 +-
 include/linux/mmu_notifier.h  |  26 +-
 include/linux/rmap.h  |  11 +-
 include/linux/swap.h  |  13 +-
 include/linux/swapops.h   | 123 ++--
 lib/test_hmm.c| 126 +++-
 lib/test_hmm_uapi.h   |   2 +
 mm/debug_vm_pgtable.c |  12 +-
 mm/hmm.c  |  12 +-
 mm/huge_memory.c  |  45 +-
 mm/hugetlb.c  |  10 +-
 mm/memcontrol.c   |   2 +-
 mm/

Re: [Nouveau] [PATCH v9 03/10] mm/rmap: Split try_to_munlock from try_to_unmap

2021-06-06 Thread Alistair Popple
On Saturday, 5 June 2021 10:41:03 AM AEST Shakeel Butt wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Fri, Jun 4, 2021 at 1:49 PM Liam Howlett  wrote:
> >
> > * Shakeel Butt  [210525 19:45]:
> > > On Tue, May 25, 2021 at 11:40 AM Liam Howlett  
wrote:
> > > >
> > > [...]
> > > > >
> > > > > +/*
> > > > > + * Walks the vma's mapping a page and mlocks the page if any locked 
vma's are
> > > > > + * found. Once one is found the page is locked and the scan can be 
terminated.
> > > > > + */
> > > >
> > > > Can you please add that this requires the mmap_sem() lock to the
> > > > comments?
> > > >
> > >
> > > Why does this require mmap_sem() lock? Also mmap_sem() lock of which 
mm_struct?
> >
> >
> > Doesn't the mlock_vma_page() require the mmap_sem() for reading?  The
> > mm_struct in vma->vm_mm;
> >
> 
> We are traversing all the vmas where this page is mapped of possibly
> different mm_structs. I don't think we want to take mmap_sem() of all
> those mm_structs. The commit b87537d9e2fe ("mm: rmap use pte lock not
> mmap_sem to set PageMlocked") removed exactly that.
> 
> >
> > From what I can see, at least the following paths have mmap_lock held
> > for writing:
> >
> > munlock_vma_pages_range() from __do_munmap()
> > munlokc_vma_pages_range() from remap_file_pages()
> >
> 
> The following path does not hold mmap_sem:
> 
> exit_mmap() -> munlock_vma_pages_all() -> munlock_vma_pages_range().
> 
> I would really suggest all to carefully read the commit message of
> b87537d9e2fe ("mm: rmap use pte lock not mmap_sem to set
> PageMlocked").
> 
> Particularly the following paragraph:
> ...
> Vlastimil Babka points out another race which this patch protects 
against.
>  try_to_unmap_one() might reach its mlock_vma_page() TestSetPageMlocked 
a
> moment after munlock_vma_pages_all() did its Phase 1 
TestClearPageMlocked:
> leaving PageMlocked and unevictable when it should be evictable.  
mmap_sem
> is ineffective because exit_mmap() does not hold it; page lock 
ineffective
> because __munlock_pagevec() only takes it afterwards, in Phase 2; pte 
lock
> is effective because __munlock_pagevec_fill() takes it to get the page,
> after VM_LOCKED was cleared from vm_flags, so visible to 
try_to_unmap_one.
> ...
>
> Alistair, please bring back the VM_LOCKED check with pte lock held and
> the comment "Holding pte lock, we do *not* need mmap_lock here".

Actually thanks for highlighting that paragraph. I have gone back through the 
code again in munlock_vma_pages_range() and think I have a better 
understanding of it now. So now I agree - the check of VM_LOCKED under the PTL 
is important to ensure mlock_vma_page() does not run after VM_LOCKED has been 
cleared and __munlock_pagevec_fill() has run.

Will post v10 to fix this and the try_to_munlock reference pointed out by Liam 
which I missed for v9. Thanks Shakeel for taking the time to point this out.

> One positive outcome of this cleanup patch is the removal of
> unnecessary invalidation (unmapping for kvm case) of secondary mmus.



___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v9 07/10] mm: Device exclusive memory access

2021-06-03 Thread Alistair Popple
On Friday, 4 June 2021 12:47:40 AM AEST Peter Xu wrote:
> External email: Use caution opening links or attachments
> 
> On Thu, Jun 03, 2021 at 09:39:32PM +1000, Alistair Popple wrote:
> > Reclaim won't run on the page due to the extra references from the special
> > swap entries.
> 
> That sounds reasonable, but I didn't find the point that stops it, probably
> due to my limited knowledge on the reclaim code.  Could you elaborate?

Sure, it isn't immediately obvious but it ends up being detected at the start 
of is_page_cache_freeable() in the pageout code:


static pageout_t pageout(struct page *page, struct address_space *mapping)
{

[...]

if (!is_page_cache_freeable(page))
return PAGE_KEEP;

 - Alistair

> --
> Peter Xu




___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v9 07/10] mm: Device exclusive memory access

2021-06-03 Thread Alistair Popple
On Thursday, 3 June 2021 12:37:30 AM AEST Peter Xu wrote:
> External email: Use caution opening links or attachments
> 
> On Wed, Jun 02, 2021 at 06:50:37PM +1000, Balbir Singh wrote:
> > On Wed, May 26, 2021 at 12:17:18AM -0700, John Hubbard wrote:
> > > On 5/25/21 4:51 AM, Balbir Singh wrote:
> > > ...
> > > 
> > > > > How beneficial is this code to nouveau users?  I see that it permits
> > > > > a
> > > > > part of OpenCL to be implemented, but how useful/important is this
> > > > > in
> > > > > the real world?
> > > > 
> > > > That is a very good question! I've not reviewed the code, but a sample
> > > > program with the described use case would make things easy to parse.
> > > > I suspect that is not easy to build at the moment?
> > > 
> > > The cover letter says this:
> > > 
> > > This has been tested with upstream Mesa 21.1.0 and a simple OpenCL
> > > program
> > > which checks that GPU atomic accesses to system memory are atomic.
> > > Without
> > > this series the test fails as there is no way of write-protecting the
> > > page
> > > mapping which results in the device clobbering CPU writes. For reference
> > > the test is available at https://ozlabs.org/~apopple/opencl_svm_atomics/
> > > 
> > > Further testing has been performed by adding support for testing
> > > exclusive
> > > access to the hmm-tests kselftests.
> > > 
> > > ...so that seems to cover the "sample program" request, at least.
> > 
> > Thanks, I'll take a look
> > 
> > > > I wonder how we co-ordinate all the work the mm is doing, page
> > > > migration,
> > > > reclaim with device exclusive access? Do we have any numbers for the
> > > > worst
> > > > case page fault latency when something is marked away for exclusive
> > > > access?
> > > 
> > > CPU page fault latency is approximately "terrible", if a page is
> > > resident on the GPU. We have to spin up a DMA engine on the GPU and
> > > have it copy the page over the PCIe bus, after all.
> > > 
> > > > I presume for now this is anonymous memory only? SWP_DEVICE_EXCLUSIVE
> > > > would
> > > 
> > > Yes, for now.
> > > 
> > > > only impact the address space of programs using the GPU. Should the
> > > > exclusively marked range live in the unreclaimable list and recycled
> > > > back to active/in-active to account for the fact that
> > > > 
> > > > 1. It is not reclaimable and reclaim will only hurt via page faults?
> > > > 2. It ages the page correctly or at-least allows for that possibility
> > > > when the> > > 
> > > > page is used by the GPU.
> > > 
> > > I'm not sure that that is *necessarily* something we can conclude. It
> > > depends upon access patterns of each program. For example, a
> > > "reduction" parallel program sends over lots of data to the GPU, and
> > > only a tiny bit of (reduced!) data comes back to the CPU. In that case,
> > > freeing the physical page on the CPU is actually the best decision for
> > > the OS to make (if the OS is sufficiently prescient).> 
> > With a shared device or a device exclusive range, it would be good to get
> > the device usage pattern and update the mm with that knowledge, so that
> > the LRU can be better maintained. With your comment you seem to suggest
> > that a page used by the GPU might be a good candidate for reclaim based
> > on the CPU's understanding of the age of the page should not account for
> > use by the device
> > (are GPU workloads - access once and discard?)
> 
> Hmm, besides the aging info, this reminded me: do we need to isolate the
> page from lru too when marking device exclusive access?
> 
> Afaict the current patch didn't do that so I think it's reclaimable.  If we
> still have the rmap then we'll get a mmu notify CLEAR when unmapping that
> special pte, so device driver should be able to drop the ownership.  However
> we dropped the rmap when marking exclusive.  Now I don't know whether and
> how it'll work if page reclaim runs with the page being exclusively owned
> if without isolating the page..

Reclaim won't run on the page due to the extra references from the special 
swap entries.

> --
> Peter Xu




___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v9 07/10] mm: Device exclusive memory access

2021-05-27 Thread Alistair Popple
On Thursday, 27 May 2021 11:04:57 PM AEST Peter Xu wrote:
> On Thu, May 27, 2021 at 01:35:39PM +1000, Alistair Popple wrote:
> > > > + *
> > > > + * @MMU_NOTIFY_EXCLUSIVE: to signal a device driver that the device
> > > > will
> > > > no + * longer have exclusive access to the page. May ignore the
> > > > invalidation that's + * part of make_device_exclusive_range() if the
> > > > owner field
> > > > + * matches the value passed to make_device_exclusive_range().
> > > 
> > > Perhaps s/matches/does not match/?
> > 
> > No, "matches" is correct. The MMU_NOTIFY_EXCLUSIVE notifier is to notify a
> > listener that a range is being invalidated for the purpose of making the
> > range available for some device to have exclusive access to. Which does
> > also mean a device getting the notification no longer has exclusive
> > access if it already did.
> > 
> > A unique type is needed because when creating the range a driver needs to
> > form a mmu critical section (with mmu_interval_read_begin()/
> > mmu_interval_read_end()) to ensure the entry remains valid long enough to
> > program the device pte and hasn't been invalidated.
> > 
> > However without a way of filtering any invalidations will result in a
> > retry, but make_device_exclusive_range() needs to do an invalidation
> > during installation of the entry. To avoid this causing infinite retries
> > the driver ignores specific invalidation events that it knows don't
> > apply, ie. the invalidations that are a result of that driver asking for
> > device exclusive entries.
> 
> OK I think I get it now.. so the driver checks both EXCLUSIVE and owner, if
> all match it skips the notify, otherwise it's treated like all the rest. 
> Thanks.
> 
> However then it's still confusing (as I raised it too in previous comment)
> that we use CLEAR when re-installing the valid pte.  It's merely against
> what CLEAR means.

Oh, thanks. I understand where you are coming from now - the pte is already 
invalid so ordinarily wouldn't need clearing.

> How about sending EXCLUSIVE for both mark/restore?  Just that when restore
> we notify with owner==NULL telling that no one is owning it anymore so
> driver needs to drop the ownership.  I assume your driver patch does not
> need change too.  Would that be much cleaner than CLEAR?  I bet it also
> makes commenting the new notify easier.
> 
> What do you think?

That seems like a good and avoids adding another type. And as you say they 
driver patch shouldn't need changing either (will need to confirm though).
 
> [...]
> 
> > > > +   vma->vm_mm, address,
> > > > min(vma->vm_end,
> > > > +   address + page_size(page)),
> > > > args->owner); + mmu_notifier_invalidate_range_start();
> > > > +
> > > > + while (page_vma_mapped_walk()) {
> > > > + /* Unexpected PMD-mapped THP? */
> > > > + VM_BUG_ON_PAGE(!pvmw.pte, page);
> > > > +
> > > > + if (!pte_present(*pvmw.pte)) {
> > > > + ret = false;
> > > > + page_vma_mapped_walk_done();
> > > > + break;
> > > > + }
> > > > +
> > > > + subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
> > > 
> > > I see that all pages passed in should be done after FOLL_SPLIT_PMD, so
> > > is
> > > this needed?  Or say, should subpage==page always be true?
> > 
> > Not always, in the case of a thp there are small ptes which will get
> > device
> > exclusive entries.
> 
> FOLL_SPLIT_PMD will first split the huge thp into smaller pages, then do
> follow_page_pte() on them (in follow_pmd_mask):
> 
> if (flags & FOLL_SPLIT_PMD) {
> int ret;
> page = pmd_page(*pmd);
> if (is_huge_zero_page(page)) {
> spin_unlock(ptl);
> ret = 0;
> split_huge_pmd(vma, pmd, address);
> if (pmd_trans_unstable(pmd))
> ret = -EBUSY;
> } else {
> spin_unlock(ptl);
> split_huge_pmd(vma, pmd, address);
> ret = pte_alloc(mm, pmd) ? -ENOMEM : 0;
> }
> 
> return ret ? ERR_PTR(ret) :
> follow_page_pte(vma, address, pmd, flags,
> 

Re: [Nouveau] [PATCH v9 07/10] mm: Device exclusive memory access

2021-05-26 Thread Alistair Popple
On Thursday, 27 May 2021 5:28:32 AM AEST Peter Xu wrote:
> On Mon, May 24, 2021 at 11:27:22PM +1000, Alistair Popple wrote:
> > Some devices require exclusive write access to shared virtual
> > memory (SVM) ranges to perform atomic operations on that memory. This
> > requires CPU page tables to be updated to deny access whilst atomic
> > operations are occurring.
> > 
> > In order to do this introduce a new swap entry
> > type (SWP_DEVICE_EXCLUSIVE). When a SVM range needs to be marked for
> > exclusive access by a device all page table mappings for the particular
> > range are replaced with device exclusive swap entries. This causes any
> > CPU access to the page to result in a fault.
> > 
> > Faults are resovled by replacing the faulting entry with the original
> > mapping. This results in MMU notifiers being called which a driver uses
> > to update access permissions such as revoking atomic access. After
> > notifiers have been called the device will no longer have exclusive
> > access to the region.
> > 
> > Walking of the page tables to find the target pages is handled by
> > get_user_pages() rather than a direct page table walk. A direct page
> > table walk similar to what migrate_vma_collect()/unmap() does could also
> > have been utilised. However this resulted in more code similar in
> > functionality to what get_user_pages() provides as page faulting is
> > required to make the PTEs present and to break COW.
> > 
> > Signed-off-by: Alistair Popple 
> > Reviewed-by: Christoph Hellwig 
> > 
> > ---
> > 
> > v9:
> > * Split rename of migrate_pgmap_owner into a separate patch.
> > * Added comments explaining SWP_DEVICE_EXCLUSIVE_* entries.
> > * Renamed try_to_protect{_one} to page_make_device_exclusive{_one} based
> > 
> >   somewhat on a suggestion from Peter Xu. I was never particularly happy
> >   with try_to_protect() as a name so think this is better.
> > 
> > * Removed unneccesary code and reworded some comments based on feedback
> > 
> >   from Peter Xu.
> > 
> > * Removed the VMA walk when restoring PTEs for device-exclusive entries.
> > * Simplified implementation of copy_pte_range() to fail if the page
> > 
> >   cannot be locked. This might lead to occasional fork() failures but at
> >   this stage we don't think that will be an issue.
> > 
> > v8:
> > * Remove device exclusive entries on fork rather than copy them.
> > 
> > v7:
> > * Added Christoph's Reviewed-by.
> > * Minor cosmetic cleanups suggested by Christoph.
> > * Replace mmu_notifier_range_init_migrate/exclusive with
> > 
> >   mmu_notifier_range_init_owner as suggested by Christoph.
> > 
> > * Replaced lock_page() with lock_page_retry() when handling faults.
> > * Restrict to anonymous pages for now.
> > 
> > v6:
> > * Fixed a bisectablity issue due to incorrectly applying the rename of
> > 
> >   migrate_pgmap_owner to the wrong patches for Nouveau and hmm_test.
> > 
> > v5:
> > * Renamed range->migrate_pgmap_owner to range->owner.
> > * Added MMU_NOTIFY_EXCLUSIVE to allow passing of a driver cookie which
> > 
> >   allows notifiers called as a result of make_device_exclusive_range() to
> >   be ignored.
> > 
> > * Added a check to try_to_protect_one() to detect if the pages originally
> > 
> >   returned from get_user_pages() have been unmapped or not.
> > 
> > * Removed check_device_exclusive_range() as it is no longer required with
> > 
> >   the other changes.
> > 
> > * Documentation update.
> > 
> > v4:
> > * Add function to check that mappings are still valid and exclusive.
> > * s/long/unsigned long/ in make_device_exclusive_entry().
> > ---
> > 
> >  Documentation/vm/hmm.rst |  17 
> >  include/linux/mmu_notifier.h |   6 ++
> >  include/linux/rmap.h |   4 +
> >  include/linux/swap.h |   7 +-
> >  include/linux/swapops.h  |  44 -
> >  mm/hmm.c |   5 +
> >  mm/memory.c  | 128 +++-
> >  mm/mprotect.c|   8 ++
> >  mm/page_vma_mapped.c |   9 +-
> >  mm/rmap.c| 186 +++
> >  10 files changed, 405 insertions(+), 9 deletions(-)
> > 
> > diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst
> > index 3df79307a797..a14c2938e7af 100644
> > --- a/Documentation/vm/hmm.rst
> > +++ b/Documentation/vm/hmm.rst
> > 
> > @@ -405,

Re: [Nouveau] [PATCH v9 06/10] mm/memory.c: Allow different return codes for copy_nonpresent_pte()

2021-05-26 Thread Alistair Popple
On Thursday, 27 May 2021 5:50:05 AM AEST Peter Xu wrote:
> On Mon, May 24, 2021 at 11:27:21PM +1000, Alistair Popple wrote:
> > Currently if copy_nonpresent_pte() returns a non-zero value it is
> > assumed to be a swap entry which requires further processing outside the
> > loop in copy_pte_range() after dropping locks. This prevents other
> > values being returned to signal conditions such as failure which a
> > subsequent change requires.
> > 
> > Instead make copy_nonpresent_pte() return an error code if further
> > processing is required and read the value for the swap entry in the main
> > loop under the ptl.
> > 
> > Signed-off-by: Alistair Popple 
> > 
> > ---
> > 
> > v9:
> > 
> > New for v9 to allow device exclusive handling to occur in
> > copy_nonpresent_pte().
> > ---
> > 
> >  mm/memory.c | 12 +++-
> >  1 file changed, 7 insertions(+), 5 deletions(-)
> > 
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 2fb455c365c2..e061cfa18c11 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -718,7 +718,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct
> > mm_struct *src_mm,> 
> >   if (likely(!non_swap_entry(entry))) {
> >   
> >   if (swap_duplicate(entry) < 0)
> > 
> > - return entry.val;
> > + return -EAGAIN;
> > 
> >   /* make sure dst_mm is on swapoff's mmlist. */
> >   if (unlikely(list_empty(_mm->mmlist))) {
> > 
> > @@ -974,11 +974,13 @@ copy_pte_range(struct vm_area_struct *dst_vma,
> > struct vm_area_struct *src_vma,> 
> >   continue;
> >   
> >   }
> >   if (unlikely(!pte_present(*src_pte))) {
> > 
> > - entry.val = copy_nonpresent_pte(dst_mm, src_mm,
> > - dst_pte, src_pte,
> > - src_vma, addr, rss);
> > - if (entry.val)
> > + ret = copy_nonpresent_pte(dst_mm, src_mm,
> > + dst_pte, src_pte,
> > + src_vma, addr, rss);
> > + if (ret == -EAGAIN) {
> > + entry = pte_to_swp_entry(*src_pte);
> > 
> >   break;
> > 
> > + }
> > 
> >   progress += 8;
> >   continue;
> >   
> >   }
> 
> Note that -EAGAIN was previously used by copy_present_page() for early cow
> use.  Here later although we check entry.val first:
> 
> if (entry.val) {
> if (add_swap_count_continuation(entry, GFP_KERNEL) < 0) {
> ret = -ENOMEM;
> goto out;
> }
> entry.val = 0;
> } else if (ret) {
> WARN_ON_ONCE(ret != -EAGAIN);
> prealloc = page_copy_prealloc(src_mm, src_vma, addr);
> if (!prealloc)
> return -ENOMEM;
> /* We've captured and resolved the error. Reset, try again.
> */ ret = 0;
> }
> 
> We didn't reset "ret" in entry.val case (maybe we should?). Then in the next
> round of "goto again" if "ret" is unluckily untouched, it could reach the
> 2nd if check, and I think it could cause an unexpected
> page_copy_prealloc().

Thanks, I had considered that but saw "ret" was always set either by 
copy_nonpresent_pte() or copy_present_pte(). However missed the "unlucky" case 
at the start of the loop:

if (progress >= 32) {
progress = 0;
if (need_resched() ||
spin_needbreak(src_ptl) || 
pin_needbreak(dst_ptl))
break;

Looking at this again though checking different variables to figure out what 
to do outside the locks and reusing error codes seems error prone. I reused -
EAGAIN for copy_nonpresent_pte() simply because that seemed the most sensible 
error code, but I don't think that aids readability and it might be better to 
use a unique error code for each case needing extra handling.

So it might be better if I update this patch to:
1) Use unique error codes for each case requiring special handling outside the 
lock.
2) Only check "ret" to determine what to do outside locks (ie. not entry.val)
3) Document these.
4) Always reset ret after handling.

Thoughts?

 - Alistair

> --
> Peter Xu




___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH][next] nouveau/svm: Fix missing failure check on call to make_device_exclusive_range

2021-05-26 Thread Alistair Popple
On Thursday, 27 May 2021 12:04:59 AM AEST Colin King wrote:
> The call to make_device_exclusive_range can potentially fail leaving
> pointer page not initialized that leads to an uninitialized pointer
> read issue. Fix this by adding a check to see if the call failed and
> returning the error code.
> 
> Addresses-Coverity: ("Uninitialized pointer read")
> Fixes: c620bba9828c ("nouveau/svm: implement atomic SVM access")
> Signed-off-by: Colin Ian King 
> ---
>  drivers/gpu/drm/nouveau/nouveau_svm.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c
> b/drivers/gpu/drm/nouveau/nouveau_svm.c index 84726a89e665..b913b4907088
> 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_svm.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
> @@ -609,8 +609,10 @@ static int nouveau_atomic_range_fault(struct
> nouveau_svmm *svmm,
> 
> notifier_seq = mmu_interval_read_begin(>notifier);
> mmap_read_lock(mm);
> -   make_device_exclusive_range(mm, start, start + PAGE_SIZE,
> -   , drm->dev);
> +   ret = make_device_exclusive_range(mm, start, start +
> PAGE_SIZE, + ,
> drm->dev); +   if (ret < 0)
> +   goto out;

Thanks for spotting, this is fixing get_user_pages() inside 
make_device_exclusive_range() returning an error. However the check needs to 
happen after dropping mmap_lock below:

> mmap_read_unlock(mm);
> if (!page) {
> ret = -EINVAL;
> --
> 2.31.1




___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v9 07/10] mm: Device exclusive memory access

2021-05-26 Thread Alistair Popple
On Wednesday, 26 May 2021 5:17:18 PM AEST John Hubbard wrote:
> On 5/25/21 4:51 AM, Balbir Singh wrote:
> ...
> 
> >> How beneficial is this code to nouveau users?  I see that it permits a
> >> part of OpenCL to be implemented, but how useful/important is this in
> >> the real world?
> > 
> > That is a very good question! I've not reviewed the code, but a sample
> > program with the described use case would make things easy to parse.
> > I suspect that is not easy to build at the moment?
> 
> The cover letter says this:
> 
> This has been tested with upstream Mesa 21.1.0 and a simple OpenCL program
> which checks that GPU atomic accesses to system memory are atomic. Without
> this series the test fails as there is no way of write-protecting the page
> mapping which results in the device clobbering CPU writes. For reference
> the test is available at https://ozlabs.org/~apopple/opencl_svm_atomics/
> 
> Further testing has been performed by adding support for testing exclusive
> access to the hmm-tests kselftests.
> 
> ...so that seems to cover the "sample program" request, at least.

It is also sufficiently easy to build, assuming of course you have the 
appropriate Mesa/LLVM/OpenCL libraries installed :-)

If you are interested I have some scripts which may help with building Mesa, 
etc. Not that that is especially hard either, it's just there are a couple of 
different dependencies required.

> > I wonder how we co-ordinate all the work the mm is doing, page migration,
> > reclaim with device exclusive access? Do we have any numbers for the worst
> > case page fault latency when something is marked away for exclusive
> > access?
>
> CPU page fault latency is approximately "terrible", if a page is resident on
> the GPU. We have to spin up a DMA engine on the GPU and have it copy the
> page over the PCIe bus, after all.

Although for clarity that describes latency for CPU faults to device private 
pages which are always resident on the GPU. A CPU fault to a page being 
exclusively accessed will be slightly less terrible as it only requires the 
GPU MMU/TLB mappings to be taken down in much the same as for any other MMU 
notifier callback as the page is mapped by the GPU rather than resident there.

> > I presume for now this is anonymous memory only? SWP_DEVICE_EXCLUSIVE
> > would
> 
> Yes, for now.
> 
> > only impact the address space of programs using the GPU. Should the
> > exclusively marked range live in the unreclaimable list and recycled back
> > to active/in-active to account for the fact that
> > 
> > 1. It is not reclaimable and reclaim will only hurt via page faults?
> > 2. It ages the page correctly or at-least allows for that possibility when
> > the> 
> > page is used by the GPU.
> 
> I'm not sure that that is *necessarily* something we can conclude. It
> depends upon access patterns of each program. For example, a "reduction"
> parallel program sends over lots of data to the GPU, and only a tiny bit of
> (reduced!) data comes back to the CPU. In that case, freeing the physical
> page on the CPU is actually the best decision for the OS to make (if the OS
> is sufficiently prescient).
> 
> thanks,




___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v9 07/10] mm: Device exclusive memory access

2021-05-25 Thread Alistair Popple
On Tuesday, 25 May 2021 11:31:17 AM AEST John Hubbard wrote:
> On 5/24/21 3:11 PM, Andrew Morton wrote:
> >> ...
> >> 
> >>   Documentation/vm/hmm.rst |  17 
> >>   include/linux/mmu_notifier.h |   6 ++
> >>   include/linux/rmap.h |   4 +
> >>   include/linux/swap.h |   7 +-
> >>   include/linux/swapops.h  |  44 -
> >>   mm/hmm.c |   5 +
> >>   mm/memory.c  | 128 +++-
> >>   mm/mprotect.c|   8 ++
> >>   mm/page_vma_mapped.c |   9 +-
> >>   mm/rmap.c| 186 +++
> >>   10 files changed, 405 insertions(+), 9 deletions(-)
> > 
> > This is quite a lot of code added to core MM for a single driver.
> > 
> > Is there any expectation that other drivers will use this code?
> 
> Yes! This should work for GPUs (and potentially, other devices) that support
> OpenCL SVM atomic accesses on the device. I haven't looked into how amdgpu
> works in any detail, but that's certainly at the top of the list of likely
> additional callers.
> 
> > Is there a way of reducing the impact (code size, at least) for systems
> > which don't need this code?

All of the code added to mm/rmap.c is specific to implementing this feature 
and not depended on by other core MM code so could be put behind something 
like CONFIG_DEVICE_PRIVATE to reduce the code size impact (I realise now it 
currently isn't but should be).

The impact on compiled code size in mm/memory.c also ends up being minimised 
by the compiler because all of it is of the form:

if (is_device_exclusive_entry(...)) {
[...]
}

Meaning it should get thrown away when the feature is not configured given 
is_device_exclusive_entry() is a static inline always returning false in that 
case.

> I'll leave this question to others for the moment, in order to answer
> the "do we need it at all" points.
> 
> > How beneficial is this code to nouveau users?  I see that it permits a
> > part of OpenCL to be implemented, but how useful/important is this in
> > the real world?
> 
> So this is interesting. Right now, OpenCL support in Nouveau is rather new
> and so probably not a huge impact yet. However, we've built up enough
> experience with CUDA and OpenCL to learn that atomic operations, as part of
> the user space programming model, are a super big deal. Atomic operations
> are so useful and important that I'd expect many OpenCL SVM users to be
> uninterested in programming models that lack atomic operations for GPU
> compute programs.
> 
> Again, this doesn't rule out future, non-GPU accelerator devices that may
> come along.
> 
> Atomic ops are just a really important piece of high-end multi-threaded
> programming, it turns out. So this is the beginning of support for an
> important building block for general purpose programming on devices that
> have GPU-like memory models.
> 
> 
> thanks,




___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


[Nouveau] [PATCH v9 10/10] nouveau/svm: Implement atomic SVM access

2021-05-24 Thread Alistair Popple
Some NVIDIA GPUs do not support direct atomic access to system memory
via PCIe. Instead this must be emulated by granting the GPU exclusive
access to the memory. This is achieved by replacing CPU page table
entries with special swap entries that fault on userspace access.

The driver then grants the GPU permission to update the page undergoing
atomic access via the GPU page tables. When CPU access to the page is
required a CPU fault is raised which calls into the device driver via
MMU notifiers to revoke the atomic access. The original page table
entries are then restored allowing CPU access to proceed.

Signed-off-by: Alistair Popple 
Reviewed-by: Ben Skeggs 

---

v9:
* Added Ben's Reviewed-By

v7:
* Removed magic values for fault access levels
* Improved readability of fault comparison code

v4:
* Check that page table entries haven't changed before mapping on the
  device
---
 drivers/gpu/drm/nouveau/include/nvif/if000c.h |   1 +
 drivers/gpu/drm/nouveau/nouveau_svm.c | 126 --
 drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h |   1 +
 .../drm/nouveau/nvkm/subdev/mmu/vmmgp100.c|   6 +
 4 files changed, 123 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/include/nvif/if000c.h 
b/drivers/gpu/drm/nouveau/include/nvif/if000c.h
index d6dd40f21eed..9c7ff56831c5 100644
--- a/drivers/gpu/drm/nouveau/include/nvif/if000c.h
+++ b/drivers/gpu/drm/nouveau/include/nvif/if000c.h
@@ -77,6 +77,7 @@ struct nvif_vmm_pfnmap_v0 {
 #define NVIF_VMM_PFNMAP_V0_APER   0x00f0ULL
 #define NVIF_VMM_PFNMAP_V0_HOST   0xULL
 #define NVIF_VMM_PFNMAP_V0_VRAM   0x0010ULL
+#define NVIF_VMM_PFNMAP_V0_A 0x0004ULL
 #define NVIF_VMM_PFNMAP_V0_W  0x0002ULL
 #define NVIF_VMM_PFNMAP_V0_V  0x0001ULL
 #define NVIF_VMM_PFNMAP_V0_NONE   0xULL
diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c 
b/drivers/gpu/drm/nouveau/nouveau_svm.c
index a195e48c9aee..81526d65b4e2 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct nouveau_svm {
struct nouveau_drm *drm;
@@ -67,6 +68,11 @@ struct nouveau_svm {
} buffer[1];
 };
 
+#define FAULT_ACCESS_READ 0
+#define FAULT_ACCESS_WRITE 1
+#define FAULT_ACCESS_ATOMIC 2
+#define FAULT_ACCESS_PREFETCH 3
+
 #define SVM_DBG(s,f,a...) NV_DEBUG((s)->drm, "svm: "f"\n", ##a)
 #define SVM_ERR(s,f,a...) NV_WARN((s)->drm, "svm: "f"\n", ##a)
 
@@ -411,6 +417,24 @@ nouveau_svm_fault_cancel_fault(struct nouveau_svm *svm,
  fault->client);
 }
 
+static int
+nouveau_svm_fault_priority(u8 fault)
+{
+   switch (fault) {
+   case FAULT_ACCESS_PREFETCH:
+   return 0;
+   case FAULT_ACCESS_READ:
+   return 1;
+   case FAULT_ACCESS_WRITE:
+   return 2;
+   case FAULT_ACCESS_ATOMIC:
+   return 3;
+   default:
+   WARN_ON_ONCE(1);
+   return -1;
+   }
+}
+
 static int
 nouveau_svm_fault_cmp(const void *a, const void *b)
 {
@@ -421,9 +445,8 @@ nouveau_svm_fault_cmp(const void *a, const void *b)
return ret;
if ((ret = (s64)fa->addr - fb->addr))
return ret;
-   /*XXX: atomic? */
-   return (fa->access == 0 || fa->access == 3) -
-  (fb->access == 0 || fb->access == 3);
+   return nouveau_svm_fault_priority(fa->access) -
+   nouveau_svm_fault_priority(fb->access);
 }
 
 static void
@@ -487,6 +510,10 @@ static bool nouveau_svm_range_invalidate(struct 
mmu_interval_notifier *mni,
struct svm_notifier *sn =
container_of(mni, struct svm_notifier, notifier);
 
+   if (range->event == MMU_NOTIFY_EXCLUSIVE &&
+   range->owner == sn->svmm->vmm->cli->drm->dev)
+   return true;
+
/*
 * serializes the update to mni->invalidate_seq done by caller and
 * prevents invalidation of the PTE from progressing while HW is being
@@ -555,6 +582,71 @@ static void nouveau_hmm_convert_pfn(struct nouveau_drm 
*drm,
args->p.phys[0] |= NVIF_VMM_PFNMAP_V0_W;
 }
 
+static int nouveau_atomic_range_fault(struct nouveau_svmm *svmm,
+  struct nouveau_drm *drm,
+  struct nouveau_pfnmap_args *args, u32 size,
+  struct svm_notifier *notifier)
+{
+   unsigned long timeout =
+   jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
+   struct mm_struct *mm = svmm->notifier.mm;
+   struct page *page;
+   unsigned long start = args

[Nouveau] [PATCH v9 09/10] nouveau/svm: Refactor nouveau_range_fault

2021-05-24 Thread Alistair Popple
Call mmu_interval_notifier_insert() as part of nouveau_range_fault().
This doesn't introduce any functional change but makes it easier for a
subsequent patch to alter the behaviour of nouveau_range_fault() to
support GPU atomic operations.

Signed-off-by: Alistair Popple 
Reviewed-by: Ben Skeggs 

---

v9:

* Added Ben's Reviewed-By (thanks!)
---
 drivers/gpu/drm/nouveau/nouveau_svm.c | 34 ---
 1 file changed, 20 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c 
b/drivers/gpu/drm/nouveau/nouveau_svm.c
index 94f841026c3b..a195e48c9aee 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -567,18 +567,27 @@ static int nouveau_range_fault(struct nouveau_svmm *svmm,
unsigned long hmm_pfns[1];
struct hmm_range range = {
.notifier = >notifier,
-   .start = notifier->notifier.interval_tree.start,
-   .end = notifier->notifier.interval_tree.last + 1,
.default_flags = hmm_flags,
.hmm_pfns = hmm_pfns,
.dev_private_owner = drm->dev,
};
-   struct mm_struct *mm = notifier->notifier.mm;
+   struct mm_struct *mm = svmm->notifier.mm;
int ret;
 
+   ret = mmu_interval_notifier_insert(>notifier, mm,
+   args->p.addr, args->p.size,
+   _svm_mni_ops);
+   if (ret)
+   return ret;
+
+   range.start = notifier->notifier.interval_tree.start;
+   range.end = notifier->notifier.interval_tree.last + 1;
+
while (true) {
-   if (time_after(jiffies, timeout))
-   return -EBUSY;
+   if (time_after(jiffies, timeout)) {
+   ret = -EBUSY;
+   goto out;
+   }
 
range.notifier_seq = mmu_interval_read_begin(range.notifier);
mmap_read_lock(mm);
@@ -587,7 +596,7 @@ static int nouveau_range_fault(struct nouveau_svmm *svmm,
if (ret) {
if (ret == -EBUSY)
continue;
-   return ret;
+   goto out;
}
 
mutex_lock(>mutex);
@@ -606,6 +615,9 @@ static int nouveau_range_fault(struct nouveau_svmm *svmm,
svmm->vmm->vmm.object.client->super = false;
mutex_unlock(>mutex);
 
+out:
+   mmu_interval_notifier_remove(>notifier);
+
return ret;
 }
 
@@ -727,14 +739,8 @@ nouveau_svm_fault(struct nvif_notify *notify)
}
 
notifier.svmm = svmm;
-   ret = mmu_interval_notifier_insert(, mm,
-  args.i.p.addr, args.i.p.size,
-  _svm_mni_ops);
-   if (!ret) {
-   ret = nouveau_range_fault(svmm, svm->drm, ,
-   sizeof(args), hmm_flags, );
-   mmu_interval_notifier_remove();
-   }
+   ret = nouveau_range_fault(svmm, svm->drm, ,
+   sizeof(args), hmm_flags, );
mmput(mm);
 
limit = args.i.p.addr + args.i.p.size;
-- 
2.20.1

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


[Nouveau] [PATCH v9 07/10] mm: Device exclusive memory access

2021-05-24 Thread Alistair Popple
Some devices require exclusive write access to shared virtual
memory (SVM) ranges to perform atomic operations on that memory. This
requires CPU page tables to be updated to deny access whilst atomic
operations are occurring.

In order to do this introduce a new swap entry
type (SWP_DEVICE_EXCLUSIVE). When a SVM range needs to be marked for
exclusive access by a device all page table mappings for the particular
range are replaced with device exclusive swap entries. This causes any
CPU access to the page to result in a fault.

Faults are resovled by replacing the faulting entry with the original
mapping. This results in MMU notifiers being called which a driver uses
to update access permissions such as revoking atomic access. After
notifiers have been called the device will no longer have exclusive
access to the region.

Walking of the page tables to find the target pages is handled by
get_user_pages() rather than a direct page table walk. A direct page
table walk similar to what migrate_vma_collect()/unmap() does could also
have been utilised. However this resulted in more code similar in
functionality to what get_user_pages() provides as page faulting is
required to make the PTEs present and to break COW.

Signed-off-by: Alistair Popple 
Reviewed-by: Christoph Hellwig 

---

v9:
* Split rename of migrate_pgmap_owner into a separate patch.
* Added comments explaining SWP_DEVICE_EXCLUSIVE_* entries.
* Renamed try_to_protect{_one} to page_make_device_exclusive{_one} based
  somewhat on a suggestion from Peter Xu. I was never particularly happy
  with try_to_protect() as a name so think this is better.
* Removed unneccesary code and reworded some comments based on feedback
  from Peter Xu.
* Removed the VMA walk when restoring PTEs for device-exclusive entries.
* Simplified implementation of copy_pte_range() to fail if the page
  cannot be locked. This might lead to occasional fork() failures but at
  this stage we don't think that will be an issue.

v8:
* Remove device exclusive entries on fork rather than copy them.

v7:
* Added Christoph's Reviewed-by.
* Minor cosmetic cleanups suggested by Christoph.
* Replace mmu_notifier_range_init_migrate/exclusive with
  mmu_notifier_range_init_owner as suggested by Christoph.
* Replaced lock_page() with lock_page_retry() when handling faults.
* Restrict to anonymous pages for now.

v6:
* Fixed a bisectablity issue due to incorrectly applying the rename of
  migrate_pgmap_owner to the wrong patches for Nouveau and hmm_test.

v5:
* Renamed range->migrate_pgmap_owner to range->owner.
* Added MMU_NOTIFY_EXCLUSIVE to allow passing of a driver cookie which
  allows notifiers called as a result of make_device_exclusive_range() to
  be ignored.
* Added a check to try_to_protect_one() to detect if the pages originally
  returned from get_user_pages() have been unmapped or not.
* Removed check_device_exclusive_range() as it is no longer required with
  the other changes.
* Documentation update.

v4:
* Add function to check that mappings are still valid and exclusive.
* s/long/unsigned long/ in make_device_exclusive_entry().
---
 Documentation/vm/hmm.rst |  17 
 include/linux/mmu_notifier.h |   6 ++
 include/linux/rmap.h |   4 +
 include/linux/swap.h |   7 +-
 include/linux/swapops.h  |  44 -
 mm/hmm.c |   5 +
 mm/memory.c  | 128 +++-
 mm/mprotect.c|   8 ++
 mm/page_vma_mapped.c |   9 +-
 mm/rmap.c| 186 +++
 10 files changed, 405 insertions(+), 9 deletions(-)

diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst
index 3df79307a797..a14c2938e7af 100644
--- a/Documentation/vm/hmm.rst
+++ b/Documentation/vm/hmm.rst
@@ -405,6 +405,23 @@ between device driver specific code and shared common code:
 
The lock can now be released.
 
+Exclusive access memory
+===
+
+Some devices have features such as atomic PTE bits that can be used to 
implement
+atomic access to system memory. To support atomic operations to a shared 
virtual
+memory page such a device needs access to that page which is exclusive of any
+userspace access from the CPU. The ``make_device_exclusive_range()`` function
+can be used to make a memory range inaccessible from userspace.
+
+This replaces all mappings for pages in the given range with special swap
+entries. Any attempt to access the swap entry results in a fault which is
+resovled by replacing the entry with the original mapping. A driver gets
+notified that the mapping has been changed by MMU notifiers, after which point
+it will no longer have exclusive access to the page. Exclusive access is
+guranteed to last until the driver drops the page lock and page reference, at
+which point any CPU faults on the page may proceed as described.
+
 Memory cgroup (memcg) and rss accounting
 
 
diff --git a/include

[Nouveau] [PATCH v9 08/10] mm: Selftests for exclusive device memory

2021-05-24 Thread Alistair Popple
Adds some selftests for exclusive device memory.

Signed-off-by: Alistair Popple 
Acked-by: Jason Gunthorpe 
Tested-by: Ralph Campbell 
Reviewed-by: Ralph Campbell 
---
 lib/test_hmm.c | 124 +++
 lib/test_hmm_uapi.h|   2 +
 tools/testing/selftests/vm/hmm-tests.c | 158 +
 3 files changed, 284 insertions(+)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 5c9f5a020c1d..305a9d9e2b4c 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "test_hmm_uapi.h"
 
@@ -46,6 +47,7 @@ struct dmirror_bounce {
unsigned long   cpages;
 };
 
+#define DPT_XA_TAG_ATOMIC 1UL
 #define DPT_XA_TAG_WRITE 3UL
 
 /*
@@ -619,6 +621,54 @@ static void dmirror_migrate_alloc_and_copy(struct 
migrate_vma *args,
}
 }
 
+static int dmirror_check_atomic(struct dmirror *dmirror, unsigned long start,
+unsigned long end)
+{
+   unsigned long pfn;
+
+   for (pfn = start >> PAGE_SHIFT; pfn < (end >> PAGE_SHIFT); pfn++) {
+   void *entry;
+   struct page *page;
+
+   entry = xa_load(>pt, pfn);
+   page = xa_untag_pointer(entry);
+   if (xa_pointer_tag(entry) == DPT_XA_TAG_ATOMIC)
+   return -EPERM;
+   }
+
+   return 0;
+}
+
+static int dmirror_atomic_map(unsigned long start, unsigned long end,
+ struct page **pages, struct dmirror *dmirror)
+{
+   unsigned long pfn, mapped = 0;
+   int i;
+
+   /* Map the migrated pages into the device's page tables. */
+   mutex_lock(>mutex);
+
+   for (i = 0, pfn = start >> PAGE_SHIFT; pfn < (end >> PAGE_SHIFT); 
pfn++, i++) {
+   void *entry;
+
+   if (!pages[i])
+   continue;
+
+   entry = pages[i];
+   entry = xa_tag_pointer(entry, DPT_XA_TAG_ATOMIC);
+   entry = xa_store(>pt, pfn, entry, GFP_ATOMIC);
+   if (xa_is_err(entry)) {
+   mutex_unlock(>mutex);
+   return xa_err(entry);
+   }
+
+   mapped++;
+   }
+
+   mutex_unlock(>mutex);
+   return mapped;
+}
+
 static int dmirror_migrate_finalize_and_map(struct migrate_vma *args,
struct dmirror *dmirror)
 {
@@ -661,6 +711,71 @@ static int dmirror_migrate_finalize_and_map(struct 
migrate_vma *args,
return 0;
 }
 
+static int dmirror_exclusive(struct dmirror *dmirror,
+struct hmm_dmirror_cmd *cmd)
+{
+   unsigned long start, end, addr;
+   unsigned long size = cmd->npages << PAGE_SHIFT;
+   struct mm_struct *mm = dmirror->notifier.mm;
+   struct page *pages[64];
+   struct dmirror_bounce bounce;
+   unsigned long next;
+   int ret;
+
+   start = cmd->addr;
+   end = start + size;
+   if (end < start)
+   return -EINVAL;
+
+   /* Since the mm is for the mirrored process, get a reference first. */
+   if (!mmget_not_zero(mm))
+   return -EINVAL;
+
+   mmap_read_lock(mm);
+   for (addr = start; addr < end; addr = next) {
+   int i, mapped;
+
+   if (end < addr + (ARRAY_SIZE(pages) << PAGE_SHIFT))
+   next = end;
+   else
+   next = addr + (ARRAY_SIZE(pages) << PAGE_SHIFT);
+
+   ret = make_device_exclusive_range(mm, addr, next, pages, NULL);
+   mapped = dmirror_atomic_map(addr, next, pages, dmirror);
+   for (i = 0; i < ret; i++) {
+   if (pages[i]) {
+   unlock_page(pages[i]);
+   put_page(pages[i]);
+   }
+   }
+
+   if (addr + (mapped << PAGE_SHIFT) < next) {
+   mmap_read_unlock(mm);
+   mmput(mm);
+   return -EBUSY;
+   }
+   }
+   mmap_read_unlock(mm);
+   mmput(mm);
+
+   /* Return the migrated data for verification. */
+   ret = dmirror_bounce_init(, start, size);
+   if (ret)
+   return ret;
+   mutex_lock(>mutex);
+   ret = dmirror_do_read(dmirror, start, end, );
+   mutex_unlock(>mutex);
+   if (ret == 0) {
+   if (copy_to_user(u64_to_user_ptr(cmd->ptr), bounce.ptr,
+bounce.size))
+   ret = -EFAULT;
+   }
+
+   cmd->cpages = bounce.cpages;
+   dmirror_bounce_fini();
+   return ret;
+}
+
 static int dmirror_migrate(struct dmirror *dmirror,
   struct hmm_dmirror_cmd *cmd)
 {
@@ -949,6 +1064,15 @@ static long dmi

[Nouveau] [PATCH v9 04/10] mm/rmap: Split migration into its own function

2021-05-24 Thread Alistair Popple
Migration is currently implemented as a mode of operation for
try_to_unmap_one() generally specified by passing the TTU_MIGRATION flag
or in the case of splitting a huge anonymous page TTU_SPLIT_FREEZE.

However it does not have much in common with the rest of the unmap
functionality of try_to_unmap_one() and thus splitting it into a
separate function reduces the complexity of try_to_unmap_one() making it
more readable.

Several simplifications can also be made in try_to_migrate_one() based
on the following observations:

 - All users of TTU_MIGRATION also set TTU_IGNORE_MLOCK.
 - No users of TTU_MIGRATION ever set TTU_IGNORE_HWPOISON.
 - No users of TTU_MIGRATION ever set TTU_BATCH_FLUSH.

TTU_SPLIT_FREEZE is a special case of migration used when splitting an
anonymous page. This is most easily dealt with by calling the correct
function from unmap_page() in mm/huge_memory.c  - either
try_to_migrate() for PageAnon or try_to_unmap().

Signed-off-by: Alistair Popple 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Ralph Campbell 

---

v5:
* Added comments about how PMD splitting works for migration vs.
  unmapping
* Tightened up the flag check in try_to_migrate() to be explicit about
  which TTU_XXX flags are supported.
---
 include/linux/rmap.h |   4 +-
 mm/huge_memory.c |  15 +-
 mm/migrate.c |   9 +-
 mm/rmap.c| 358 ---
 4 files changed, 280 insertions(+), 106 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 38a746787c2f..0e25d829f742 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -86,8 +86,6 @@ struct anon_vma_chain {
 };
 
 enum ttu_flags {
-   TTU_MIGRATION   = 0x1,  /* migration mode */
-
TTU_SPLIT_HUGE_PMD  = 0x4,  /* split huge PMD if any */
TTU_IGNORE_MLOCK= 0x8,  /* ignore mlock */
TTU_IGNORE_HWPOISON = 0x20, /* corrupted page is recoverable */
@@ -96,7 +94,6 @@ enum ttu_flags {
 * do a final flush if necessary */
TTU_RMAP_LOCKED = 0x80, /* do not grab rmap lock:
 * caller holds it */
-   TTU_SPLIT_FREEZE= 0x100,/* freeze pte under 
splitting thp */
 };
 
 #ifdef CONFIG_MMU
@@ -193,6 +190,7 @@ static inline void page_dup_rmap(struct page *page, bool 
compound)
 int page_referenced(struct page *, int is_locked,
struct mem_cgroup *memcg, unsigned long *vm_flags);
 
+bool try_to_migrate(struct page *page, enum ttu_flags flags);
 bool try_to_unmap(struct page *, enum ttu_flags flags);
 
 /* Avoid racy checks */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2ec6dab72217..6dddc75b89ee 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2345,16 +2345,21 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma,
 
 static void unmap_page(struct page *page)
 {
-   enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK |
-   TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD;
+   enum ttu_flags ttu_flags = TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD;
bool unmap_success;
 
VM_BUG_ON_PAGE(!PageHead(page), page);
 
if (PageAnon(page))
-   ttu_flags |= TTU_SPLIT_FREEZE;
-
-   unmap_success = try_to_unmap(page, ttu_flags);
+   unmap_success = try_to_migrate(page, ttu_flags);
+   else
+   /*
+* Don't install migration entries for file backed pages. This
+* helps handle cases when i_size is in the middle of the page
+* as there is no need to unmap pages beyond i_size manually.
+*/
+   unmap_success = try_to_unmap(page, ttu_flags |
+   TTU_IGNORE_MLOCK);
VM_BUG_ON_PAGE(!unmap_success, page);
 }
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 930de919b1f2..05740f816bc4 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1103,7 +1103,7 @@ static int __unmap_and_move(struct page *page, struct 
page *newpage,
/* Establish migration ptes */
VM_BUG_ON_PAGE(PageAnon(page) && !PageKsm(page) && !anon_vma,
page);
-   try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK);
+   try_to_migrate(page, 0);
page_was_mapped = 1;
}
 
@@ -1305,7 +1305,7 @@ static int unmap_and_move_huge_page(new_page_t 
get_new_page,
 
if (page_mapped(hpage)) {
bool mapping_locked = false;
-   enum ttu_flags ttu = TTU_MIGRATION|TTU_IGNORE_MLOCK;
+   enum ttu_flags ttu = 0;
 
if (!PageAnon(hpage)) {
/*
@@ -1322,7 +1322,7 @@ static int unmap_and_move_huge_page(new_page_t 
get_new_page,
ttu |= TTU_RMAP_LOCKED;
}
 
-   try_to_unmap(hpage, ttu);
+   try_to_migrate(hpage, ttu);
   

[Nouveau] [PATCH v9 06/10] mm/memory.c: Allow different return codes for copy_nonpresent_pte()

2021-05-24 Thread Alistair Popple
Currently if copy_nonpresent_pte() returns a non-zero value it is
assumed to be a swap entry which requires further processing outside the
loop in copy_pte_range() after dropping locks. This prevents other
values being returned to signal conditions such as failure which a
subsequent change requires.

Instead make copy_nonpresent_pte() return an error code if further
processing is required and read the value for the swap entry in the main
loop under the ptl.

Signed-off-by: Alistair Popple 

---

v9:

New for v9 to allow device exclusive handling to occur in
copy_nonpresent_pte().
---
 mm/memory.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 2fb455c365c2..e061cfa18c11 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -718,7 +718,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct 
mm_struct *src_mm,
 
if (likely(!non_swap_entry(entry))) {
if (swap_duplicate(entry) < 0)
-   return entry.val;
+   return -EAGAIN;
 
/* make sure dst_mm is on swapoff's mmlist. */
if (unlikely(list_empty(_mm->mmlist))) {
@@ -974,11 +974,13 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct 
vm_area_struct *src_vma,
continue;
}
if (unlikely(!pte_present(*src_pte))) {
-   entry.val = copy_nonpresent_pte(dst_mm, src_mm,
-   dst_pte, src_pte,
-   src_vma, addr, rss);
-   if (entry.val)
+   ret = copy_nonpresent_pte(dst_mm, src_mm,
+   dst_pte, src_pte,
+   src_vma, addr, rss);
+   if (ret == -EAGAIN) {
+   entry = pte_to_swp_entry(*src_pte);
break;
+   }
progress += 8;
continue;
}
-- 
2.20.1

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


[Nouveau] [PATCH v9 05/10] mm: Rename migrate_pgmap_owner

2021-05-24 Thread Alistair Popple
MMU notifier ranges have a migrate_pgmap_owner field which is used by
drivers to store a pointer. This is subsequently used by the driver
callback to filter MMU_NOTIFY_MIGRATE events. Other notifier event types
can also benefit from this filtering, so rename the
'migrate_pgmap_owner' field to 'owner' and create a new notifier
initialisation function to initialise this field.

Signed-off-by: Alistair Popple 
Suggested-by: Peter Xu 

---

v9:

Previously part of the next patch in the series ('mm: Device exclusive
memory access') but now split out as a separate change as suggested by
Peter Xu.
---
 Documentation/vm/hmm.rst  |  2 +-
 drivers/gpu/drm/nouveau/nouveau_svm.c |  2 +-
 include/linux/mmu_notifier.h  | 20 ++--
 lib/test_hmm.c|  2 +-
 mm/migrate.c  | 10 +-
 5 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst
index 09e28507f5b2..3df79307a797 100644
--- a/Documentation/vm/hmm.rst
+++ b/Documentation/vm/hmm.rst
@@ -332,7 +332,7 @@ between device driver specific code and shared common code:
walks to fill in the ``args->src`` array with PFNs to be migrated.
The ``invalidate_range_start()`` callback is passed a
``struct mmu_notifier_range`` with the ``event`` field set to
-   ``MMU_NOTIFY_MIGRATE`` and the ``migrate_pgmap_owner`` field set to
+   ``MMU_NOTIFY_MIGRATE`` and the ``owner`` field set to
the ``args->pgmap_owner`` field passed to migrate_vma_setup(). This is
allows the device driver to skip the invalidation callback and only
invalidate device private MMU mappings that are actually migrating.
diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c 
b/drivers/gpu/drm/nouveau/nouveau_svm.c
index f18bd53da052..94f841026c3b 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -265,7 +265,7 @@ nouveau_svmm_invalidate_range_start(struct mmu_notifier *mn,
 * the invalidation is handled as part of the migration process.
 */
if (update->event == MMU_NOTIFY_MIGRATE &&
-   update->migrate_pgmap_owner == svmm->vmm->cli->drm->dev)
+   update->owner == svmm->vmm->cli->drm->dev)
goto out;
 
if (limit > svmm->unmanaged.start && start < svmm->unmanaged.limit) {
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 1a6a9eb6d3fa..8e428eb813b8 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -41,7 +41,7 @@ struct mmu_interval_notifier;
  *
  * @MMU_NOTIFY_MIGRATE: used during migrate_vma_collect() invalidate to signal
  * a device driver to possibly ignore the invalidation if the
- * migrate_pgmap_owner field matches the driver's device private pgmap owner.
+ * owner field matches the driver's device private pgmap owner.
  */
 enum mmu_notifier_event {
MMU_NOTIFY_UNMAP = 0,
@@ -269,7 +269,7 @@ struct mmu_notifier_range {
unsigned long end;
unsigned flags;
enum mmu_notifier_event event;
-   void *migrate_pgmap_owner;
+   void *owner;
 };
 
 static inline int mm_has_notifiers(struct mm_struct *mm)
@@ -521,14 +521,14 @@ static inline void mmu_notifier_range_init(struct 
mmu_notifier_range *range,
range->flags = flags;
 }
 
-static inline void mmu_notifier_range_init_migrate(
-   struct mmu_notifier_range *range, unsigned int flags,
+static inline void mmu_notifier_range_init_owner(
+   struct mmu_notifier_range *range,
+   enum mmu_notifier_event event, unsigned int flags,
struct vm_area_struct *vma, struct mm_struct *mm,
-   unsigned long start, unsigned long end, void *pgmap)
+   unsigned long start, unsigned long end, void *owner)
 {
-   mmu_notifier_range_init(range, MMU_NOTIFY_MIGRATE, flags, vma, mm,
-   start, end);
-   range->migrate_pgmap_owner = pgmap;
+   mmu_notifier_range_init(range, event, flags, vma, mm, start, end);
+   range->owner = owner;
 }
 
 #define ptep_clear_flush_young_notify(__vma, __address, __ptep)
\
@@ -655,8 +655,8 @@ static inline void _mmu_notifier_range_init(struct 
mmu_notifier_range *range,
 
 #define mmu_notifier_range_init(range,event,flags,vma,mm,start,end)  \
_mmu_notifier_range_init(range, start, end)
-#define mmu_notifier_range_init_migrate(range, flags, vma, mm, start, end, \
-   pgmap) \
+#define mmu_notifier_range_init_owner(range, event, flags, vma, mm, start, \
+   end, owner) \
_mmu_notifier_range_init(range, start, end)
 
 static inline bool
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 80a78877bd93..5c9f5a020c1d 100644
--- a/lib/test_hmm.c
+++ 

[Nouveau] [PATCH v9 03/10] mm/rmap: Split try_to_munlock from try_to_unmap

2021-05-24 Thread Alistair Popple
The behaviour of try_to_unmap_one() is difficult to follow because it
performs different operations based on a fairly large set of flags used
in different combinations.

TTU_MUNLOCK is one such flag. However it is exclusively used by
try_to_munlock() which specifies no other flags. Therefore rather than
overload try_to_unmap_one() with unrelated behaviour split this out into
it's own function and remove the flag.

Signed-off-by: Alistair Popple 
Reviewed-by: Ralph Campbell 
Reviewed-by: Christoph Hellwig 

---

v9:
* Improved comments

v8:
* Renamed try_to_munlock to page_mlock to better reflect what the
  function actually does.
* Removed the TODO from the documentation that this patch addresses.

v7:
* Added Christoph's Reviewed-by

v4:
* Removed redundant check for VM_LOCKED
---
 Documentation/vm/unevictable-lru.rst | 33 ++-
 include/linux/rmap.h |  3 +-
 mm/mlock.c   | 10 ++---
 mm/rmap.c| 61 
 4 files changed, 63 insertions(+), 44 deletions(-)

diff --git a/Documentation/vm/unevictable-lru.rst 
b/Documentation/vm/unevictable-lru.rst
index 0e1490524f53..eae3af17f2d9 100644
--- a/Documentation/vm/unevictable-lru.rst
+++ b/Documentation/vm/unevictable-lru.rst
@@ -389,14 +389,14 @@ mlocked, munlock_vma_page() updates that zone statistics 
for the number of
 mlocked pages.  Note, however, that at this point we haven't checked whether
 the page is mapped by other VM_LOCKED VMAs.
 
-We can't call try_to_munlock(), the function that walks the reverse map to
+We can't call page_mlock(), the function that walks the reverse map to
 check for other VM_LOCKED VMAs, without first isolating the page from the LRU.
-try_to_munlock() is a variant of try_to_unmap() and thus requires that the page
+page_mlock() is a variant of try_to_unmap() and thus requires that the page
 not be on an LRU list [more on these below].  However, the call to
-isolate_lru_page() could fail, in which case we couldn't try_to_munlock().  So,
+isolate_lru_page() could fail, in which case we can't call page_mlock().  So,
 we go ahead and clear PG_mlocked up front, as this might be the only chance we
-have.  If we can successfully isolate the page, we go ahead and
-try_to_munlock(), which will restore the PG_mlocked flag and update the zone
+have.  If we can successfully isolate the page, we go ahead and call
+page_mlock(), which will restore the PG_mlocked flag and update the zone
 page statistics if it finds another VMA holding the page mlocked.  If we fail
 to isolate the page, we'll have left a potentially mlocked page on the LRU.
 This is fine, because we'll catch it later if and if vmscan tries to reclaim
@@ -545,31 +545,24 @@ munlock or munmap system calls, mm teardown 
(munlock_vma_pages_all), reclaim,
 holepunching, and truncation of file pages and their anonymous COWed pages.
 
 
-try_to_munlock() Reverse Map Scan
+page_mlock() Reverse Map Scan
 -
 
-.. warning::
-   [!] TODO/FIXME: a better name might be page_mlocked() - analogous to the
-   page_referenced() reverse map walker.
-
 When munlock_vma_page() [see section :ref:`munlock()/munlockall() System Call
 Handling ` above] tries to munlock a
 page, it needs to determine whether or not the page is mapped by any
 VM_LOCKED VMA without actually attempting to unmap all PTEs from the
 page.  For this purpose, the unevictable/mlock infrastructure
-introduced a variant of try_to_unmap() called try_to_munlock().
+introduced a variant of try_to_unmap() called page_mlock().
 
-try_to_munlock() calls the same functions as try_to_unmap() for anonymous and
-mapped file and KSM pages with a flag argument specifying unlock versus unmap
-processing.  Again, these functions walk the respective reverse maps looking
-for VM_LOCKED VMAs.  When such a VMA is found, as in the try_to_unmap() case,
-the functions mlock the page via mlock_vma_page() and return SWAP_MLOCK.  This
-undoes the pre-clearing of the page's PG_mlocked done by munlock_vma_page.
+page_mlock() walks the respective reverse maps looking for VM_LOCKED VMAs. When
+such a VMA is found the page is mlocked via mlock_vma_page(). This undoes the
+pre-clearing of the page's PG_mlocked done by munlock_vma_page.
 
-Note that try_to_munlock()'s reverse map walk must visit every VMA in a page's
+Note that page_mlock()'s reverse map walk must visit every VMA in a page's
 reverse map to determine that a page is NOT mapped into any VM_LOCKED VMA.
 However, the scan can terminate when it encounters a VM_LOCKED VMA.
-Although try_to_munlock() might be called a great many times when munlocking a
+Although page_mlock() might be called a great many times when munlocking a
 large region or tearing down a large address space that has been mlocked via
 mlockall(), overall this is a fairly rare event.
 
@@ -602,7 +595,7 @@ inactive lists to the appropriate node's unevictable list.
 shrink_inactive_list() should only see

[Nouveau] [PATCH v9 02/10] mm/swapops: Rework swap entry manipulation code

2021-05-24 Thread Alistair Popple
Both migration and device private pages use special swap entries that
are manipluated by a range of inline functions. The arguments to these
are somewhat inconsitent so rework them to remove flag type arguments
and to make the arguments similar for both read and write entry
creation.

Signed-off-by: Alistair Popple 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Ralph Campbell 
---
 include/linux/swapops.h | 56 ++---
 mm/debug_vm_pgtable.c   | 12 -
 mm/hmm.c|  2 +-
 mm/huge_memory.c| 26 +--
 mm/hugetlb.c| 10 +---
 mm/memory.c | 10 +---
 mm/migrate.c| 26 ++-
 mm/mprotect.c   | 10 +---
 mm/rmap.c   | 10 +---
 9 files changed, 100 insertions(+), 62 deletions(-)

diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 139be8235ad2..4dfd807ae52a 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -100,35 +100,35 @@ static inline void *swp_to_radix_entry(swp_entry_t entry)
 }
 
 #if IS_ENABLED(CONFIG_DEVICE_PRIVATE)
-static inline swp_entry_t make_device_private_entry(struct page *page, bool 
write)
+static inline swp_entry_t make_readable_device_private_entry(pgoff_t offset)
 {
-   return swp_entry(write ? SWP_DEVICE_WRITE : SWP_DEVICE_READ,
-page_to_pfn(page));
+   return swp_entry(SWP_DEVICE_READ, offset);
 }
 
-static inline bool is_device_private_entry(swp_entry_t entry)
+static inline swp_entry_t make_writable_device_private_entry(pgoff_t offset)
 {
-   int type = swp_type(entry);
-   return type == SWP_DEVICE_READ || type == SWP_DEVICE_WRITE;
+   return swp_entry(SWP_DEVICE_WRITE, offset);
 }
 
-static inline void make_device_private_entry_read(swp_entry_t *entry)
+static inline bool is_device_private_entry(swp_entry_t entry)
 {
-   *entry = swp_entry(SWP_DEVICE_READ, swp_offset(*entry));
+   int type = swp_type(entry);
+   return type == SWP_DEVICE_READ || type == SWP_DEVICE_WRITE;
 }
 
-static inline bool is_write_device_private_entry(swp_entry_t entry)
+static inline bool is_writable_device_private_entry(swp_entry_t entry)
 {
return unlikely(swp_type(entry) == SWP_DEVICE_WRITE);
 }
 #else /* CONFIG_DEVICE_PRIVATE */
-static inline swp_entry_t make_device_private_entry(struct page *page, bool 
write)
+static inline swp_entry_t make_readable_device_private_entry(pgoff_t offset)
 {
return swp_entry(0, 0);
 }
 
-static inline void make_device_private_entry_read(swp_entry_t *entry)
+static inline swp_entry_t make_writable_device_private_entry(pgoff_t offset)
 {
+   return swp_entry(0, 0);
 }
 
 static inline bool is_device_private_entry(swp_entry_t entry)
@@ -136,35 +136,32 @@ static inline bool is_device_private_entry(swp_entry_t 
entry)
return false;
 }
 
-static inline bool is_write_device_private_entry(swp_entry_t entry)
+static inline bool is_writable_device_private_entry(swp_entry_t entry)
 {
return false;
 }
 #endif /* CONFIG_DEVICE_PRIVATE */
 
 #ifdef CONFIG_MIGRATION
-static inline swp_entry_t make_migration_entry(struct page *page, int write)
-{
-   BUG_ON(!PageLocked(compound_head(page)));
-
-   return swp_entry(write ? SWP_MIGRATION_WRITE : SWP_MIGRATION_READ,
-   page_to_pfn(page));
-}
-
 static inline int is_migration_entry(swp_entry_t entry)
 {
return unlikely(swp_type(entry) == SWP_MIGRATION_READ ||
swp_type(entry) == SWP_MIGRATION_WRITE);
 }
 
-static inline int is_write_migration_entry(swp_entry_t entry)
+static inline int is_writable_migration_entry(swp_entry_t entry)
 {
return unlikely(swp_type(entry) == SWP_MIGRATION_WRITE);
 }
 
-static inline void make_migration_entry_read(swp_entry_t *entry)
+static inline swp_entry_t make_readable_migration_entry(pgoff_t offset)
 {
-   *entry = swp_entry(SWP_MIGRATION_READ, swp_offset(*entry));
+   return swp_entry(SWP_MIGRATION_READ, offset);
+}
+
+static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
+{
+   return swp_entry(SWP_MIGRATION_WRITE, offset);
 }
 
 extern void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
@@ -174,21 +171,28 @@ extern void migration_entry_wait(struct mm_struct *mm, 
pmd_t *pmd,
 extern void migration_entry_wait_huge(struct vm_area_struct *vma,
struct mm_struct *mm, pte_t *pte);
 #else
+static inline swp_entry_t make_readable_migration_entry(pgoff_t offset)
+{
+   return swp_entry(0, 0);
+}
+
+static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
+{
+   return swp_entry(0, 0);
+}
 
-#define make_migration_entry(page, write) swp_entry(0, 0)
 static inline int is_migration_entry(swp_entry_t swp)
 {
return 0;
 }
 
-static inline void make_migration_entry_read(swp_entry_t *entryp) { }
 static inline void __migration_entry_wait(struct mm_struct *mm

[Nouveau] [PATCH v9 01/10] mm: Remove special swap entry functions

2021-05-24 Thread Alistair Popple
Remove multiple similar inline functions for dealing with different
types of special swap entries.

Both migration and device private swap entries use the swap offset to
store a pfn. Instead of multiple inline functions to obtain a struct
page for each swap entry type use a common function
pfn_swap_entry_to_page(). Also open-code the various entry_to_pfn()
functions as this results is shorter code that is easier to understand.

Signed-off-by: Alistair Popple 
Reviewed-by: Ralph Campbell 
Reviewed-by: Christoph Hellwig 

---

v9:
* Rebased on v5.13-rc2

v8:
* No changes

v7:
* Reworded commit message to include pfn_swap_entry_to_page()
* Added Christoph's Reviewed-by

v6:
* Removed redundant compound_page() call from inside PageLocked()
* Fixed a minor build issue for s390 reported by kernel test bot

v4:
* Added pfn_swap_entry_to_page()
* Reinstated check that migration entries point to locked pages
* Removed #define swapcache_prepare which isn't needed for CONFIG_SWAP=0
  builds
---
 arch/s390/mm/pgtable.c  |  2 +-
 fs/proc/task_mmu.c  | 23 +-
 include/linux/swap.h|  4 +--
 include/linux/swapops.h | 69 ++---
 mm/hmm.c|  5 ++-
 mm/huge_memory.c|  4 +--
 mm/memcontrol.c |  2 +-
 mm/memory.c | 10 +++---
 mm/migrate.c|  6 ++--
 mm/page_vma_mapped.c|  6 ++--
 10 files changed, 50 insertions(+), 81 deletions(-)

diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 18205f851c24..eec3a9d7176e 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -691,7 +691,7 @@ static void ptep_zap_swap_entry(struct mm_struct *mm, 
swp_entry_t entry)
if (!non_swap_entry(entry))
dec_mm_counter(mm, MM_SWAPENTS);
else if (is_migration_entry(entry)) {
-   struct page *page = migration_entry_to_page(entry);
+   struct page *page = pfn_swap_entry_to_page(entry);
 
dec_mm_counter(mm, mm_counter(page));
}
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index fc9784544b24..0953732c8ce1 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -514,10 +514,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
} else {
mss->swap_pss += (u64)PAGE_SIZE << PSS_SHIFT;
}
-   } else if (is_migration_entry(swpent))
-   page = migration_entry_to_page(swpent);
-   else if (is_device_private_entry(swpent))
-   page = device_private_entry_to_page(swpent);
+   } else if (is_pfn_swap_entry(swpent))
+   page = pfn_swap_entry_to_page(swpent);
} else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap
&& pte_none(*pte))) {
page = xa_load(>vm_file->f_mapping->i_pages,
@@ -549,7 +547,7 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
swp_entry_t entry = pmd_to_swp_entry(*pmd);
 
if (is_migration_entry(entry))
-   page = migration_entry_to_page(entry);
+   page = pfn_swap_entry_to_page(entry);
}
if (IS_ERR_OR_NULL(page))
return;
@@ -694,10 +692,8 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long 
hmask,
} else if (is_swap_pte(*pte)) {
swp_entry_t swpent = pte_to_swp_entry(*pte);
 
-   if (is_migration_entry(swpent))
-   page = migration_entry_to_page(swpent);
-   else if (is_device_private_entry(swpent))
-   page = device_private_entry_to_page(swpent);
+   if (is_pfn_swap_entry(swpent))
+   page = pfn_swap_entry_to_page(swpent);
}
if (page) {
int mapcount = page_mapcount(page);
@@ -1384,11 +1380,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct 
pagemapread *pm,
frame = swp_type(entry) |
(swp_offset(entry) << MAX_SWAPFILES_SHIFT);
flags |= PM_SWAP;
-   if (is_migration_entry(entry))
-   page = migration_entry_to_page(entry);
-
-   if (is_device_private_entry(entry))
-   page = device_private_entry_to_page(entry);
+   if (is_pfn_swap_entry(entry))
+   page = pfn_swap_entry_to_page(entry);
}
 
if (page && !PageAnon(page))
@@ -1445,7 +1438,7 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long 
addr, unsigned long end,
if (pmd_swp_soft_dirty(pmd))
flags |= PM_SOFT_DIRTY;
VM_BUG_ON(!is_pmd_migration_entry(pmd));
-   page = migration

[Nouveau] [PATCH v9 00/10] Add support for SVM atomics in Nouveau

2021-05-24 Thread Alistair Popple
This is a repost of the previous series to rebase on v5.13-rc2 and to
address comments.

Outside of some code comment updates the primary change was to split the
renaming of migrate_pgmap_owner into a separate patch and to further
simplify the handling of device exclusive entries in copy_pte_range(). This
may result in temporary fork() failures if the process is using a device
whilst forking, but such usage is unlikely to be practical.

This resulted in a new clean-up patch for the series (patch 6) so that
device exclusive entries can be handled inside copy_nonpresent_pte(),
although more extensive clean-ups of copy_pte_range() are planned as
further development work in future.

Introduction


Some devices have features such as atomic PTE bits that can be used to
implement atomic access to system memory. To support atomic operations to a
shared virtual memory page such a device needs access to that page which is
exclusive of the CPU. This series introduces a mechanism to temporarily
unmap pages granting exclusive access to a device.

These changes are required to support OpenCL atomic operations in Nouveau
to shared virtual memory (SVM) regions allocated with the
CL_MEM_SVM_ATOMICS clSVMAlloc flag. A more complete description of the
OpenCL SVM feature is available at
https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/
OpenCL_API.html#_shared_virtual_memory .

Implementation
==

Exclusive device access is implemented by adding a new swap entry type
(SWAP_DEVICE_EXCLUSIVE) which is similar to a migration entry. The main
difference is that on fault the original entry is immediately restored by
the fault handler instead of waiting.

Restoring the entry triggers calls to MMU notifers which allows a device
driver to revoke the atomic access permission from the GPU prior to the CPU
finalising the entry.

Patches
===

Patches 1 & 2 refactor existing migration and device private entry
functions.

Patches 3 & 4 rework try_to_unmap_one() by splitting out unrelated
functionality into separate functions - try_to_migrate_one() and
try_to_munlock_one().

Patch 5 renames some existing code but does not introduce functionality.

Patch 6 is a small clean-up to swap entry handling in copy_pte_range().

Patch 7 contains the bulk of the implementation for device exclusive
memory.

Patch 8 contains some additions to the HMM selftests to ensure everything
works as expected.

Patch 9 is a cleanup for the Nouveau SVM implementation.

Patch 10 contains the implementation of atomic access for the Nouveau
driver.

Testing
===

This has been tested with upstream Mesa 21.1.0 and a simple OpenCL program
which checks that GPU atomic accesses to system memory are atomic. Without
this series the test fails as there is no way of write-protecting the page
mapping which results in the device clobbering CPU writes. For reference
the test is available at https://ozlabs.org/~apopple/opencl_svm_atomics/

Further testing has been performed by adding support for testing exclusive
access to the hmm-tests kselftests.

Alistair Popple (10):
  mm: Remove special swap entry functions
  mm/swapops: Rework swap entry manipulation code
  mm/rmap: Split try_to_munlock from try_to_unmap
  mm/rmap: Split migration into its own function
  mm: Rename migrate_pgmap_owner
  mm/memory.c: Allow different return codes for copy_nonpresent_pte()
  mm: Device exclusive memory access
  mm: Selftests for exclusive device memory
  nouveau/svm: Refactor nouveau_range_fault
  nouveau/svm: Implement atomic SVM access

 Documentation/vm/hmm.rst  |  19 +-
 Documentation/vm/unevictable-lru.rst  |  33 +-
 arch/s390/mm/pgtable.c|   2 +-
 drivers/gpu/drm/nouveau/include/nvif/if000c.h |   1 +
 drivers/gpu/drm/nouveau/nouveau_svm.c | 156 -
 drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h |   1 +
 .../drm/nouveau/nvkm/subdev/mmu/vmmgp100.c|   6 +
 fs/proc/task_mmu.c|  23 +-
 include/linux/mmu_notifier.h  |  26 +-
 include/linux/rmap.h  |  11 +-
 include/linux/swap.h  |  11 +-
 include/linux/swapops.h   | 123 ++--
 lib/test_hmm.c| 126 +++-
 lib/test_hmm_uapi.h   |   2 +
 mm/debug_vm_pgtable.c |  12 +-
 mm/hmm.c  |  12 +-
 mm/huge_memory.c  |  45 +-
 mm/hugetlb.c  |  10 +-
 mm/memcontrol.c   |   2 +-
 mm/memory.c   | 160 -
 mm/migrate.c  |  51 +-
 mm/mlock.c|  10 +-
 mm/mprotect.c |  18 +-
 mm/page_vma_mapped.c  |  15 +-
 mm/rmap.c | 601 +++---
 tools

Re: [Nouveau] [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-21 Thread Alistair Popple
On Tuesday, 18 May 2021 11:19:05 PM AEST Alistair Popple wrote:

[...]

> > > +/*
> > > + * Restore a potential device exclusive pte to a working pte entry
> > > + */
> > > +static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
> > > +{
> > > + struct page *page = vmf->page;
> > > + struct vm_area_struct *vma = vmf->vma;
> > > + struct page_vma_mapped_walk pvmw = {
> > > + .page = page,
> > > + .vma = vma,
> > > + .address = vmf->address,
> > > + .flags = PVMW_SYNC,
> > > + };
> > > + vm_fault_t ret = 0;
> > > + struct mmu_notifier_range range;
> > > +
> > > + if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags))
> > > + return VM_FAULT_RETRY;
> > > + mmu_notifier_range_init(, MMU_NOTIFY_CLEAR, 0, vma,
> > > vma->vm_mm,
> > > + vmf->address & PAGE_MASK,
> > > + (vmf->address & PAGE_MASK) + PAGE_SIZE);
> > > + mmu_notifier_invalidate_range_start();
> > 
> > I looked at MMU_NOTIFIER_CLEAR document and it tells me:
> > 
> > * @MMU_NOTIFY_CLEAR: clear page table entry (many reasons for this like
> > * madvise() or replacing a page by another one, ...).
> > 
> > Does MMU_NOTIFIER_CLEAR suite for this case?  Normally I think for such a
> > case (existing pte is invalid) we don't need to notify at all.  However
> > from what I read from the whole series, this seems to be a critical point
> > where we'd like to kick the owner/driver to synchronously stop doing
> > atomic
> > operations from the device.  Not sure whether we'd like a new notifier
> > type, or maybe at least comment on why to use CLEAR?
> 
> Right, notifying the owner/driver when it no longer has exclusive access to
> the page and allowing it to stop atomic operations is the critical point and
> why it notifies when we ordinarily wouldn't (ie. invalid -> valid).
> 
> I did consider adding a new type, but in the driver implementation it ends
> up
> being treated the same as a CLEAR notification anyway so didn't think it was
> necessary. But I suppose adding a different type would allow other listening
> notifiers to filter these which might be worthwhile.
>
> > > +
> > > + while (page_vma_mapped_walk()) {
> > 
> > IIUC a while loop of page_vma_mapped_walk() handles thps, however here
> > it's
> > already in do_swap_page() so it's small pte only?  Meanwhile...
> > 
> > > + if (unlikely(!pte_same(*pvmw.pte, vmf->orig_pte))) {
> > > + page_vma_mapped_walk_done();
> > > + break;
> > > + }
> > > +
> > > + restore_exclusive_pte(vma, page, pvmw.address, pvmw.pte);
> > 
> > ... I'm not sure whether passing in page would work for thp after all, as
> > iiuc we may need to pass in the subpage rather than the head page if so.
> 
> Hmm, I need to check this and follow up.

Thanks Peter for pointing that out. I needed to follow this up because I had 
slightly misunderstood page_vma_mapped_walk(). As you say this is actually a 
small pte and restore_exclusive_pte() needs the actual page from the fault. So 
I should be able to drop the page_vma_mapped_walk() and use 
pte_offset_map_lock() to call restore_exclusive_pte directly.

 - Alistair



___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v8 3/8] mm/rmap: Split try_to_munlock from try_to_unmap

2021-05-20 Thread Alistair Popple
On Friday, 21 May 2021 6:24:27 AM AEST Liam Howlett wrote:

[...]
 
> > > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > > index 977e70803ed8..f09d522725b9 100644
> > > > --- a/mm/rmap.c
> > > > +++ b/mm/rmap.c
> > > > @@ -1405,10 +1405,6 @@ static bool try_to_unmap_one(struct page *page,
> > > > struct vm_area_struct *vma,>
> > > > 
> > > >   struct mmu_notifier_range range;
> > > >   enum ttu_flags flags = (enum ttu_flags)(long)arg;
> > > > 
> > > > - /* munlock has nothing to gain from examining un-locked vmas */
> > > > - if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
> > > > - return true;
> > > > -
> > > > 
> > > >   if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) &&
> > > >   
> > > >   is_zone_device_page(page) && !is_device_private_page(page))
> > > >   
> > > >   return true;
> > > > 
> > > > @@ -1469,8 +1465,6 @@ static bool try_to_unmap_one(struct page *page,
> > > > struct vm_area_struct *vma,>
> > > > 
> > > >   page_vma_mapped_walk_done();
> > > >   break;
> > > >   
> > > >   }
> > > > 
> > > > - if (flags & TTU_MUNLOCK)
> > > > - continue;
> > > > 
> > > >   }
> > > >   
> > > >   /* Unexpected PMD-mapped THP? */
> > > > 
> > > > @@ -1784,8 +1778,39 @@ bool try_to_unmap(struct page *page, enum
> > > > ttu_flags
> > > > flags)>
> > > > 
> > > >   return !page_mapcount(page) ? true : false;
> > > >  
> > > >  }
> > > 
> > > Please add a comment here, especially around locking.
> 
> Did you miss this comment?  I think the name confusion alone means this
> needs some documentation.  It's also worth mentioning arg is unused.

Ack. Was meant to come back to that after discussing some of the locking 
questions below. The other side effect of splitting this code out is it leaves 
space for more specific documentation which is only a good thing. I will try 
and summarise some of the discussion below into a comment here.

> > > > +static bool page_mlock_one(struct page *page, struct vm_area_struct
> > > > *vma,
> > > > +  unsigned long address, void *arg)
> > > > +{
> > > > + struct page_vma_mapped_walk pvmw = {
> > > > + .page = page,
> > > > + .vma = vma,
> > > > + .address = address,
> > > > + };
> > > > +
> > > > + /* munlock has nothing to gain from examining un-locked vmas */
> > > > + if (!(vma->vm_flags & VM_LOCKED))
> > > > + return true;
> > > 
> > > The logic here doesn't make sense.  You called page_mlock_one() on a VMA
> > > that isn't locked and it returns true?  Is this a check to see if the
> > > VMA has zero mlock'ed pages?
> > 
> > I'm pretty sure the logic is correct. This is used for an rmap_walk, so we
> > return true to continue the page table scan to see if other VMAs have the
> > page locked.
> 
> yes, sorry.  The logic is correct but doesn't read as though it does.
> I cannot see what is going on easily and there are no comments stating
> what is happening.

Thanks for confirming. The documentation in Documentation/vm/unevictable-
lru.rst is helpful for higher level context but I will put some comments here 
around the logic.

> > > > +
> > > > + while (page_vma_mapped_walk()) {
> > > > + /* PTE-mapped THP are never mlocked */
> > > > + if (!PageTransCompound(page)) {
> > > > + /*
> > > > +  * Holding pte lock, we do *not* need
> > > > +  * mmap_lock here
> > > > +  */
> > > 
> > > Are you sure?  I think you at least need to hold the mmap lock for
> > > reading to ensure there's no race here?  mlock_vma_page() eludes to such
> > > a scenario when lazy mlocking.
> > 
> > Not really. I don't claim to be an mlock expert but as this is a clean-up
> > for try_to_unmap() the intent was to not change existing behaviour.
> > 
> > However presenting the function in this simplified form did raise this and
> > some other questions during previous reviews - see
> > https://lore.kernel.org/
> > dri-devel/20210331115746.ga1463...@nvidia.com/ for the previous
> > discussion.
> 
> From what I can see, at least the following paths have mmap_lock held
> for writing:
> 
> munlock_vma_pages_range() from __do_munmap()
> munlokc_vma_pages_range() from remap_file_pages()
> 
> > To answer the questions around locking though I did do some git sha1
> > mining. The best explanation seemed to be contained in
> > https://git.kernel.org/pub/scm/
> > linux/kernel/git/torvalds/linux.git/commit/?
> > id=b87537d9e2feb30f6a962f27eb32768682698d3b from Hugh (whom I've added
> > again here in case he can help answer some of these).
> 
> Thanks for the pointer.  That race doesn't make the lock unnecessary.
> It is the exception to the rule because the 

Re: [Nouveau] [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-19 Thread Alistair Popple
On Wednesday, 19 May 2021 10:15:41 PM AEST Peter Xu wrote:
> External email: Use caution opening links or attachments
> 
> On Wed, May 19, 2021 at 09:04:53PM +1000, Alistair Popple wrote:
> > Failing fork() because we couldn't take a lock doesn't seem like the right
> > approach though, especially as there is already existing code that
> > retries. I get this adds complexity though, so would be happy to take a
> > look at cleaning copy_pte_range() up in future.
> 
> Yes, I proposed that as this one won't affect any existing applications
> (unlike the existing ones) but only new userspace driver apps that will use
> this new atomic feature.
> 
> IMHO it'll be a pity to add extra complexity and maintainance burden into
> fork() if only for keeping the "logical correctness of fork()" however the
> code never triggers. If we start with trylock we'll know whether people
> will use it, since people will complain with a reason when needed; however
> I still doubt whether a sane userspace device driver should fork() within
> busy interaction with the device underneath..

I will refrain from commenting on the sanity or otherwise of doing that :-)

Agree such a scenario seems unlikely in practice (and possibly unreasonable). 
Keeping the "logical correctness of fork()" still seems worthwhile to me, but 
if the added complexity/maintenance burden for an admittedly fairly specific 
feature is going to stop progress here I am happy to take the fail fork 
approach. I could then possibly fix it up as a future clean up to 
copy_pte_range(). Perhaps others have thoughts?

> In all cases, please still consider to keep them in copy_nonpresent_pte()
> (and if to rework, separating patches would be great).
>
> Thanks,
> 
> --
> Peter Xu




___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-19 Thread Alistair Popple
On Wednesday, 19 May 2021 10:21:08 PM AEST Peter Xu wrote:
> External email: Use caution opening links or attachments
> 
> On Wed, May 19, 2021 at 09:35:10PM +1000, Alistair Popple wrote:
> > I think the approach you are describing is similar to what
> > migrate_vma_collect()/migrate_vma_unamp() does now and I think it could be
> > made to work. I ended up going with the GUP+unmap approach in part because
> > Christoph suggested it but primarily because it required less code
> > especially given we needed to call something to fault the page in/break
> > COW anyway (or extend what was there to call handle_mm_fault(), etc.).
> 
> I see, thank. Would you mind add a short paragraph in the commit message
> talking about these two solutions and why we choose the GUP way?

Sure.

 - Alistair

> --
> Peter Xu




___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-19 Thread Alistair Popple
On Wednesday, 19 May 2021 10:24:27 PM AEST Peter Xu wrote:
> External email: Use caution opening links or attachments
> 
> On Wed, May 19, 2021 at 08:49:01PM +1000, Alistair Popple wrote:
> > On Wednesday, 19 May 2021 7:16:38 AM AEST Peter Xu wrote:
> > > External email: Use caution opening links or attachments
> > > 
> > > 
> > > On Wed, Apr 07, 2021 at 06:42:35PM +1000, Alistair Popple wrote:
> > > 
> > > [...]
> > > 
> > > > +static bool try_to_protect(struct page *page, struct mm_struct *mm,
> > > > +unsigned long address, void *arg)
> > > > +{
> > > > + struct ttp_args ttp = {
> > > > + .mm = mm,
> > > > + .address = address,
> > > > + .arg = arg,
> > > > + .valid = false,
> > > > + };
> > > > + struct rmap_walk_control rwc = {
> > > > + .rmap_one = try_to_protect_one,
> > > > + .done = page_not_mapped,
> > > > + .anon_lock = page_lock_anon_vma_read,
> > > > + .arg = ,
> > > > + };
> > > > +
> > > > + /*
> > > > +  * Restrict to anonymous pages for now to avoid potential
> > > > writeback
> > > > +  * issues.
> > > > +  */
> > > > + if (!PageAnon(page))
> > > > + return false;
> > > > +
> > > > + /*
> > > > +  * During exec, a temporary VMA is setup and later moved.
> > > > +  * The VMA is moved under the anon_vma lock but not the
> > > > +  * page tables leading to a race where migration cannot
> > > > +  * find the migration ptes. Rather than increasing the
> > > > +  * locking requirements of exec(), migration skips
> > > > +  * temporary VMAs until after exec() completes.
> > > > +  */
> > > > + if (!PageKsm(page) && PageAnon(page))
> > > > + rwc.invalid_vma = invalid_migration_vma;
> > > > +
> > > > + rmap_walk(page, );
> > > > +
> > > > + return ttp.valid && !page_mapcount(page);
> > > > +}
> > > 
> > > I raised a question in the other thread regarding fork():
> > > 
> > > https://lore.kernel.org/lkml/YKQjmtMo+YQGx%2FwZ@t490s/
> > > 
> > > While I suddenly noticed that we may have similar issues even if we
> > > fork()
> > > before creating the ptes.
> > > 
> > > In that case, we may see multiple read-only ptes pointing to the same
> > > page.
> > > We will convert all of them into device exclusive read ptes in
> > > rmap_walk()
> > > above, however how do we guarantee after all COW done in the parent and
> > > all
> > > the childs processes, the device owned page will be returned to the
> > > parent?
> > 
> > I assume you are talking about a fork() followed by a call to
> > make_device_exclusive()? I think this should be ok because
> > make_device_exclusive() always calls GUP with FOLL_WRITE both to break COW
> > and because a device performing atomic operations needs to write to the
> > page. I suppose a comment here highlighting the need to break COW to
> > avoid this scenario would be useful though.
> 
> Indeed, sorry for the false alarm!  Yes it would be great to mention that
> too.

No problem! Thanks for the comments.

> --
> Peter Xu




___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v8 3/8] mm/rmap: Split try_to_munlock from try_to_unmap

2021-05-19 Thread Alistair Popple
On Wednesday, 19 May 2021 6:04:51 AM AEST Liam Howlett wrote:
> External email: Use caution opening links or attachments
> 
> * Alistair Popple  [210407 04:43]:
> > The behaviour of try_to_unmap_one() is difficult to follow because it
> > performs different operations based on a fairly large set of flags used
> > in different combinations.
> > 
> > TTU_MUNLOCK is one such flag. However it is exclusively used by
> > try_to_munlock() which specifies no other flags. Therefore rather than
> > overload try_to_unmap_one() with unrelated behaviour split this out into
> > it's own function and remove the flag.
> > 
> > Signed-off-by: Alistair Popple 
> > Reviewed-by: Ralph Campbell 
> > Reviewed-by: Christoph Hellwig 
> > 
> > ---
> > 
> > v8:
> > * Renamed try_to_munlock to page_mlock to better reflect what the
> > 
> >   function actually does.
> > 
> > * Removed the TODO from the documentation that this patch addresses.
> > 
> > v7:
> > * Added Christoph's Reviewed-by
> > 
> > v4:
> > * Removed redundant check for VM_LOCKED
> > ---
> > 
> >  Documentation/vm/unevictable-lru.rst | 33 ---
> >  include/linux/rmap.h |  3 +-
> >  mm/mlock.c   | 10 +++---
> >  mm/rmap.c| 48 +---
> >  4 files changed, 55 insertions(+), 39 deletions(-)
> > 
> > diff --git a/Documentation/vm/unevictable-lru.rst
> > b/Documentation/vm/unevictable-lru.rst index 0e1490524f53..eae3af17f2d9
> > 100644
> > --- a/Documentation/vm/unevictable-lru.rst
> > +++ b/Documentation/vm/unevictable-lru.rst
> > @@ -389,14 +389,14 @@ mlocked, munlock_vma_page() updates that zone
> > statistics for the number of> 
> >  mlocked pages.  Note, however, that at this point we haven't checked
> >  whether the page is mapped by other VM_LOCKED VMAs.
> > 
> > -We can't call try_to_munlock(), the function that walks the reverse map
> > to
> > +We can't call page_mlock(), the function that walks the reverse map to
> > 
> >  check for other VM_LOCKED VMAs, without first isolating the page from the
> >  LRU.> 
> > -try_to_munlock() is a variant of try_to_unmap() and thus requires that
> > the page +page_mlock() is a variant of try_to_unmap() and thus requires
> > that the page> 
> >  not be on an LRU list [more on these below].  However, the call to
> > 
> > -isolate_lru_page() could fail, in which case we couldn't
> > try_to_munlock().  So, +isolate_lru_page() could fail, in which case we
> > can't call page_mlock().  So,> 
> >  we go ahead and clear PG_mlocked up front, as this might be the only
> >  chance we> 
> > -have.  If we can successfully isolate the page, we go ahead and
> > -try_to_munlock(), which will restore the PG_mlocked flag and update the
> > zone +have.  If we can successfully isolate the page, we go ahead and
> > call +page_mlock(), which will restore the PG_mlocked flag and update the
> > zone> 
> >  page statistics if it finds another VMA holding the page mlocked.  If we
> >  fail to isolate the page, we'll have left a potentially mlocked page on
> >  the LRU. This is fine, because we'll catch it later if and if vmscan
> >  tries to reclaim> 
> > @@ -545,31 +545,24 @@ munlock or munmap system calls, mm teardown
> > (munlock_vma_pages_all), reclaim,> 
> >  holepunching, and truncation of file pages and their anonymous COWed
> >  pages.
> > 
> > -try_to_munlock() Reverse Map Scan
> > +page_mlock() Reverse Map Scan
> > 
> >  -
> > 
> > -.. warning::
> > -   [!] TODO/FIXME: a better name might be page_mlocked() - analogous to
> > the -   page_referenced() reverse map walker.
> > -
> > 
> >  When munlock_vma_page() [see section :ref:`munlock()/munlockall() System
> >  Call Handling ` above] tries to munlock a
> >  page, it needs to determine whether or not the page is mapped by any
> >  VM_LOCKED VMA without actually attempting to unmap all PTEs from the
> >  page.  For this purpose, the unevictable/mlock infrastructure
> > 
> > -introduced a variant of try_to_unmap() called try_to_munlock().
> > +introduced a variant of try_to_unmap() called page_mlock().
> > 
> > -try_to_munlock() calls the same functions as try_to_unmap() for anonymous
> > and -mapped file and KSM pages with a flag argument specifying unlock
> > versus unmap -processing.  Again, these functions walk the respective
> > reverse maps l

Re: [Nouveau] [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-19 Thread Alistair Popple
On Wednesday, 19 May 2021 3:27:42 AM AEST Peter Xu wrote:
> > > The odd part is the remote GUP should have walked the page table
> > > already, so since the target here is the vaddr to replace, the 1st page
> > > table walk should be able to both trylock/lock the page, then modify
> > > the pte with pgtable lock held, return the locked page, then walk the
> > > rmap again to remove all the rest of the ptes that are mapping to this
> > > page.  In that case before we call the rmap_walk() we know this must be
> > > the page we want to take care of, and no one will be able to restore
> > > the original mm pte either (as we're with the page lock).  Then we
> > > don't need this check, neither do we need ttp->address.
> > 
> > If I am understanding you correctly I think this would be similar to the
> > approach that was taken in v2. However it pretty much ended up being just
> > an open-coded version of gup which is useful anyway to fault the page in.
> I see.  For easier reference this is v2 patch 1:
> 
> https://lore.kernel.org/lkml/20210219020750.16444-2-apop...@nvidia.com/

Sorry, I should have been clearer and just included that reference for you.

> Indeed that looks like it, it's just that instead of grabbing the page only
> in hmm_exclusive_pmd() we can do the pte modification along the way to seal
> the whole thing (address/pte & page).  I saw Christoph and Jason commented
> in that patch, but not regarding to this approach.  So is there a reason
> that you switched?  Do you think it'll work?

I think the approach you are describing is similar to what 
migrate_vma_collect()/migrate_vma_unamp() does now and I think it could be 
made to work. I ended up going with the GUP+unmap approach in part because 
Christoph suggested it but primarily because it required less code especially 
given we needed to call something to fault the page in/break COW anyway (or 
extend what was there to call handle_mm_fault(), etc.).

> I have no strong opinion either, it's just not crystal clear why we'd need
> that ttp->address at all for a rmap walk along with that "valid" field.

It's purely to ensure the PTE pointing to the GUP page was replaced with an 
exclusive swap entry and that the mapping didn't change between calls.

> Meanwhile it should be slightly less efficient too to go with current
> approach, especially when the page array gets huge, I think: since there'll
> be longer time we do GUP before doing the rmap walk, so higher possibility
> that the GUPed pages got replaced for whatever reason.  Then the call to
> make_device_exclusive_range() will fail as a whole just for a single page
> replacement within the range.




___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-19 Thread Alistair Popple
On Wednesday, 19 May 2021 9:45:05 AM AEST Peter Xu wrote:
> External email: Use caution opening links or attachments
> 
> On Tue, May 18, 2021 at 08:03:27PM -0300, Jason Gunthorpe wrote:
> > Logically during fork all these device exclusive pages should be
> > reverted back to their CPU pages, write protected and the CPU page PTE
> > copied to the fork.
> > 
> > We should not copy the device exclusive page PTE to the fork. I think
> > I pointed to this on an earlier rev..
> 
> Agreed.  Though please see the question I posted in the other thread: now I
> am not very sure whether we'll be able to mark a page as device exclusive
> if that page has mapcount>1.
>
> > We can optimize this into the various variants above, but logically
> > device exclusive stop existing during fork.
> 
> Makes sense, I think that's indeed what this patch did at least for the COW
> case, so I think Alistair did address that comment.  It's just that I think
> we need to drop the other !COW case (imho that should correspond to the
> changes in copy_nonpresent_pte()) in this patch to guarantee it.

Right. The main change from v7 -> v8 was to remove device exclusive entries on 
fork instead of copying them. The change in copy_nonpresent_pte() is for the
!COW case. I think what you are getting at is given exclusive entries are 
(currently) only supported for PageAnon pages is_cow_mapping() will always be 
true and therefore the change to copy_nonpresent_pte() can be dropped. That 
logic seems reasonable so I will change the exclusive case in 
copy_nonpresent_pte() to a VM_WARN_ON.

> I also hope we don't make copy_pte_range() even more complicated just to do
> the lock_page() right, so we could fail the fork() if the lock is hard to
> take.

Failing fork() because we couldn't take a lock doesn't seem like the right 
approach though, especially as there is already existing code that retries. I 
get this adds complexity though, so would be happy to take a look at cleaning 
copy_pte_range() up in future.

> --
> Peter Xu




___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-19 Thread Alistair Popple
On Wednesday, 19 May 2021 7:16:38 AM AEST Peter Xu wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Wed, Apr 07, 2021 at 06:42:35PM +1000, Alistair Popple wrote:
> 
> [...]
> 
> > +static bool try_to_protect(struct page *page, struct mm_struct *mm,
> > +unsigned long address, void *arg)
> > +{
> > + struct ttp_args ttp = {
> > + .mm = mm,
> > + .address = address,
> > + .arg = arg,
> > + .valid = false,
> > + };
> > + struct rmap_walk_control rwc = {
> > + .rmap_one = try_to_protect_one,
> > + .done = page_not_mapped,
> > + .anon_lock = page_lock_anon_vma_read,
> > + .arg = ,
> > + };
> > +
> > + /*
> > +  * Restrict to anonymous pages for now to avoid potential writeback
> > +  * issues.
> > +  */
> > + if (!PageAnon(page))
> > + return false;
> > +
> > + /*
> > +  * During exec, a temporary VMA is setup and later moved.
> > +  * The VMA is moved under the anon_vma lock but not the
> > +  * page tables leading to a race where migration cannot
> > +  * find the migration ptes. Rather than increasing the
> > +  * locking requirements of exec(), migration skips
> > +  * temporary VMAs until after exec() completes.
> > +  */
> > + if (!PageKsm(page) && PageAnon(page))
> > + rwc.invalid_vma = invalid_migration_vma;
> > +
> > + rmap_walk(page, );
> > +
> > + return ttp.valid && !page_mapcount(page);
> > +}
> 
> I raised a question in the other thread regarding fork():
> 
> https://lore.kernel.org/lkml/YKQjmtMo+YQGx%2FwZ@t490s/
> 
> While I suddenly noticed that we may have similar issues even if we fork()
> before creating the ptes.
> 
> In that case, we may see multiple read-only ptes pointing to the same page. 
> We will convert all of them into device exclusive read ptes in rmap_walk()
> above, however how do we guarantee after all COW done in the parent and all
> the childs processes, the device owned page will be returned to the parent?

I assume you are talking about a fork() followed by a call to 
make_device_exclusive()? I think this should be ok because 
make_device_exclusive() always calls GUP with FOLL_WRITE both to break COW and 
because a device performing atomic operations needs to write to the page. I 
suppose a comment here highlighting the need to break COW to avoid this 
scenario would be useful though.

> E.g., if parent accesses the page earlier than the children processes
> (actually, as long as not the last one), do_wp_page() will do COW for parent
> on this page because refcount(page)>1, then the page seems to get lost to a
> random child too..
>
> To resolve all these complexity, not sure whether try_to_protect() could
> enforce VM_DONTCOPY (needs madvise MADV_DONTFORK in the user app), meanwhile
> make sure mapcount(page)==1 before granting the page to the device, so that
> this will guarantee this mm owns this page forever, I think?  It'll
> simplify fork() too as a side effect, since VM_DONTCOPY vma go away when
> fork.
> 
> --
> Peter Xu




___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-18 Thread Alistair Popple
On Tuesday, 18 May 2021 12:08:34 PM AEST Peter Xu wrote:
> > v6:
> > * Fixed a bisectablity issue due to incorrectly applying the rename of
> > 
> >   migrate_pgmap_owner to the wrong patches for Nouveau and hmm_test.
> > 
> > v5:
> > * Renamed range->migrate_pgmap_owner to range->owner.
> 
> May be nicer to mention this rename in commit message (or make it a separate
> patch)?

Ok, I think if it needs to be explicitly called out in the commit message it 
should be a separate patch. Originally I thought the change was _just_ small 
enough to include here, but this patch has since grown so I'll split it out.
 


> > diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> > index 0e25d829f742..3a1ce4ef9276 100644
> > --- a/include/linux/rmap.h
> > +++ b/include/linux/rmap.h
> > @@ -193,6 +193,10 @@ int page_referenced(struct page *, int is_locked,
> > 
> >  bool try_to_migrate(struct page *page, enum ttu_flags flags);
> >  bool try_to_unmap(struct page *, enum ttu_flags flags);
> > 
> > +int make_device_exclusive_range(struct mm_struct *mm, unsigned long
> > start,
> > + unsigned long end, struct page **pages,
> > + void *arg);
> > +
> > 
> >  /* Avoid racy checks */
> >  #define PVMW_SYNC(1 << 0)
> >  /* Look for migarion entries rather than present PTEs */
> > 
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 516104b9334b..7a3c260146df 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -63,9 +63,11 @@ static inline int current_is_kswapd(void)
> > 
> >   * to a special SWP_DEVICE_* entry.
> >   */
> 
> Should we add another short description for the newly added two types?
> Otherwise the reader could get confused assuming the above comment is
> explaining all four types, while it is for SWP_DEVICE_{READ|WRITE} only?

Good idea, I can see how that could be confusing. Will add a short 
description.



> > diff --git a/mm/memory.c b/mm/memory.c
> > index 3a5705cfc891..556ff396f2e9 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -700,6 +700,84 @@ struct page *vm_normal_page_pmd(struct vm_area_struct
> > *vma, unsigned long addr,> 
> >  }
> >  #endif
> > 
> > +static void restore_exclusive_pte(struct vm_area_struct *vma,
> > +   struct page *page, unsigned long address,
> > +   pte_t *ptep)
> > +{
> > + pte_t pte;
> > + swp_entry_t entry;
> > +
> > + pte = pte_mkold(mk_pte(page, READ_ONCE(vma->vm_page_prot)));
> > + if (pte_swp_soft_dirty(*ptep))
> > + pte = pte_mksoft_dirty(pte);
> > +
> > + entry = pte_to_swp_entry(*ptep);
> > + if (pte_swp_uffd_wp(*ptep))
> > + pte = pte_mkuffd_wp(pte);
> > + else if (is_writable_device_exclusive_entry(entry))
> > + pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> > +
> > + set_pte_at(vma->vm_mm, address, ptep, pte);
> > +
> > + /*
> > +  * No need to take a page reference as one was already
> > +  * created when the swap entry was made.
> > +  */
> > + if (PageAnon(page))
> > + page_add_anon_rmap(page, vma, address, false);
> > + else
> > + page_add_file_rmap(page, false);
> > +
> > + if (vma->vm_flags & VM_LOCKED)
> > + mlock_vma_page(page);
> > +
> > + /*
> > +  * No need to invalidate - it was non-present before. However
> > +  * secondary CPUs may have mappings that need invalidating.
> > +  */
> > + update_mmu_cache(vma, address, ptep);
> > +}
> > +
> > +/*
> > + * Tries to restore an exclusive pte if the page lock can be acquired
> > without + * sleeping. Returns 0 on success or -EBUSY if the page could
> > not be locked or + * the entry no longer points at locked_page in which
> > case locked_page should be + * locked before retrying the call.
> > + */
> > +static unsigned long
> > +try_restore_exclusive_pte(struct mm_struct *src_mm, pte_t *src_pte,
> > +   struct vm_area_struct *vma, unsigned long addr,
> > +   struct page **locked_page)
> > +{
> > + swp_entry_t entry = pte_to_swp_entry(*src_pte);
> > + struct page *page = pfn_swap_entry_to_page(entry);
> > +
> > + if (*locked_page) {
> > + /* The entry changed, retry */
> > + if (unlikely(*locked_page != page)) {
> > + unlock_page(*locked_page);
> > + put_page(*locked_page);
> > + *locked_page = page;
> > + return -EBUSY;
> > + }
> > + restore_exclusive_pte(vma, page, addr, src_pte);
> > + unlock_page(page);
> > + put_page(page);
> > + *locked_page = NULL;
> > + return 0;
> > + }
> > +
> > + if (trylock_page(page)) {
> > + restore_exclusive_pte(vma, page, addr, src_pte);
> > + unlock_page(page);
> > + return 0;
> > + 

Re: [Nouveau] [PATCH v8 1/8] mm: Remove special swap entry functions

2021-05-18 Thread Alistair Popple
On Tuesday, 18 May 2021 12:17:32 PM AEST Peter Xu wrote:
> On Wed, Apr 07, 2021 at 06:42:31PM +1000, Alistair Popple wrote:
> > +static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
> > +{
> > + struct page *p = pfn_to_page(swp_offset(entry));
> > +
> > + /*
> > +  * Any use of migration entries may only occur while the
> > +  * corresponding page is locked
> > +  */
> > + BUG_ON(is_migration_entry(entry) && !PageLocked(p));
> > +
> > + return p;
> > +}
> 
> Would swap_pfn_entry_to_page() be slightly better?
> 
> The thing is it's very easy to read pfn_*() as a function to take a pfn as
> parameter...
> 
> Since I'm also recently working on some swap-related new ptes [1], I'm
> thinking whether we could name these swap entries as "swap XXX entries". 
> Say, "swap hwpoison entry", "swap pfn entry" (which is a superset of "swap
> migration entry", "swap device exclusive entry", ...).  That's where I came
> with the above swap_pfn_entry_to_page(), then below will be naturally
> is_swap_pfn_entry().

Equally though "hwpoison swap entry", "pfn swap entry", "migration swap 
entry", etc. also makes sense (at least to me), but does that not fit in as 
well with your series? I haven't looked too deeply at your series but have 
been meaning to so thanks for the pointer.

> No strong opinion as this is already a v8 series (and sorry to chim in this
> late), just to raise this up.

No worries, it's good timing as I was about to send a v9 which was just a 
rebase anyway. I am hoping to try and get this accepted for the next merge 
window but I will wait before sending v9 to see if anyone else has thoughts on 
the naming here.

I don't have a particularly strong opinion either, and your justification is 
more thought than I gave to naming these originally so am happy to rename if 
it's more readable or fits better with your series.

Thanks.

 - Alistair

> [1] https://lore.kernel.org/lkml/20210427161317.50682-1-pet...@redhat.com/
> 
> Thanks,
> 
> > +
> > +/*
> > + * A pfn swap entry is a special type of swap entry that always has a pfn
> > stored + * in the swap offset. They are used to represent unaddressable
> > device memory + * and to restrict access to a page undergoing migration.
> > + */
> > +static inline bool is_pfn_swap_entry(swp_entry_t entry)
> > +{
> > + return is_migration_entry(entry) || is_device_private_entry(entry);
> > +}
> 
> --
> Peter Xu




___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v8 0/8] Add support for SVM atomics in Nouveau

2021-05-06 Thread Alistair Popple
Hi Andrew,

There is currently no outstanding feedback for this series so I am hoping it 
may be considered for inclusion (or at least the mm portions - I still need 
Reviews/Acks for the Nouveau bits). The main change for v8 was removal of 
entries on fork rather than copying in response to feedback from Jason so any 
follow up comments on patch 5 would also be welcome. The series contains a 
number of general clean-ups suggested by Christoph along with a feature to 
temporarily make selected user page mappings write-protected.

This is needed to support OpenCL atomic operations in Nouveau to shared 
virtual memory (SVM) regions allocated with the CL_MEM_SVM_ATOMICS clSVMAlloc 
flag. A more complete description of the OpenCL SVM feature is available at 
https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/
OpenCL_API.html#_shared_virtual_memory .

I have been testing this with Mesa 21.1.0 and a simple OpenCL program which 
checks GPU atomic accesses to system memory are atomic. Without this series 
the test fails as there is no way of write-protecting the userspace page 
mapping which results in the device clobbering CPU writes. For reference the 
test is available at https://ozlabs.org/~apopple/opencl_svm_atomics/ .

 - Alistair

On Wednesday, 7 April 2021 6:42:30 PM AEST Alistair Popple wrote:
> This is the eighth version of a series to add support to Nouveau for atomic
> memory operations on OpenCL shared virtual memory (SVM) regions.
> 
> The main change for this version is a simplification of device exclusive
> entry handling. Instead of copying entries for copy-on-write mappings
> during fork they are removed instead. This is safer because there could be
> unique corner cases when copying, particularly for pinned pages which
> should follow the same logic as copy_present_page(). Removing entries
> avoids this possiblity by treating them as normal ptes.
> 
> Exclusive device access is implemented by adding a new swap entry type
> (SWAP_DEVICE_EXCLUSIVE) which is similar to a migration entry. The main
> difference is that on fault the original entry is immediately restored by
> the fault handler instead of waiting.
> 
> Restoring the entry triggers calls to MMU notifers which allows a device
> driver to revoke the atomic access permission from the GPU prior to the CPU
> finalising the entry.
> 
> Patches 1 & 2 refactor existing migration and device private entry
> functions.
> 
> Patches 3 & 4 rework try_to_unmap_one() by splitting out unrelated
> functionality into separate functions - try_to_migrate_one() and
> try_to_munlock_one(). These should not change any functionality, but any
> help testing would be much appreciated as I have not been able to test
> every usage of try_to_unmap_one().
> 
> Patch 5 contains the bulk of the implementation for device exclusive
> memory.
> 
> Patch 6 contains some additions to the HMM selftests to ensure everything
> works as expected.
> 
> Patch 7 is a cleanup for the Nouveau SVM implementation.
> 
> Patch 8 contains the implementation of atomic access for the Nouveau
> driver.
> 
> This has been tested using the latest upstream Mesa userspace with a simple
> OpenCL test program which checks the results of atomic GPU operations on a
> SVM buffer whilst also writing to the same buffer from the CPU.
> 
> Alistair Popple (8):
>   mm: Remove special swap entry functions
>   mm/swapops: Rework swap entry manipulation code
>   mm/rmap: Split try_to_munlock from try_to_unmap
>   mm/rmap: Split migration into its own function
>   mm: Device exclusive memory access
>   mm: Selftests for exclusive device memory
>   nouveau/svm: Refactor nouveau_range_fault
>   nouveau/svm: Implement atomic SVM access
> 
>  Documentation/vm/hmm.rst  |  19 +-
>  Documentation/vm/unevictable-lru.rst  |  33 +-
>  arch/s390/mm/pgtable.c|   2 +-
>  drivers/gpu/drm/nouveau/include/nvif/if000c.h |   1 +
>  drivers/gpu/drm/nouveau/nouveau_svm.c | 156 -
>  drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h |   1 +
>  .../drm/nouveau/nvkm/subdev/mmu/vmmgp100.c|   6 +
>  fs/proc/task_mmu.c|  23 +-
>  include/linux/mmu_notifier.h  |  26 +-
>  include/linux/rmap.h  |  11 +-
>  include/linux/swap.h  |   8 +-
>  include/linux/swapops.h   | 123 ++--
>  lib/test_hmm.c| 126 +++-
>  lib/test_hmm_uapi.h   |   2 +
>  mm/debug_vm_pgtable.c |  12 +-
>  mm/hmm.c  |  12 +-
>  mm/huge_memory.c  |  45 +-
>  mm/hugetlb.c  |  10 +-
>  mm/me

  1   2   3   >