from:"Alistair Popple"

Re: [PATCH] mm: Remove double faults once write a device pfn

2024-01-25 Thread Alistair Popple



"Zhou, Xianrong"  writes:

> [AMD Official Use Only - General]
>
>> > The vmf_insert_pfn_prot could cause unnecessary double faults on a
>> > device pfn. Because currently the vmf_insert_pfn_prot does not
>> > make the pfn writable so the pte entry is normally read-only or
>> > dirty catching.
>>  What? How do you got to this conclusion?
>> >>> Sorry. I did not mention that this problem only exists on arm64 platform.
>> >> Ok, that makes at least a little bit more sense.
>> >>
>> >>> Because on arm64 platform the PTE_RDONLY is automatically attached
>> >>> to the userspace pte entries even through VM_WRITE + VM_SHARE.
>> >>> The  PTE_RDONLY needs to be cleared in vmf_insert_pfn_prot. However
>> >>> vmf_insert_pfn_prot do not make the pte writable passing false
>> >>> @mkwrite to insert_pfn.
>> >> Question is why is arm64 doing this? As far as I can see they must
>> >> have some hardware reason for that.
>> >>
>> >> The mkwrite parameter to insert_pfn() was added by commit
>> >> b2770da642540 to make insert_pfn() look more like insert_pfn_pmd() so
>> >> that the DAX code can insert PTEs which are writable and dirty at the same
>> time.
>> >>
>> > This is one scenario to do so. In fact on arm64 there are many
>> > scenarios could be to do so. So we can let vmf_insert_pfn_prot
>> > supporting @mkwrite for drivers at core layer and let drivers to
>> > decide whether or not to make writable and dirty at one time. The
>> > patch did this. Otherwise double faults on arm64 when call
>> vmf_insert_pfn_prot.
>>
>> Well, that doesn't answer my question why arm64 is double faulting in the
>> first place,.
>>
>
>
> Eh.
>
> On arm64 When userspace mmap() with PROT_WRITE and MAP_SHARED the
> vma->vm_page_prot has the PTE_RDONLY and PTE_WRITE within
> PAGE_SHARED_EXEC. (seeing arm64 protection_map)
>
> When write the userspace virtual address the first fault happen and call
> into driver's 
> .fault->ttm_bo_vm_fault_reserved->vmf_insert_pfn_prot->insert_pfn.
> The insert_pfn will establish the pte entry. However the vmf_insert_pfn_prot
> pass false @mkwrite to insert_pfn by default and so insert_pfn could not make
> the pfn writable and it do not call maybe_mkwrite(pte_mkdirty(entry), vma)
> to clear the PTE_RDONLY bit. So the pte entry is actually write protection 
> for mmu.
> So when the first fault return and re-execute the store instruction the second
> fault happen again. And in second fault it just only do pte_mkdirty(entry) 
> which
> clear the PTE_RDONLY.

It depends if the ARM64 CPU in question supports hardware dirty bit
management (DBM). If that is the case and PTE_DBM (ie. PTE_WRITE) is set
HW will automatically clear PTE_RDONLY bit to mark the entry dirty
instead of raising a write fault. So you shouldn't see a double fault if
PTE_DBM/WRITE is set.

On ARM64 you can kind of think of PTE_RDONLY as the HW dirty bit and
PTE_DBM as the read/write permission bit with SW being responsible for
updating PTE_RDONLY via the fault handler if DBM is not supported by HW.

At least that's my understanding from having hacked on this in the
past. You can see all this weirdness happening in the definitions of
pte_dirty() and pte_write() for ARM64.

> I think so and hope no wrong.
>
>> So as long as this isn't sorted out I'm going to reject this patch.
>>
>> Regards,
>> Christian.
>>
>> >
>> >> This is a completely different use case to what you try to use it
>> >> here for and that looks extremely fishy to me.
>> >>
>> >> Regards,
>> >> Christian.
>> >>
>> > The first fault only sets up the pte entry which actually is dirty
>> > catching. And the second immediate fault to the pfn due to first
>> > dirty catching when the cpu re-execute the store instruction.
>>  It could be that this is done to work around some hw behavior, but
>>  not because of dirty catching.
>> 
>> > Normally if the drivers call vmf_insert_pfn_prot and also supply
>> > 'pfn_mkwrite' callback within vm_operations_struct which requires
>> > the pte to be dirty catching then the vmf_insert_pfn_prot and the
>> > double fault are reasonable. It is not a problem.
>>  Well, as far as I can see that behavior absolutely doesn't make sense.
>> 
>>  When pfn_mkwrite is requested then the driver should use PAGE_COPY,
>>  which is exactly what VMWGFX (the only driver using dirty tracking)
>>  is
>> >> doing.
>>  Everybody else uses PAGE_SHARED which should make the pte writeable
>>  immediately.
>> 
>>  Regards,
>>  Christian.
>> 
>> > However the most of drivers calling vmf_insert_pfn_prot do not
>> > supply the 'pfn_mkwrite' callback so that the second fault is
>> unnecessary.
>> >
>> > So just like vmf_insert_mixed and vmf_insert_mixed_mkwrite pair,
>> > we should also supply vmf_insert_pfn_mkwrite for drivers as well.
>> >
>> > Signed-off-by: Xianrong Zhou 
>> > ---
>> > arch/x86/entry/vdso/vma.c  |  3

Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory management) for external memory devices

2023-12-04 Thread Alistair Popple



Christian König  writes:

> Am 01.12.23 um 06:48 schrieb Zeng, Oak:
>> [SNIP]

>> Besides memory eviction/oversubscription, there are a few other pain points 
>> when I use hmm:
>>
>> 1) hmm doesn't support file-back memory, so it is hard to share
> memory b/t process in a gpu environment. You mentioned you have a
> plan... How hard is it to support file-backed in your approach?
>
> As hard as it is to support it through HMM. That's what I meant that
> this approach doesn't integrate well, as far as I know the problem
> isn't inside HMM or any other solution but rather in the file system
> layer.

In what way does HMM not support file-backed memory? I was under the
impression that at least hmm_range_fault() does.

 - Alistair

> Regards,
> Christian.
>
>> 2)virtual address range based memory attribute/hint: with hmadvise,
> where do you save the memory attribute of a virtual address range? Do
> you need to extend vm_area_struct to save it? With hmm, we have to
> maintain such information at driver. This ends up with pretty
> complicated logic to split/merge those address range. I know core mm
> has similar logic to split/merge vma...
>>
>> Oak
>>
>>
>>> -Weixi
>>>
>>> -Original Message-
>>> From: Christian König
>>> Sent: Thursday, November 30, 2023 4:28 PM
>>> To: Zeng, Oak; Christian König
>>> ; zhuweixi; linux-
>>> m...@kvack.org;linux-ker...@vger.kernel.org;a...@linux-foundation.org;
>>> Danilo Krummrich; Dave Airlie; Daniel
>>> Vetter
>>> Cc:intel-gvt-...@lists.freedesktop.org;rcampb...@nvidia.com;
>>> mhairgr...@nvidia.com;j...@nvidia.com;weixi@openeuler.sh;
>>> jhubb...@nvidia.com;intel-...@lists.freedesktop.org;apop...@nvidia.com;
>>> xinhui@amd.com;amd-gfx@lists.freedesktop.org;
>>> tvrtko.ursu...@linux.intel.com;ogab...@kernel.org;jgli...@redhat.com; dri-
>>> de...@lists.freedesktop.org;z...@nvidia.com; Vivi, Rodrigo
>>> ;alexander.deuc...@amd.com;leo...@nvidia.com;
>>> felix.kuehl...@amd.com; Wang, Zhi A;
>>> mgor...@suse.de
>>> Subject: Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory
>>> management) for external memory devices
>>>
>>> Hi Oak,
>>>
>>> yeah, #4 is indeed a really good point and I think Felix will agree to that 
>>> as well.
>>>
>>> HMM is basically still missing a way to advise device attributes for the CPU
>>> address space. Both migration strategy as well as device specific 
>>> information (like
>>> cache preferences) fall into this category.
>>>
>>> Since there is a device specific component in those attributes as well I 
>>> think
>>> device specific IOCTLs still make sense to update them, but HMM should offer
>>> the functionality to manage and store those information.
>>>
>>> Split and merge of VMAs only become a problem if you attach those 
>>> information
>>> to VMAs, if you keep them completely separate than that doesn't become an
>>> issue either. The down side of this approach is that you don't get 
>>> automatically
>>> extending attribute ranges for growing VMAs for example.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 29.11.23 um 23:23 schrieb Zeng, Oak:
 Hi Weixi,

 Even though Christian has listed reasons rejecting this proposal (yes they 
 are
>>> very reasonable to me), I would open my mind and further explore the 
>>> possibility
>>> here. Since the current GPU driver uses a hmm based implementation (AMD and
>>> NV has done this; At Intel we are catching up), I want to explore how much 
>>> we
>>> can benefit from the proposed approach and how your approach can solve some
>>> pain points of our development. So basically what I am questioning here is: 
>>> what
>>> is the advantage of your approach against hmm.
 To implement a UVM (unified virtual address space b/t cpu and gpu device),
>>> with hmm, driver essentially need to implement below functions:
 1. device page table update. Your approach requires the same because
 this is device specific codes

 2. Some migration functions to migrate memory b/t system memory and GPU
>>> local memory. My understanding is, even though you generalized this a bit, 
>>> such
>>> as modified cpu page fault path, provided "general" gm_dev_fault handler... 
>>> but
>>> device driver still need to provide migration functions because migration
>>> functions have to be device specific (i.e., using device dma/copy engine for
>>> performance purpose). Right?
 3. GPU physical memory management, this part is now in drm/buddy, shared
>>> by all drivers. I think with your approach, driver still need to provide 
>>> callback
>>> functions to allocate/free physical pages. Right? Or do you let linux core 
>>> mm
>>> buddy manage device memory directly?
 4. madvise/hints/virtual address range management. This has been pain point
>>> for us. Right now device driver has to maintain certain virtual address 
>>> range data
>>> structure to maintain hints and other virtual address range based memory
>>> attributes. Driver need to sync with linux vma. Driver need to explicitly

Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory management) for external memory devices

2023-12-01 Thread Alistair Popple



"Zeng, Oak"  writes:

> See inline comments
>
>> -Original Message-
>> From: dri-devel  On Behalf Of
>> zhuweixi
>> Sent: Thursday, November 30, 2023 5:48 AM
>> To: Christian König ; Zeng, Oak
>> ; Christian König ; linux-
>> m...@kvack.org; linux-ker...@vger.kernel.org; a...@linux-foundation.org;
>> Danilo Krummrich ; Dave Airlie ; Daniel
>> Vetter 
>> Cc: tvrtko.ursu...@linux.intel.com; rcampb...@nvidia.com; apop...@nvidia.com;
>> z...@nvidia.com; weixi@openeuler.sh; jhubb...@nvidia.com; intel-
>> g...@lists.freedesktop.org; mhairgr...@nvidia.com; Wang, Zhi A
>> ; xinhui@amd.com; amd-gfx@lists.freedesktop.org;
>> jgli...@redhat.com; dri-de...@lists.freedesktop.org; j...@nvidia.com; Vivi,
>> Rodrigo ; alexander.deuc...@amd.com;
>> felix.kuehl...@amd.com; intel-gvt-...@lists.freedesktop.org;
>> ogab...@kernel.org; leo...@nvidia.com; mgor...@suse.de
>> Subject: RE: [RFC PATCH 0/6] Supporting GMEM (generalized memory
>> management) for external memory devices
>> 
>> Glad to know that there is a common demand for a new syscall like 
>> hmadvise(). I
>> expect it would also be useful for homogeneous NUMA cases. Credits to
>> cudaMemAdvise() API which brought this idea to GMEM's design.
>> 
>> To answer @Oak's questions about GMEM vs. HMM,
>> 
>> Here is the major difference:
>>   GMEM's main target is to stop drivers from reinventing MM code, while
>> HMM/MMU notifiers provide a compatible struct page solution and a
>> coordination mechanism for existing device driver MMs that requires adding
>> extra code to interact with CPU MM.
>> 
>> A straightforward qualitative result for the main target: after integrating 
>> Huawei's
>> Ascend NPU driver with GMEM's interface, 30,000 lines of MM code were cut,
>> leaving <100 lines invoking GMEM interface and 3,700 lines implementing 
>> vendor-
>> specific functions. Some code from the 3,700 lines should be further moved to
>> GMEM as a generalized feature like device memory oversubscription, but not
>> included in this RFC patch yet.
>> 
>> A list of high-level differences:
>>   1. With HMM/MMU notifiers, drivers need to first implement a full MM
>> subsystem. With GMEM, drivers can reuse Linux's core MM.
>
> A full mm subsystem essentially has below functions:
>
> Physical memory management: neither your approach nor hmm-based
> solution provide device physical memory management. You mentioned you
> have a plan but at least for now driver need to mange device physical
> memory.
>
> Virtual address space management: both approach leverage linux core mm, vma 
> for this.
>
> Data eviction, migration: with hmm, driver need to implement this. It
> is not clear whether gmem has this function. I guess even gmem has it,
> it might be slow cpu data copy, compared to modern gpu's fast data
> copy engine.
>
> Device page table update, va-pa mapping: I think it is driver's 
> responsibility in both approach.
>
> So from the point of re-use core MM, I don't see big difference. Maybe
> you did it more elegantly. I think it is very possible with your
> approach driver can be simpler, less codes.
>
>> 
>>   2. HMM encodes device mapping information in the CPU arch-dependent PTEs,
>> while GMEM proposes an abstraction layer in vm_object. Since GMEM's
>> approach further decouples the arch-related stuff, drivers do not need to
>> implement separate code for X86/ARM and etc.
>
> I don't understand this...with hmm, when a virtual address range's
> backing store is in device memory, cpu pte is encoded to point to
> device memory. Device page table is also encoded to point to the same
> device memory location. But since device memory is not accessible to
> CPU (DEVICE_PRIVATE), so when cpu access this virtual address, there
> is a cpu page fault. Device mapping info is still in device page
> table, not in cpu ptes.
>
> I do not see with hmm why driver need to implement x86/arm
> code... driver only take cares of device page table. Hmm/core mm take
> care of cpu page table, right?

I see our replies have crossed, but that is my understanding as well.

>> 
>>   3. MMU notifiers register hooks at certain core MM events, while GMEM
>> declares basic functions and internally invokes them. GMEM requires less from
>> the driver side -- no need to understand what core MM behaves at certain MMU
>> events. GMEM also expects fewer bugs than MMU notifiers: implementing basic
>> operations with standard declarations vs. implementing whatever random device
>> MM logic in MMU notifiers.
>
> This seems true to me. I feel the mmu notifier thing, especially the
> synchronization/lock design (those sequence numbers, interacting with
> driver lock, and the mmap lock) are very complicated. I indeed spent
> time to understand the specification documented in hmm.rst...

No argument there, but I think that's something we could look at
providing an improved interface for. I don't think it needs a whole new
subsystem to fix. Probably just a version of hmm_range_fault() that
takes the lock and

Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory management) for external memory devices

2023-12-01 Thread Alistair Popple

zhuweixi  writes:

> Glad to know that there is a common demand for a new syscall like
> hmadvise(). I expect it would also be useful for homogeneous NUMA
> cases. Credits to cudaMemAdvise() API which brought this idea to
> GMEM's design.

It's not clear to me that this would need to be a new syscall. Scanning
the patches it looks like your adding a NUMA node anyway, so the
existing interfaces (eg. madvise) with its various options
(MPOL_PREFERRED/PREFERRED_MANY) and
set_mempolicy/set_mempolicy_home_node() could potentially cover this for
both NUMA and hNUMA nodes. The main advantage here would be providing a
common userspace interface for setting these kind of hints.

> To answer @Oak's questions about GMEM vs. HMM,
>
> Here is the major difference:
>   GMEM's main target is to stop drivers from reinventing MM code,
> while HMM/MMU notifiers provide a compatible struct page solution and
> a coordination mechanism for existing device driver MMs that requires
> adding extra code to interact with CPU MM.
>
> A straightforward qualitative result for the main target: after
> integrating Huawei's Ascend NPU driver with GMEM's interface, 30,000
> lines of MM code were cut, leaving <100 lines invoking GMEM interface
> and 3,700 lines implementing vendor-specific functions. Some code from
> the 3,700 lines should be further moved to GMEM as a generalized
> feature like device memory oversubscription, but not included in this
> RFC patch yet.

I think it would be helpful if you could be a bit more specific on what
functionality the current HMM/migrate_vma/MMU notifier interfaces are
missing that every driver has to implement in a common way. Because I'm
not convinced we can't either improve those interfaces to provide what's
needed or add specific components (eg. a physical page allocator)
instead of a whole new framework.

> A list of high-level differences: 
>   1. With HMM/MMU notifiers, drivers need to first implement a full MM 
> subsystem. With GMEM, drivers can reuse Linux's core MM.

What do the common bits of this full MM subsystem look like?
Fundamentally the existing HMM functionality can already make use of
Linux core MM to manage page tables and migrate pages and everything
else seems pretty device specific (ie. actual copying of data,
programming of MMUs, TLBs, etc.)

I can see that there would be scope to have say a generic memory
allocator, which I vaguely recall discussing in relation to
DEIVCE_PRIVATE pages in the past but @Oak suggests something close
already exists (drm/buddy).

Potentially I suppose there is VA allocation that might be common across
devices. However I have not had any experience working with devices with
VA requirements different enough from the CPU to matter. If they are so
different I'm not convinced it would be easy to have a common
implementation anyway.

>   2. HMM encodes device mapping information in the CPU arch-dependent
> PTEs, while GMEM proposes an abstraction layer in vm_object. Since
> GMEM's approach further decouples the arch-related stuff, drivers do
> not need to implement separate code for X86/ARM and etc.

I'm not following this. At present all HMM encodes in CPU PTEs is the
fact a page has been migrated to the device and what permissions it
has. I'm not aware of needing to treat X86 and ARM differently for
example here. Are you saying you want somewhere to store other bits
attached to a particular VA?

>   3. MMU notifiers register hooks at certain core MM events, while
> GMEM declares basic functions and internally invokes them. GMEM
> requires less from the driver side -- no need to understand what core
> MM behaves at certain MMU events. GMEM also expects fewer bugs than
> MMU notifiers: implementing basic operations with standard
> declarations vs. implementing whatever random device MM logic in MMU
> notifiers.

How is this proposal any different though? From what I can see it
replaces MMU notifier callbacks with TLB invalidation callbacks, but
that is essentially what MMU notifier callbacks are anyway. The "random
device MM logic" should just be clearing device TLBs. What other MM
logic has to be implemented in the MMU notifier callbacks that is the
same between devices?

>   4. GMEM plans to support a more lightweight physical memory
> management. The discussion about this part can be found in my cover
> letter. The question is whether struct page should be compatible
> (directly use HMM's ZONE_DEVICE solution) or a trimmed, smaller struct
> page that satisfies generalized demands from accelerators is more
> preferrable?

What is wrong with the current ZONE_DEVICE solution? You mention size of
struct page, but that is already being worked on through the conversion
to folios. Admittedly higher order HMM ZONE_DEVICE folios are not
currently supported, but that is something I'm actively working on at
the moment.

>   5. GMEM has been demonstrated to allow device memory
> oversubscription (a GMEM-based 32GB NPU card can run a GPT model
>

Re: [PATCH v2 0/8] Fix several device private page reference counting issues

2022-10-26 Thread Alistair Popple



"Vlastimil Babka (SUSE)"  writes:

> On 9/28/22 14:01, Alistair Popple wrote:
>> This series aims to fix a number of page reference counting issues in
>> drivers dealing with device private ZONE_DEVICE pages. These result in
>> use-after-free type bugs, either from accessing a struct page which no
>> longer exists because it has been removed or accessing fields within the
>> struct page which are no longer valid because the page has been freed.
>>
>> During normal usage it is unlikely these will cause any problems. However
>> without these fixes it is possible to crash the kernel from userspace.
>> These crashes can be triggered either by unloading the kernel module or
>> unbinding the device from the driver prior to a userspace task exiting. In
>> modules such as Nouveau it is also possible to trigger some of these issues
>> by explicitly closing the device file-descriptor prior to the task exiting
>> and then accessing device private memory.
>
> Hi, as this series was noticed to create a CVE [1], do you think a stable
> backport is warranted? I think the "It is possible to launch the attack
> remotely." in [1] is incorrect though, right?

Right, I don't see how this could be exploited remotely. And I'm pretty
sure you need root as well because in practice the pgmap needs to be
freed, and for Nouveau at least that only happens on device removal.

> It looks to me that patch 1 would be needed since the CONFIG_DEVICE_PRIVATE
> introduction, while the following few only to kernels with 27674ef6c73f
> (probably not so critical as that includes no LTS)?

Patch 3 already has a fixes tag for 27674ef6c73f. Patch 1 would need to
go back to CONFIG_DEVICE_PRIVATE introduction. I think patches 4-8 would
also need to go back to introduction of CONFIG_DEVICE_PRIVATE, but there
isn't as much impact there and they would be harder to backport I think.
Without them device removal can loop indefinitely in kernel mode (if
patch 3 is present or the kernel is older than 27674ef6c73f).

 - Alistair

> Thanks,
> Vlastimil
>
> [1] https://nvd.nist.gov/vuln/detail/CVE-2022-3523
>
>> This involves some minor changes to both PowerPC and AMD GPU code.
>> Unfortunately I lack hardware to test either of those so any help there
>> would be appreciated. The changes mimic what is done in for both Nouveau
>> and hmm-tests though so I doubt they will cause problems.
>>
>> To: Andrew Morton 
>> To: linux...@kvack.org
>> Cc: linux-ker...@vger.kernel.org
>> Cc: amd-gfx@lists.freedesktop.org
>> Cc: nouv...@lists.freedesktop.org
>> Cc: dri-de...@lists.freedesktop.org
>>
>> Alistair Popple (8):
>>   mm/memory.c: Fix race when faulting a device private page
>>   mm: Free device private pages have zero refcount
>>   mm/memremap.c: Take a pgmap reference on page allocation
>>   mm/migrate_device.c: Refactor migrate_vma and 
>> migrate_deivce_coherent_page()
>>   mm/migrate_device.c: Add migrate_device_range()
>>   nouveau/dmem: Refactor nouveau_dmem_fault_copy_one()
>>   nouveau/dmem: Evict device private memory during release
>>   hmm-tests: Add test for migrate_device_range()
>>
>>  arch/powerpc/kvm/book3s_hv_uvmem.c   |  17 +-
>>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |  19 +-
>>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.h |   2 +-
>>  drivers/gpu/drm/amd/amdkfd/kfd_svm.c |  11 +-
>>  drivers/gpu/drm/nouveau/nouveau_dmem.c   | 108 +++
>>  include/linux/memremap.h |   1 +-
>>  include/linux/migrate.h  |  15 ++-
>>  lib/test_hmm.c   | 129 ++---
>>  lib/test_hmm_uapi.h  |   1 +-
>>  mm/memory.c  |  16 +-
>>  mm/memremap.c|  30 ++-
>>  mm/migrate.c |  34 +--
>>  mm/migrate_device.c  | 239 +---
>>  mm/page_alloc.c  |   8 +-
>>  tools/testing/selftests/vm/hmm-tests.c   |  49 +-
>>  15 files changed, 516 insertions(+), 163 deletions(-)
>>
>> base-commit: 088b8aa537c2c767765f1c19b555f21ffe555786

Re: [PATCH v2 1/8] mm/memory.c: Fix race when faulting a device private page

2022-10-03 Thread Alistair Popple



Felix Kuehling  writes:

> On 2022-09-28 08:01, Alistair Popple wrote:
>> When the CPU tries to access a device private page the migrate_to_ram()
>> callback associated with the pgmap for the page is called. However no
>> reference is taken on the faulting page. Therefore a concurrent
>> migration of the device private page can free the page and possibly the
>> underlying pgmap. This results in a race which can crash the kernel due
>> to the migrate_to_ram() function pointer becoming invalid. It also means
>> drivers can't reliably read the zone_device_data field because the page
>> may have been freed with memunmap_pages().
>>
>> Close the race by getting a reference on the page while holding the ptl
>> to ensure it has not been freed. Unfortunately the elevated reference
>> count will cause the migration required to handle the fault to fail. To
>> avoid this failure pass the faulting page into the migrate_vma functions
>> so that if an elevated reference count is found it can be checked to see
>> if it's expected or not.
>
> Do we really have to drag the fault_page all the way into the migrate 
> structure?
> Is the elevated refcount still needed at that time? Maybe it would be easier 
> to
> drop the refcount early in the ops->migrage_to_ram callbacks, so we won't have
> to deal with it in all the migration code.

That would also work. Honestly I don't really like either solution :-)

I didn't like having to plumb it all through the migration code
but I ended up going this way because I felt it was easier to explain
the life time of vmf->page for the migrate_to_ram() callback. This way
vmf->page is guaranteed to be valid for the duration of the
migrate_to_ram() callbak.

As you suggest we could instead have drivers call put_page(vmf->page)
somewhere in migrate_to_ram() before calling migrate_vma_setup(). The
reason I didn't go this way is IMHO it's more subtle because in general
the page will remain valid after that put_page() anyway. So it would be
easy for drivers to introduce a bug assuming the vmf->page is still
valid and not notice because most of the time it is.

Let me know if you disagree with my reasoning though - would appreciate
any review here.

> Regards,
>   Felix
>
>
>>
>> Signed-off-by: Alistair Popple 
>> Cc: Jason Gunthorpe 
>> Cc: John Hubbard 
>> Cc: Ralph Campbell 
>> Cc: Michael Ellerman 
>> Cc: Felix Kuehling 
>> Cc: Lyude Paul 
>> ---
>>   arch/powerpc/kvm/book3s_hv_uvmem.c   | 15 ++-
>>   drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 17 +++--
>>   drivers/gpu/drm/amd/amdkfd/kfd_migrate.h |  2 +-
>>   drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 11 +---
>>   include/linux/migrate.h  |  8 ++-
>>   lib/test_hmm.c   |  7 ++---
>>   mm/memory.c  | 16 +++-
>>   mm/migrate.c | 34 ++---
>>   mm/migrate_device.c  | 18 +
>>   9 files changed, 87 insertions(+), 41 deletions(-)
>>
>> diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
>> b/arch/powerpc/kvm/book3s_hv_uvmem.c
>> index 5980063..d4eacf4 100644
>> --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
>> +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
>> @@ -508,10 +508,10 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
>>   static int __kvmppc_svm_page_out(struct vm_area_struct *vma,
>>  unsigned long start,
>>  unsigned long end, unsigned long page_shift,
>> -struct kvm *kvm, unsigned long gpa)
>> +struct kvm *kvm, unsigned long gpa, struct page *fault_page)
>>   {
>>  unsigned long src_pfn, dst_pfn = 0;
>> -struct migrate_vma mig;
>> +struct migrate_vma mig = { 0 };
>>  struct page *dpage, *spage;
>>  struct kvmppc_uvmem_page_pvt *pvt;
>>  unsigned long pfn;
>> @@ -525,6 +525,7 @@ static int __kvmppc_svm_page_out(struct vm_area_struct 
>> *vma,
>>  mig.dst = _pfn;
>>  mig.pgmap_owner = _uvmem_pgmap;
>>  mig.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
>> +mig.fault_page = fault_page;
>>  /* The requested page is already paged-out, nothing to do */
>>  if (!kvmppc_gfn_is_uvmem_pfn(gpa >> page_shift, kvm, NULL))
>> @@ -580,12 +581,14 @@ static int __kvmppc_svm_page_out(struct vm_area_struct 
>> *vma,
>>   static inline int kvmppc_svm_page_out(struct vm_area_struct *vma,
>>unsigned long start, unsigned long end,
>>unsigned long page_shift,
>> -

Re: [PATCH 2/7] mm: Free device private pages have zero refcount

2022-09-30 Thread Alistair Popple



Dan Williams  writes:

> Alistair Popple wrote:
>>
>> Jason Gunthorpe  writes:
>>
>> > On Mon, Sep 26, 2022 at 04:03:06PM +1000, Alistair Popple wrote:
>> >> Since 27674ef6c73f ("mm: remove the extra ZONE_DEVICE struct page
>> >> refcount") device private pages have no longer had an extra reference
>> >> count when the page is in use. However before handing them back to the
>> >> owning device driver we add an extra reference count such that free
>> >> pages have a reference count of one.
>> >>
>> >> This makes it difficult to tell if a page is free or not because both
>> >> free and in use pages will have a non-zero refcount. Instead we should
>> >> return pages to the drivers page allocator with a zero reference count.
>> >> Kernel code can then safely use kernel functions such as
>> >> get_page_unless_zero().
>> >>
>> >> Signed-off-by: Alistair Popple 
>> >> ---
>> >>  arch/powerpc/kvm/book3s_hv_uvmem.c   | 1 +
>> >>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 1 +
>> >>  drivers/gpu/drm/nouveau/nouveau_dmem.c   | 1 +
>> >>  lib/test_hmm.c   | 1 +
>> >>  mm/memremap.c| 5 -
>> >>  mm/page_alloc.c  | 6 ++
>> >>  6 files changed, 10 insertions(+), 5 deletions(-)
>> >
>> > I think this is a great idea, but I'm surprised no dax stuff is
>> > touched here?
>>
>> free_zone_device_page() shouldn't be called for pgmap->type ==
>> MEMORY_DEVICE_FS_DAX so I don't think we should have to worry about DAX
>> there. Except that the folio code looks like it might have introduced a
>> bug. AFAICT put_page() always calls
>> put_devmap_managed_page(>page) but folio_put() does not (although
>> folios_put() does!). So it seems folio_put() won't end up calling
>> __put_devmap_managed_page_refs() as I think it should.
>>
>> I think you're right about the change to __init_zone_device_page() - I
>> should limit it to DEVICE_PRIVATE/COHERENT pages only. But I need to
>> look at Dan's patch series more closely as I suspect it might be better
>> to rebase this patch on top of that.
>
> Apologies for the delay I was travelling the past few days. Yes, I think
> this patch slots in nicely to avoid the introduction of an init_mode
> [1]:
>
> https://lore.kernel.org/nvdimm/166329940343.2786261.6047770378829215962.st...@dwillia2-xfh.jf.intel.com/
>
> Mind if I steal it into my series?

No problem, although I notice Andrew has already merged it into
mm-unstable. If you end up rebasing your series on top of mine I think
all that's needed is a patch somewhere in your series to drop the
various `if (pgmap->type == MEMORY_DEVICE_*)` I added to (hopefully)
avoid breaking DAX. Assuming DAX takes a pagemap reference on struct
page allocation something like below.

---

diff --git a/mm/memremap.c b/mm/memremap.c
index 421bec3a29ee..da1a0e0abb8b 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -507,15 +507,7 @@ void free_zone_device_page(struct page *page)
page->mapping = NULL;
page->pgmap->ops->page_free(page);

-   if (page->pgmap->type != MEMORY_DEVICE_PRIVATE &&
-   page->pgmap->type != MEMORY_DEVICE_COHERENT)
-   /*
-* Reset the page count to 1 to prepare for handing out the page
-* again.
-*/
-   set_page_count(page, 1);
-   else
-   put_dev_pagemap(page->pgmap);
+   put_dev_pagemap(page->pgmap);
 }

 void zone_device_page_init(struct page *page)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 014dbdf54d62..3e5ff06700ca 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6816,9 +6816,7 @@ static void __ref __init_zone_device_page(struct page 
*page, unsigned long pfn,
 * ZONE_DEVICE pages are released directly to the driver page allocator
 * which will set the page count to 1 when allocating the page.
 */
-   if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
-   pgmap->type == MEMORY_DEVICE_COHERENT)
-   set_page_count(page, 0);
+   set_page_count(page, 0);
 }

 /*

Re: [PATCH v2 8/8] hmm-tests: Add test for migrate_device_range()

2022-09-29 Thread Alistair Popple



Andrew Morton  writes:

> On Wed, 28 Sep 2022 22:01:22 +1000 Alistair Popple  wrote:
>
>> @@ -1401,22 +1494,7 @@ static int dmirror_device_init(struct dmirror_device 
>> *mdevice, int id)
>>
>>  static void dmirror_device_remove(struct dmirror_device *mdevice)
>>  {
>> -unsigned int i;
>> -
>> -if (mdevice->devmem_chunks) {
>> -for (i = 0; i < mdevice->devmem_count; i++) {
>> -struct dmirror_chunk *devmem =
>> -mdevice->devmem_chunks[i];
>> -
>> -memunmap_pages(>pagemap);
>> -if (devmem->pagemap.type == MEMORY_DEVICE_PRIVATE)
>> -release_mem_region(devmem->pagemap.range.start,
>> -   
>> range_len(>pagemap.range));
>> -kfree(devmem);
>> -}
>> -kfree(mdevice->devmem_chunks);
>> -}
>> -
>> +dmirror_device_remove_chunks(mdevice);
>>  cdev_del(>cdevice);
>>  }
>
> Needed a bit or rework due to
> https://lkml.kernel.org/r/20220826050631.25771-1-mpent...@redhat.com.
> Please check my resolution.

Thanks. Rework looks good to me.

> --- a/lib/test_hmm.c~hmm-tests-add-test-for-migrate_device_range
> +++ a/lib/test_hmm.c
> @@ -100,6 +100,7 @@ struct dmirror {
>  struct dmirror_chunk {
>   struct dev_pagemap  pagemap;
>   struct dmirror_device   *mdevice;
> + bool remove;
>  };
>
>  /*
> @@ -192,11 +193,15 @@ static int dmirror_fops_release(struct i
>   return 0;
>  }
>
> +static struct dmirror_chunk *dmirror_page_to_chunk(struct page *page)
> +{
> + return container_of(page->pgmap, struct dmirror_chunk, pagemap);
> +}
> +
>  static struct dmirror_device *dmirror_page_to_device(struct page *page)
>
>  {
> - return container_of(page->pgmap, struct dmirror_chunk,
> - pagemap)->mdevice;
> + return dmirror_page_to_chunk(page)->mdevice;
>  }
>
>  static int dmirror_do_fault(struct dmirror *dmirror, struct hmm_range *range)
> @@ -1218,6 +1223,85 @@ static int dmirror_snapshot(struct dmirr
>   return ret;
>  }
>
> +static void dmirror_device_evict_chunk(struct dmirror_chunk *chunk)
> +{
> + unsigned long start_pfn = chunk->pagemap.range.start >> PAGE_SHIFT;
> + unsigned long end_pfn = chunk->pagemap.range.end >> PAGE_SHIFT;
> + unsigned long npages = end_pfn - start_pfn + 1;
> + unsigned long i;
> + unsigned long *src_pfns;
> + unsigned long *dst_pfns;
> +
> + src_pfns = kcalloc(npages, sizeof(*src_pfns), GFP_KERNEL);
> + dst_pfns = kcalloc(npages, sizeof(*dst_pfns), GFP_KERNEL);
> +
> + migrate_device_range(src_pfns, start_pfn, npages);
> + for (i = 0; i < npages; i++) {
> + struct page *dpage, *spage;
> +
> + spage = migrate_pfn_to_page(src_pfns[i]);
> + if (!spage || !(src_pfns[i] & MIGRATE_PFN_MIGRATE))
> + continue;
> +
> + if (WARN_ON(!is_device_private_page(spage) &&
> + !is_device_coherent_page(spage)))
> + continue;
> + spage = BACKING_PAGE(spage);
> + dpage = alloc_page(GFP_HIGHUSER_MOVABLE | __GFP_NOFAIL);
> + lock_page(dpage);
> + copy_highpage(dpage, spage);
> + dst_pfns[i] = migrate_pfn(page_to_pfn(dpage));
> + if (src_pfns[i] & MIGRATE_PFN_WRITE)
> + dst_pfns[i] |= MIGRATE_PFN_WRITE;
> + }
> + migrate_device_pages(src_pfns, dst_pfns, npages);
> + migrate_device_finalize(src_pfns, dst_pfns, npages);
> + kfree(src_pfns);
> + kfree(dst_pfns);
> +}
> +
> +/* Removes free pages from the free list so they can't be re-allocated */
> +static void dmirror_remove_free_pages(struct dmirror_chunk *devmem)
> +{
> + struct dmirror_device *mdevice = devmem->mdevice;
> + struct page *page;
> +
> + for (page = mdevice->free_pages; page; page = page->zone_device_data)
> + if (dmirror_page_to_chunk(page) == devmem)
> + mdevice->free_pages = page->zone_device_data;
> +}
> +
> +static void dmirror_device_remove_chunks(struct dmirror_device *mdevice)
> +{
> + unsigned int i;
> +
> + mutex_lock(>devmem_lock);
> + if (mdevice->devmem_chunks) {
> + for (i = 0; i < mdevice->devmem_count; i++) {
> + struct dmirror_

Re: [PATCH 1/7] mm/memory.c: Fix race when faulting a device private page

2022-09-28 Thread Alistair Popple



Michael Ellerman  writes:

> Alistair Popple  writes:
>> When the CPU tries to access a device private page the migrate_to_ram()
>> callback associated with the pgmap for the page is called. However no
>> reference is taken on the faulting page. Therefore a concurrent
>> migration of the device private page can free the page and possibly the
>> underlying pgmap. This results in a race which can crash the kernel due
>> to the migrate_to_ram() function pointer becoming invalid. It also means
>> drivers can't reliably read the zone_device_data field because the page
>> may have been freed with memunmap_pages().
>>
>> Close the race by getting a reference on the page while holding the ptl
>> to ensure it has not been freed. Unfortunately the elevated reference
>> count will cause the migration required to handle the fault to fail. To
>> avoid this failure pass the faulting page into the migrate_vma functions
>> so that if an elevated reference count is found it can be checked to see
>> if it's expected or not.
>>
>> Signed-off-by: Alistair Popple 
>> ---
>>  arch/powerpc/kvm/book3s_hv_uvmem.c   | 15 ++-
>>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 17 +++--
>>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.h |  2 +-
>>  drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 11 +---
>>  include/linux/migrate.h  |  8 ++-
>>  lib/test_hmm.c   |  7 ++---
>>  mm/memory.c  | 16 +++-
>>  mm/migrate.c | 34 ++---
>>  mm/migrate_device.c  | 18 +
>>  9 files changed, 87 insertions(+), 41 deletions(-)
>>
>> diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
>> b/arch/powerpc/kvm/book3s_hv_uvmem.c
>> index 5980063..d4eacf4 100644
>> --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
>> +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
>> @@ -508,10 +508,10 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
>>  static int __kvmppc_svm_page_out(struct vm_area_struct *vma,
>>  unsigned long start,
>>  unsigned long end, unsigned long page_shift,
>> -struct kvm *kvm, unsigned long gpa)
>> +struct kvm *kvm, unsigned long gpa, struct page *fault_page)
>>  {
>>  unsigned long src_pfn, dst_pfn = 0;
>> -struct migrate_vma mig;
>> +struct migrate_vma mig = { 0 };
>>  struct page *dpage, *spage;
>>  struct kvmppc_uvmem_page_pvt *pvt;
>>  unsigned long pfn;
>> @@ -525,6 +525,7 @@ static int __kvmppc_svm_page_out(struct vm_area_struct 
>> *vma,
>>  mig.dst = _pfn;
>>  mig.pgmap_owner = _uvmem_pgmap;
>>  mig.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
>> +mig.fault_page = fault_page;
>>
>>  /* The requested page is already paged-out, nothing to do */
>>  if (!kvmppc_gfn_is_uvmem_pfn(gpa >> page_shift, kvm, NULL))
>> @@ -580,12 +581,14 @@ static int __kvmppc_svm_page_out(struct vm_area_struct 
>> *vma,
>>  static inline int kvmppc_svm_page_out(struct vm_area_struct *vma,
>>unsigned long start, unsigned long end,
>>unsigned long page_shift,
>> -  struct kvm *kvm, unsigned long gpa)
>> +  struct kvm *kvm, unsigned long gpa,
>> +  struct page *fault_page)
>>  {
>>  int ret;
>>
>>  mutex_lock(>arch.uvmem_lock);
>> -ret = __kvmppc_svm_page_out(vma, start, end, page_shift, kvm, gpa);
>> +ret = __kvmppc_svm_page_out(vma, start, end, page_shift, kvm, gpa,
>> +fault_page);
>>  mutex_unlock(>arch.uvmem_lock);
>>
>>  return ret;
>> @@ -736,7 +739,7 @@ static int kvmppc_svm_page_in(struct vm_area_struct *vma,
>>  bool pagein)
>>  {
>>  unsigned long src_pfn, dst_pfn = 0;
>> -struct migrate_vma mig;
>> +struct migrate_vma mig = { 0 };
>>  struct page *spage;
>>  unsigned long pfn;
>>  struct page *dpage;
>> @@ -994,7 +997,7 @@ static vm_fault_t kvmppc_uvmem_migrate_to_ram(struct 
>> vm_fault *vmf)
>>
>>  if (kvmppc_svm_page_out(vmf->vma, vmf->address,
>>  vmf->address + PAGE_SIZE, PAGE_SHIFT,
>> -pvt->kvm, pvt->gpa))
>> +pvt->kvm, pvt->gpa, vmf->page))
>>  return VM_FAULT_SIG

[PATCH v2 1/8] mm/memory.c: Fix race when faulting a device private page

2022-09-28 Thread Alistair Popple

When the CPU tries to access a device private page the migrate_to_ram()
callback associated with the pgmap for the page is called. However no
reference is taken on the faulting page. Therefore a concurrent
migration of the device private page can free the page and possibly the
underlying pgmap. This results in a race which can crash the kernel due
to the migrate_to_ram() function pointer becoming invalid. It also means
drivers can't reliably read the zone_device_data field because the page
may have been freed with memunmap_pages().

Close the race by getting a reference on the page while holding the ptl
to ensure it has not been freed. Unfortunately the elevated reference
count will cause the migration required to handle the fault to fail. To
avoid this failure pass the faulting page into the migrate_vma functions
so that if an elevated reference count is found it can be checked to see
if it's expected or not.

Signed-off-by: Alistair Popple 
Cc: Jason Gunthorpe 
Cc: John Hubbard 
Cc: Ralph Campbell 
Cc: Michael Ellerman 
Cc: Felix Kuehling 
Cc: Lyude Paul 
---
 arch/powerpc/kvm/book3s_hv_uvmem.c   | 15 ++-
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 17 +++--
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.h |  2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 11 +---
 include/linux/migrate.h  |  8 ++-
 lib/test_hmm.c   |  7 ++---
 mm/memory.c  | 16 +++-
 mm/migrate.c | 34 ++---
 mm/migrate_device.c  | 18 +
 9 files changed, 87 insertions(+), 41 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 5980063..d4eacf4 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -508,10 +508,10 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
 static int __kvmppc_svm_page_out(struct vm_area_struct *vma,
unsigned long start,
unsigned long end, unsigned long page_shift,
-   struct kvm *kvm, unsigned long gpa)
+   struct kvm *kvm, unsigned long gpa, struct page *fault_page)
 {
unsigned long src_pfn, dst_pfn = 0;
-   struct migrate_vma mig;
+   struct migrate_vma mig = { 0 };
struct page *dpage, *spage;
struct kvmppc_uvmem_page_pvt *pvt;
unsigned long pfn;
@@ -525,6 +525,7 @@ static int __kvmppc_svm_page_out(struct vm_area_struct *vma,
mig.dst = _pfn;
mig.pgmap_owner = _uvmem_pgmap;
mig.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
+   mig.fault_page = fault_page;
 
/* The requested page is already paged-out, nothing to do */
if (!kvmppc_gfn_is_uvmem_pfn(gpa >> page_shift, kvm, NULL))
@@ -580,12 +581,14 @@ static int __kvmppc_svm_page_out(struct vm_area_struct 
*vma,
 static inline int kvmppc_svm_page_out(struct vm_area_struct *vma,
  unsigned long start, unsigned long end,
  unsigned long page_shift,
- struct kvm *kvm, unsigned long gpa)
+ struct kvm *kvm, unsigned long gpa,
+ struct page *fault_page)
 {
int ret;
 
mutex_lock(>arch.uvmem_lock);
-   ret = __kvmppc_svm_page_out(vma, start, end, page_shift, kvm, gpa);
+   ret = __kvmppc_svm_page_out(vma, start, end, page_shift, kvm, gpa,
+   fault_page);
mutex_unlock(>arch.uvmem_lock);
 
return ret;
@@ -736,7 +739,7 @@ static int kvmppc_svm_page_in(struct vm_area_struct *vma,
bool pagein)
 {
unsigned long src_pfn, dst_pfn = 0;
-   struct migrate_vma mig;
+   struct migrate_vma mig = { 0 };
struct page *spage;
unsigned long pfn;
struct page *dpage;
@@ -994,7 +997,7 @@ static vm_fault_t kvmppc_uvmem_migrate_to_ram(struct 
vm_fault *vmf)
 
if (kvmppc_svm_page_out(vmf->vma, vmf->address,
vmf->address + PAGE_SIZE, PAGE_SHIFT,
-   pvt->kvm, pvt->gpa))
+   pvt->kvm, pvt->gpa, vmf->page))
return VM_FAULT_SIGBUS;
else
return 0;
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index b059a77..776448b 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -409,7 +409,7 @@ svm_migrate_vma_to_vram(struct amdgpu_device *adev, struct 
svm_range *prange,
uint64_t npages = (end - start) >> PAGE_SHIFT;
struct kfd_process_device *pdd;
struct dma_fence *mfence = NULL;
-   struct migrate_vma migrate;
+   struct migrate_vma migrate = { 0 };
unsigned long cpages = 0;
dma_add

[PATCH v2 5/8] mm/migrate_device.c: Add migrate_device_range()

2022-09-28 Thread Alistair Popple

Device drivers can use the migrate_vma family of functions to migrate
existing private anonymous mappings to device private pages. These pages
are backed by memory on the device with drivers being responsible for
copying data to and from device memory.

Device private pages are freed via the pgmap->page_free() callback when
they are unmapped and their refcount drops to zero. Alternatively they
may be freed indirectly via migration back to CPU memory in response to
a pgmap->migrate_to_ram() callback called whenever the CPU accesses
an address mapped to a device private page.

In other words drivers cannot control the lifetime of data allocated on
the devices and must wait until these pages are freed from userspace.
This causes issues when memory needs to reclaimed on the device, either
because the device is going away due to a ->release() callback or
because another user needs to use the memory.

Drivers could use the existing migrate_vma functions to migrate data off
the device. However this would require them to track the mappings of
each page which is both complicated and not always possible. Instead
drivers need to be able to migrate device pages directly so they can
free up device memory.

To allow that this patch introduces the migrate_device family of
functions which are functionally similar to migrate_vma but which skips
the initial lookup based on mapping.

Signed-off-by: Alistair Popple 
Cc: "Huang, Ying" 
Cc: Zi Yan 
Cc: Matthew Wilcox 
Cc: Yang Shi 
Cc: David Hildenbrand 
Cc: Ralph Campbell 
Cc: John Hubbard 
---
 include/linux/migrate.h |  7 +++-
 mm/migrate_device.c | 89 ++
 2 files changed, 89 insertions(+), 7 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 82ffa47..582cdc7 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -225,6 +225,13 @@ struct migrate_vma {
 int migrate_vma_setup(struct migrate_vma *args);
 void migrate_vma_pages(struct migrate_vma *migrate);
 void migrate_vma_finalize(struct migrate_vma *migrate);
+int migrate_device_range(unsigned long *src_pfns, unsigned long start,
+   unsigned long npages);
+void migrate_device_pages(unsigned long *src_pfns, unsigned long *dst_pfns,
+   unsigned long npages);
+void migrate_device_finalize(unsigned long *src_pfns,
+   unsigned long *dst_pfns, unsigned long npages);
+
 #endif /* CONFIG_MIGRATION */
 
 #endif /* _LINUX_MIGRATE_H */
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index ba479b5..824860a 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -681,7 +681,7 @@ static void migrate_vma_insert_page(struct migrate_vma 
*migrate,
*src &= ~MIGRATE_PFN_MIGRATE;
 }
 
-static void migrate_device_pages(unsigned long *src_pfns,
+static void __migrate_device_pages(unsigned long *src_pfns,
unsigned long *dst_pfns, unsigned long npages,
struct migrate_vma *migrate)
 {
@@ -703,6 +703,9 @@ static void migrate_device_pages(unsigned long *src_pfns,
if (!page) {
unsigned long addr;
 
+   if (!(src_pfns[i] & MIGRATE_PFN_MIGRATE))
+   continue;
+
/*
 * The only time there is no vma is when called from
 * migrate_device_coherent_page(). However this isn't
@@ -710,8 +713,6 @@ static void migrate_device_pages(unsigned long *src_pfns,
 */
VM_BUG_ON(!migrate);
addr = migrate->start + i*PAGE_SIZE;
-   if (!(src_pfns[i] & MIGRATE_PFN_MIGRATE))
-   continue;
if (!notified) {
notified = true;
 
@@ -767,6 +768,22 @@ static void migrate_device_pages(unsigned long *src_pfns,
 }
 
 /**
+ * migrate_device_pages() - migrate meta-data from src page to dst page
+ * @src_pfns: src_pfns returned from migrate_device_range()
+ * @dst_pfns: array of pfns allocated by the driver to migrate memory to
+ * @npages: number of pages in the range
+ *
+ * Equivalent to migrate_vma_pages(). This is called to migrate struct page
+ * meta-data from source struct page to destination.
+ */
+void migrate_device_pages(unsigned long *src_pfns, unsigned long *dst_pfns,
+   unsigned long npages)
+{
+   __migrate_device_pages(src_pfns, dst_pfns, npages, NULL);
+}
+EXPORT_SYMBOL(migrate_device_pages);
+
+/**
  * migrate_vma_pages() - migrate meta-data from src page to dst page
  * @migrate: migrate struct containing all migration information
  *
@@ -776,12 +793,22 @@ static void migrate_device_pages(unsigned long *src_pfns,
  */
 void migrate_vma_pages(struct migrate_vma *migrate)
 {
-   migrate_device_pages(migrate->src, mi

[PATCH v2 4/8] mm/migrate_device.c: Refactor migrate_vma and migrate_deivce_coherent_page()

2022-09-28 Thread Alistair Popple

migrate_device_coherent_page() reuses the existing migrate_vma family of
functions to migrate a specific page without providing a valid mapping
or vma. This looks a bit odd because it means we are calling
migrate_vma_*() without setting a valid vma, however it was considered
acceptable at the time because the details were internal to
migrate_device.c and there was only a single user.

One of the reasons the details could be kept internal was that this was
strictly for migrating device coherent memory. Such memory can be copied
directly by the CPU without intervention from a driver. However this
isn't true for device private memory, and a future change requires
similar functionality for device private memory. So refactor the code
into something more sensible for migrating device memory without a vma.

Signed-off-by: Alistair Popple 
Cc: "Huang, Ying" 
Cc: Zi Yan 
Cc: Matthew Wilcox 
Cc: Yang Shi 
Cc: David Hildenbrand 
Cc: Ralph Campbell 
Cc: John Hubbard 
---
 mm/migrate_device.c | 150 +
 1 file changed, 85 insertions(+), 65 deletions(-)

diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index f756c00..ba479b5 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -345,26 +345,20 @@ static bool migrate_vma_check_page(struct page *page, 
struct page *fault_page)
 }
 
 /*
- * migrate_vma_unmap() - replace page mapping with special migration pte entry
- * @migrate: migrate struct containing all migration information
- *
- * Isolate pages from the LRU and replace mappings (CPU page table pte) with a
- * special migration pte entry and check if it has been pinned. Pinned pages 
are
- * restored because we cannot migrate them.
- *
- * This is the last step before we call the device driver callback to allocate
- * destination memory and copy contents of original page over to new page.
+ * Unmaps pages for migration. Returns number of unmapped pages.
  */
-static void migrate_vma_unmap(struct migrate_vma *migrate)
+static unsigned long migrate_device_unmap(unsigned long *src_pfns,
+ unsigned long npages,
+ struct page *fault_page)
 {
-   const unsigned long npages = migrate->npages;
unsigned long i, restore = 0;
bool allow_drain = true;
+   unsigned long unmapped = 0;
 
lru_add_drain();
 
for (i = 0; i < npages; i++) {
-   struct page *page = migrate_pfn_to_page(migrate->src[i]);
+   struct page *page = migrate_pfn_to_page(src_pfns[i]);
struct folio *folio;
 
if (!page)
@@ -379,8 +373,7 @@ static void migrate_vma_unmap(struct migrate_vma *migrate)
}
 
if (isolate_lru_page(page)) {
-   migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
-   migrate->cpages--;
+   src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
restore++;
continue;
}
@@ -394,34 +387,54 @@ static void migrate_vma_unmap(struct migrate_vma *migrate)
try_to_migrate(folio, 0);
 
if (page_mapped(page) ||
-   !migrate_vma_check_page(page, migrate->fault_page)) {
+   !migrate_vma_check_page(page, fault_page)) {
if (!is_zone_device_page(page)) {
get_page(page);
putback_lru_page(page);
}
 
-   migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
-   migrate->cpages--;
+   src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
restore++;
continue;
}
+
+   unmapped++;
}
 
for (i = 0; i < npages && restore; i++) {
-   struct page *page = migrate_pfn_to_page(migrate->src[i]);
+   struct page *page = migrate_pfn_to_page(src_pfns[i]);
struct folio *folio;
 
-   if (!page || (migrate->src[i] & MIGRATE_PFN_MIGRATE))
+   if (!page || (src_pfns[i] & MIGRATE_PFN_MIGRATE))
continue;
 
folio = page_folio(page);
remove_migration_ptes(folio, folio, false);
 
-   migrate->src[i] = 0;
+   src_pfns[i] = 0;
folio_unlock(folio);
folio_put(folio);
restore--;
}
+
+   return unmapped;
+}
+
+/*
+ * migrate_vma_unmap() - replace page mapping with special migration pte entry
+ * @migrate: migrate struct containing all migration information
+ *
+ * Isolate pages from the LRU and replace mappings (CPU page table pte) with a
+ * special migration pte entry and check if it

[PATCH v2 8/8] hmm-tests: Add test for migrate_device_range()

2022-09-28 Thread Alistair Popple

Signed-off-by: Alistair Popple 
Cc: Jason Gunthorpe 
Cc: Ralph Campbell 
Cc: John Hubbard 
Cc: Alex Sierra 
Cc: Felix Kuehling 
---
 lib/test_hmm.c | 120 +-
 lib/test_hmm_uapi.h|   1 +-
 tools/testing/selftests/vm/hmm-tests.c |  49 +++-
 3 files changed, 149 insertions(+), 21 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 688c15d..6c2fc85 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -100,6 +100,7 @@ struct dmirror {
 struct dmirror_chunk {
struct dev_pagemap  pagemap;
struct dmirror_device   *mdevice;
+   bool remove;
 };
 
 /*
@@ -192,11 +193,15 @@ static int dmirror_fops_release(struct inode *inode, 
struct file *filp)
return 0;
 }
 
+static struct dmirror_chunk *dmirror_page_to_chunk(struct page *page)
+{
+   return container_of(page->pgmap, struct dmirror_chunk, pagemap);
+}
+
 static struct dmirror_device *dmirror_page_to_device(struct page *page)
 
 {
-   return container_of(page->pgmap, struct dmirror_chunk,
-   pagemap)->mdevice;
+   return dmirror_page_to_chunk(page)->mdevice;
 }
 
 static int dmirror_do_fault(struct dmirror *dmirror, struct hmm_range *range)
@@ -1218,6 +1223,85 @@ static int dmirror_snapshot(struct dmirror *dmirror,
return ret;
 }
 
+static void dmirror_device_evict_chunk(struct dmirror_chunk *chunk)
+{
+   unsigned long start_pfn = chunk->pagemap.range.start >> PAGE_SHIFT;
+   unsigned long end_pfn = chunk->pagemap.range.end >> PAGE_SHIFT;
+   unsigned long npages = end_pfn - start_pfn + 1;
+   unsigned long i;
+   unsigned long *src_pfns;
+   unsigned long *dst_pfns;
+
+   src_pfns = kcalloc(npages, sizeof(*src_pfns), GFP_KERNEL);
+   dst_pfns = kcalloc(npages, sizeof(*dst_pfns), GFP_KERNEL);
+
+   migrate_device_range(src_pfns, start_pfn, npages);
+   for (i = 0; i < npages; i++) {
+   struct page *dpage, *spage;
+
+   spage = migrate_pfn_to_page(src_pfns[i]);
+   if (!spage || !(src_pfns[i] & MIGRATE_PFN_MIGRATE))
+   continue;
+
+   if (WARN_ON(!is_device_private_page(spage) &&
+   !is_device_coherent_page(spage)))
+   continue;
+   spage = BACKING_PAGE(spage);
+   dpage = alloc_page(GFP_HIGHUSER_MOVABLE | __GFP_NOFAIL);
+   lock_page(dpage);
+   copy_highpage(dpage, spage);
+   dst_pfns[i] = migrate_pfn(page_to_pfn(dpage));
+   if (src_pfns[i] & MIGRATE_PFN_WRITE)
+   dst_pfns[i] |= MIGRATE_PFN_WRITE;
+   }
+   migrate_device_pages(src_pfns, dst_pfns, npages);
+   migrate_device_finalize(src_pfns, dst_pfns, npages);
+   kfree(src_pfns);
+   kfree(dst_pfns);
+}
+
+/* Removes free pages from the free list so they can't be re-allocated */
+static void dmirror_remove_free_pages(struct dmirror_chunk *devmem)
+{
+   struct dmirror_device *mdevice = devmem->mdevice;
+   struct page *page;
+
+   for (page = mdevice->free_pages; page; page = page->zone_device_data)
+   if (dmirror_page_to_chunk(page) == devmem)
+   mdevice->free_pages = page->zone_device_data;
+}
+
+static void dmirror_device_remove_chunks(struct dmirror_device *mdevice)
+{
+   unsigned int i;
+
+   mutex_lock(>devmem_lock);
+   if (mdevice->devmem_chunks) {
+   for (i = 0; i < mdevice->devmem_count; i++) {
+   struct dmirror_chunk *devmem =
+   mdevice->devmem_chunks[i];
+
+   spin_lock(>lock);
+   devmem->remove = true;
+   dmirror_remove_free_pages(devmem);
+   spin_unlock(>lock);
+
+   dmirror_device_evict_chunk(devmem);
+   memunmap_pages(>pagemap);
+   if (devmem->pagemap.type == MEMORY_DEVICE_PRIVATE)
+   release_mem_region(devmem->pagemap.range.start,
+  
range_len(>pagemap.range));
+   kfree(devmem);
+   }
+   mdevice->devmem_count = 0;
+   mdevice->devmem_capacity = 0;
+   mdevice->free_pages = NULL;
+   kfree(mdevice->devmem_chunks);
+   mdevice->devmem_chunks = NULL;
+   }
+   mutex_unlock(>devmem_lock);
+}
+
 static long dmirror_fops_unlocked_ioctl(struct file *filp,
unsigned int command,
unsigned long arg)
@@ -1272,6 +1356,11 @@ static long dmirror_fops_unlocked_ioctl(struct file 
*filp,
ret = dmirror_snapshot(dmirror, );

[PATCH v2 6/8] nouveau/dmem: Refactor nouveau_dmem_fault_copy_one()

2022-09-28 Thread Alistair Popple

nouveau_dmem_fault_copy_one() is used during handling of CPU faults via
the migrate_to_ram() callback and is used to copy data from GPU to CPU
memory. It is currently specific to fault handling, however a future
patch implementing eviction of data during teardown needs similar
functionality.

Refactor out the core functionality so that it is not specific to fault
handling.

Signed-off-by: Alistair Popple 
Reviewed-by: Lyude Paul 
Cc: Ben Skeggs 
Cc: Ralph Campbell 
Cc: John Hubbard 
---
 drivers/gpu/drm/nouveau/nouveau_dmem.c | 58 +--
 1 file changed, 28 insertions(+), 30 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index b092988..65f51fb 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -139,44 +139,24 @@ static void nouveau_dmem_fence_done(struct nouveau_fence 
**fence)
}
 }
 
-static vm_fault_t nouveau_dmem_fault_copy_one(struct nouveau_drm *drm,
-   struct vm_fault *vmf, struct migrate_vma *args,
-   dma_addr_t *dma_addr)
+static int nouveau_dmem_copy_one(struct nouveau_drm *drm, struct page *spage,
+   struct page *dpage, dma_addr_t *dma_addr)
 {
struct device *dev = drm->dev->dev;
-   struct page *dpage, *spage;
-   struct nouveau_svmm *svmm;
-
-   spage = migrate_pfn_to_page(args->src[0]);
-   if (!spage || !(args->src[0] & MIGRATE_PFN_MIGRATE))
-   return 0;
 
-   dpage = alloc_page_vma(GFP_HIGHUSER, vmf->vma, vmf->address);
-   if (!dpage)
-   return VM_FAULT_SIGBUS;
lock_page(dpage);
 
*dma_addr = dma_map_page(dev, dpage, 0, PAGE_SIZE, DMA_BIDIRECTIONAL);
if (dma_mapping_error(dev, *dma_addr))
-   goto error_free_page;
+   return -EIO;
 
-   svmm = spage->zone_device_data;
-   mutex_lock(>mutex);
-   nouveau_svmm_invalidate(svmm, args->start, args->end);
if (drm->dmem->migrate.copy_func(drm, 1, NOUVEAU_APER_HOST, *dma_addr,
-   NOUVEAU_APER_VRAM, nouveau_dmem_page_addr(spage)))
-   goto error_dma_unmap;
-   mutex_unlock(>mutex);
+NOUVEAU_APER_VRAM, 
nouveau_dmem_page_addr(spage))) {
+   dma_unmap_page(dev, *dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
+   return -EIO;
+   }
 
-   args->dst[0] = migrate_pfn(page_to_pfn(dpage));
return 0;
-
-error_dma_unmap:
-   mutex_unlock(>mutex);
-   dma_unmap_page(dev, *dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
-error_free_page:
-   __free_page(dpage);
-   return VM_FAULT_SIGBUS;
 }
 
 static vm_fault_t nouveau_dmem_migrate_to_ram(struct vm_fault *vmf)
@@ -184,9 +164,11 @@ static vm_fault_t nouveau_dmem_migrate_to_ram(struct 
vm_fault *vmf)
struct nouveau_drm *drm = page_to_drm(vmf->page);
struct nouveau_dmem *dmem = drm->dmem;
struct nouveau_fence *fence;
+   struct nouveau_svmm *svmm;
+   struct page *spage, *dpage;
unsigned long src = 0, dst = 0;
dma_addr_t dma_addr = 0;
-   vm_fault_t ret;
+   vm_fault_t ret = 0;
struct migrate_vma args = {
.vma= vmf->vma,
.start  = vmf->address,
@@ -207,9 +189,25 @@ static vm_fault_t nouveau_dmem_migrate_to_ram(struct 
vm_fault *vmf)
if (!args.cpages)
return 0;
 
-   ret = nouveau_dmem_fault_copy_one(drm, vmf, , _addr);
-   if (ret || dst == 0)
+   spage = migrate_pfn_to_page(src);
+   if (!spage || !(src & MIGRATE_PFN_MIGRATE))
+   goto done;
+
+   dpage = alloc_page_vma(GFP_HIGHUSER, vmf->vma, vmf->address);
+   if (!dpage)
+   goto done;
+
+   dst = migrate_pfn(page_to_pfn(dpage));
+
+   svmm = spage->zone_device_data;
+   mutex_lock(>mutex);
+   nouveau_svmm_invalidate(svmm, args.start, args.end);
+   ret = nouveau_dmem_copy_one(drm, spage, dpage, _addr);
+   mutex_unlock(>mutex);
+   if (ret) {
+   ret = VM_FAULT_SIGBUS;
goto done;
+   }
 
nouveau_fence_new(dmem->migrate.chan, false, );
migrate_vma_pages();
-- 
git-series 0.9.1

[PATCH v2 3/8] mm/memremap.c: Take a pgmap reference on page allocation

2022-09-28 Thread Alistair Popple

ZONE_DEVICE pages have a struct dev_pagemap which is allocated by a
driver. When the struct page is first allocated by the kernel in
memremap_pages() a reference is taken on the associated pagemap to
ensure it is not freed prior to the pages being freed.

Prior to 27674ef6c73f ("mm: remove the extra ZONE_DEVICE struct page
refcount") pages were considered free and returned to the driver when
the reference count dropped to one. However the pagemap reference was
not dropped until the page reference count hit zero. This would occur as
part of the final put_page() in memunmap_pages() which would wait for
all pages to be freed prior to returning.

When the extra refcount was removed the pagemap reference was no longer
being dropped in put_page(). Instead memunmap_pages() was changed to
explicitly drop the pagemap references. This means that memunmap_pages()
can complete even though pages are still mapped by the kernel which can
lead to kernel crashes, particularly if a driver frees the pagemap.

To fix this drivers should take a pagemap reference when allocating the
page. This reference can then be returned when the page is freed.

Signed-off-by: Alistair Popple 
Fixes: 27674ef6c73f ("mm: remove the extra ZONE_DEVICE struct page refcount")
Cc: Jason Gunthorpe 
Cc: Felix Kuehling 
Cc: Alex Deucher 
Cc: Christian König 
Cc: Ben Skeggs 
Cc: Lyude Paul 
Cc: Ralph Campbell 
Cc: Alex Sierra 
Cc: John Hubbard 
Cc: Dan Williams 

---

Again I expect this will conflict with Dan's series. This implements the
first suggestion from Jason at
https://lore.kernel.org/linux-mm/yzly5jjof0jdl...@nvidia.com/ so
whatever we end up doing for DAX we should do the same here.
---
 mm/memremap.c | 25 +++--
 1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/mm/memremap.c b/mm/memremap.c
index 1c2c038..421bec3 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -138,8 +138,11 @@ void memunmap_pages(struct dev_pagemap *pgmap)
int i;
 
percpu_ref_kill(>ref);
-   for (i = 0; i < pgmap->nr_range; i++)
-   percpu_ref_put_many(>ref, pfn_len(pgmap, i));
+   if (pgmap->type != MEMORY_DEVICE_PRIVATE &&
+   pgmap->type != MEMORY_DEVICE_COHERENT)
+   for (i = 0; i < pgmap->nr_range; i++)
+   percpu_ref_put_many(>ref, pfn_len(pgmap, i));
+
wait_for_completion(>done);
 
for (i = 0; i < pgmap->nr_range; i++)
@@ -264,7 +267,9 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct 
mhp_params *params,
memmap_init_zone_device(_DATA(nid)->node_zones[ZONE_DEVICE],
PHYS_PFN(range->start),
PHYS_PFN(range_len(range)), pgmap);
-   percpu_ref_get_many(>ref, pfn_len(pgmap, range_id));
+   if (pgmap->type != MEMORY_DEVICE_PRIVATE &&
+   pgmap->type != MEMORY_DEVICE_COHERENT)
+   percpu_ref_get_many(>ref, pfn_len(pgmap, range_id));
return 0;
 
 err_add_memory:
@@ -502,16 +507,24 @@ void free_zone_device_page(struct page *page)
page->mapping = NULL;
page->pgmap->ops->page_free(page);
 
-   /*
-* Reset the page count to 1 to prepare for handing out the page again.
-*/
if (page->pgmap->type != MEMORY_DEVICE_PRIVATE &&
page->pgmap->type != MEMORY_DEVICE_COHERENT)
+   /*
+* Reset the page count to 1 to prepare for handing out the page
+* again.
+*/
set_page_count(page, 1);
+   else
+   put_dev_pagemap(page->pgmap);
 }
 
 void zone_device_page_init(struct page *page)
 {
+   /*
+* Drivers shouldn't be allocating pages after calling
+* memunmap_pages().
+*/
+   WARN_ON_ONCE(!percpu_ref_tryget_live(>pgmap->ref));
set_page_count(page, 1);
lock_page(page);
 }
-- 
git-series 0.9.1

[PATCH v2 7/8] nouveau/dmem: Evict device private memory during release

2022-09-28 Thread Alistair Popple

When the module is unloaded or a GPU is unbound from the module it is
possible for device private pages to still be mapped in currently
running processes. This can lead to a hangs and RCU stall warnings when
unbinding the device as memunmap_pages() will wait in an uninterruptible
state until all device pages have been freed which may never happen.

Fix this by migrating device mappings back to normal CPU memory prior to
freeing the GPU memory chunks and associated device private pages.

Signed-off-by: Alistair Popple 
Cc: Lyude Paul 
Cc: Ben Skeggs 
Cc: Ralph Campbell 
Cc: John Hubbard 
---
 drivers/gpu/drm/nouveau/nouveau_dmem.c | 48 +++-
 1 file changed, 48 insertions(+)

diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 65f51fb..5fe2091 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -367,6 +367,52 @@ nouveau_dmem_suspend(struct nouveau_drm *drm)
mutex_unlock(>dmem->mutex);
 }
 
+/*
+ * Evict all pages mapping a chunk.
+ */
+static void
+nouveau_dmem_evict_chunk(struct nouveau_dmem_chunk *chunk)
+{
+   unsigned long i, npages = range_len(>pagemap.range) >> 
PAGE_SHIFT;
+   unsigned long *src_pfns, *dst_pfns;
+   dma_addr_t *dma_addrs;
+   struct nouveau_fence *fence;
+
+   src_pfns = kcalloc(npages, sizeof(*src_pfns), GFP_KERNEL);
+   dst_pfns = kcalloc(npages, sizeof(*dst_pfns), GFP_KERNEL);
+   dma_addrs = kcalloc(npages, sizeof(*dma_addrs), GFP_KERNEL);
+
+   migrate_device_range(src_pfns, chunk->pagemap.range.start >> PAGE_SHIFT,
+   npages);
+
+   for (i = 0; i < npages; i++) {
+   if (src_pfns[i] & MIGRATE_PFN_MIGRATE) {
+   struct page *dpage;
+
+   /*
+* _GFP_NOFAIL because the GPU is going away and there
+* is nothing sensible we can do if we can't copy the
+* data back.
+*/
+   dpage = alloc_page(GFP_HIGHUSER | __GFP_NOFAIL);
+   dst_pfns[i] = migrate_pfn(page_to_pfn(dpage));
+   nouveau_dmem_copy_one(chunk->drm,
+   migrate_pfn_to_page(src_pfns[i]), dpage,
+   _addrs[i]);
+   }
+   }
+
+   nouveau_fence_new(chunk->drm->dmem->migrate.chan, false, );
+   migrate_device_pages(src_pfns, dst_pfns, npages);
+   nouveau_dmem_fence_done();
+   migrate_device_finalize(src_pfns, dst_pfns, npages);
+   kfree(src_pfns);
+   kfree(dst_pfns);
+   for (i = 0; i < npages; i++)
+   dma_unmap_page(chunk->drm->dev->dev, dma_addrs[i], PAGE_SIZE, 
DMA_BIDIRECTIONAL);
+   kfree(dma_addrs);
+}
+
 void
 nouveau_dmem_fini(struct nouveau_drm *drm)
 {
@@ -378,8 +424,10 @@ nouveau_dmem_fini(struct nouveau_drm *drm)
mutex_lock(>dmem->mutex);
 
list_for_each_entry_safe(chunk, tmp, >dmem->chunks, list) {
+   nouveau_dmem_evict_chunk(chunk);
nouveau_bo_unpin(chunk->bo);
nouveau_bo_ref(NULL, >bo);
+   WARN_ON(chunk->callocated);
list_del(>list);
memunmap_pages(>pagemap);
release_mem_region(chunk->pagemap.range.start,
-- 
git-series 0.9.1

[PATCH v2 0/8] Fix several device private page reference counting issues

2022-09-28 Thread Alistair Popple

This series aims to fix a number of page reference counting issues in
drivers dealing with device private ZONE_DEVICE pages. These result in
use-after-free type bugs, either from accessing a struct page which no
longer exists because it has been removed or accessing fields within the
struct page which are no longer valid because the page has been freed.

During normal usage it is unlikely these will cause any problems. However
without these fixes it is possible to crash the kernel from userspace.
These crashes can be triggered either by unloading the kernel module or
unbinding the device from the driver prior to a userspace task exiting. In
modules such as Nouveau it is also possible to trigger some of these issues
by explicitly closing the device file-descriptor prior to the task exiting
and then accessing device private memory.

This involves some minor changes to both PowerPC and AMD GPU code.
Unfortunately I lack hardware to test either of those so any help there
would be appreciated. The changes mimic what is done in for both Nouveau
and hmm-tests though so I doubt they will cause problems.

To: Andrew Morton 
To: linux...@kvack.org
Cc: linux-ker...@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: nouv...@lists.freedesktop.org
Cc: dri-de...@lists.freedesktop.org

Alistair Popple (8):
  mm/memory.c: Fix race when faulting a device private page
  mm: Free device private pages have zero refcount
  mm/memremap.c: Take a pgmap reference on page allocation
  mm/migrate_device.c: Refactor migrate_vma and migrate_deivce_coherent_page()
  mm/migrate_device.c: Add migrate_device_range()
  nouveau/dmem: Refactor nouveau_dmem_fault_copy_one()
  nouveau/dmem: Evict device private memory during release
  hmm-tests: Add test for migrate_device_range()

 arch/powerpc/kvm/book3s_hv_uvmem.c   |  17 +-
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |  19 +-
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.h |   2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c |  11 +-
 drivers/gpu/drm/nouveau/nouveau_dmem.c   | 108 +++
 include/linux/memremap.h |   1 +-
 include/linux/migrate.h  |  15 ++-
 lib/test_hmm.c   | 129 ++---
 lib/test_hmm_uapi.h  |   1 +-
 mm/memory.c  |  16 +-
 mm/memremap.c|  30 ++-
 mm/migrate.c |  34 +--
 mm/migrate_device.c  | 239 +---
 mm/page_alloc.c  |   8 +-
 tools/testing/selftests/vm/hmm-tests.c   |  49 +-
 15 files changed, 516 insertions(+), 163 deletions(-)

base-commit: 088b8aa537c2c767765f1c19b555f21ffe555786
-- 
git-series 0.9.1

[PATCH v2 2/8] mm: Free device private pages have zero refcount

2022-09-28 Thread Alistair Popple

Since 27674ef6c73f ("mm: remove the extra ZONE_DEVICE struct page
refcount") device private pages have no longer had an extra reference
count when the page is in use. However before handing them back to the
owning device driver we add an extra reference count such that free
pages have a reference count of one.

This makes it difficult to tell if a page is free or not because both
free and in use pages will have a non-zero refcount. Instead we should
return pages to the drivers page allocator with a zero reference count.
Kernel code can then safely use kernel functions such as
get_page_unless_zero().

Signed-off-by: Alistair Popple 
Cc: Jason Gunthorpe 
Cc: Michael Ellerman 
Cc: Felix Kuehling 
Cc: Alex Deucher 
Cc: Christian König 
Cc: Ben Skeggs 
Cc: Lyude Paul 
Cc: Ralph Campbell 
Cc: Alex Sierra 
Cc: John Hubbard 
Cc: Dan Williams 

---

This will conflict with Dan's series to fix reference counts for DAX[1].
At the moment this only makes changes for device private and coherent
pages, however if DAX is fixed to remove the extra refcount then we
should just be able to drop the checks for private/coherent pages and
treat them the same.

[1] - 
https://lore.kernel.org/linux-mm/166329930818.2786261.6086109734008025807.st...@dwillia2-xfh.jf.intel.com/
---
 arch/powerpc/kvm/book3s_hv_uvmem.c   |  2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |  2 +-
 drivers/gpu/drm/nouveau/nouveau_dmem.c   |  2 +-
 include/linux/memremap.h |  1 +
 lib/test_hmm.c   |  2 +-
 mm/memremap.c|  9 +
 mm/page_alloc.c  |  8 
 7 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index d4eacf4..9d8de68 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -718,7 +718,7 @@ static struct page *kvmppc_uvmem_get_page(unsigned long 
gpa, struct kvm *kvm)
 
dpage = pfn_to_page(uvmem_pfn);
dpage->zone_device_data = pvt;
-   lock_page(dpage);
+   zone_device_page_init(dpage);
return dpage;
 out_clear:
spin_lock(_uvmem_bitmap_lock);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index 776448b..97a6845 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -223,7 +223,7 @@ svm_migrate_get_vram_page(struct svm_range *prange, 
unsigned long pfn)
page = pfn_to_page(pfn);
svm_range_bo_ref(prange->svm_bo);
page->zone_device_data = prange->svm_bo;
-   lock_page(page);
+   zone_device_page_init(page);
 }
 
 static void
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 1635661..b092988 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -326,7 +326,7 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm)
return NULL;
}
 
-   lock_page(page);
+   zone_device_page_init(page);
return page;
 }
 
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 1901049..f68bf6d 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -182,6 +182,7 @@ static inline bool folio_is_device_coherent(const struct 
folio *folio)
 }
 
 #ifdef CONFIG_ZONE_DEVICE
+void zone_device_page_init(struct page *page);
 void *memremap_pages(struct dev_pagemap *pgmap, int nid);
 void memunmap_pages(struct dev_pagemap *pgmap);
 void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 89463ff..688c15d 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -627,8 +627,8 @@ static struct page *dmirror_devmem_alloc_page(struct 
dmirror_device *mdevice)
goto error;
}
 
+   zone_device_page_init(dpage);
dpage->zone_device_data = rpage;
-   lock_page(dpage);
return dpage;
 
 error:
diff --git a/mm/memremap.c b/mm/memremap.c
index 25029a4..1c2c038 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -505,8 +505,17 @@ void free_zone_device_page(struct page *page)
/*
 * Reset the page count to 1 to prepare for handing out the page again.
 */
+   if (page->pgmap->type != MEMORY_DEVICE_PRIVATE &&
+   page->pgmap->type != MEMORY_DEVICE_COHERENT)
+   set_page_count(page, 1);
+}
+
+void zone_device_page_init(struct page *page)
+{
set_page_count(page, 1);
+   lock_page(page);
 }
+EXPORT_SYMBOL_GPL(zone_device_page_init);
 
 #ifdef CONFIG_FS_DAX
 bool __put_devmap_managed_page_refs(struct page *page, int refs)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9d49803..4df1e43 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6744,6 +6744,14 @@ static void __ref __init_zone_devi

Re: [PATCH 5/7] nouveau/dmem: Refactor nouveau_dmem_fault_copy_one()

2022-09-28 Thread Alistair Popple



Lyude Paul  writes:

> On Mon, 2022-09-26 at 16:03 +1000, Alistair Popple wrote:
>> nouveau_dmem_fault_copy_one() is used during handling of CPU faults via
>> the migrate_to_ram() callback and is used to copy data from GPU to CPU
>> memory. It is currently specific to fault handling, however a future
>> patch implementing eviction of data during teardown needs similar
>> functionality.
>>
>> Refactor out the core functionality so that it is not specific to fault
>> handling.
>>
>> Signed-off-by: Alistair Popple 
>> ---
>>  drivers/gpu/drm/nouveau/nouveau_dmem.c | 59 +--
>>  1 file changed, 29 insertions(+), 30 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
>> b/drivers/gpu/drm/nouveau/nouveau_dmem.c
>> index f9234ed..66ebbd4 100644
>> --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
>> +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
>> @@ -139,44 +139,25 @@ static void nouveau_dmem_fence_done(struct 
>> nouveau_fence **fence)
>>  }
>>  }
>>
>> -static vm_fault_t nouveau_dmem_fault_copy_one(struct nouveau_drm *drm,
>> -struct vm_fault *vmf, struct migrate_vma *args,
>> -dma_addr_t *dma_addr)
>> +static int nouveau_dmem_copy_one(struct nouveau_drm *drm, struct page 
>> *spage,
>> +struct page *dpage, dma_addr_t *dma_addr)
>>  {
>>  struct device *dev = drm->dev->dev;
>> -struct page *dpage, *spage;
>> -struct nouveau_svmm *svmm;
>> -
>> -spage = migrate_pfn_to_page(args->src[0]);
>> -if (!spage || !(args->src[0] & MIGRATE_PFN_MIGRATE))
>> -return 0;
>>
>> -dpage = alloc_page_vma(GFP_HIGHUSER, vmf->vma, vmf->address);
>> -if (!dpage)
>> -return VM_FAULT_SIGBUS;
>>  lock_page(dpage);
>>
>>  *dma_addr = dma_map_page(dev, dpage, 0, PAGE_SIZE, DMA_BIDIRECTIONAL);
>>  if (dma_mapping_error(dev, *dma_addr))
>> -goto error_free_page;
>> +return -EIO;
>>
>> -svmm = spage->zone_device_data;
>> -mutex_lock(>mutex);
>> -nouveau_svmm_invalidate(svmm, args->start, args->end);
>>  if (drm->dmem->migrate.copy_func(drm, 1, NOUVEAU_APER_HOST, *dma_addr,
>> -NOUVEAU_APER_VRAM, nouveau_dmem_page_addr(spage)))
>> -goto error_dma_unmap;
>> -mutex_unlock(>mutex);
>> + NOUVEAU_APER_VRAM,
>> + nouveau_dmem_page_addr(spage))) {
>> +dma_unmap_page(dev, *dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
>> +return -EIO;
>> +}
>
> Feel free to just align this with the starting (, as long as it doesn't go
> above 100 characters it doesn't really matter imho and would look nicer that
> way.
>
> Otherwise:
>
> Reviewed-by: Lyude Paul 

Thanks! I'm not sure I precisely understood your alignment comment above
but feel free to let me know if I got it wrong in v2.

> Will look at the other patch in a moment
>
>>
>> -args->dst[0] = migrate_pfn(page_to_pfn(dpage));
>>  return 0;
>> -
>> -error_dma_unmap:
>> -mutex_unlock(>mutex);
>> -dma_unmap_page(dev, *dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
>> -error_free_page:
>> -__free_page(dpage);
>> -return VM_FAULT_SIGBUS;
>>  }
>>
>>  static vm_fault_t nouveau_dmem_migrate_to_ram(struct vm_fault *vmf)
>> @@ -184,9 +165,11 @@ static vm_fault_t nouveau_dmem_migrate_to_ram(struct 
>> vm_fault *vmf)
>>  struct nouveau_drm *drm = page_to_drm(vmf->page);
>>  struct nouveau_dmem *dmem = drm->dmem;
>>  struct nouveau_fence *fence;
>> +struct nouveau_svmm *svmm;
>> +struct page *spage, *dpage;
>>  unsigned long src = 0, dst = 0;
>>  dma_addr_t dma_addr = 0;
>> -vm_fault_t ret;
>> +vm_fault_t ret = 0;
>>  struct migrate_vma args = {
>>  .vma= vmf->vma,
>>  .start  = vmf->address,
>> @@ -207,9 +190,25 @@ static vm_fault_t nouveau_dmem_migrate_to_ram(struct 
>> vm_fault *vmf)
>>  if (!args.cpages)
>>  return 0;
>>
>> -ret = nouveau_dmem_fault_copy_one(drm, vmf, , _addr);
>> -if (ret || dst == 0)
>> +spage = migrate_pfn_to_page(src);
>> +if (!spage || !(src & MIGRATE_PFN_MIGRATE))
>> +goto done;
>> +
>> +dpage = alloc_page_vma(GFP_HIGHUSER, vmf->vma, vmf->address);
>> +if (!dpage)
>> +goto done;
>> +
>> +dst = migrate_pfn(page_to_pfn(dpage));
>> +
>> +svmm = spage->zone_device_data;
>> +mutex_lock(>mutex);
>> +nouveau_svmm_invalidate(svmm, args.start, args.end);
>> +ret = nouveau_dmem_copy_one(drm, spage, dpage, _addr);
>> +mutex_unlock(>mutex);
>> +if (ret) {
>> +ret = VM_FAULT_SIGBUS;
>>  goto done;
>> +}
>>
>>  nouveau_fence_new(dmem->migrate.chan, false, );
>>  migrate_vma_pages();

Re: [PATCH 6/7] nouveau/dmem: Evict device private memory during release

2022-09-27 Thread Alistair Popple



Felix Kuehling  writes:

> On 2022-09-26 17:35, Lyude Paul wrote:
>> On Mon, 2022-09-26 at 16:03 +1000, Alistair Popple wrote:
>>> When the module is unloaded or a GPU is unbound from the module it is
>>> possible for device private pages to be left mapped in currently running
>>> processes. This leads to a kernel crash when the pages are either freed
>>> or accessed from the CPU because the GPU and associated data structures
>>> and callbacks have all been freed.
>>>
>>> Fix this by migrating any mappings back to normal CPU memory prior to
>>> freeing the GPU memory chunks and associated device private pages.
>>>
>>> Signed-off-by: Alistair Popple 
>>>
>>> ---
>>>
>>> I assume the AMD driver might have a similar issue. However I can't see
>>> where device private (or coherent) pages actually get unmapped/freed
>>> during teardown as I couldn't find any relevant calls to
>>> devm_memunmap(), memunmap(), devm_release_mem_region() or
>>> release_mem_region(). So it appears that ZONE_DEVICE pages are not being
>>> properly freed during module unload, unless I'm missing something?
>> I've got no idea, will poke Ben to see if they know the answer to this
>
> I guess we're relying on devm to release the region. Isn't the whole point of
> using devm_request_free_mem_region that we don't have to remember to 
> explicitly
> release it when the device gets destroyed? I believe we had an explicit free
> call at some point by mistake, and that caused a double-free during module
> unload. See this commit for reference:

Argh, thanks for that pointer. I was not so familiar with
devm_request_free_mem_region()/devm_memremap_pages() as currently
Nouveau explicitly manages that itself.

> commit 22f4f4faf337d5fb2d2750aff13215726814273e
> Author: Philip Yang 
> Date:   Mon Sep 20 17:25:52 2021 -0400
>
> drm/amdkfd: fix svm_migrate_fini warning
>  Device manager releases device-specific resources when a driver
> disconnects from a device, devm_memunmap_pages and
> devm_release_mem_region calls in svm_migrate_fini are redundant.
>  It causes below warning trace after patch "drm/amdgpu: Split
> amdgpu_device_fini into early and late", so remove function
> svm_migrate_fini.
>  BUG: https://gitlab.freedesktop.org/drm/amd/-/issues/1718
>  WARNING: CPU: 1 PID: 3646 at drivers/base/devres.c:795
> devm_release_action+0x51/0x60
> Call Trace:
> ? memunmap_pages+0x360/0x360
> svm_migrate_fini+0x2d/0x60 [amdgpu]
> kgd2kfd_device_exit+0x23/0xa0 [amdgpu]
> amdgpu_amdkfd_device_fini_sw+0x1d/0x30 [amdgpu]
> amdgpu_device_fini_sw+0x45/0x290 [amdgpu]
> amdgpu_driver_release_kms+0x12/0x30 [amdgpu]
> drm_dev_release+0x20/0x40 [drm]
> release_nodes+0x196/0x1e0
> device_release_driver_internal+0x104/0x1d0
> driver_detach+0x47/0x90
> bus_remove_driver+0x7a/0xd0
> pci_unregister_driver+0x3d/0x90
> amdgpu_exit+0x11/0x20 [amdgpu]
>  Signed-off-by: Philip Yang 
> Reviewed-by: Felix Kuehling 
> Signed-off-by: Alex Deucher 
>
> Furthermore, I guess we are assuming that nobody is using the GPU when the
> module is unloaded. As long as any processes have /dev/kfd open, you won't be
> able to unload the module (except by force-unload). I suppose with ZONE_DEVICE
> memory, we can have references to device memory pages even when user mode has
> closed /dev/kfd. We do have a cleanup handler that runs in an 
> MMU-free-notifier.
> In theory that should run after all the pages in the mm_struct have been 
> freed.
> It releases all sorts of other device resources and needs the driver to still 
> be
> there. I'm not sure if there is anything preventing a module unload before the
> free-notifier runs. I'll look into that.

Right - module unload (or device unbind) is one of the other ways we can
hit this issue in Nouveau at least. You can end up with ZONE_DEVICE
pages mapped in a running process after the module has unloaded.
Although now you mention it that seems a bit wrong - the pgmap refcount
should provide some protection against that. Will have to look into
that too.

> Regards,
>   Felix
>
>
>>
>>> ---
>>>   drivers/gpu/drm/nouveau/nouveau_dmem.c | 48 +++-
>>>   1 file changed, 48 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
>>> b/drivers/gpu/drm/nouveau/nouveau_dmem.c
>>> index 66ebbd4..3b247b8 100644
>>> --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
>>> +++ b/drivers/gpu/drm/

Re: [PATCH 6/7] nouveau/dmem: Evict device private memory during release

2022-09-27 Thread Alistair Popple

John Hubbard  writes:

> On 9/26/22 14:35, Lyude Paul wrote:
>>> +   for (i = 0; i < npages; i++) {
>>> +   if (src_pfns[i] & MIGRATE_PFN_MIGRATE) {
>>> +   struct page *dpage;
>>> +
>>> +   /*
>>> +* _GFP_NOFAIL because the GPU is going away and there
>>> +* is nothing sensible we can do if we can't copy the
>>> +* data back.
>>> +*/
>>
>> You'll have to excuse me for a moment since this area of nouveau isn't one of
>> my strongpoints, but are we sure about this? IIRC __GFP_NOFAIL means infinite
>> retry, in the case of a GPU hotplug event I would assume we would rather just
>> stop trying to migrate things to the GPU and just drop the data instead of
>> hanging on infinite retries.
>>

No problem, thanks for taking a look!

> Hi Lyude!
>
> Actually, I really think it's better in this case to keep trying
> (presumably not necessarily infinitely, but only until memory becomes
> available), rather than failing out and corrupting data.
>
> That's because I'm not sure it's completely clear that this memory is
> discardable. And at some point, we're going to make this all work with
> file-backed memory, which will *definitely* not be discardable--I
> realize that we're not there yet, of course.
>
> But here, it's reasonable to commit to just retrying indefinitely,
> really. Memory should eventually show up. And if it doesn't, then
> restarting the machine is better than corrupting data, generally.

The memory is definitely not discardable here if the migration failed
because that implies it is still mapped into some userspace process.

We could avoid restarting the machine by doing something similar to what
happens during memory failure and killing every process that maps the
page(s). But overall I think it's better to retry until memory is
available, because that allows things like reclaim to work and in the
worst case allows the OOM killer to select an appropriate task to kill.
It also won't cause data corruption if/when we have file-backed memory.

> thanks,

Re: [PATCH 2/7] mm: Free device private pages have zero refcount

2022-09-27 Thread Alistair Popple

Jason Gunthorpe  writes:

> On Mon, Sep 26, 2022 at 04:03:06PM +1000, Alistair Popple wrote:
>> Since 27674ef6c73f ("mm: remove the extra ZONE_DEVICE struct page
>> refcount") device private pages have no longer had an extra reference
>> count when the page is in use. However before handing them back to the
>> owning device driver we add an extra reference count such that free
>> pages have a reference count of one.
>>
>> This makes it difficult to tell if a page is free or not because both
>> free and in use pages will have a non-zero refcount. Instead we should
>> return pages to the drivers page allocator with a zero reference count.
>> Kernel code can then safely use kernel functions such as
>> get_page_unless_zero().
>>
>> Signed-off-by: Alistair Popple 
>> ---
>>  arch/powerpc/kvm/book3s_hv_uvmem.c   | 1 +
>>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 1 +
>>  drivers/gpu/drm/nouveau/nouveau_dmem.c   | 1 +
>>  lib/test_hmm.c   | 1 +
>>  mm/memremap.c| 5 -
>>  mm/page_alloc.c  | 6 ++
>>  6 files changed, 10 insertions(+), 5 deletions(-)
>
> I think this is a great idea, but I'm surprised no dax stuff is
> touched here?

free_zone_device_page() shouldn't be called for pgmap->type ==
MEMORY_DEVICE_FS_DAX so I don't think we should have to worry about DAX
there. Except that the folio code looks like it might have introduced a
bug. AFAICT put_page() always calls
put_devmap_managed_page(>page) but folio_put() does not (although
folios_put() does!). So it seems folio_put() won't end up calling
__put_devmap_managed_page_refs() as I think it should.

I think you're right about the change to __init_zone_device_page() - I
should limit it to DEVICE_PRIVATE/COHERENT pages only. But I need to
look at Dan's patch series more closely as I suspect it might be better
to rebase this patch on top of that.

> Jason

[PATCH 4/7] mm/migrate_device.c: Add migrate_device_range()

2022-09-26 Thread Alistair Popple

Device drivers can use the migrate_vma family of functions to migrate
existing private anonymous mappings to device private pages. These pages
are backed by memory on the device with drivers being responsible for
copying data to and from device memory.

Device private pages are freed via the pgmap->page_free() callback when
they are unmapped and their refcount drops to zero. Alternatively they
may be freed indirectly via migration back to CPU memory in response to
a pgmap->migrate_to_ram() callback called whenever the CPU accesses
an address mapped to a device private page.

In other words drivers cannot control the lifetime of data allocated on
the devices and must wait until these pages are freed from userspace.
This causes issues when memory needs to reclaimed on the device, either
because the device is going away due to a ->release() callback or
because another user needs to use the memory.

Drivers could use the existing migrate_vma functions to migrate data off
the device. However this would require them to track the mappings of
each page which is both complicated and not always possible. Instead
drivers need to be able to migrate device pages directly so they can
free up device memory.

To allow that this patch introduces the migrate_device family of
functions which are functionally similar to migrate_vma but which skips
the initial lookup based on mapping.

Signed-off-by: Alistair Popple 
---
 include/linux/migrate.h |  7 +++-
 mm/migrate_device.c | 89 ++
 2 files changed, 89 insertions(+), 7 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 82ffa47..582cdc7 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -225,6 +225,13 @@ struct migrate_vma {
 int migrate_vma_setup(struct migrate_vma *args);
 void migrate_vma_pages(struct migrate_vma *migrate);
 void migrate_vma_finalize(struct migrate_vma *migrate);
+int migrate_device_range(unsigned long *src_pfns, unsigned long start,
+   unsigned long npages);
+void migrate_device_pages(unsigned long *src_pfns, unsigned long *dst_pfns,
+   unsigned long npages);
+void migrate_device_finalize(unsigned long *src_pfns,
+   unsigned long *dst_pfns, unsigned long npages);
+
 #endif /* CONFIG_MIGRATION */
 
 #endif /* _LINUX_MIGRATE_H */
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index ba479b5..824860a 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -681,7 +681,7 @@ static void migrate_vma_insert_page(struct migrate_vma 
*migrate,
*src &= ~MIGRATE_PFN_MIGRATE;
 }
 
-static void migrate_device_pages(unsigned long *src_pfns,
+static void __migrate_device_pages(unsigned long *src_pfns,
unsigned long *dst_pfns, unsigned long npages,
struct migrate_vma *migrate)
 {
@@ -703,6 +703,9 @@ static void migrate_device_pages(unsigned long *src_pfns,
if (!page) {
unsigned long addr;
 
+   if (!(src_pfns[i] & MIGRATE_PFN_MIGRATE))
+   continue;
+
/*
 * The only time there is no vma is when called from
 * migrate_device_coherent_page(). However this isn't
@@ -710,8 +713,6 @@ static void migrate_device_pages(unsigned long *src_pfns,
 */
VM_BUG_ON(!migrate);
addr = migrate->start + i*PAGE_SIZE;
-   if (!(src_pfns[i] & MIGRATE_PFN_MIGRATE))
-   continue;
if (!notified) {
notified = true;
 
@@ -767,6 +768,22 @@ static void migrate_device_pages(unsigned long *src_pfns,
 }
 
 /**
+ * migrate_device_pages() - migrate meta-data from src page to dst page
+ * @src_pfns: src_pfns returned from migrate_device_range()
+ * @dst_pfns: array of pfns allocated by the driver to migrate memory to
+ * @npages: number of pages in the range
+ *
+ * Equivalent to migrate_vma_pages(). This is called to migrate struct page
+ * meta-data from source struct page to destination.
+ */
+void migrate_device_pages(unsigned long *src_pfns, unsigned long *dst_pfns,
+   unsigned long npages)
+{
+   __migrate_device_pages(src_pfns, dst_pfns, npages, NULL);
+}
+EXPORT_SYMBOL(migrate_device_pages);
+
+/**
  * migrate_vma_pages() - migrate meta-data from src page to dst page
  * @migrate: migrate struct containing all migration information
  *
@@ -776,12 +793,22 @@ static void migrate_device_pages(unsigned long *src_pfns,
  */
 void migrate_vma_pages(struct migrate_vma *migrate)
 {
-   migrate_device_pages(migrate->src, migrate->dst, migrate->npages, 
migrate);
+   __migrate_device_pages(migrate->src, migrate->dst, migrate->npages, 
migrate);
 }
 EXPORT

[PATCH 0/7] Fix several device private page reference counting issues

2022-09-26 Thread Alistair Popple

This series aims to fix a number of page reference counting issues in drivers
dealing with device private ZONE_DEVICE pages. These result in use-after-free
type bugs, either from accessing a struct page which no longer exists because it
has been removed or accessing fields within the struct page which are no longer
valid because the page has been freed.

During normal usage it is unlikely these will cause any problems. However
without these fixes it is possible to crash the kernel from userspace. These
crashes can be triggered either by unloading the kernel module or unbinding the
device from the driver prior to a userspace task exiting. In modules such as
Nouveau it is also possible to trigger some of these issues by explicitly
closing the device file-descriptor prior to the task exiting and then accessing
device private memory.

This involves changes to both PowerPC and AMD GPU code. Unfortunately I lack the
hardware to test on either of these so would appreciate it if someone with
access could test those.

Alistair Popple (7):
  mm/memory.c: Fix race when faulting a device private page
  mm: Free device private pages have zero refcount
  mm/migrate_device.c: Refactor migrate_vma and migrate_deivce_coherent_page()
  mm/migrate_device.c: Add migrate_device_range()
  nouveau/dmem: Refactor nouveau_dmem_fault_copy_one()
  nouveau/dmem: Evict device private memory during release
  hmm-tests: Add test for migrate_device_range()

 arch/powerpc/kvm/book3s_hv_uvmem.c   |  16 +-
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |  18 +-
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.h |   2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c |  11 +-
 drivers/gpu/drm/nouveau/nouveau_dmem.c   | 108 +++
 include/linux/migrate.h  |  15 ++-
 lib/test_hmm.c   | 127 ++---
 lib/test_hmm_uapi.h  |   1 +-
 mm/memory.c  |  16 +-
 mm/memremap.c|   5 +-
 mm/migrate.c |  34 +--
 mm/migrate_device.c  | 239 +---
 mm/page_alloc.c  |   6 +-
 tools/testing/selftests/vm/hmm-tests.c   |  49 +-
 14 files changed, 487 insertions(+), 160 deletions(-)

base-commit: 088b8aa537c2c767765f1c19b555f21ffe555786
-- 
git-series 0.9.1

[PATCH 7/7] hmm-tests: Add test for migrate_device_range()

2022-09-26 Thread Alistair Popple

Signed-off-by: Alistair Popple 
---
 lib/test_hmm.c | 119 +-
 lib/test_hmm_uapi.h|   1 +-
 tools/testing/selftests/vm/hmm-tests.c |  49 +++-
 3 files changed, 148 insertions(+), 21 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 2bd3a67..d2821dd 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -100,6 +100,7 @@ struct dmirror {
 struct dmirror_chunk {
struct dev_pagemap  pagemap;
struct dmirror_device   *mdevice;
+   bool remove;
 };
 
 /*
@@ -192,11 +193,15 @@ static int dmirror_fops_release(struct inode *inode, 
struct file *filp)
return 0;
 }
 
+static struct dmirror_chunk *dmirror_page_to_chunk(struct page *page)
+{
+   return container_of(page->pgmap, struct dmirror_chunk, pagemap);
+}
+
 static struct dmirror_device *dmirror_page_to_device(struct page *page)
 
 {
-   return container_of(page->pgmap, struct dmirror_chunk,
-   pagemap)->mdevice;
+   return dmirror_page_to_chunk(page)->mdevice;
 }
 
 static int dmirror_do_fault(struct dmirror *dmirror, struct hmm_range *range)
@@ -1219,6 +1224,84 @@ static int dmirror_snapshot(struct dmirror *dmirror,
return ret;
 }
 
+static void dmirror_device_evict_chunk(struct dmirror_chunk *chunk)
+{
+   unsigned long start_pfn = chunk->pagemap.range.start >> PAGE_SHIFT;
+   unsigned long end_pfn = chunk->pagemap.range.end >> PAGE_SHIFT;
+   unsigned long npages = end_pfn - start_pfn + 1;
+   unsigned long i;
+   unsigned long *src_pfns;
+   unsigned long *dst_pfns;
+
+   src_pfns = kcalloc(npages, sizeof(*src_pfns), GFP_KERNEL);
+   dst_pfns = kcalloc(npages, sizeof(*dst_pfns), GFP_KERNEL);
+
+   migrate_device_range(src_pfns, start_pfn, npages);
+   for (i = 0; i < npages; i++) {
+   struct page *dpage, *spage;
+
+   spage = migrate_pfn_to_page(src_pfns[i]);
+   if (!spage || !(src_pfns[i] & MIGRATE_PFN_MIGRATE))
+   continue;
+
+   if (WARN_ON(!is_device_private_page(spage) &&
+   !is_device_coherent_page(spage)))
+   continue;
+   spage = BACKING_PAGE(spage);
+   dpage = alloc_page(GFP_HIGHUSER_MOVABLE | __GFP_NOFAIL);
+   lock_page(dpage);
+   copy_highpage(dpage, spage);
+   dst_pfns[i] = migrate_pfn(page_to_pfn(dpage));
+   if (src_pfns[i] & MIGRATE_PFN_WRITE)
+   dst_pfns[i] |= MIGRATE_PFN_WRITE;
+   }
+   migrate_device_pages(src_pfns, dst_pfns, npages);
+   migrate_device_finalize(src_pfns, dst_pfns, npages);
+   kfree(src_pfns);
+   kfree(dst_pfns);
+}
+
+/* Removes free pages from the free list so they can't be re-allocated */
+static void dmirror_remove_free_pages(struct dmirror_chunk *devmem)
+{
+   struct dmirror_device *mdevice = devmem->mdevice;
+   struct page *page;
+
+   for (page = mdevice->free_pages; page; page = page->zone_device_data)
+   if (dmirror_page_to_chunk(page) == devmem)
+   mdevice->free_pages = page->zone_device_data;
+}
+
+static void dmirror_device_remove_chunks(struct dmirror_device *mdevice)
+{
+   unsigned int i;
+
+   mutex_lock(>devmem_lock);
+   if (mdevice->devmem_chunks) {
+   for (i = 0; i < mdevice->devmem_count; i++) {
+   struct dmirror_chunk *devmem =
+   mdevice->devmem_chunks[i];
+
+   spin_lock(>lock);
+   devmem->remove = true;
+   dmirror_remove_free_pages(devmem);
+   spin_unlock(>lock);
+
+   dmirror_device_evict_chunk(devmem);
+   memunmap_pages(>pagemap);
+   if (devmem->pagemap.type == MEMORY_DEVICE_PRIVATE)
+   release_mem_region(devmem->pagemap.range.start,
+  
range_len(>pagemap.range));
+   kfree(devmem);
+   }
+   mdevice->devmem_count = 0;
+   mdevice->devmem_capacity = 0;
+   mdevice->free_pages = NULL;
+   kfree(mdevice->devmem_chunks);
+   }
+   mutex_unlock(>devmem_lock);
+}
+
 static long dmirror_fops_unlocked_ioctl(struct file *filp,
unsigned int command,
unsigned long arg)
@@ -1273,6 +1356,11 @@ static long dmirror_fops_unlocked_ioctl(struct file 
*filp,
ret = dmirror_snapshot(dmirror, );
break;
 
+   case HMM_DMIRROR_RELEASE:
+   dmirror_device_remove_chunks(dmirror->mdevice);
+   ret = 0;

[PATCH 3/7] mm/migrate_device.c: Refactor migrate_vma and migrate_deivce_coherent_page()

2022-09-26 Thread Alistair Popple

migrate_device_coherent_page() reuses the existing migrate_vma family of
functions to migrate a specific page without providing a valid mapping
or vma. This looks a bit odd because it means we are calling
migrate_vma_*() without setting a valid vma, however it was considered
acceptable at the time because the details were internal to
migrate_device.c and there was only a single user.

One of the reasons the details could be kept internal was that this was
strictly for migrating device coherent memory. Such memory can be copied
directly by the CPU without intervention from a driver. However this
isn't true for device private memory, and a future change requires
similar functionality for device private memory. So refactor the code
into something more sensible for migrating device memory without a vma.

Signed-off-by: Alistair Popple 
---
 mm/migrate_device.c | 150 +
 1 file changed, 85 insertions(+), 65 deletions(-)

diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index f756c00..ba479b5 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -345,26 +345,20 @@ static bool migrate_vma_check_page(struct page *page, 
struct page *fault_page)
 }
 
 /*
- * migrate_vma_unmap() - replace page mapping with special migration pte entry
- * @migrate: migrate struct containing all migration information
- *
- * Isolate pages from the LRU and replace mappings (CPU page table pte) with a
- * special migration pte entry and check if it has been pinned. Pinned pages 
are
- * restored because we cannot migrate them.
- *
- * This is the last step before we call the device driver callback to allocate
- * destination memory and copy contents of original page over to new page.
+ * Unmaps pages for migration. Returns number of unmapped pages.
  */
-static void migrate_vma_unmap(struct migrate_vma *migrate)
+static unsigned long migrate_device_unmap(unsigned long *src_pfns,
+ unsigned long npages,
+ struct page *fault_page)
 {
-   const unsigned long npages = migrate->npages;
unsigned long i, restore = 0;
bool allow_drain = true;
+   unsigned long unmapped = 0;
 
lru_add_drain();
 
for (i = 0; i < npages; i++) {
-   struct page *page = migrate_pfn_to_page(migrate->src[i]);
+   struct page *page = migrate_pfn_to_page(src_pfns[i]);
struct folio *folio;
 
if (!page)
@@ -379,8 +373,7 @@ static void migrate_vma_unmap(struct migrate_vma *migrate)
}
 
if (isolate_lru_page(page)) {
-   migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
-   migrate->cpages--;
+   src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
restore++;
continue;
}
@@ -394,34 +387,54 @@ static void migrate_vma_unmap(struct migrate_vma *migrate)
try_to_migrate(folio, 0);
 
if (page_mapped(page) ||
-   !migrate_vma_check_page(page, migrate->fault_page)) {
+   !migrate_vma_check_page(page, fault_page)) {
if (!is_zone_device_page(page)) {
get_page(page);
putback_lru_page(page);
}
 
-   migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
-   migrate->cpages--;
+   src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
restore++;
continue;
}
+
+   unmapped++;
}
 
for (i = 0; i < npages && restore; i++) {
-   struct page *page = migrate_pfn_to_page(migrate->src[i]);
+   struct page *page = migrate_pfn_to_page(src_pfns[i]);
struct folio *folio;
 
-   if (!page || (migrate->src[i] & MIGRATE_PFN_MIGRATE))
+   if (!page || (src_pfns[i] & MIGRATE_PFN_MIGRATE))
continue;
 
folio = page_folio(page);
remove_migration_ptes(folio, folio, false);
 
-   migrate->src[i] = 0;
+   src_pfns[i] = 0;
folio_unlock(folio);
folio_put(folio);
restore--;
}
+
+   return unmapped;
+}
+
+/*
+ * migrate_vma_unmap() - replace page mapping with special migration pte entry
+ * @migrate: migrate struct containing all migration information
+ *
+ * Isolate pages from the LRU and replace mappings (CPU page table pte) with a
+ * special migration pte entry and check if it has been pinned. Pinned pages 
are
+ * restored because we cannot migrate them.
+ *
+ * This is the last step before we call the device driver callb

[PATCH 1/7] mm/memory.c: Fix race when faulting a device private page

2022-09-26 Thread Alistair Popple

When the CPU tries to access a device private page the migrate_to_ram()
callback associated with the pgmap for the page is called. However no
reference is taken on the faulting page. Therefore a concurrent
migration of the device private page can free the page and possibly the
underlying pgmap. This results in a race which can crash the kernel due
to the migrate_to_ram() function pointer becoming invalid. It also means
drivers can't reliably read the zone_device_data field because the page
may have been freed with memunmap_pages().

Close the race by getting a reference on the page while holding the ptl
to ensure it has not been freed. Unfortunately the elevated reference
count will cause the migration required to handle the fault to fail. To
avoid this failure pass the faulting page into the migrate_vma functions
so that if an elevated reference count is found it can be checked to see
if it's expected or not.

Signed-off-by: Alistair Popple 
---
 arch/powerpc/kvm/book3s_hv_uvmem.c   | 15 ++-
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 17 +++--
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.h |  2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 11 +---
 include/linux/migrate.h  |  8 ++-
 lib/test_hmm.c   |  7 ++---
 mm/memory.c  | 16 +++-
 mm/migrate.c | 34 ++---
 mm/migrate_device.c  | 18 +
 9 files changed, 87 insertions(+), 41 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 5980063..d4eacf4 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -508,10 +508,10 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
 static int __kvmppc_svm_page_out(struct vm_area_struct *vma,
unsigned long start,
unsigned long end, unsigned long page_shift,
-   struct kvm *kvm, unsigned long gpa)
+   struct kvm *kvm, unsigned long gpa, struct page *fault_page)
 {
unsigned long src_pfn, dst_pfn = 0;
-   struct migrate_vma mig;
+   struct migrate_vma mig = { 0 };
struct page *dpage, *spage;
struct kvmppc_uvmem_page_pvt *pvt;
unsigned long pfn;
@@ -525,6 +525,7 @@ static int __kvmppc_svm_page_out(struct vm_area_struct *vma,
mig.dst = _pfn;
mig.pgmap_owner = _uvmem_pgmap;
mig.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
+   mig.fault_page = fault_page;
 
/* The requested page is already paged-out, nothing to do */
if (!kvmppc_gfn_is_uvmem_pfn(gpa >> page_shift, kvm, NULL))
@@ -580,12 +581,14 @@ static int __kvmppc_svm_page_out(struct vm_area_struct 
*vma,
 static inline int kvmppc_svm_page_out(struct vm_area_struct *vma,
  unsigned long start, unsigned long end,
  unsigned long page_shift,
- struct kvm *kvm, unsigned long gpa)
+ struct kvm *kvm, unsigned long gpa,
+ struct page *fault_page)
 {
int ret;
 
mutex_lock(>arch.uvmem_lock);
-   ret = __kvmppc_svm_page_out(vma, start, end, page_shift, kvm, gpa);
+   ret = __kvmppc_svm_page_out(vma, start, end, page_shift, kvm, gpa,
+   fault_page);
mutex_unlock(>arch.uvmem_lock);
 
return ret;
@@ -736,7 +739,7 @@ static int kvmppc_svm_page_in(struct vm_area_struct *vma,
bool pagein)
 {
unsigned long src_pfn, dst_pfn = 0;
-   struct migrate_vma mig;
+   struct migrate_vma mig = { 0 };
struct page *spage;
unsigned long pfn;
struct page *dpage;
@@ -994,7 +997,7 @@ static vm_fault_t kvmppc_uvmem_migrate_to_ram(struct 
vm_fault *vmf)
 
if (kvmppc_svm_page_out(vmf->vma, vmf->address,
vmf->address + PAGE_SIZE, PAGE_SHIFT,
-   pvt->kvm, pvt->gpa))
+   pvt->kvm, pvt->gpa, vmf->page))
return VM_FAULT_SIGBUS;
else
return 0;
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index b059a77..776448b 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -409,7 +409,7 @@ svm_migrate_vma_to_vram(struct amdgpu_device *adev, struct 
svm_range *prange,
uint64_t npages = (end - start) >> PAGE_SHIFT;
struct kfd_process_device *pdd;
struct dma_fence *mfence = NULL;
-   struct migrate_vma migrate;
+   struct migrate_vma migrate = { 0 };
unsigned long cpages = 0;
dma_addr_t *scratch;
void *buf;
@@ -668,7 +668,7 @@ svm_migrate_copy_to_ram(struct amdgpu_device *ade

[PATCH 6/7] nouveau/dmem: Evict device private memory during release

2022-09-26 Thread Alistair Popple

When the module is unloaded or a GPU is unbound from the module it is
possible for device private pages to be left mapped in currently running
processes. This leads to a kernel crash when the pages are either freed
or accessed from the CPU because the GPU and associated data structures
and callbacks have all been freed.

Fix this by migrating any mappings back to normal CPU memory prior to
freeing the GPU memory chunks and associated device private pages.

Signed-off-by: Alistair Popple 

---

I assume the AMD driver might have a similar issue. However I can't see
where device private (or coherent) pages actually get unmapped/freed
during teardown as I couldn't find any relevant calls to
devm_memunmap(), memunmap(), devm_release_mem_region() or
release_mem_region(). So it appears that ZONE_DEVICE pages are not being
properly freed during module unload, unless I'm missing something?
---
 drivers/gpu/drm/nouveau/nouveau_dmem.c | 48 +++-
 1 file changed, 48 insertions(+)

diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 66ebbd4..3b247b8 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -369,6 +369,52 @@ nouveau_dmem_suspend(struct nouveau_drm *drm)
mutex_unlock(>dmem->mutex);
 }
 
+/*
+ * Evict all pages mapping a chunk.
+ */
+void
+nouveau_dmem_evict_chunk(struct nouveau_dmem_chunk *chunk)
+{
+   unsigned long i, npages = range_len(>pagemap.range) >> 
PAGE_SHIFT;
+   unsigned long *src_pfns, *dst_pfns;
+   dma_addr_t *dma_addrs;
+   struct nouveau_fence *fence;
+
+   src_pfns = kcalloc(npages, sizeof(*src_pfns), GFP_KERNEL);
+   dst_pfns = kcalloc(npages, sizeof(*dst_pfns), GFP_KERNEL);
+   dma_addrs = kcalloc(npages, sizeof(*dma_addrs), GFP_KERNEL);
+
+   migrate_device_range(src_pfns, chunk->pagemap.range.start >> PAGE_SHIFT,
+   npages);
+
+   for (i = 0; i < npages; i++) {
+   if (src_pfns[i] & MIGRATE_PFN_MIGRATE) {
+   struct page *dpage;
+
+   /*
+* _GFP_NOFAIL because the GPU is going away and there
+* is nothing sensible we can do if we can't copy the
+* data back.
+*/
+   dpage = alloc_page(GFP_HIGHUSER | __GFP_NOFAIL);
+   dst_pfns[i] = migrate_pfn(page_to_pfn(dpage));
+   nouveau_dmem_copy_one(chunk->drm,
+   migrate_pfn_to_page(src_pfns[i]), dpage,
+   _addrs[i]);
+   }
+   }
+
+   nouveau_fence_new(chunk->drm->dmem->migrate.chan, false, );
+   migrate_device_pages(src_pfns, dst_pfns, npages);
+   nouveau_dmem_fence_done();
+   migrate_device_finalize(src_pfns, dst_pfns, npages);
+   kfree(src_pfns);
+   kfree(dst_pfns);
+   for (i = 0; i < npages; i++)
+   dma_unmap_page(chunk->drm->dev->dev, dma_addrs[i], PAGE_SIZE, 
DMA_BIDIRECTIONAL);
+   kfree(dma_addrs);
+}
+
 void
 nouveau_dmem_fini(struct nouveau_drm *drm)
 {
@@ -380,8 +426,10 @@ nouveau_dmem_fini(struct nouveau_drm *drm)
mutex_lock(>dmem->mutex);
 
list_for_each_entry_safe(chunk, tmp, >dmem->chunks, list) {
+   nouveau_dmem_evict_chunk(chunk);
nouveau_bo_unpin(chunk->bo);
nouveau_bo_ref(NULL, >bo);
+   WARN_ON(chunk->callocated);
list_del(>list);
memunmap_pages(>pagemap);
release_mem_region(chunk->pagemap.range.start,
-- 
git-series 0.9.1

[PATCH 5/7] nouveau/dmem: Refactor nouveau_dmem_fault_copy_one()

2022-09-26 Thread Alistair Popple

nouveau_dmem_fault_copy_one() is used during handling of CPU faults via
the migrate_to_ram() callback and is used to copy data from GPU to CPU
memory. It is currently specific to fault handling, however a future
patch implementing eviction of data during teardown needs similar
functionality.

Refactor out the core functionality so that it is not specific to fault
handling.

Signed-off-by: Alistair Popple 
---
 drivers/gpu/drm/nouveau/nouveau_dmem.c | 59 +--
 1 file changed, 29 insertions(+), 30 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index f9234ed..66ebbd4 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -139,44 +139,25 @@ static void nouveau_dmem_fence_done(struct nouveau_fence 
**fence)
}
 }
 
-static vm_fault_t nouveau_dmem_fault_copy_one(struct nouveau_drm *drm,
-   struct vm_fault *vmf, struct migrate_vma *args,
-   dma_addr_t *dma_addr)
+static int nouveau_dmem_copy_one(struct nouveau_drm *drm, struct page *spage,
+   struct page *dpage, dma_addr_t *dma_addr)
 {
struct device *dev = drm->dev->dev;
-   struct page *dpage, *spage;
-   struct nouveau_svmm *svmm;
-
-   spage = migrate_pfn_to_page(args->src[0]);
-   if (!spage || !(args->src[0] & MIGRATE_PFN_MIGRATE))
-   return 0;
 
-   dpage = alloc_page_vma(GFP_HIGHUSER, vmf->vma, vmf->address);
-   if (!dpage)
-   return VM_FAULT_SIGBUS;
lock_page(dpage);
 
*dma_addr = dma_map_page(dev, dpage, 0, PAGE_SIZE, DMA_BIDIRECTIONAL);
if (dma_mapping_error(dev, *dma_addr))
-   goto error_free_page;
+   return -EIO;
 
-   svmm = spage->zone_device_data;
-   mutex_lock(>mutex);
-   nouveau_svmm_invalidate(svmm, args->start, args->end);
if (drm->dmem->migrate.copy_func(drm, 1, NOUVEAU_APER_HOST, *dma_addr,
-   NOUVEAU_APER_VRAM, nouveau_dmem_page_addr(spage)))
-   goto error_dma_unmap;
-   mutex_unlock(>mutex);
+NOUVEAU_APER_VRAM,
+nouveau_dmem_page_addr(spage))) {
+   dma_unmap_page(dev, *dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
+   return -EIO;
+   }
 
-   args->dst[0] = migrate_pfn(page_to_pfn(dpage));
return 0;
-
-error_dma_unmap:
-   mutex_unlock(>mutex);
-   dma_unmap_page(dev, *dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
-error_free_page:
-   __free_page(dpage);
-   return VM_FAULT_SIGBUS;
 }
 
 static vm_fault_t nouveau_dmem_migrate_to_ram(struct vm_fault *vmf)
@@ -184,9 +165,11 @@ static vm_fault_t nouveau_dmem_migrate_to_ram(struct 
vm_fault *vmf)
struct nouveau_drm *drm = page_to_drm(vmf->page);
struct nouveau_dmem *dmem = drm->dmem;
struct nouveau_fence *fence;
+   struct nouveau_svmm *svmm;
+   struct page *spage, *dpage;
unsigned long src = 0, dst = 0;
dma_addr_t dma_addr = 0;
-   vm_fault_t ret;
+   vm_fault_t ret = 0;
struct migrate_vma args = {
.vma= vmf->vma,
.start  = vmf->address,
@@ -207,9 +190,25 @@ static vm_fault_t nouveau_dmem_migrate_to_ram(struct 
vm_fault *vmf)
if (!args.cpages)
return 0;
 
-   ret = nouveau_dmem_fault_copy_one(drm, vmf, , _addr);
-   if (ret || dst == 0)
+   spage = migrate_pfn_to_page(src);
+   if (!spage || !(src & MIGRATE_PFN_MIGRATE))
+   goto done;
+
+   dpage = alloc_page_vma(GFP_HIGHUSER, vmf->vma, vmf->address);
+   if (!dpage)
+   goto done;
+
+   dst = migrate_pfn(page_to_pfn(dpage));
+
+   svmm = spage->zone_device_data;
+   mutex_lock(>mutex);
+   nouveau_svmm_invalidate(svmm, args.start, args.end);
+   ret = nouveau_dmem_copy_one(drm, spage, dpage, _addr);
+   mutex_unlock(>mutex);
+   if (ret) {
+   ret = VM_FAULT_SIGBUS;
goto done;
+   }
 
nouveau_fence_new(dmem->migrate.chan, false, );
migrate_vma_pages();
-- 
git-series 0.9.1

[PATCH 2/7] mm: Free device private pages have zero refcount

2022-09-26 Thread Alistair Popple

Since 27674ef6c73f ("mm: remove the extra ZONE_DEVICE struct page
refcount") device private pages have no longer had an extra reference
count when the page is in use. However before handing them back to the
owning device driver we add an extra reference count such that free
pages have a reference count of one.

This makes it difficult to tell if a page is free or not because both
free and in use pages will have a non-zero refcount. Instead we should
return pages to the drivers page allocator with a zero reference count.
Kernel code can then safely use kernel functions such as
get_page_unless_zero().

Signed-off-by: Alistair Popple 
---
 arch/powerpc/kvm/book3s_hv_uvmem.c   | 1 +
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 1 +
 drivers/gpu/drm/nouveau/nouveau_dmem.c   | 1 +
 lib/test_hmm.c   | 1 +
 mm/memremap.c| 5 -
 mm/page_alloc.c  | 6 ++
 6 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index d4eacf4..08d2f7d 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -718,6 +718,7 @@ static struct page *kvmppc_uvmem_get_page(unsigned long 
gpa, struct kvm *kvm)
 
dpage = pfn_to_page(uvmem_pfn);
dpage->zone_device_data = pvt;
+   set_page_count(dpage, 1);
lock_page(dpage);
return dpage;
 out_clear:
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index 776448b..05c2f4d 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -223,6 +223,7 @@ svm_migrate_get_vram_page(struct svm_range *prange, 
unsigned long pfn)
page = pfn_to_page(pfn);
svm_range_bo_ref(prange->svm_bo);
page->zone_device_data = prange->svm_bo;
+   set_page_count(page, 1);
lock_page(page);
 }
 
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 1635661..f9234ed 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -326,6 +326,7 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm)
return NULL;
}
 
+   set_page_count(page, 1);
lock_page(page);
return page;
 }
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 89463ff..2bd3a67 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -627,6 +627,7 @@ static struct page *dmirror_devmem_alloc_page(struct 
dmirror_device *mdevice)
goto error;
}
 
+   set_page_count(dpage, 1);
dpage->zone_device_data = rpage;
lock_page(dpage);
return dpage;
diff --git a/mm/memremap.c b/mm/memremap.c
index 25029a4..e065171 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -501,11 +501,6 @@ void free_zone_device_page(struct page *page)
 */
page->mapping = NULL;
page->pgmap->ops->page_free(page);
-
-   /*
-* Reset the page count to 1 to prepare for handing out the page again.
-*/
-   set_page_count(page, 1);
 }
 
 #ifdef CONFIG_FS_DAX
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9d49803..67eaab5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6744,6 +6744,12 @@ static void __ref __init_zone_device_page(struct page 
*page, unsigned long pfn,
set_pageblock_migratetype(page, MIGRATE_MOVABLE);
cond_resched();
}
+
+   /*
+* ZONE_DEVICE pages are released directly to the driver page allocator
+* which will set the page count to 1 when allocating the page.
+*/
+   set_page_count(page, 0);
 }
 
 /*
-- 
git-series 0.9.1

[PATCH] mm/gup.c: Fix formating in check_and_migrate_movable_page()

2022-07-20 Thread Alistair Popple

Commit b05a79d4377f ("mm/gup: migrate device coherent pages when pinning
instead of failing") added a badly formatted if statement. Fix it.

Signed-off-by: Alistair Popple 
Reported-by: David Hildenbrand 
---

Apologies Andrew for missing this. Hopefully this fixes things.

 mm/gup.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 364b274a10c2..c6d060dee9e0 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1980,8 +1980,8 @@ static long check_and_migrate_movable_pages(unsigned long 
nr_pages,
folio_nr_pages(folio));
}
 
-   if (!list_empty(_page_list) || isolation_error_count
-   || coherent_pages)
+   if (!list_empty(_page_list) || isolation_error_count ||
+   coherent_pages)
goto unpin_pages;
 
/*
-- 
2.35.1

[PATCH] mm/gup.c: Fix formating in check_and_migrate_movable_page()

2022-07-20 Thread Alistair Popple

Commit b05a79d4377f ("mm/gup: migrate device coherent pages when pinning
instead of failing") added a badly formatted if statement. Fix it.

Signed-off-by: Alistair Popple 
Reported-by: David Hildenbrand 
---

Apologies Andrew for missing this. Hopefully this fixes things.

 mm/gup.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 364b274a10c2..c6d060dee9e0 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1980,8 +1980,8 @@ static long check_and_migrate_movable_pages(unsigned long 
nr_pages,
folio_nr_pages(folio));
}
 
-   if (!list_empty(_page_list) || isolation_error_count
-   || coherent_pages)
+   if (!list_empty(_page_list) || isolation_error_count ||
+   coherent_pages)
goto unpin_pages;
 
/*
-- 
2.35.1

[PATCH] mm/gup: migrate device coherent pages when pinning instead of failing

2022-07-16 Thread Alistair Popple

Currently any attempts to pin a device coherent page will fail. This is
because device coherent pages need to be managed by a device driver, and
pinning them would prevent a driver from migrating them off the device.

However this is no reason to fail pinning of these pages. These are
coherent and accessible from the CPU so can be migrated just like
pinning ZONE_MOVABLE pages. So instead of failing all attempts to pin
them first try migrating them out of ZONE_DEVICE.

[hch: rebased to the split device memory checks,
  moved migrate_device_page to migrate_device.c]

Signed-off-by: Alistair Popple 
Acked-by: Felix Kuehling 
Signed-off-by: Christoph Hellwig 
---

This patch hopefully addresses all of David's comments. It replaces both my "mm:
remove the vma check in migrate_vma_setup()" and "mm/gup: migrate device
coherent pages when pinning instead of failing" patches. I'm not sure what the
best way of including this is, perhaps Alex can respin the series with this
patch instead?

 - Alistair

 mm/gup.c| 50 +--
 mm/internal.h   |  1 +
 mm/migrate_device.c | 52 +
 3 files changed, 96 insertions(+), 7 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index b65fe8bf5af4..22b97ab61cd9 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1881,7 +1881,7 @@ static long check_and_migrate_movable_pages(unsigned long 
nr_pages,
unsigned long isolation_error_count = 0, i;
struct folio *prev_folio = NULL;
LIST_HEAD(movable_page_list);
-   bool drain_allow = true;
+   bool drain_allow = true, coherent_pages = false;
int ret = 0;
 
for (i = 0; i < nr_pages; i++) {
@@ -1891,9 +1891,38 @@ static long check_and_migrate_movable_pages(unsigned 
long nr_pages,
continue;
prev_folio = folio;
 
-   if (folio_is_longterm_pinnable(folio))
+   /*
+* Device coherent pages are managed by a driver and should not
+* be pinned indefinitely as it prevents the driver moving the
+* page. So when trying to pin with FOLL_LONGTERM instead try
+* to migrate the page out of device memory.
+*/
+   if (folio_is_device_coherent(folio)) {
+   /*
+* We always want a new GUP lookup with device coherent
+* pages.
+*/
+   pages[i] = 0;
+   coherent_pages = true;
+
+   /*
+* Migration will fail if the page is pinned, so convert
+* the pin on the source page to a normal reference.
+*/
+   if (gup_flags & FOLL_PIN) {
+   get_page(>page);
+   unpin_user_page(>page);
+   }
+
+   ret = migrate_device_coherent_page(>page);
+   if (ret)
+   goto unpin_pages;
+
continue;
+   }
 
+   if (folio_is_longterm_pinnable(folio))
+   continue;
/*
 * Try to move out any movable page before pinning the range.
 */
@@ -1919,7 +1948,8 @@ static long check_and_migrate_movable_pages(unsigned long 
nr_pages,
folio_nr_pages(folio));
}
 
-   if (!list_empty(_page_list) || isolation_error_count)
+   if (!list_empty(_page_list) || isolation_error_count
+   || coherent_pages)
goto unpin_pages;
 
/*
@@ -1929,10 +1959,16 @@ static long check_and_migrate_movable_pages(unsigned 
long nr_pages,
return nr_pages;
 
 unpin_pages:
-   if (gup_flags & FOLL_PIN) {
-   unpin_user_pages(pages, nr_pages);
-   } else {
-   for (i = 0; i < nr_pages; i++)
+   /*
+* pages[i] might be NULL if any device coherent pages were found.
+*/
+   for (i = 0; i < nr_pages; i++) {
+   if (!pages[i])
+   continue;
+
+   if (gup_flags & FOLL_PIN)
+   unpin_user_page(pages[i]);
+   else
put_page(pages[i]);
}
 
diff --git a/mm/internal.h b/mm/internal.h
index c0f8fbe0445b..899dab512c5a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -853,6 +853,7 @@ int numa_migrate_prep(struct page *page, struct 
vm_area_struct *vma,
  unsigned long addr, int page_nid, int *flags);
 
 void free_zone_device_page(struct page *page);
+int migrate_device_coherent_page(struct page *page);
 
 /*
  * mm/gup.c
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 18bc6483f63a..7feeb447e3b9 100644
--- a/mm/migrate_device.c
+

Re: [PATCH v8 06/15] mm: remove the vma check in migrate_vma_setup()

2022-07-14 Thread Alistair Popple



David Hildenbrand  writes:

> On 07.07.22 21:03, Alex Sierra wrote:
>> From: Alistair Popple 
>>
>> migrate_vma_setup() checks that a valid vma is passed so that the page
>> tables can be walked to find the pfns associated with a given address
>> range. However in some cases the pfns are already known, such as when
>> migrating device coherent pages during pin_user_pages() meaning a valid
>> vma isn't required.
>
> As raised in my other reply, without a VMA ... it feels odd to use a
> "migrate_vma" API. For an internal (mm/migrate_device.c) use case it is
> ok I guess, but it certainly adds a bit of confusion. For example,
> because migrate_vma_setup() will undo ref+lock not obtained by it.
>
> I guess the interesting point is that
>
> a) Besides migrate_vma_pages() and migrate_vma_setup(), the ->vma is unused.
>
> b) migrate_vma_setup() does collect+unmap+cleanup if unmap failed.
>
> c) With our source page in our hands, we cannot be processing a hole in
> a VMA.
>
>
>
> Not sure if it's better. but I would
>
> a) Enforce in migrate_vma_setup() that there is a VMA. Code outside of
> mm/migrate_device.c shouldn't be doing some hacks like this.
>
> b) Don't call migrate_vma_setup() from migrate_device_page(), but
> directly migrate_vma_unmap() and add a comment.
>
>
> That will leave a single change to this patch (migrate_vma_pages()). But
> is that even required? Because 
>
>> @@ -685,7 +685,7 @@ void migrate_vma_pages(struct migrate_vma *migrate)
>>  continue;
>>  }
>>
>> -if (!page) {
>> +if (!page && migrate->vma) {
>
> How could we ever have !page in case of migrate_device_page()?

Oh good point. This patch was originally part of a larger series I was
working on at the time but you're right - for migrate_device_page() we
should never hit this case. I will respin the next patch (number 7 in
this series) to include this.

> Instead, I think a VM_BUG_ON(migrate->vma); should hold and you can just
> simplify.
>
>>  if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE))
>>  continue;
>>  if (!notified) {

Re: [PATCH v8 07/15] mm/gup: migrate device coherent pages when pinning instead of failing

2022-07-14 Thread Alistair Popple



David Hildenbrand  writes:

> On 07.07.22 21:03, Alex Sierra wrote:
>> From: Alistair Popple 
>>
>> Currently any attempts to pin a device coherent page will fail. This is
>> because device coherent pages need to be managed by a device driver, and
>> pinning them would prevent a driver from migrating them off the device.
>>
>> However this is no reason to fail pinning of these pages. These are
>> coherent and accessible from the CPU so can be migrated just like
>> pinning ZONE_MOVABLE pages. So instead of failing all attempts to pin
>> them first try migrating them out of ZONE_DEVICE.
>>
>> Signed-off-by: Alistair Popple 
>> Acked-by: Felix Kuehling 
>> [hch: rebased to the split device memory checks,
>>   moved migrate_device_page to migrate_device.c]
>> Signed-off-by: Christoph Hellwig 
>> ---
>>  mm/gup.c| 47 +++-
>>  mm/internal.h   |  1 +
>>  mm/migrate_device.c | 53 +
>>  3 files changed, 96 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index b65fe8bf5af4..9b6b9923d22d 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -1891,9 +1891,43 @@ static long check_and_migrate_movable_pages(unsigned 
>> long nr_pages,
>>  continue;
>>  prev_folio = folio;
>>
>> -if (folio_is_longterm_pinnable(folio))
>> +/*
>> + * Device private pages will get faulted in during gup so it
>> + * shouldn't be possible to see one here.
>> + */
>> +if (WARN_ON_ONCE(folio_is_device_private(folio))) {
>> +ret = -EFAULT;
>> +goto unpin_pages;
>> +}
>
> I'd just drop that. Device private pages are never part of a present PTE. So 
> if we
> could actually get a grab of one via GUP we would be in bigger trouble ...
> already before this patch.

Fair.

>> +
>> +/*
>> + * Device coherent pages are managed by a driver and should not
>> + * be pinned indefinitely as it prevents the driver moving the
>> + * page. So when trying to pin with FOLL_LONGTERM instead try
>> + * to migrate the page out of device memory.
>> + */
>> +if (folio_is_device_coherent(folio)) {
>> +WARN_ON_ONCE(PageCompound(>page));
>
> Maybe that belongs into migrate_device_page()?

Ok (noting Matthew's comment there as well).

>> +
>> +/*
>> + * Migration will fail if the page is pinned, so convert
>
> [...]
>
>>  /*
>>   * mm/gup.c
>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>> index cf9668376c5a..5decd26dd551 100644
>> --- a/mm/migrate_device.c
>> +++ b/mm/migrate_device.c
>> @@ -794,3 +794,56 @@ void migrate_vma_finalize(struct migrate_vma *migrate)
>>  }
>>  }
>>  EXPORT_SYMBOL(migrate_vma_finalize);
>> +
>> +/*
>> + * Migrate a device coherent page back to normal memory.  The caller should 
>> have
>> + * a reference on page which will be copied to the new page if migration is
>> + * successful or dropped on failure.
>> + */
>> +struct page *migrate_device_page(struct page *page, unsigned int gup_flags)
>
> Function name should most probably indicate that we're dealing with coherent 
> pages here?

Ok.

>> +{
>> +unsigned long src_pfn, dst_pfn = 0;
>> +struct migrate_vma args;
>> +struct page *dpage;
>> +
>> +lock_page(page);
>> +src_pfn = migrate_pfn(page_to_pfn(page)) | MIGRATE_PFN_MIGRATE;
>> +args.src = _pfn;
>> +args.dst = _pfn;
>> +args.cpages = 1;
>> +args.npages = 1;
>> +args.vma = NULL;
>> +migrate_vma_setup();
>> +if (!(src_pfn & MIGRATE_PFN_MIGRATE))
>> +return NULL;
>
> Wow, these refcount and page locking/unlocking rules with this migrate_* api 
> are
> confusing now. And the usage here of sometimes returning and sometimes falling
> trough don't make it particularly easier to understand here.
>
> I'm not 100% happy about reusing migrate_vma_setup() usage if there *is no 
> VMA*.
> That's just absolutely confusing, because usually migrate_vma_setup() itself
> would do the collection step and ref+lock pages. :/
>
> In general, I can see why/how we're reusing the migrate_vma_* API here, but 
> there
> is absolutely no VMA ... not sure what to improve besides providing

Re: [PATCH v7 04/14] mm: add device coherent vma selection for memory migration

2022-06-30 Thread Alistair Popple

David Hildenbrand  writes:

> On 29.06.22 05:54, Alex Sierra wrote:
>> This case is used to migrate pages from device memory, back to system
>> memory. Device coherent type memory is cache coherent from device and CPU
>> point of view.
>>
>> Signed-off-by: Alex Sierra 
>> Acked-by: Felix Kuehling 
>> Reviewed-by: Alistair Poppple 
>> Signed-off-by: Christoph Hellwig 
>
>
> I'm not too familiar with this code, please excuse my naive questions:
>
>> @@ -148,15 +148,21 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>  if (is_writable_device_private_entry(entry))
>>  mpfn |= MIGRATE_PFN_WRITE;
>>  } else {
>> -if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
>> -goto next;
>
> Why not exclude MIGRATE_VMA_SELECT_DEVICE_PRIVATE here? IIRC that would
> have happened before this change.

I might be missing something as I don't quite follow - this path is for
normal system pages so we only want to skip selecting them if
MIGRATE_VMA_SELECT_SYSTEM or MIGRATE_VMA_SELECT_DEVICE_COHERENT aren't
set.

Note that MIGRATE_VMA_SELECT_DEVICE_PRIVATE doesn't apply here because
we already know it's not a device private page by virtue of
pte_present(pte) == True.

>>  pfn = pte_pfn(pte);
>> -if (is_zero_pfn(pfn)) {
>> +if (is_zero_pfn(pfn) &&
>> +(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) {
>>  mpfn = MIGRATE_PFN_MIGRATE;
>>  migrate->cpages++;
>>  goto next;
>>  }
>>  page = vm_normal_page(migrate->vma, addr, pte);
>> +if (page && !is_zone_device_page(page) &&
>
> I'm wondering if that check logically belongs into patch #2.

I don't think so as it would break functionality until the below
conditionals are added - we explicitly don't want to skip
is_zone_device_page(page) == False here because that is the pages we are
trying to select.

You could add in this:

>> +!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))

But then in patch 2 we know this can never be true because we've already
checked for !MIGRATE_VMA_SELECT_SYSTEM there.

>> +goto next;
>> +else if (page && is_device_coherent_page(page) &&
>> +(!(migrate->flags & 
>> MIGRATE_VMA_SELECT_DEVICE_COHERENT) ||
>> + page->pgmap->owner != migrate->pgmap_owner))
>
>
> In general LGTM

Re: [PATCH v5 01/13] mm: add zone device coherent type memory support

2022-06-22 Thread Alistair Popple



David Hildenbrand  writes:

> On 21.06.22 18:08, Sierra Guiza, Alejandro (Alex) wrote:
>>
>> On 6/21/2022 7:25 AM, David Hildenbrand wrote:
>>> On 21.06.22 13:55, Alistair Popple wrote:
>>>> David Hildenbrand  writes:
>>>>
>>>>> On 21.06.22 13:25, Felix Kuehling wrote:
>>>>>> Am 6/17/22 um 23:19 schrieb David Hildenbrand:
>>>>>>> On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>>>> On 6/17/2022 12:33 PM, David Hildenbrand wrote:
>>>>>>>>> On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>>>>>> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
>>>>>>>>>>> On 31.05.22 22:00, Alex Sierra wrote:
>>>>>>>>>>>> Device memory that is cache coherent from device and CPU point of 
>>>>>>>>>>>> view.
>>>>>>>>>>>> This is used on platforms that have an advanced system bus (like 
>>>>>>>>>>>> CAPI
>>>>>>>>>>>> or CXL). Any page of a process can be migrated to such memory. 
>>>>>>>>>>>> However,
>>>>>>>>>>>> no one should be allowed to pin such memory so that it can always 
>>>>>>>>>>>> be
>>>>>>>>>>>> evicted.
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Alex Sierra
>>>>>>>>>>>> Acked-by: Felix Kuehling
>>>>>>>>>>>> Reviewed-by: Alistair Popple
>>>>>>>>>>>> [hch: rebased ontop of the refcount changes,
>>>>>>>>>>>>   removed is_dev_private_or_coherent_page]
>>>>>>>>>>>> Signed-off-by: Christoph Hellwig
>>>>>>>>>>>> ---
>>>>>>>>>>>>  include/linux/memremap.h | 19 +++
>>>>>>>>>>>>  mm/memcontrol.c  |  7 ---
>>>>>>>>>>>>  mm/memory-failure.c  |  8 ++--
>>>>>>>>>>>>  mm/memremap.c| 10 ++
>>>>>>>>>>>>  mm/migrate_device.c  | 16 +++-
>>>>>>>>>>>>  mm/rmap.c|  5 +++--
>>>>>>>>>>>>  6 files changed, 49 insertions(+), 16 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>>>>>>>>> index 8af304f6b504..9f752ebed613 100644
>>>>>>>>>>>> --- a/include/linux/memremap.h
>>>>>>>>>>>> +++ b/include/linux/memremap.h
>>>>>>>>>>>> @@ -41,6 +41,13 @@ struct vmem_altmap {
>>>>>>>>>>>>   * A more complete discussion of unaddressable memory may be 
>>>>>>>>>>>> found in
>>>>>>>>>>>>   * include/linux/hmm.h and Documentation/vm/hmm.rst.
>>>>>>>>>>>>   *
>>>>>>>>>>>> + * MEMORY_DEVICE_COHERENT:
>>>>>>>>>>>> + * Device memory that is cache coherent from device and CPU point 
>>>>>>>>>>>> of view. This
>>>>>>>>>>>> + * is used on platforms that have an advanced system bus (like 
>>>>>>>>>>>> CAPI or CXL). A
>>>>>>>>>>>> + * driver can hotplug the device memory using ZONE_DEVICE and 
>>>>>>>>>>>> with that memory
>>>>>>>>>>>> + * type. Any page of a process can be migrated to such memory. 
>>>>>>>>>>>> However no one
>>>>>>>>>>> Any page might not be right, I'm pretty sure. ... just thinking 
>>>>>>>>>>> about special pages
>>>>>>>>>>> like vdso, shared zeropage, ... pinned pages ...
>>>>>>>>> Well, you cannot migrate long term pages, that's what I meant :)
>>>>>>>>>
>>>>>>>>>>>> + * should be allowed to pin such memory s

Re: [PATCH v5 01/13] mm: add zone device coherent type memory support

2022-06-21 Thread Alistair Popple



David Hildenbrand  writes:

> On 21.06.22 13:25, Felix Kuehling wrote:
>>
>> Am 6/17/22 um 23:19 schrieb David Hildenbrand:
>>> On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote:
>>>> On 6/17/2022 12:33 PM, David Hildenbrand wrote:
>>>>> On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
>>>>>>> On 31.05.22 22:00, Alex Sierra wrote:
>>>>>>>> Device memory that is cache coherent from device and CPU point of view.
>>>>>>>> This is used on platforms that have an advanced system bus (like CAPI
>>>>>>>> or CXL). Any page of a process can be migrated to such memory. However,
>>>>>>>> no one should be allowed to pin such memory so that it can always be
>>>>>>>> evicted.
>>>>>>>>
>>>>>>>> Signed-off-by: Alex Sierra 
>>>>>>>> Acked-by: Felix Kuehling 
>>>>>>>> Reviewed-by: Alistair Popple 
>>>>>>>> [hch: rebased ontop of the refcount changes,
>>>>>>>>  removed is_dev_private_or_coherent_page]
>>>>>>>> Signed-off-by: Christoph Hellwig 
>>>>>>>> ---
>>>>>>>> include/linux/memremap.h | 19 +++
>>>>>>>> mm/memcontrol.c  |  7 ---
>>>>>>>> mm/memory-failure.c  |  8 ++--
>>>>>>>> mm/memremap.c| 10 ++
>>>>>>>> mm/migrate_device.c  | 16 +++-
>>>>>>>> mm/rmap.c|  5 +++--
>>>>>>>> 6 files changed, 49 insertions(+), 16 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>>>>> index 8af304f6b504..9f752ebed613 100644
>>>>>>>> --- a/include/linux/memremap.h
>>>>>>>> +++ b/include/linux/memremap.h
>>>>>>>> @@ -41,6 +41,13 @@ struct vmem_altmap {
>>>>>>>>  * A more complete discussion of unaddressable memory may be found 
>>>>>>>> in
>>>>>>>>  * include/linux/hmm.h and Documentation/vm/hmm.rst.
>>>>>>>>  *
>>>>>>>> + * MEMORY_DEVICE_COHERENT:
>>>>>>>> + * Device memory that is cache coherent from device and CPU point of 
>>>>>>>> view. This
>>>>>>>> + * is used on platforms that have an advanced system bus (like CAPI 
>>>>>>>> or CXL). A
>>>>>>>> + * driver can hotplug the device memory using ZONE_DEVICE and with 
>>>>>>>> that memory
>>>>>>>> + * type. Any page of a process can be migrated to such memory. 
>>>>>>>> However no one
>>>>>>> Any page might not be right, I'm pretty sure. ... just thinking about 
>>>>>>> special pages
>>>>>>> like vdso, shared zeropage, ... pinned pages ...
>>>>> Well, you cannot migrate long term pages, that's what I meant :)
>>>>>
>>>>>>>> + * should be allowed to pin such memory so that it can always be 
>>>>>>>> evicted.
>>>>>>>> + *
>>>>>>>>  * MEMORY_DEVICE_FS_DAX:
>>>>>>>>  * Host memory that has similar access semantics as System RAM 
>>>>>>>> i.e. DMA
>>>>>>>>  * coherent and supports page pinning. In support of coordinating 
>>>>>>>> page
>>>>>>>> @@ -61,6 +68,7 @@ struct vmem_altmap {
>>>>>>>> enum memory_type {
>>>>>>>>/* 0 is reserved to catch uninitialized type fields */
>>>>>>>>MEMORY_DEVICE_PRIVATE = 1,
>>>>>>>> +  MEMORY_DEVICE_COHERENT,
>>>>>>>>MEMORY_DEVICE_FS_DAX,
>>>>>>>>MEMORY_DEVICE_GENERIC,
>>>>>>>>MEMORY_DEVICE_PCI_P2PDMA,
>>>>>>>> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const 
>>>>>>>> struct folio *folio)
>>>>>>> In general, this LGTM, and it should be correct with PageAnonExclusive 
>>>>

Re: [PATCH v5 01/13] mm: add zone device coherent type memory support

2022-06-20 Thread Alistair Popple



Oded Gabbay  writes:

> On Mon, Jun 20, 2022 at 3:33 AM Alistair Popple  wrote:
>>
>>
>> Oded Gabbay  writes:
>>
>> > On Fri, Jun 17, 2022 at 8:20 PM Sierra Guiza, Alejandro (Alex)
>> >  wrote:
>> >>
>> >>
>> >> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
>> >> > On 31.05.22 22:00, Alex Sierra wrote:
>> >> >> Device memory that is cache coherent from device and CPU point of view.
>> >> >> This is used on platforms that have an advanced system bus (like CAPI
>> >> >> or CXL). Any page of a process can be migrated to such memory. However,
>> >> >> no one should be allowed to pin such memory so that it can always be
>> >> >> evicted.
>> >> >>
>> >> >> Signed-off-by: Alex Sierra 
>> >> >> Acked-by: Felix Kuehling 
>> >> >> Reviewed-by: Alistair Popple 
>> >> >> [hch: rebased ontop of the refcount changes,
>> >> >>removed is_dev_private_or_coherent_page]
>> >> >> Signed-off-by: Christoph Hellwig 
>> >> >> ---
>> >> >>   include/linux/memremap.h | 19 +++
>> >> >>   mm/memcontrol.c  |  7 ---
>> >> >>   mm/memory-failure.c  |  8 ++--
>> >> >>   mm/memremap.c| 10 ++
>> >> >>   mm/migrate_device.c  | 16 +++-
>> >> >>   mm/rmap.c|  5 +++--
>> >> >>   6 files changed, 49 insertions(+), 16 deletions(-)
>> >> >>
>> >> >> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>> >> >> index 8af304f6b504..9f752ebed613 100644
>> >> >> --- a/include/linux/memremap.h
>> >> >> +++ b/include/linux/memremap.h
>> >> >> @@ -41,6 +41,13 @@ struct vmem_altmap {
>> >> >>* A more complete discussion of unaddressable memory may be found in
>> >> >>* include/linux/hmm.h and Documentation/vm/hmm.rst.
>> >> >>*
>> >> >> + * MEMORY_DEVICE_COHERENT:
>> >> >> + * Device memory that is cache coherent from device and CPU point of 
>> >> >> view. This
>> >> >> + * is used on platforms that have an advanced system bus (like CAPI 
>> >> >> or CXL). A
>> >> >> + * driver can hotplug the device memory using ZONE_DEVICE and with 
>> >> >> that memory
>> >> >> + * type. Any page of a process can be migrated to such memory. 
>> >> >> However no one
>> >> > Any page might not be right, I'm pretty sure. ... just thinking about 
>> >> > special pages
>> >> > like vdso, shared zeropage, ... pinned pages ...
>> >>
>> >> Hi David,
>> >>
>> >> Yes, I think you're right. This type does not cover all special pages.
>> >> I need to correct that on the cover letter.
>> >> Pinned pages are allowed as long as they're not long term pinned.
>> >>
>> >> Regards,
>> >> Alex Sierra
>> >
>> > What if I want to hotplug this device's coherent memory, but I do
>> > *not* want the OS
>> > to migrate any page to it ?
>> > I want to fully-control what resides on this memory, as I can consider
>> > this memory
>> > "expensive". i.e. I don't have a lot of it, I want to use it for
>> > specific purposes and
>> > I don't want the OS to start using it when there is some memory pressure in
>> > the system.
>>
>> This is exactly what MEMORY_DEVICE_COHERENT is for. Device coherent
>> pages are only allocated by a device driver and exposed to user-space by
>> a driver migrating pages to them with migrate_vma. The OS can't just
>> start using them due to memory pressure for example.
>>
>>  - Alistair
> Thanks for the explanation.
>
> I guess the commit message confused me a bit, especially these two sentences:
>
> "Any page of a process can be migrated to such memory. However no one should 
> be
> allowed to pin such memory so that it can always be evicted."
>
> I read them as if the OS is free to choose which pages are migrated to
> this memory,
> and anything is eligible for migration to that memory (and that's why
> we also don't
> allow it to pin memory there).
>
> If we are not allowed to pin anything there, can the device driver
> decide to d

Re: [PATCH v5 01/13] mm: add zone device coherent type memory support

2022-06-20 Thread Alistair Popple



Oded Gabbay  writes:

> On Fri, Jun 17, 2022 at 8:20 PM Sierra Guiza, Alejandro (Alex)
>  wrote:
>>
>>
>> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
>> > On 31.05.22 22:00, Alex Sierra wrote:
>> >> Device memory that is cache coherent from device and CPU point of view.
>> >> This is used on platforms that have an advanced system bus (like CAPI
>> >> or CXL). Any page of a process can be migrated to such memory. However,
>> >> no one should be allowed to pin such memory so that it can always be
>> >> evicted.
>> >>
>> >> Signed-off-by: Alex Sierra 
>> >> Acked-by: Felix Kuehling 
>> >> Reviewed-by: Alistair Popple 
>> >> [hch: rebased ontop of the refcount changes,
>> >>removed is_dev_private_or_coherent_page]
>> >> Signed-off-by: Christoph Hellwig 
>> >> ---
>> >>   include/linux/memremap.h | 19 +++
>> >>   mm/memcontrol.c  |  7 ---
>> >>   mm/memory-failure.c  |  8 ++--
>> >>   mm/memremap.c| 10 ++
>> >>   mm/migrate_device.c  | 16 +++-
>> >>   mm/rmap.c|  5 +++--
>> >>   6 files changed, 49 insertions(+), 16 deletions(-)
>> >>
>> >> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>> >> index 8af304f6b504..9f752ebed613 100644
>> >> --- a/include/linux/memremap.h
>> >> +++ b/include/linux/memremap.h
>> >> @@ -41,6 +41,13 @@ struct vmem_altmap {
>> >>* A more complete discussion of unaddressable memory may be found in
>> >>* include/linux/hmm.h and Documentation/vm/hmm.rst.
>> >>*
>> >> + * MEMORY_DEVICE_COHERENT:
>> >> + * Device memory that is cache coherent from device and CPU point of 
>> >> view. This
>> >> + * is used on platforms that have an advanced system bus (like CAPI or 
>> >> CXL). A
>> >> + * driver can hotplug the device memory using ZONE_DEVICE and with that 
>> >> memory
>> >> + * type. Any page of a process can be migrated to such memory. However 
>> >> no one
>> > Any page might not be right, I'm pretty sure. ... just thinking about 
>> > special pages
>> > like vdso, shared zeropage, ... pinned pages ...
>>
>> Hi David,
>>
>> Yes, I think you're right. This type does not cover all special pages.
>> I need to correct that on the cover letter.
>> Pinned pages are allowed as long as they're not long term pinned.
>>
>> Regards,
>> Alex Sierra
>
> What if I want to hotplug this device's coherent memory, but I do
> *not* want the OS
> to migrate any page to it ?
> I want to fully-control what resides on this memory, as I can consider
> this memory
> "expensive". i.e. I don't have a lot of it, I want to use it for
> specific purposes and
> I don't want the OS to start using it when there is some memory pressure in
> the system.

This is exactly what MEMORY_DEVICE_COHERENT is for. Device coherent
pages are only allocated by a device driver and exposed to user-space by
a driver migrating pages to them with migrate_vma. The OS can't just
start using them due to memory pressure for example.

 - Alistair

> Oded
>
>>
>> >
>> >> + * should be allowed to pin such memory so that it can always be evicted.
>> >> + *
>> >>* MEMORY_DEVICE_FS_DAX:
>> >>* Host memory that has similar access semantics as System RAM i.e. DMA
>> >>* coherent and supports page pinning. In support of coordinating page
>> >> @@ -61,6 +68,7 @@ struct vmem_altmap {
>> >>   enum memory_type {
>> >>  /* 0 is reserved to catch uninitialized type fields */
>> >>  MEMORY_DEVICE_PRIVATE = 1,
>> >> +MEMORY_DEVICE_COHERENT,
>> >>  MEMORY_DEVICE_FS_DAX,
>> >>  MEMORY_DEVICE_GENERIC,
>> >>  MEMORY_DEVICE_PCI_P2PDMA,
>> >> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const 
>> >> struct folio *folio)
>> > In general, this LGTM, and it should be correct with PageAnonExclusive I 
>> > think.
>> >
>> >
>> > However, where exactly is pinning forbidden?
>>
>> Long-term pinning is forbidden since it would interfere with the device
>> memory manager owning the
>> device-coherent pages (e.g. evictions in TTM). However, normal pinning
>> is allowed on this device type.
>>
>> Regards,
>> Alex Sierra
>>
>> >

Re: [PATCH v5 02/13] mm: handling Non-LRU pages returned by vm_normal_pages

2022-06-08 Thread Alistair Popple



I can't see any issues with this now so:

Reviewed-by: Alistair Popple 

Alex Sierra  writes:

> With DEVICE_COHERENT, we'll soon have vm_normal_pages() return
> device-managed anonymous pages that are not LRU pages. Although they
> behave like normal pages for purposes of mapping in CPU page, and for
> COW. They do not support LRU lists, NUMA migration or THP.
>
> We also introduced a FOLL_LRU flag that adds the same behaviour to
> follow_page and related APIs, to allow callers to specify that they
> expect to put pages on an LRU list.
>
> Signed-off-by: Alex Sierra 
> Acked-by: Felix Kuehling 
> ---
>  fs/proc/task_mmu.c | 2 +-
>  include/linux/mm.h | 3 ++-
>  mm/gup.c   | 6 +-
>  mm/huge_memory.c   | 2 +-
>  mm/khugepaged.c| 9 ++---
>  mm/ksm.c   | 6 +++---
>  mm/madvise.c   | 4 ++--
>  mm/memory.c| 9 -
>  mm/mempolicy.c | 2 +-
>  mm/migrate.c   | 4 ++--
>  mm/mlock.c | 2 +-
>  mm/mprotect.c  | 2 +-
>  12 files changed, 33 insertions(+), 18 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 2d04e3470d4c..2dd8c8a66924 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1792,7 +1792,7 @@ static struct page *can_gather_numa_stats(pte_t pte, 
> struct vm_area_struct *vma,
>   return NULL;
>
>   page = vm_normal_page(vma, addr, pte);
> - if (!page)
> + if (!page || is_zone_device_page(page))
>   return NULL;
>
>   if (PageReserved(page))
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index bc8f326be0ce..d3f43908ff8d 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -601,7 +601,7 @@ struct vm_operations_struct {
>  #endif
>   /*
>* Called by vm_normal_page() for special PTEs to find the
> -  * page for @addr.  This is useful if the default behavior
> +  * page for @addr. This is useful if the default behavior
>* (using pte_page()) would not find the correct page.
>*/
>   struct page *(*find_special_page)(struct vm_area_struct *vma,
> @@ -2934,6 +2934,7 @@ struct page *follow_page(struct vm_area_struct *vma, 
> unsigned long address,
>  #define FOLL_NUMA0x200   /* force NUMA hinting page fault */
>  #define FOLL_MIGRATION   0x400   /* wait for page to replace migration 
> entry */
>  #define FOLL_TRIED   0x800   /* a retry, previous pass started an IO */
> +#define FOLL_LRU0x1000  /* return only LRU (anon or page cache) */
>  #define FOLL_REMOTE  0x2000  /* we are working on non-current tsk/mm */
>  #define FOLL_COW 0x4000  /* internal GUP flag */
>  #define FOLL_ANON0x8000  /* don't do file mappings */
> diff --git a/mm/gup.c b/mm/gup.c
> index 551264407624..48b45bcc8501 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -532,7 +532,11 @@ static struct page *follow_page_pte(struct 
> vm_area_struct *vma,
>   }
>
>   page = vm_normal_page(vma, address, pte);
> - if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) {
> + if ((flags & FOLL_LRU) && ((page && is_zone_device_page(page)) ||
> + (!page && pte_devmap(pte {
> + page = ERR_PTR(-EEXIST);
> + goto out;
> + } else if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) 
> {
>   /*
>* Only return device mapping pages in the FOLL_GET or FOLL_PIN
>* case since they are only valid while holding the pgmap
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index a77c78a2b6b5..48182c8fe151 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2906,7 +2906,7 @@ static int split_huge_pages_pid(int pid, unsigned long 
> vaddr_start,
>   }
>
>   /* FOLL_DUMP to ignore special (like zero) pages */
> - page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
> + page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP | FOLL_LRU);
>
>   if (IS_ERR(page))
>   continue;
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 16be62d493cd..671ac7800e53 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -618,7 +618,7 @@ static int __collapse_huge_page_isolate(struct 
> vm_area_struct *vma,
>   goto out;
>   }
>   page = vm_normal_page(vma, address, pteval);
> - if (unlikely(!page)) {
> + if (unlikely(!page) || unlikely(is_zone_device_page(page))) {
>   result = SCAN_PAGE_NULL;
>   goto out;
>   }
> @@ -1267,7 +126

Re: [PATCH v3 02/13] mm: handling Non-LRU pages returned by vm_normal_pages

2022-05-27 Thread Alistair Popple



Felix Kuehling  writes:

> Am 2022-05-25 um 00:11 schrieb Alistair Popple:
>> Alex Sierra  writes:
>>
>>> With DEVICE_COHERENT, we'll soon have vm_normal_pages() return
>>> device-managed anonymous pages that are not LRU pages. Although they
>>> behave like normal pages for purposes of mapping in CPU page, and for
>>> COW. They do not support LRU lists, NUMA migration or THP.
>>>
>>> We also introduced a FOLL_LRU flag that adds the same behaviour to
>>> follow_page and related APIs, to allow callers to specify that they
>>> expect to put pages on an LRU list.
>> Continuing the follow up from the thread for v2:
>>
>>>> This means by default GUP can return non-LRU pages. I didn't see
>>>> anywhere that would be a problem but I didn't check everything. Did you
>>>> check this or is there some other reason I've missed that makes this not
>>>> a problem?
>>> I have double checked all gup and pin_user_pages callers and none of them 
>>> seem
>>> to have interaction with LRU APIs.
>> And actually if I'm understanding things correctly callers of
>> GUP/PUP/follow_page_pte() should already expect to get non-LRU pages
>> returned:
>>
>>  page = vm_normal_page(vma, address, pte);
>>  if ((flags & FOLL_LRU) && page && is_zone_device_page(page))
>>  page = NULL;
>>  if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) {
>>  /*
>>   * Only return device mapping pages in the FOLL_GET or FOLL_PIN
>>   * case since they are only valid while holding the pgmap
>>   * reference.
>>   */
>>  *pgmap = get_dev_pagemap(pte_pfn(pte), *pgmap);
>>  if (*pgmap)
>>  page = pte_page(pte);
>>
>> Which I think makes FOLL_LRU confusing, because if understand correctly
>> even with FOLL_LRU it is still possible for follow_page_pte() to return
>> a non-LRU page. Could we do something like this to make it consistent:
>>
>>  if ((flags & FOLL_LRU) && (page && is_zone_device_page(page) ||
>>  !page && pte_devmap(pte)))
>
> This alone won't help if it still goes into the if (!page && pte_devmap(pte)
> ...) afterwards. I think what you're suggesting is:
>
> + if ((flags & FOLL_LRU) && (page && is_zone_device_page(page) ||
> +!page && pte_devmap(pte)))
> + page = NULL;
> - |if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) { + 
> else
>  if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) { |
>
> Is that what you meant?

Oh my bad. Yes, that is what I meant. Although as Alex pointed out we
should goto no_page as well. However we also need to fix up the return
code, because returning NULL will cause GUP to try and fault the page in
when it already possibly exists. So I think something like this should
work:

  page = vm_normal_page(vma, address, pte);
  if ((flags & FOLL_LRU) && (page && is_zone_device_page(page) ||
  !page && pte_devmap(pte))) {
  page = ERR_PTR(-EEXIST);
  goto out;
  } else if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) {
  /*
   * Only return device mapping pages in the FOLL_GET or FOLL_PIN
   * case since they are only valid while holding the pgmap
   * reference.
   */
  *pgmap = get_dev_pagemap(pte_pfn(pte), *pgmap);
  if (*pgmap)
  page = pte_page(pte);

> Regards,
>   Felix
>
>
>>
>> Looking at callers that currently use FOLL_LRU I don't think this would
>> change any behaviour as they already filter out devmap through various
>> other means.
>>
>>> Signed-off-by: Alex Sierra 
>>> Acked-by: Felix Kuehling 
>>> ---
>>>   fs/proc/task_mmu.c | 2 +-
>>>   include/linux/mm.h | 3 ++-
>>>   mm/gup.c   | 2 ++
>>>   mm/huge_memory.c   | 2 +-
>>>   mm/khugepaged.c| 9 ++---
>>>   mm/ksm.c   | 6 +++---
>>>   mm/madvise.c   | 4 ++--
>>>   mm/memory.c| 9 -
>>>   mm/mempolicy.c | 2 +-
>>>   mm/migrate.c   | 4 ++--
>>>   mm/mlock.c | 2 +-
>>>   mm/mprotect.c  | 2 +-
>>>   12 files changed, 30 insertions(+), 17 deletions(-)
>>>
>>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>>> index f46060eb91b5..5d620733f1

Re: [PATCH v3 02/13] mm: handling Non-LRU pages returned by vm_normal_pages

2022-05-27 Thread Alistair Popple



"Sierra Guiza, Alejandro (Alex)"  writes:

> On 5/24/2022 11:11 PM, Alistair Popple wrote:
>> Alex Sierra  writes:
>>
>>> With DEVICE_COHERENT, we'll soon have vm_normal_pages() return
>>> device-managed anonymous pages that are not LRU pages. Although they
>>> behave like normal pages for purposes of mapping in CPU page, and for
>>> COW. They do not support LRU lists, NUMA migration or THP.
>>>
>>> We also introduced a FOLL_LRU flag that adds the same behaviour to
>>> follow_page and related APIs, to allow callers to specify that they
>>> expect to put pages on an LRU list.
>> Continuing the follow up from the thread for v2:
>>
>>>> This means by default GUP can return non-LRU pages. I didn't see
>>>> anywhere that would be a problem but I didn't check everything. Did you
>>>> check this or is there some other reason I've missed that makes this not
>>>> a problem?
>>> I have double checked all gup and pin_user_pages callers and none of them 
>>> seem
>>> to have interaction with LRU APIs.
>> And actually if I'm understanding things correctly callers of
>> GUP/PUP/follow_page_pte() should already expect to get non-LRU pages
>> returned:
>>
>>  page = vm_normal_page(vma, address, pte);
>>  if ((flags & FOLL_LRU) && page && is_zone_device_page(page))
>>  page = NULL;
>>  if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) {
>>  /*
>>   * Only return device mapping pages in the FOLL_GET or FOLL_PIN
>>   * case since they are only valid while holding the pgmap
>>   * reference.
>>   */
>>  *pgmap = get_dev_pagemap(pte_pfn(pte), *pgmap);
>>  if (*pgmap)
>>  page = pte_page(pte);
>>
>> Which I think makes FOLL_LRU confusing, because if understand correctly
>> even with FOLL_LRU it is still possible for follow_page_pte() to return
>> a non-LRU page. Could we do something like this to make it consistent:
>>
>>  if ((flags & FOLL_LRU) && (page && is_zone_device_page(page) ||
>>  !page && pte_devmap(pte)))
>
> Hi Alistair,
> Not sure if this suggestion is a replacement for the first or the second
> condition in the snip code above. We know device coherent type will not
> be set with devmap. So we could do the following:

Sorry, I must not have been clear enough. My understanding is if the
following condition is true:

>>  if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) {

Then follow_page_pte() could return a non-LRU page even when FOLL_LRU is
specified (because I think a devmap page is a non-LRU page). That seems
confusing, so for consistency I was suggesting we should not return
devmap pages for FOLL_LRU.

To be clear I don't think there is an actual problem here atm, but the
inconsistency could easily lead to one in future.

>  if ((flags & FOLL_LRU) && page && is_zone_device_page(page))
> - page = NULL;
> + goto no_page;
>
> Regards,
> Alex Sierra
>
>>
>> Looking at callers that currently use FOLL_LRU I don't think this would
>> change any behaviour as they already filter out devmap through various
>> other means.
>>
>>> Signed-off-by: Alex Sierra 
>>> Acked-by: Felix Kuehling 
>>> ---
>>>   fs/proc/task_mmu.c | 2 +-
>>>   include/linux/mm.h | 3 ++-
>>>   mm/gup.c   | 2 ++
>>>   mm/huge_memory.c   | 2 +-
>>>   mm/khugepaged.c| 9 ++---
>>>   mm/ksm.c   | 6 +++---
>>>   mm/madvise.c   | 4 ++--
>>>   mm/memory.c| 9 -
>>>   mm/mempolicy.c | 2 +-
>>>   mm/migrate.c   | 4 ++--
>>>   mm/mlock.c | 2 +-
>>>   mm/mprotect.c  | 2 +-
>>>   12 files changed, 30 insertions(+), 17 deletions(-)
>>>
>>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>>> index f46060eb91b5..5d620733f173 100644
>>> --- a/fs/proc/task_mmu.c
>>> +++ b/fs/proc/task_mmu.c
>>> @@ -1785,7 +1785,7 @@ static struct page *can_gather_numa_stats(pte_t pte, 
>>> struct vm_area_struct *vma,
>>> return NULL;
>>>
>>> page = vm_normal_page(vma, addr, pte);
>>> -   if (!page)
>>> +   if (!page || is_zone_device_page(page))
>>> return NULL;
>>>
>>> if (PageReserved(page))
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> in

Re: [PATCH v3 02/13] mm: handling Non-LRU pages returned by vm_normal_pages

2022-05-25 Thread Alistair Popple



Alex Sierra  writes:

> With DEVICE_COHERENT, we'll soon have vm_normal_pages() return
> device-managed anonymous pages that are not LRU pages. Although they
> behave like normal pages for purposes of mapping in CPU page, and for
> COW. They do not support LRU lists, NUMA migration or THP.
>
> We also introduced a FOLL_LRU flag that adds the same behaviour to
> follow_page and related APIs, to allow callers to specify that they
> expect to put pages on an LRU list.

Continuing the follow up from the thread for v2:

>> This means by default GUP can return non-LRU pages. I didn't see
>> anywhere that would be a problem but I didn't check everything. Did you
>> check this or is there some other reason I've missed that makes this not
>> a problem?

> I have double checked all gup and pin_user_pages callers and none of them seem
> to have interaction with LRU APIs.

And actually if I'm understanding things correctly callers of
GUP/PUP/follow_page_pte() should already expect to get non-LRU pages
returned:

page = vm_normal_page(vma, address, pte);
if ((flags & FOLL_LRU) && page && is_zone_device_page(page))
page = NULL;
if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) {
/*
 * Only return device mapping pages in the FOLL_GET or FOLL_PIN
 * case since they are only valid while holding the pgmap
 * reference.
 */
*pgmap = get_dev_pagemap(pte_pfn(pte), *pgmap);
if (*pgmap)
page = pte_page(pte);

Which I think makes FOLL_LRU confusing, because if understand correctly
even with FOLL_LRU it is still possible for follow_page_pte() to return
a non-LRU page. Could we do something like this to make it consistent:

if ((flags & FOLL_LRU) && (page && is_zone_device_page(page) ||
!page && pte_devmap(pte)))

Looking at callers that currently use FOLL_LRU I don't think this would
change any behaviour as they already filter out devmap through various
other means.

>
> Signed-off-by: Alex Sierra 
> Acked-by: Felix Kuehling 
> ---
>  fs/proc/task_mmu.c | 2 +-
>  include/linux/mm.h | 3 ++-
>  mm/gup.c   | 2 ++
>  mm/huge_memory.c   | 2 +-
>  mm/khugepaged.c| 9 ++---
>  mm/ksm.c   | 6 +++---
>  mm/madvise.c   | 4 ++--
>  mm/memory.c| 9 -
>  mm/mempolicy.c | 2 +-
>  mm/migrate.c   | 4 ++--
>  mm/mlock.c | 2 +-
>  mm/mprotect.c  | 2 +-
>  12 files changed, 30 insertions(+), 17 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index f46060eb91b5..5d620733f173 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1785,7 +1785,7 @@ static struct page *can_gather_numa_stats(pte_t pte, 
> struct vm_area_struct *vma,
>   return NULL;
>
>   page = vm_normal_page(vma, addr, pte);
> - if (!page)
> + if (!page || is_zone_device_page(page))
>   return NULL;
>
>   if (PageReserved(page))
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 9f44254af8ce..d7f253a0c41e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -601,7 +601,7 @@ struct vm_operations_struct {
>  #endif
>   /*
>* Called by vm_normal_page() for special PTEs to find the
> -  * page for @addr.  This is useful if the default behavior
> +  * page for @addr. This is useful if the default behavior
>* (using pte_page()) would not find the correct page.
>*/
>   struct page *(*find_special_page)(struct vm_area_struct *vma,
> @@ -2929,6 +2929,7 @@ struct page *follow_page(struct vm_area_struct *vma, 
> unsigned long address,
>  #define FOLL_NUMA0x200   /* force NUMA hinting page fault */
>  #define FOLL_MIGRATION   0x400   /* wait for page to replace migration 
> entry */
>  #define FOLL_TRIED   0x800   /* a retry, previous pass started an IO */
> +#define FOLL_LRU0x1000  /* return only LRU (anon or page cache) */
>  #define FOLL_REMOTE  0x2000  /* we are working on non-current tsk/mm */
>  #define FOLL_COW 0x4000  /* internal GUP flag */
>  #define FOLL_ANON0x8000  /* don't do file mappings */
> diff --git a/mm/gup.c b/mm/gup.c
> index 501bc150792c..c9cbac06bcc5 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -479,6 +479,8 @@ static struct page *follow_page_pte(struct vm_area_struct 
> *vma,
>   }
>
>   page = vm_normal_page(vma, address, pte);
> + if ((flags & FOLL_LRU) && page && is_zone_device_page(page))
> + page = NULL;
>   if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) {
>   /*
>* Only return device mapping pages in the FOLL_GET or FOLL_PIN
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 910a138e9859..eed80696c5fd 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2856,7 +2856,7 @@ static int split_huge_pages_pid(int pid, unsigned long 
> vaddr_start,
>   }
>
>   /* FOLL_DUMP to ignore special (like zero) pages */
> -

Re: [PATCH v2 11/13] mm: handling Non-LRU pages returned by vm_normal_pages

2022-05-23 Thread Alistair Popple



Technically I think this patch should be earlier in the series. As I
understand it patch 1 allows DEVICE_COHERENT pages to be inserted in the
page tables and therefore makes it possible for page table walkers to
see non-LRU pages.

Some more comments below:

Alex Sierra  writes:

> With DEVICE_COHERENT, we'll soon have vm_normal_pages() return
> device-managed anonymous pages that are not LRU pages. Although they
> behave like normal pages for purposes of mapping in CPU page, and for
> COW. They do not support LRU lists, NUMA migration or THP.
>
> We also introduced a FOLL_LRU flag that adds the same behaviour to
> follow_page and related APIs, to allow callers to specify that they
> expect to put pages on an LRU list.

This means by default GUP can return non-LRU pages. I didn't see
anywhere that would be a problem but I didn't check everything. Did you
check this or is there some other reason I've missed that makes this not
a problem?

[...]

> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index a4e5eaf3eb01..eb3cfd679800 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -627,7 +627,7 @@ static int __collapse_huge_page_isolate(struct 
> vm_area_struct *vma,
>   goto out;
>   }
>   page = vm_normal_page(vma, address, pteval);
> - if (unlikely(!page)) {
> + if (unlikely(!page) || unlikely(is_zone_device_page(page))) {
>   result = SCAN_PAGE_NULL;
>   goto out;
>   }
> @@ -1276,7 +1276,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>   writable = true;
>
>   page = vm_normal_page(vma, _address, pteval);
> - if (unlikely(!page)) {
> + if (unlikely(!page) || unlikely(is_zone_device_page(page))) {
>   result = SCAN_PAGE_NULL;
>   goto out_unmap;
>   }
> @@ -1484,7 +1484,8 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, 
> unsigned long addr)
>   goto abort;
>
>   page = vm_normal_page(vma, addr, *pte);
> -
> + if (page && is_zone_device_page(page))
> + page = NULL;
>   /*
>* Note that uprobe, debugger, or MAP_PRIVATE may change the
>* page table, but the new page will not be a subpage of hpage.
> @@ -1502,6 +1503,8 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, 
> unsigned long addr)
>   if (pte_none(*pte))
>   continue;
>   page = vm_normal_page(vma, addr, *pte);
> + if (page && is_zone_device_page(page))
> + goto abort;

Are either of these two cases actually possible? DEVICE_COHERENT doesn't
currently support THP, so if I'm understanding correctly we couldn't
have a pte mapped DEVICE_COHERENT THP right? Assuming that's the case I
think WARN_ON_ONCE() would be better.

Otherwise I think everything else looks reasonable.

>   page_remove_rmap(page, vma, false);
>   }
>
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 063a48eeb5ee..f16056efca21 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -474,7 +474,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned 
> long addr)
>   do {
>   cond_resched();
>   page = follow_page(vma, addr,
> - FOLL_GET | FOLL_MIGRATION | FOLL_REMOTE);
> + FOLL_GET | FOLL_MIGRATION | FOLL_REMOTE | 
> FOLL_LRU);
>   if (IS_ERR_OR_NULL(page))
>   break;
>   if (PageKsm(page))
> @@ -559,7 +559,7 @@ static struct page *get_mergeable_page(struct rmap_item 
> *rmap_item)
>   if (!vma)
>   goto out;
>
> - page = follow_page(vma, addr, FOLL_GET);
> + page = follow_page(vma, addr, FOLL_GET | FOLL_LRU);
>   if (IS_ERR_OR_NULL(page))
>   goto out;
>   if (PageAnon(page)) {
> @@ -2288,7 +2288,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct 
> page **page)
>   while (ksm_scan.address < vma->vm_end) {
>   if (ksm_test_exit(mm))
>   break;
> - *page = follow_page(vma, ksm_scan.address, FOLL_GET);
> + *page = follow_page(vma, ksm_scan.address, FOLL_GET | 
> FOLL_LRU);
>   if (IS_ERR_OR_NULL(*page)) {
>   ksm_scan.address += PAGE_SIZE;
>   cond_resched();
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 1873616a37d2..e9c24c834e98 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -413,7 +413,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>   continue;
>
>   page = vm_normal_page(vma, addr, ptent);
> - if (!page)
> + if (!page || is_zone_device_page(page))
>   continue;
>
>   /*
> @@

Re: [PATCH v1 14/15] tools: add hmm gup tests for device coherent type

2022-05-16 Thread Alistair Popple



Alex Sierra  writes:

> The intention is to test hmm device coherent type under different get
> user pages paths. Also, test gup with FOLL_LONGTERM flag set in
> device coherent pages. These pages should get migrated back to system
> memory.
>
> Signed-off-by: Alex Sierra 
> ---
>  tools/testing/selftests/vm/hmm-tests.c | 104 +
>  1 file changed, 104 insertions(+)
>
> diff --git a/tools/testing/selftests/vm/hmm-tests.c 
> b/tools/testing/selftests/vm/hmm-tests.c
> index 84ec8c4a1dc7..65e30ab6494c 100644
> --- a/tools/testing/selftests/vm/hmm-tests.c
> +++ b/tools/testing/selftests/vm/hmm-tests.c
> @@ -36,6 +36,7 @@
>   * in the usual include/uapi/... directory.
>   */
>  #include "../../../../lib/test_hmm_uapi.h"
> +#include "../../../../mm/gup_test.h"
>
>  struct hmm_buffer {
>   void*ptr;
> @@ -60,6 +61,8 @@ enum {
>  #define NTIMES   10
>
>  #define ALIGN(x, a) (((x) + (a - 1)) & (~((a) - 1)))
> +/* Just the flags we need, copied from mm.h: */
> +#define FOLL_WRITE   0x01/* check pte is writable */
>
>  FIXTURE(hmm)
>  {
> @@ -1766,4 +1769,105 @@ TEST_F(hmm, exclusive_cow)
>   hmm_buffer_free(buffer);
>  }
>
> +static int gup_test_exec(int gup_fd, unsigned long addr,
> +  int cmd, int npages, int size)
> +{
> + struct gup_test gup = {
> + .nr_pages_per_call  = npages,
> + .addr   = addr,
> + .gup_flags  = FOLL_WRITE,
> + .size   = size,
> + };
> +
> + if (ioctl(gup_fd, cmd, )) {
> + perror("ioctl on error\n");
> + return errno;
> + }
> +
> + return 0;
> +}
> +
> +/*
> + * Test get user device pages through gup_test. Setting PIN_LONGTERM flag.
> + * This should trigger a migration back to system memory for both, private
> + * and coherent type pages.
> + * This test makes use of gup_test module. Make sure GUP_TEST_CONFIG is added
> + * to your configuration before you run it.
> + */
> +TEST_F(hmm, hmm_gup_test)
> +{
> + struct hmm_buffer *buffer;
> + int gup_fd;
> + unsigned long npages;
> + unsigned long size;
> + unsigned long i;
> + int *ptr;
> + int ret;
> + unsigned char *m;
> +
> + gup_fd = open("/sys/kernel/debug/gup_test", O_RDWR);
> + if (gup_fd == -1)
> + SKIP(return, "Skipping test, could not find gup_test driver");
> +
> + npages = 3;
> + size = npages << self->page_shift;
> +
> + buffer = malloc(sizeof(*buffer));
> + ASSERT_NE(buffer, NULL);
> +
> + buffer->fd = -1;
> + buffer->size = size;
> + buffer->mirror = malloc(size);
> + ASSERT_NE(buffer->mirror, NULL);
> +
> + buffer->ptr = mmap(NULL, size,
> +PROT_READ | PROT_WRITE,
> +MAP_PRIVATE | MAP_ANONYMOUS,
> +buffer->fd, 0);
> + ASSERT_NE(buffer->ptr, MAP_FAILED);
> +
> + /* Initialize buffer in system memory. */
> + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
> + ptr[i] = i;
> +
> + /* Migrate memory to device. */
> + ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
> + ASSERT_EQ(ret, 0);
> + ASSERT_EQ(buffer->cpages, npages);
> + /* Check what the device read. */
> + for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
> + ASSERT_EQ(ptr[i], i);
> +
> + ASSERT_EQ(gup_test_exec(gup_fd,
> + (unsigned long)buffer->ptr,
> + GUP_BASIC_TEST, 1, self->page_size), 0);
> + ASSERT_EQ(gup_test_exec(gup_fd,
> + (unsigned long)buffer->ptr + 1 * 
> self->page_size,
> + GUP_FAST_BENCHMARK, 1, self->page_size), 0);
> + ASSERT_EQ(gup_test_exec(gup_fd,
> + (unsigned long)buffer->ptr + 2 * 
> self->page_size,
> + PIN_LONGTERM_BENCHMARK, 1, self->page_size), 0);
> +
> + /* Take snapshot to CPU pagetables */
> + ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_SNAPSHOT, buffer, npages);
> + ASSERT_EQ(ret, 0);
> + ASSERT_EQ(buffer->cpages, npages);
> + m = buffer->mirror;
> + if (hmm_is_coherent_type(variant->device_number)) {
> + ASSERT_EQ(HMM_DMIRROR_PROT_DEV_COHERENT_LOCAL | 
> HMM_DMIRROR_PROT_WRITE, m[0]);
> + ASSERT_EQ(HMM_DMIRROR_PROT_DEV_COHERENT_LOCA

Re: [PATCH v1 01/15] mm: add zone device coherent type memory support

2022-05-11 Thread Alistair Popple

Alex Sierra  writes:

[...]

> diff --git a/mm/rmap.c b/mm/rmap.c
> index fedb82371efe..d57102cd4b43 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1995,7 +1995,8 @@ void try_to_migrate(struct folio *folio, enum ttu_flags 
> flags)
>   TTU_SYNC)))
>   return;
>
> - if (folio_is_zone_device(folio) && !folio_is_device_private(folio))
> + if (folio_is_zone_device(folio) &&
> + (!folio_is_device_private(folio) && 
> !folio_is_device_coherent(folio)))
>   return;
>
>   /*

I vaguely recall commenting on this previously, or at least intending
to. In try_to_migrate_one() we have this:

if (folio_is_zone_device(folio)) {
unsigned long pfn = folio_pfn(folio);
swp_entry_t entry;
pte_t swp_pte;

/*
 * Store the pfn of the page in a special migration
 * pte. do_swap_page() will wait until the migration
 * pte is removed and then restart fault handling.
 */
entry = pte_to_swp_entry(pteval);
if (is_writable_device_private_entry(entry))
entry = make_writable_migration_entry(pfn);
else
entry = make_readable_migration_entry(pfn);
swp_pte = swp_entry_to_pte(entry);

The check in try_to_migrate() guarantees that if folio_is_zone_device()
is true this must be a DEVICE_PRIVATE page and it treats it as such by
assuming there is a special device private swap entry there.

Relying on that assumption seems bad, and I have no idea why I didn't
just use is_device_private_page() originally but I think the fix is just
to change this to:

if (folio_is_device_private(folio))

And let DEVICE_COHERENT pages fall through to normal page processing.

 - Alistair

Re: [PATCH v1 04/15] mm: add device coherent checker to remove migration pte

2022-05-11 Thread Alistair Popple

"Sierra Guiza, Alejandro (Alex)"  writes:

> @apop...@nvidia.com Could you please check this patch? It's somehow related 
> to migrate_device_page() for long term device coherent pages.
>
> Regards,
> Alex Sierra
>> -Original Message-
>> From: amd-gfx  On Behalf Of Alex
>> Sierra
>> Sent: Thursday, May 5, 2022 4:34 PM
>> To: j...@nvidia.com
>> Cc: rcampb...@nvidia.com; wi...@infradead.org; da...@redhat.com;
>> Kuehling, Felix ; apop...@nvidia.com; amd-
>> g...@lists.freedesktop.org; linux-...@vger.kernel.org; linux...@kvack.org;
>> jgli...@redhat.com; dri-de...@lists.freedesktop.org; akpm@linux-
>> foundation.org; linux-e...@vger.kernel.org; h...@lst.de
>> Subject: [PATCH v1 04/15] mm: add device coherent checker to remove
>> migration pte
>>
>> During remove_migration_pte(), entries for device coherent type pages that
>> were not created through special migration ptes, ignore _PAGE_RW flag. This
>> path can be found at migrate_device_page(), where valid vma is not
>> required. In this case, migrate_vma_collect_pmd() is not called and special
>> migration ptes are not set.

It's true that we don't call migrate_vma_collect_pmd() for
migrate_device_page(), but this doesn't imply migration entries are not
created. We still call migrate_vma_unmap() which calls try_to_migrate()
to install migration entries.

When we have a vma migrate_vma_collect_pmd() is a fast path for the
common case a page is only mapped once. So migrate_vma_collect_pmd()
should fairly closely match try_to_migrate_one(). I did experiment
locally with removing the fast path to simplify the code, but it does
provide a meaningful performance improvement so I abandoned it.

I think you're running into the problem addressed by
https://lkml.kernel.org/r/20211018045247.3128058-1-apop...@nvidia.com
but for DEVICE_COHERENT pages.

Based on that I think the approach below is wrong. You should update
try_to_migrate_one() to deal with DEVICE_COHERENT pages. It would make
sense to do that as part of patch 1 in this series.

The problem is that try_to_migrate_one() assumes folio_is_zone_device()
implies it is a DEVICE_PRIVATE page due to the check in
try_to_migrate().

>> Signed-off-by: Alex Sierra 
>> ---
>>  mm/migrate.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/migrate.c b/mm/migrate.c index
>> 6c31ee1e1c9b..e18ddee56f37 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -206,7 +206,8 @@ static bool remove_migration_pte(struct folio *folio,
>>   * Recheck VMA as permissions can change since migration
>> started
>>   */
>>  entry = pte_to_swp_entry(*pvmw.pte);
>> -if (is_writable_migration_entry(entry))
>> +if (is_writable_migration_entry(entry) ||
>> +is_device_coherent_page(pfn_to_page(pvmw.pfn)))
>>  pte = maybe_mkwrite(pte, vma);
>>  else if (pte_swp_uffd_wp(*pvmw.pte))
>>  pte = pte_mkuffd_wp(pte);
>> --
>> 2.32.0

Re: [PATCH v1 04/15] mm: add device coherent checker to remove migration pte

2022-05-06 Thread Alistair Popple

"Sierra Guiza, Alejandro (Alex)"  writes:

> @apop...@nvidia.com Could you please check this patch? It's somehow related to
> migrate_device_page() for long term device coherent pages.

Sure thing. This whole series is in my queue of things to review once I make it 
home from LSF/MM.

- Alistair

> Regards,
> Alex Sierra
>> -Original Message-
>> From: amd-gfx  On Behalf Of Alex
>> Sierra
>> Sent: Thursday, May 5, 2022 4:34 PM
>> To: j...@nvidia.com
>> Cc: rcampb...@nvidia.com; wi...@infradead.org; da...@redhat.com;
>> Kuehling, Felix ; apop...@nvidia.com; amd-
>> g...@lists.freedesktop.org; linux-...@vger.kernel.org; linux...@kvack.org;
>> jgli...@redhat.com; dri-de...@lists.freedesktop.org; akpm@linux-
>> foundation.org; linux-e...@vger.kernel.org; h...@lst.de
>> Subject: [PATCH v1 04/15] mm: add device coherent checker to remove
>> migration pte
>>
>> During remove_migration_pte(), entries for device coherent type pages that
>> were not created through special migration ptes, ignore _PAGE_RW flag. This
>> path can be found at migrate_device_page(), where valid vma is not
>> required. In this case, migrate_vma_collect_pmd() is not called and special
>> migration ptes are not set.
>>
>> Signed-off-by: Alex Sierra 
>> ---
>>  mm/migrate.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/migrate.c b/mm/migrate.c index
>> 6c31ee1e1c9b..e18ddee56f37 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -206,7 +206,8 @@ static bool remove_migration_pte(struct folio *folio,
>>   * Recheck VMA as permissions can change since migration
>> started
>>   */
>>  entry = pte_to_swp_entry(*pvmw.pte);
>> -if (is_writable_migration_entry(entry))
>> +if (is_writable_migration_entry(entry) ||
>> +is_device_coherent_page(pfn_to_page(pvmw.pfn)))
>>  pte = maybe_mkwrite(pte, vma);
>>  else if (pte_swp_uffd_wp(*pvmw.pte))
>>  pte = pte_mkuffd_wp(pte);
>> --
>> 2.32.0

Re: [PATCH v1 1/3] mm: split vm_normal_pages for LRU and non-LRU handling

2022-03-16 Thread Alistair Popple

Felix Kuehling  writes:

> On 2022-03-11 04:16, David Hildenbrand wrote:
>> On 10.03.22 18:26, Alex Sierra wrote:
>>> DEVICE_COHERENT pages introduce a subtle distinction in the way
>>> "normal" pages can be used by various callers throughout the kernel.
>>> They behave like normal pages for purposes of mapping in CPU page
>>> tables, and for COW. But they do not support LRU lists, NUMA
>>> migration or THP. Therefore we split vm_normal_page into two
>>> functions vm_normal_any_page and vm_normal_lru_page. The latter will
>>> only return pages that can be put on an LRU list and that support
>>> NUMA migration, KSM and THP.
>>>
>>> We also introduced a FOLL_LRU flag that adds the same behaviour to
>>> follow_page and related APIs, to allow callers to specify that they
>>> expect to put pages on an LRU list.
>>>
>> I still don't see the need for s/vm_normal_page/vm_normal_any_page/. And
>> as this patch is dominated by that change, I'd suggest (again) to just
>> drop it as I don't see any value of that renaming. No specifier implies any.
>
> OK. If nobody objects, we can adopts that naming convention.

I'd prefer we avoid the churn too, but I don't think we should make
vm_normal_page() the equivalent of vm_normal_any_page(). It would mean
vm_normal_page() would return non-LRU device coherent pages, but to me at least
device coherent pages seem special and not what I'd expect from a function with
"normal" in the name.

So I think it would be better to s/vm_normal_lru_page/vm_normal_page/ and keep
vm_normal_any_page() (or perhaps call it vm_any_page?). This is basically what
the previous incarnation of this feature did:

struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
pte_t pte, bool with_public_device);
#define vm_normal_page(vma, addr, pte) _vm_normal_page(vma, addr, pte, false)

Except we should add:

#define vm_normal_any_page(vma, addr, pte) _vm_normal_page(vma, addr, pte, true)

>> The general idea of this change LGTM.
>>
>>
>> I wonder how this interacts with the actual DEVICE_COHERENT coherent
>> series. Is this a preparation? Should it be part of the DEVICE_COHERENT
>> series?
>
> Yes, it should be part of that series. Alex developed it on top of the series
> for now. But I think eventually it would need to be spliced into it.

Agreed, this needs to go at the start of the DEVICE_COHERENT series.

Thanks.

Alistair

> Patch1 would need to go somewhere before the other DEVICE_COHERENT patches 
> (with
> minor modifications). Patch 2 could be squashed into "tools: add hmm gup test
> for long term pinned device pages" or go next to it. Patch 3 doesn't have a
> direct dependency on device-coherent pages. It only mentions them in comments.
>
>
>>
>> IOW, should this patch start with
>>
>> "With DEVICE_COHERENT, we'll soon have vm_normal_pages() return
>> device-managed anonymous pages that are not LRU pages. Although they
>> behave like normal pages for purposes of mapping in CPU page, and for
>> COW, they do not support LRU lists, NUMA migration or THP. [...]"
>
> Yes, that makes sense.
>
> Regards,
>   Felix
>
>
>>
>> But then, I'm confused by patch 2 and 3, because it feels more like we'd
>> already have DEVICE_COHERENT then ("hmm_is_coherent_type").
>>
>>

Re: [PATCH v1 1/3] mm: split vm_normal_pages for LRU and non-LRU handling

2022-03-16 Thread Alistair Popple

Felix Kuehling  writes:

> Am 2022-03-10 um 14:25 schrieb Matthew Wilcox:
>> On Thu, Mar 10, 2022 at 11:26:31AM -0600, Alex Sierra wrote:
>>> @@ -606,7 +606,7 @@ static void print_bad_pte(struct vm_area_struct *vma, 
>>> unsigned long addr,
>>>* PFNMAP mappings in order to support COWable mappings.
>>>*
>>>*/
>>> -struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
>>> +struct page *vm_normal_any_page(struct vm_area_struct *vma, unsigned long 
>>> addr,
>>> pte_t pte)
>>>   {
>>> unsigned long pfn = pte_pfn(pte);
>>> @@ -620,8 +620,6 @@ struct page *vm_normal_page(struct vm_area_struct *vma, 
>>> unsigned long addr,
>>> return NULL;
>>> if (is_zero_pfn(pfn))
>>> return NULL;
>>> -   if (pte_devmap(pte))
>>> -   return NULL;
>>> print_bad_pte(vma, addr, pte, NULL);
>>> return NULL;
>> ... what?
>>
>> Haven't you just made it so that a devmap page always prints a bad PTE
>> message, and then returns NULL anyway?
>
> Yeah, that was stupid. :/  I think the long-term goal was to get rid of
> pte_devmap. But for now, as long as we have pte_special with pte_devmap,
> we'll need a special case to handle that like a normal page.
>
> I only see the PFN_DEV|PFN_MAP flags set in a few places: 
> drivers/dax/device.c,
> drivers/nvdimm/pmem.c, fs/fuse/virtio_fs.c. I guess we need to test at least 
> one
> of them for this patch series to make sure we're not breaking them.
>
>
>>
>> Surely this should be:
>>
>>  if (pte_devmap(pte))
>> -return NULL;
>> +return pfn_to_page(pfn);
>>
>> or maybe
>>
>> +goto check_pfn;
>>
>> But I don't know about that highest_memmap_pfn check.
>
> Looks to me like it should work. highest_memmap_pfn gets updated in
> memremap_pages -> pagemap_range -> move_pfn_range_to_zone ->
> memmap_init_range.

FWIW the previous version of this feature which was removed in 25b2995a35b6
("mm: remove MEMORY_DEVICE_PUBLIC support") had a similar comparison with
highest_memmap_pfn:

if (likely(pfn <= highest_memmap_pfn)) {
struct page *page = pfn_to_page(pfn);

if (is_device_public_page(page)) {
if (with_public_device)
return page;
return NULL;
}
}

> Regards,
>   Felix
>
>
>>
>>> @@ -661,6 +659,22 @@ struct page *vm_normal_page(struct vm_area_struct 
>>> *vma, unsigned long addr,
>>> return pfn_to_page(pfn);
>>>   }
>>>   +/*
>>> + * vm_normal_lru_page -- This function gets the "struct page" associated
>>> + * with a pte only for page cache and anon page. These pages are LRU 
>>> handled.
>>> + */
>>> +struct page *vm_normal_lru_page(struct vm_area_struct *vma, unsigned long 
>>> addr,
>>> +   pte_t pte)
>> It seems a shame to add a new function without proper kernel-doc.
>>

Re: [PATCH v6 01/10] mm: add zone device coherent type memory support

2022-02-18 Thread Alistair Popple

Felix Kuehling  writes:

> Am 2022-02-16 um 07:26 schrieb Jason Gunthorpe:
>> The other place that needs careful audit is all the callers using
>> vm_normal_page() - they must all be able to accept a ZONE_DEVICE page
>> if we don't set pte_devmap.
>
> How much code are we talking about here? A quick search finds 26 call-sites in
> 12 files in current master:
>
>fs/proc/task_mmu.c
>mm/hmm.c
>mm/gup.c
>mm/huge_memory.c (vm_normal_page_pmd)
>mm/khugepaged.c
>mm/madvise.c
>mm/mempolicy.c
>mm/memory.c
>mm/mlock.c
>mm/migrate.c
>mm/mprotect.c
>mm/memcontrol.c
>
> I'm thinking of a more theoretical approach: Instead of auditing all users, 
> I'd
> ask, what are the invariants that a vm_normal_page should have. Then check,
> whether our DEVICE_COHERENT pages satisfy them. But maybe the concept of a
> vm_normal_page isn't defined clearly enough for that.
>
> That said, I think we (Alex and myself) made an implicit assumption from the
> start, that a DEVICE_COHERENT page should behave a lot like a normal page in
> terms of VMA mappings, even if we didn't know what that means in detail.

Yes I'm afraid I made a similar mistake when reviewing this, forgetting that
DEVICE_COHERENT pages are not LRU pages and therefore need special treatment in
some places. So for now I will have to withdraw my reviewed-by until this has
been looked at more closely, because as you note below accidentally treating
them as LRU pages leads to a bad time.

> I can now at least name some differences between DEVICE_COHERENT and normal
> pages: how the memory is allocated, how data is migrated into DEVICE_COHERENT
> pages and that it can't be on any LRU list (because the lru list_head in 
> struct
> page is aliased by pgmap and zone_device_data). Maybe I'll find more 
> differences
> if I keep digging.
>
> Regards,
>   Felix
>
>
>>
>> Jason

Re: [PATCH v6 01/10] mm: add zone device coherent type memory support

2022-02-16 Thread Alistair Popple

Jason Gunthorpe  writes:

> On Wed, Feb 16, 2022 at 09:31:03AM +0100, David Hildenbrand wrote:
>> On 16.02.22 03:36, Alistair Popple wrote:
>> > On Wednesday, 16 February 2022 1:03:57 PM AEDT Jason Gunthorpe wrote:
>> >> On Wed, Feb 16, 2022 at 12:23:44PM +1100, Alistair Popple wrote:
>> >>
>> >>> Device private and device coherent pages are not marked with pte_devmap 
>> >>> and they
>> >>> are backed by a struct page. The only way of inserting them is via 
>> >>> migrate_vma.
>> >>> The refcount is decremented in zap_pte_range() on munmap() with special 
>> >>> handling
>> >>> for device private pages. Looking at it again though I wonder if there 
>> >>> is any
>> >>> special treatment required in zap_pte_range() for device coherent pages 
>> >>> given
>> >>> they count as present pages.
>> >>
>> >> This is what I guessed, but we shouldn't be able to just drop
>> >> pte_devmap on these pages without any other work?? Granted it does
>> >> very little already..
>> >
>> > Yes, I agree we need to check this more closely. For device private pages
>> > not having pte_devmap is fine, because they are non-present swap entries so
>> > they always get special handling in the swap entry paths but the same isn't
>> > true for coherent device pages.
>>
>> I'm curious, how does the refcount of a PageAnon() DEVICE_COHERENT page
>> look like when mapped? I'd assume it's also (currently) still offset by
>> one, meaning, if it's mapped into a single page table it's always at
>> least 2.
>
> Christoph fixed this offset by one and updated the DEVICE_COHERENT
> patchset, I hope we will see that version merged.
>
>> >> I thought at least gup_fast needed to be touched or did this get
>> >> handled by scanning the page list after the fact?
>> >
>> > Right, for gup I think the only special handling required is to prevent
>> > pinning. I had assumed that check_and_migrate_movable_pages() would still 
>> > get
>> > called for gup_fast but unless I've missed something I don't think it does.
>> > That means gup_fast could still pin movable and coherent pages. Technically
>> > that is ok for coherent pages, but it's undesirable.
>>
>> We really should have the same pinning rules for GUP vs. GUP-fast.
>> is_pinnable_page() should be the right place for such checks (similarly
>> as indicated in my reply to the migration series).
>
> Yes, I think this is a bug too.

Agreed, I will add a fix for it to my series as I was surprised the rules for
PUP-fast were different. I can see how this happened though -
check_and_migrate_cma_pages() (the precursor to
check_and_migrate_movable_pages()) was added before PUP-fast and FOLL_LONGTERM
so I guess we just never added this check there.

- Alistair

> The other place that needs careful audit is all the callers using
> vm_normal_page() - they must all be able to accept a ZONE_DEVICE page
> if we don't set pte_devmap.
>
> Jason

Re: [PATCH v6 01/10] mm: add zone device coherent type memory support

2022-02-16 Thread Alistair Popple

Jason Gunthorpe  writes:

> On Tue, Feb 15, 2022 at 04:35:56PM -0500, Felix Kuehling wrote:
>>
>> On 2022-02-15 14:41, Jason Gunthorpe wrote:
>> > On Tue, Feb 15, 2022 at 07:32:09PM +0100, Christoph Hellwig wrote:
>> > > On Tue, Feb 15, 2022 at 10:45:24AM -0400, Jason Gunthorpe wrote:
>> > > > > Do you know if DEVICE_GENERIC pages would end up as PageAnon()? My
>> > > > > assumption was that they would be part of a special mapping.
>> > > > We need to stop using the special PTEs and VMAs for things that have a
>> > > > struct page. This is a mistake DAX created that must be undone.
>> > > Yes, we'll get to it.  Maybe we can do it for the non-DAX devmap
>> > > ptes first given that DAX is more complicated.
>> > Probably, I think we can check the page->pgmap type to tell the
>> > difference.
>> >
>> > I'm not sure how the DEVICE_GENERIC can work without this, as DAX was
>> > made safe by using the unmap_mapping_range(), which won't work
>> > here. Is there some other trick being used to keep track of references
>> > inside the AMD driver?
>>
>> Not sure I'm following all the discussion about VMAs and DAX. So I may be
>> answering the wrong question: We treat each ZONE_DEVICE page as a reference
>> to the BO (buffer object) that backs the page. We increment the BO refcount
>> for each page we migrate into it. In the dev_pagemap_ops.page_free callback
>> we drop that reference. Once all pages backed by a BO are freed, the BO
>> refcount reaches 0 [*] and we can free the BO allocation.
>
> Userspace does
>  1) mmap(MAP_PRIVATE) to allocate anon memory
>  2) something to trigger migration to install a ZONE_DEVICE page
>  3) munmap()
>
> Who decrements the refcout on the munmap?
>
> When a ZONE_DEVICE page is installed in the PTE is supposed to be
> marked as pte_devmap and that disables all the normal page refcounting
> during munmap().

Device private and device coherent pages are not marked with pte_devmap and they
are backed by a struct page. The only way of inserting them is via migrate_vma.
The refcount is decremented in zap_pte_range() on munmap() with special handling
for device private pages. Looking at it again though I wonder if there is any
special treatment required in zap_pte_range() for device coherent pages given
they count as present pages.

> fsdax makes this work by working the refcounts backwards, the page is
> refcounted while it exists in the driver, when the driver decides to
> remove it then unmap_mapping_range() is called to purge it from all
> PTEs and then refcount is decrd. munmap/fork/etc don't change the
> refcount.

The equivalent here is for drivers to use migrate_vma to migrate the pages back
from device memory to CPU memory. In this case the refcounting is (mostly)
handled by migration code which decrements the refcount on the original source
device page during the migration.

- Alistair

> Jason

Re: [PATCH v6 01/10] mm: add zone device coherent type memory support

2022-02-16 Thread Alistair Popple

On Wednesday, 16 February 2022 1:03:57 PM AEDT Jason Gunthorpe wrote:
> On Wed, Feb 16, 2022 at 12:23:44PM +1100, Alistair Popple wrote:
> 
> > Device private and device coherent pages are not marked with pte_devmap and 
> > they
> > are backed by a struct page. The only way of inserting them is via 
> > migrate_vma.
> > The refcount is decremented in zap_pte_range() on munmap() with special 
> > handling
> > for device private pages. Looking at it again though I wonder if there is 
> > any
> > special treatment required in zap_pte_range() for device coherent pages 
> > given
> > they count as present pages.
> 
> This is what I guessed, but we shouldn't be able to just drop
> pte_devmap on these pages without any other work?? Granted it does
> very little already..

Yes, I agree we need to check this more closely. For device private pages
not having pte_devmap is fine, because they are non-present swap entries so
they always get special handling in the swap entry paths but the same isn't
true for coherent device pages.

> I thought at least gup_fast needed to be touched or did this get
> handled by scanning the page list after the fact?

Right, for gup I think the only special handling required is to prevent
pinning. I had assumed that check_and_migrate_movable_pages() would still get
called for gup_fast but unless I've missed something I don't think it does.
That means gup_fast could still pin movable and coherent pages. Technically
that is ok for coherent pages, but it's undesirable.

 - Alistair

> Jason
>

Re: [PATCH v2 2/3] mm/gup.c: Migrate device coherent pages when pinning instead of failing

2022-02-14 Thread Alistair Popple

John Hubbard  writes:

> On 2/11/22 18:51, Alistair Popple wrote:

[…]

>>> See below…
>>>
>>>> +  }
>>>> +
>>>> +  pages[i] = migrate_device_page(head, gup_flags);
>> migrate_device_page() will return a new page that has been correctly pinned
>> with gup_flags by try_grab_page(). Therefore this page can still be released
>> with unpin_user_page() or put_page() as appropriate for the given gup_flags.
>> The reference we had on the source page (head) always gets dropped in
>> migrate_vma_finalize().
>
> OK. Good.
>
> The above would be good to have in a comment, right around here, imho.
> Because we have this marvelous mix of references for migration (get_page())
> and other, and it’s a bit hard to see that it’s all correct without a
> hint or two.

Ok, will do.

>
> …
>> Which unless I’ve missed something is still the correct thing to do.
>>
>>> This reminds me: out of the many things to monitor, the FOLL_PIN counts
>>> in /proc/vmstat are especially helpful, whenever making changes to code
>>> that deals with this:
>>>
>>> nr_foll_pin_acquired
>>> nr_foll_pin_released
>>>
>>> …and those should normally be equal to each other when “at rest”.
>>>
>
> I hope this is/was run, just to be sure?

Thanks for the suggestion, these remain equal to each other after running
hmm-tests which confirms everything is working as expected.

>
> thanks,

Re: [PATCH v6 01/10] mm: add zone device coherent type memory support

2022-02-14 Thread Alistair Popple

Felix Kuehling  writes:

> Am 2022-02-11 um 11:15 schrieb David Hildenbrand:
>> On 01.02.22 16:48, Alex Sierra wrote:
>>> Device memory that is cache coherent from device and CPU point of view.
>>> This is used on platforms that have an advanced system bus (like CAPI
>>> or CXL). Any page of a process can be migrated to such memory. However,
>>> no one should be allowed to pin such memory so that it can always be
>>> evicted.
>>>
>>> Signed-off-by: Alex Sierra 
>>> Acked-by: Felix Kuehling 
>>> Reviewed-by: Alistair Popple 
>> So, I’m currently messing with PageAnon() pages and CoW semantics …
>> all these PageAnon() ZONE_DEVICE variants don’t necessarily make my life
>> easier but I’m not sure yet if they make my life harder. I hope you can
>> help me understand some of that stuff.
>>
>> 1) What are expected CoW semantics for DEVICE_COHERENT?
>>
>> I assume we’ll share them just like other PageAnon() pages during fork()
>> readable, and the first sharer writing to them receives an “ordinary”
>> !ZONE_DEVICE copy.
>
> Yes.
>
>
>>
>> So this would be just like DEVICE_EXCLUSIVE CoW handling I assume, just
>> that we don’t have to go through the loop of restoring a device
>> exclusive entry?
>
> I’m not sure how DEVICE_EXCLUSIVE pages are handled under CoW. As I understand
> it, they’re not really in a special memory zone like DEVICE_COHERENT. Just a
> special way of mapping an ordinary page in order to allow device-exclusive
> access for some time. I suspect there may even be a possibility that a page 
> can
> be both DEVICE_EXCLUSIVE and DEVICE_COHERENT.

Right - there aren’t really device exclusive pages, they are just special
non-present ptes conceptually pretty similar to migration entries. The
difference is that on CPU fault (or fork) the original entry is restored
immediately after notifying the device that it no longer has exclusive access.

As device exclusive entries can be turned into normal entries whenever required
we handle CoW by restoring the original ptes if a device exclusive entry is
encountered. This reduces the chances of introducing any subtle CoW bugs as it
just gets handled the same as any normal page table entry (because the exclusive
entries will have been removed).

> That said, your statement sounds correct. There is no requirement to do 
> anything
> with the new “ordinary” page after copying. What actually happens to
> DEVICE_COHERENT pages on CoW is a bit convoluted:
>
> When the page is marked as CoW, it is marked R/O in the CPU page table. This
> causes an MMU notifier that invalidates the device PTE. The next device access
> in the parent process causes a page fault. If that’s a write fault (usually is
> in our current driver), it will trigger CoW, which means the parent process 
> now
> gets a new system memory copy of the page, while the child process keeps the
> DEVICE_COHERENT page. The driver could decide to migrate the page back to a 
> new
> DEVICE_COHERENT allocation.
>
> In practice that means, “fork” basically causes all DEVICE_COHERENT memory in
> the parent process to be migrated to ordinary system memory, which is quite
> disruptive. What we have today results in correct behaviour, but the 
> performance
> is far from ideal.
>
> We could probably mitigate it by making the driver better at mapping pages R/O
> in the device on read faults, at the potential cost of having to handle a 
> second
> (write) fault later.
>
>
>>
>> 2) How are these pages freed to clear/invalidate PageAnon() ?
>>
>> I assume for PageAnon() ZONE_DEVICE pages we’ll always for via
>> free_devmap_managed_page(), correct?
>
> Yes. The driver depends on the the page->pgmap->ops->page_free callback to 
> free
> the device memory allocation backing the page.
>
>
>>
>>
>> 3) FOLL_PIN
>>
>> While you write “no one should be allowed to pin such memory”, patch #2
>> only blocks FOLL_LONGTERM. So I assume we allow ordinary FOLL_PIN and
>> you might want to be a bit more precise?
>
> I agree. I think the paragraph was written before we fully fleshed out the
> interaction with GUP, and the forgotten.
>
>
>>
>>
>> … I’m pretty sure we cannot FOLL_PIN DEVICE_PRIVATE pages,
>
> Right. Trying to GUP a DEVICE_PRIVATE page causes a page fault that migrates 
> the
> page back to normal system memory (using the page->pgmap->ops->migrate_to_ram
> callback). Then you pin the system memory page.
>
>
>>   but can we
>> FILL_PIN DEVICE_EXCLUSIVE pages? I strongly assume so?

In the case of device exclusive entries GUP/PUP will fault and restore the
original entry. It will then pin the original normal page pointed to by the
device exclusive entry.

• Alistair

>
> I assume you mean DEVICE_COHERENT, not DEVICE_EXCLUSIVE? In that case the 
> answer
> is “Yes”.
>
> Regards,
>   Felix
>
>
>>
>>
>> Thanks for any information.
>>

Re: [PATCH v2 2/3] mm/gup.c: Migrate device coherent pages when pinning instead of failing

2022-02-14 Thread Alistair Popple

On Saturday, 12 February 2022 1:10:29 PM AEDT John Hubbard wrote:
> On 2/6/22 20:26, Alistair Popple wrote:
> > Currently any attempts to pin a device coherent page will fail. This is
> > because device coherent pages need to be managed by a device driver, and
> > pinning them would prevent a driver from migrating them off the device.
> > 
> > However this is no reason to fail pinning of these pages. These are
> > coherent and accessible from the CPU so can be migrated just like
> > pinning ZONE_MOVABLE pages. So instead of failing all attempts to pin
> > them first try migrating them out of ZONE_DEVICE.
> > 
> 
> Hi Alistair and all,
> 
> Here's a possible issue (below) that I really should have spotted the
> first time around, sorry for this late-breaking review. And maybe it's
> actually just my misunderstanding, anyway.

I think it might be a misunderstanding, see below.

> > Signed-off-by: Alistair Popple 
> > Acked-by: Felix Kuehling 
> > ---
> > 
> > Changes for v2:
> > 
> >   - Added Felix's Acked-by
> >   - Fixed missing check for dpage == NULL
> > 
> >   mm/gup.c | 105 ++--
> >   1 file changed, 95 insertions(+), 10 deletions(-)
> > 
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 56d9577..5e826db 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -1861,6 +1861,60 @@ struct page *get_dump_page(unsigned long addr)
> >   
> >   #ifdef CONFIG_MIGRATION
> >   /*
> > + * Migrates a device coherent page back to normal memory. Caller should 
> > have a
> > + * reference on page which will be copied to the new page if migration is
> > + * successful or dropped on failure.
> > + */
> > +static struct page *migrate_device_page(struct page *page,
> > +   unsigned int gup_flags)
> > +{
> > +   struct page *dpage;
> > +   struct migrate_vma args;
> > +   unsigned long src_pfn, dst_pfn = 0;
> > +
> > +   lock_page(page);
> > +   src_pfn = migrate_pfn(page_to_pfn(page)) | MIGRATE_PFN_MIGRATE;
> > +   args.src = _pfn;
> > +   args.dst = _pfn;
> > +   args.cpages = 1;
> > +   args.npages = 1;
> > +   args.vma = NULL;
> > +   migrate_vma_setup();
> > +   if (!(src_pfn & MIGRATE_PFN_MIGRATE))
> > +   return NULL;
> > +
> > +   dpage = alloc_pages(GFP_USER | __GFP_NOWARN, 0);
> > +
> > +   /*
> > +* get/pin the new page now so we don't have to retry gup after
> > +* migrating. We already have a reference so this should never fail.
> > +*/
> > +   if (dpage && WARN_ON_ONCE(!try_grab_page(dpage, gup_flags))) {
> > +   __free_pages(dpage, 0);
> > +   dpage = NULL;
> > +   }
> > +
> > +   if (dpage) {
> > +   lock_page(dpage);
> > +   dst_pfn = migrate_pfn(page_to_pfn(dpage));
> > +   }
> > +
> > +   migrate_vma_pages();
> > +   if (src_pfn & MIGRATE_PFN_MIGRATE)
> > +   copy_highpage(dpage, page);
> > +   migrate_vma_finalize();
> > +   if (dpage && !(src_pfn & MIGRATE_PFN_MIGRATE)) {
> > +   if (gup_flags & FOLL_PIN)
> > +   unpin_user_page(dpage);
> > +   else
> > +   put_page(dpage);
> > +   dpage = NULL;
> > +   }
> > +
> > +   return dpage;
> > +}
> > +
> > +/*
> >* Check whether all pages are pinnable, if so return number of pages.  
> > If some
> >* pages are not pinnable, migrate them, and unpin all pages. Return zero 
> > if
> >* pages were migrated, or if some pages were not successfully isolated.
> > @@ -1888,15 +1942,40 @@ static long 
> > check_and_migrate_movable_pages(unsigned long nr_pages,
> > continue;
> > prev_head = head;
> > /*
> > -* If we get a movable page, since we are going to be pinning
> > -* these entries, try to move them out if possible.
> > +* Device coherent pages are managed by a driver and should not
> > +* be pinned indefinitely as it prevents the driver moving the
> > +* page. So when trying to pin with FOLL_LONGTERM instead try
> > +* migrating page out of device memory.
> >  */
> > if (is_dev_private_or_coherent_page(head)) {
> > +   /*
> > +* device private pages will get faulted in during gup
> > +* so

Re: [PATCH v2 2/3] mm/gup.c: Migrate device coherent pages when pinning instead of failing

2022-02-11 Thread Alistair Popple

On Thursday, 10 February 2022 10:47:35 PM AEDT David Hildenbrand wrote:
> On 10.02.22 12:39, Alistair Popple wrote:
> > On Thursday, 10 February 2022 9:53:38 PM AEDT David Hildenbrand wrote:
> >> On 07.02.22 05:26, Alistair Popple wrote:
> >>> Currently any attempts to pin a device coherent page will fail. This is
> >>> because device coherent pages need to be managed by a device driver, and
> >>> pinning them would prevent a driver from migrating them off the device.
> >>>
> >>> However this is no reason to fail pinning of these pages. These are
> >>> coherent and accessible from the CPU so can be migrated just like
> >>> pinning ZONE_MOVABLE pages. So instead of failing all attempts to pin
> >>> them first try migrating them out of ZONE_DEVICE.
> >>>
> >>> Signed-off-by: Alistair Popple 
> >>> Acked-by: Felix Kuehling 
> >>> ---
> >>>
> >>> Changes for v2:
> >>>
> >>>  - Added Felix's Acked-by
> >>>  - Fixed missing check for dpage == NULL
> >>>
> >>>  mm/gup.c | 105 ++--
> >>>  1 file changed, 95 insertions(+), 10 deletions(-)
> >>>
> >>> diff --git a/mm/gup.c b/mm/gup.c
> >>> index 56d9577..5e826db 100644
> >>> --- a/mm/gup.c
> >>> +++ b/mm/gup.c
> >>> @@ -1861,6 +1861,60 @@ struct page *get_dump_page(unsigned long addr)
> >>>  
> >>>  #ifdef CONFIG_MIGRATION
> >>>  /*
> >>> + * Migrates a device coherent page back to normal memory. Caller should 
> >>> have a
> >>> + * reference on page which will be copied to the new page if migration is
> >>> + * successful or dropped on failure.
> >>> + */
> >>> +static struct page *migrate_device_page(struct page *page,
> >>> + unsigned int gup_flags)
> >>> +{
> >>> + struct page *dpage;
> >>> + struct migrate_vma args;
> >>> + unsigned long src_pfn, dst_pfn = 0;
> >>> +
> >>> + lock_page(page);
> >>> + src_pfn = migrate_pfn(page_to_pfn(page)) | MIGRATE_PFN_MIGRATE;
> >>> + args.src = _pfn;
> >>> + args.dst = _pfn;
> >>> + args.cpages = 1;
> >>> + args.npages = 1;
> >>> + args.vma = NULL;
> >>> + migrate_vma_setup();
> >>> + if (!(src_pfn & MIGRATE_PFN_MIGRATE))
> >>> + return NULL;
> >>> +
> >>> + dpage = alloc_pages(GFP_USER | __GFP_NOWARN, 0);
> >>> +
> >>> + /*
> >>> +  * get/pin the new page now so we don't have to retry gup after
> >>> +  * migrating. We already have a reference so this should never fail.
> >>> +  */
> >>> + if (dpage && WARN_ON_ONCE(!try_grab_page(dpage, gup_flags))) {
> >>> + __free_pages(dpage, 0);
> >>> + dpage = NULL;
> >>> + }
> >>> +
> >>> + if (dpage) {
> >>> + lock_page(dpage);
> >>> + dst_pfn = migrate_pfn(page_to_pfn(dpage));
> >>> + }
> >>> +
> >>> + migrate_vma_pages();
> >>> + if (src_pfn & MIGRATE_PFN_MIGRATE)
> >>> + copy_highpage(dpage, page);
> >>> + migrate_vma_finalize();
> >>> + if (dpage && !(src_pfn & MIGRATE_PFN_MIGRATE)) {
> >>> + if (gup_flags & FOLL_PIN)
> >>> + unpin_user_page(dpage);
> >>> + else
> >>> + put_page(dpage);
> >>> + dpage = NULL;
> >>> + }
> >>> +
> >>> + return dpage;
> >>> +}
> >>> +
> >>> +/*
> >>>   * Check whether all pages are pinnable, if so return number of pages.  
> >>> If some
> >>>   * pages are not pinnable, migrate them, and unpin all pages. Return 
> >>> zero if
> >>>   * pages were migrated, or if some pages were not successfully isolated.
> >>> @@ -1888,15 +1942,40 @@ static long 
> >>> check_and_migrate_movable_pages(unsigned long nr_pages,
> >>>   continue;
> >>>   prev_head = head;
> >>>   /*
> >>> -  * If we get a movable page, since we are going to be pinning
> >>> -  * these entries, try to move them out if possible.
> >>> +  * Device coherent pa

Re: [PATCH v2 2/3] mm/gup.c: Migrate device coherent pages when pinning instead of failing

2022-02-10 Thread Alistair Popple

On Thursday, 10 February 2022 9:53:38 PM AEDT David Hildenbrand wrote:
> On 07.02.22 05:26, Alistair Popple wrote:
> > Currently any attempts to pin a device coherent page will fail. This is
> > because device coherent pages need to be managed by a device driver, and
> > pinning them would prevent a driver from migrating them off the device.
> > 
> > However this is no reason to fail pinning of these pages. These are
> > coherent and accessible from the CPU so can be migrated just like
> > pinning ZONE_MOVABLE pages. So instead of failing all attempts to pin
> > them first try migrating them out of ZONE_DEVICE.
> > 
> > Signed-off-by: Alistair Popple 
> > Acked-by: Felix Kuehling 
> > ---
> > 
> > Changes for v2:
> > 
> >  - Added Felix's Acked-by
> >  - Fixed missing check for dpage == NULL
> > 
> >  mm/gup.c | 105 ++--
> >  1 file changed, 95 insertions(+), 10 deletions(-)
> > 
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 56d9577..5e826db 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -1861,6 +1861,60 @@ struct page *get_dump_page(unsigned long addr)
> >  
> >  #ifdef CONFIG_MIGRATION
> >  /*
> > + * Migrates a device coherent page back to normal memory. Caller should 
> > have a
> > + * reference on page which will be copied to the new page if migration is
> > + * successful or dropped on failure.
> > + */
> > +static struct page *migrate_device_page(struct page *page,
> > +   unsigned int gup_flags)
> > +{
> > +   struct page *dpage;
> > +   struct migrate_vma args;
> > +   unsigned long src_pfn, dst_pfn = 0;
> > +
> > +   lock_page(page);
> > +   src_pfn = migrate_pfn(page_to_pfn(page)) | MIGRATE_PFN_MIGRATE;
> > +   args.src = _pfn;
> > +   args.dst = _pfn;
> > +   args.cpages = 1;
> > +   args.npages = 1;
> > +   args.vma = NULL;
> > +   migrate_vma_setup();
> > +   if (!(src_pfn & MIGRATE_PFN_MIGRATE))
> > +   return NULL;
> > +
> > +   dpage = alloc_pages(GFP_USER | __GFP_NOWARN, 0);
> > +
> > +   /*
> > +* get/pin the new page now so we don't have to retry gup after
> > +* migrating. We already have a reference so this should never fail.
> > +*/
> > +   if (dpage && WARN_ON_ONCE(!try_grab_page(dpage, gup_flags))) {
> > +   __free_pages(dpage, 0);
> > +   dpage = NULL;
> > +   }
> > +
> > +   if (dpage) {
> > +   lock_page(dpage);
> > +   dst_pfn = migrate_pfn(page_to_pfn(dpage));
> > +   }
> > +
> > +   migrate_vma_pages();
> > +   if (src_pfn & MIGRATE_PFN_MIGRATE)
> > +   copy_highpage(dpage, page);
> > +   migrate_vma_finalize();
> > +   if (dpage && !(src_pfn & MIGRATE_PFN_MIGRATE)) {
> > +   if (gup_flags & FOLL_PIN)
> > +   unpin_user_page(dpage);
> > +   else
> > +   put_page(dpage);
> > +   dpage = NULL;
> > +   }
> > +
> > +   return dpage;
> > +}
> > +
> > +/*
> >   * Check whether all pages are pinnable, if so return number of pages.  If 
> > some
> >   * pages are not pinnable, migrate them, and unpin all pages. Return zero 
> > if
> >   * pages were migrated, or if some pages were not successfully isolated.
> > @@ -1888,15 +1942,40 @@ static long 
> > check_and_migrate_movable_pages(unsigned long nr_pages,
> > continue;
> > prev_head = head;
> > /*
> > -* If we get a movable page, since we are going to be pinning
> > -* these entries, try to move them out if possible.
> > +* Device coherent pages are managed by a driver and should not
> > +* be pinned indefinitely as it prevents the driver moving the
> > +* page. So when trying to pin with FOLL_LONGTERM instead try
> > +* migrating page out of device memory.
> >  */
> > if (is_dev_private_or_coherent_page(head)) {
> > +   /*
> > +* device private pages will get faulted in during gup
> > +* so it shouldn't be possible to see one here.
> > +*/
> > WARN_ON_ONCE(is_device_private_page(head));
> > -   ret = -EFAULT;
> > -   goto unpin_pages;
> > +   WARN_ON_ONCE(PageCompou

Re: start sorting out the ZONE_DEVICE refcount mess v2

2022-02-10 Thread Alistair Popple

On Thursday, 10 February 2022 6:28:01 PM AEDT Christoph Hellwig wrote:

[...]

> Changes since v1:
>  - add a missing memremap.h include in memcontrol.c
>  - include rebased versions of the device coherent support and
>device coherent migration support series as well as additional
>cleanup patches

Thanks for the rebase. I will take a closer look at it tomorrow but I just
ran the hmm-tests and they are all still passing for me with this series.

> Diffstt:
>  arch/arm64/mm/mmu.c  |1 
>  arch/powerpc/kvm/book3s_hv_uvmem.c   |1 
>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |   35 -
>  drivers/gpu/drm/amd/amdkfd/kfd_priv.h|1 
>  drivers/gpu/drm/drm_cache.c  |2 
>  drivers/gpu/drm/nouveau/nouveau_dmem.c   |3 
>  drivers/gpu/drm/nouveau/nouveau_svm.c|1 
>  drivers/infiniband/core/rw.c |1 
>  drivers/nvdimm/pmem.h|1 
>  drivers/nvme/host/pci.c  |1 
>  drivers/nvme/target/io-cmd-bdev.c|1 
>  fs/Kconfig   |2 
>  fs/fuse/virtio_fs.c  |1 
>  include/linux/hmm.h  |9 
>  include/linux/memremap.h |   36 +
>  include/linux/migrate.h  |1 
>  include/linux/mm.h   |   59 --
>  lib/test_hmm.c   |  353 ++---
>  lib/test_hmm_uapi.h  |   22 
>  mm/Kconfig   |7 
>  mm/Makefile  |1 
>  mm/gup.c |  127 +++-
>  mm/internal.h|3 
>  mm/memcontrol.c  |   19 
>  mm/memory-failure.c  |8 
>  mm/memremap.c|   75 +-
>  mm/migrate.c |  763 
>  mm/migrate_device.c  |  822 
> +++
>  mm/rmap.c|5 
>  mm/swap.c|   49 -
>  tools/testing/selftests/vm/Makefile  |2 
>  tools/testing/selftests/vm/hmm-tests.c   |  204 ++-
>  tools/testing/selftests/vm/test_hmm.sh   |   24 
>  33 files changed, 1552 insertions(+), 1088 deletions(-)
>

Re: [PATCH 11/27] mm: refactor the ZONE_DEVICE handling in migrate_vma_insert_page

2022-02-10 Thread Alistair Popple

Reviewed-by: Alistair Popple 

On Thursday, 10 February 2022 6:28:12 PM AEDT Christoph Hellwig wrote:
> Make the flow a little more clear and prepare for adding a new
> ZONE_DEVICE memory type.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  mm/migrate.c | 31 +++
>  1 file changed, 15 insertions(+), 16 deletions(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 8e0370a73f8a43..30ecd7223656c1 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2670,26 +2670,25 @@ static void migrate_vma_insert_page(struct 
> migrate_vma *migrate,
>*/
>   __SetPageUptodate(page);
>  
> - if (is_zone_device_page(page)) {
> - if (is_device_private_page(page)) {
> - swp_entry_t swp_entry;
> + if (is_device_private_page(page)) {
> + swp_entry_t swp_entry;
>  
> - if (vma->vm_flags & VM_WRITE)
> - swp_entry = make_writable_device_private_entry(
> - page_to_pfn(page));
> - else
> - swp_entry = make_readable_device_private_entry(
> - page_to_pfn(page));
> - entry = swp_entry_to_pte(swp_entry);
> - } else {
> - /*
> -  * For now we only support migrating to un-addressable
> -  * device memory.
> -  */
> + if (vma->vm_flags & VM_WRITE)
> + swp_entry = make_writable_device_private_entry(
> + page_to_pfn(page));
> + else
> + swp_entry = make_readable_device_private_entry(
> + page_to_pfn(page));
> + entry = swp_entry_to_pte(swp_entry);
> + } else {
> + /*
> +  * For now we only support migrating to un-addressable device
> +  * memory.
> +  */
> + if (is_zone_device_page(page)) {
>   pr_warn_once("Unsupported ZONE_DEVICE page type.\n");
>   goto abort;
>   }
> - } else {
>   entry = mk_pte(page, vma->vm_page_prot);
>   if (vma->vm_flags & VM_WRITE)
>   entry = pte_mkwrite(pte_mkdirty(entry));
>

Re: [PATCH 12/27] mm: refactor the ZONE_DEVICE handling in migrate_vma_pages

2022-02-10 Thread Alistair Popple

Reviewed-by: Alistair Popple 

On Thursday, 10 February 2022 6:28:13 PM AEDT Christoph Hellwig wrote:
> Make the flow a little more clear and prepare for adding a new
> ZONE_DEVICE memory type.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  mm/migrate.c | 27 ---
>  1 file changed, 12 insertions(+), 15 deletions(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 30ecd7223656c1..746e1230886ddb 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2788,24 +2788,21 @@ void migrate_vma_pages(struct migrate_vma *migrate)
>  
>   mapping = page_mapping(page);
>  
> - if (is_zone_device_page(newpage)) {
> - if (is_device_private_page(newpage)) {
> - /*
> -  * For now only support private anonymous when
> -  * migrating to un-addressable device memory.
> -  */
> - if (mapping) {
> - migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
> - continue;
> - }
> - } else {
> - /*
> -  * Other types of ZONE_DEVICE page are not
> -  * supported.
> -  */
> + if (is_device_private_page(newpage)) {
> + /*
> +  * For now only support private anonymous when migrating
> +  * to un-addressable device memory.
> +  */
> + if (mapping) {
>   migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
>   continue;
>   }
> + } else if (is_zone_device_page(newpage)) {
> + /*
> +  * Other types of ZONE_DEVICE page are not supported.
> +  */
> + migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
> + continue;
>   }
>  
>   r = migrate_page(mapping, newpage, page, MIGRATE_SYNC_NO_COPY);
>

Re: [PATCH 14/27] mm: build migrate_vma_* for all configs with ZONE_DEVICE support

2022-02-10 Thread Alistair Popple

Thanks, it's also better than more stubbed functions.

Reviewed-by: Alistair Popple 

On Thursday, 10 February 2022 6:28:15 PM AEDT Christoph Hellwig wrote:
> This code will be used for device coherent memory as well in a bit,
> so relax the ifdef a bit.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  mm/Kconfig | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 6391d8d3a616f3..95d4aa3acaefe0 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -250,7 +250,7 @@ config MIGRATION
> allocation instead of reclaiming.
>  
>  config DEVICE_MIGRATION
> - def_bool MIGRATION && DEVICE_PRIVATE
> + def_bool MIGRATION && ZONE_DEVICE
>  
>  config ARCH_ENABLE_HUGEPAGE_MIGRATION
>   bool
>

Re: [PATCH 13/27] mm: move the migrate_vma_* device migration code into it's own file

2022-02-10 Thread Alistair Popple

I got the following build error:

/data/source/linux/mm/migrate_device.c: In function ‘migrate_vma_collect_pmd’:
/data/source/linux/mm/migrate_device.c:242:3: error: implicit declaration of 
function ‘flush_tlb_range’; did you mean ‘flush_pmd_tlb_range’? 
[-Werror=implicit-function-declaration]
  242 |   flush_tlb_range(walk->vma, start, end);
  |   ^~~
  |   flush_pmd_tlb_range

Including asm/tlbflush.h in migrate_device.c fixed it for me.

On Thursday, 10 February 2022 6:28:14 PM AEDT Christoph Hellwig wrote:
> Split the code used to migrate to and from ZONE_DEVICE memory from
> migrate.c into a new file.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  mm/Kconfig  |   3 +
>  mm/Makefile |   1 +
>  mm/migrate.c| 753 ---
>  mm/migrate_device.c | 765 
>  4 files changed, 769 insertions(+), 753 deletions(-)
>  create mode 100644 mm/migrate_device.c
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index a1901ae6d06293..6391d8d3a616f3 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -249,6 +249,9 @@ config MIGRATION
> pages as migration can relocate pages to satisfy a huge page
> allocation instead of reclaiming.
>  
> +config DEVICE_MIGRATION
> + def_bool MIGRATION && DEVICE_PRIVATE
> +
>  config ARCH_ENABLE_HUGEPAGE_MIGRATION
>   bool
>  
> diff --git a/mm/Makefile b/mm/Makefile
> index 70d4309c9ce338..4cc13f3179a518 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/
>  obj-$(CONFIG_FAILSLAB) += failslab.o
>  obj-$(CONFIG_MEMTEST)+= memtest.o
>  obj-$(CONFIG_MIGRATION) += migrate.o
> +obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
>  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
>  obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 746e1230886ddb..c31d04b46a5e17 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -38,12 +38,10 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
>  #include 
>  #include 
> @@ -2125,757 +2123,6 @@ int migrate_misplaced_page(struct page *page, struct 
> vm_area_struct *vma,
>  #endif /* CONFIG_NUMA_BALANCING */
>  #endif /* CONFIG_NUMA */
>  
> -#ifdef CONFIG_DEVICE_PRIVATE
> -static int migrate_vma_collect_skip(unsigned long start,
> - unsigned long end,
> - struct mm_walk *walk)
> -{
> - struct migrate_vma *migrate = walk->private;
> - unsigned long addr;
> -
> - for (addr = start; addr < end; addr += PAGE_SIZE) {
> - migrate->dst[migrate->npages] = 0;
> - migrate->src[migrate->npages++] = 0;
> - }
> -
> - return 0;
> -}
> -
> -static int migrate_vma_collect_hole(unsigned long start,
> - unsigned long end,
> - __always_unused int depth,
> - struct mm_walk *walk)
> -{
> - struct migrate_vma *migrate = walk->private;
> - unsigned long addr;
> -
> - /* Only allow populating anonymous memory. */
> - if (!vma_is_anonymous(walk->vma))
> - return migrate_vma_collect_skip(start, end, walk);
> -
> - for (addr = start; addr < end; addr += PAGE_SIZE) {
> - migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE;
> - migrate->dst[migrate->npages] = 0;
> - migrate->npages++;
> - migrate->cpages++;
> - }
> -
> - return 0;
> -}
> -
> -static int migrate_vma_collect_pmd(pmd_t *pmdp,
> -unsigned long start,
> -unsigned long end,
> -struct mm_walk *walk)
> -{
> - struct migrate_vma *migrate = walk->private;
> - struct vm_area_struct *vma = walk->vma;
> - struct mm_struct *mm = vma->vm_mm;
> - unsigned long addr = start, unmapped = 0;
> - spinlock_t *ptl;
> - pte_t *ptep;
> -
> -again:
> - if (pmd_none(*pmdp))
> - return migrate_vma_collect_hole(start, end, -1, walk);
> -
> - if (pmd_trans_huge(*pmdp)) {
> - struct page *page;
> -
> - ptl = pmd_lock(mm, pmdp);
> - if (unlikely(!pmd_trans_huge(*pmdp))) {
> - spin_unlock(ptl);
> - goto again;
> - }
> -
> - page = pmd_page(*pmdp);
> - if (is_huge_zero_page(page)) {
> - spin_unlock(ptl);
> - split_huge_pmd(vma, pmdp, addr);
> - if (pmd_trans_unstable(pmdp))
> - return migrate_vma_collect_skip(start, end,
> - walk);
> - } else {
> -

Re: [PATCH 6/8] mm: don't include in

2022-02-09 Thread Alistair Popple

On Thursday, 10 February 2022 4:48:36 AM AEDT Christoph Hellwig wrote:
> On Mon, Feb 07, 2022 at 04:19:29PM -0500, Felix Kuehling wrote:
> >
> > Am 2022-02-07 um 01:32 schrieb Christoph Hellwig:
> >> Move the check for the actual pgmap types that need the free at refcount
> >> one behavior into the out of line helper, and thus avoid the need to
> >> pull memremap.h into mm.h.
> >>
> >> Signed-off-by: Christoph Hellwig 
> >
> > The amdkfd part looks good to me.
> >
> > It looks like this patch is not based on Alex Sierra's coherent memory 
> > series. He added two new helpers is_device_coherent_page and 
> > is_dev_private_or_coherent_page that would need to be moved along with 
> > is_device_private_page and is_pci_p2pdma_page.
> 
> FYI, here is a branch that contains a rebase of the coherent memory
> related patches on top of this series:
> 
> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/pgmap-refcount
> 
> I don't have a good way to test this, but I'll at least let the build bot
> finish before sending it out (probably tomorrow).

Thanks, I ran up hmm-test which revealed a few minor problems with the rebase.
Fixes below.

---

diff --git a/mm/gup.c b/mm/gup.c
index cbb49abb7992..8e85c9fb8df4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2007,7 +2007,6 @@ static long check_and_migrate_movable_pages(unsigned long 
nr_pages,
if (!ret && list_empty(_page_list) && !isolation_error_count)
return nr_pages;
 
-   ret = 0;
 unpin_pages:
for (i = 0; i < nr_pages; i++)
if (!pages[i])
diff --git a/mm/migrate.c b/mm/migrate.c
index f909f5a92757..1ae3e99baa50 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2686,12 +2686,11 @@ static void migrate_vma_insert_page(struct migrate_vma 
*migrate,
swp_entry = make_readable_device_private_entry(
page_to_pfn(page));
entry = swp_entry_to_pte(swp_entry);
-   } else {
-   if (is_zone_device_page(page) &&
-   is_device_coherent_page(page)) {
+   } else if (is_zone_device_page(page) &&
+   !is_device_coherent_page(page)) {
pr_warn_once("Unsupported ZONE_DEVICE page type.\n");
goto abort;
-   }
+   } else {
entry = mk_pte(page, vma->vm_page_prot);
if (vma->vm_flags & VM_WRITE)
entry = pte_mkwrite(pte_mkdirty(entry));

[PATCH v2 1/3] migrate.c: Remove vma check in migrate_vma_setup()

2022-02-07 Thread Alistair Popple

migrate_vma_setup() checks that a valid vma is passed so that the page
tables can be walked to find the pfns associated with a given address
range. However in some cases the pfns are already known, such as when
migrating device coherent pages during pin_user_pages() meaning a valid
vma isn't required.

Signed-off-by: Alistair Popple 
Acked-by: Felix Kuehling 
---

Changes for v2:

 - Added Felix's Acked-by

 mm/migrate.c | 34 +-
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index a9aed12..0d6570d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2602,24 +2602,24 @@ int migrate_vma_setup(struct migrate_vma *args)
 
args->start &= PAGE_MASK;
args->end &= PAGE_MASK;
-   if (!args->vma || is_vm_hugetlb_page(args->vma) ||
-   (args->vma->vm_flags & VM_SPECIAL) || vma_is_dax(args->vma))
-   return -EINVAL;
-   if (nr_pages <= 0)
-   return -EINVAL;
-   if (args->start < args->vma->vm_start ||
-   args->start >= args->vma->vm_end)
-   return -EINVAL;
-   if (args->end <= args->vma->vm_start || args->end > args->vma->vm_end)
-   return -EINVAL;
if (!args->src || !args->dst)
return -EINVAL;
-
-   memset(args->src, 0, sizeof(*args->src) * nr_pages);
-   args->cpages = 0;
-   args->npages = 0;
-
-   migrate_vma_collect(args);
+   if (args->vma) {
+   if (is_vm_hugetlb_page(args->vma) ||
+   (args->vma->vm_flags & VM_SPECIAL) || 
vma_is_dax(args->vma))
+   return -EINVAL;
+   if (args->start < args->vma->vm_start ||
+   args->start >= args->vma->vm_end)
+   return -EINVAL;
+   if (args->end <= args->vma->vm_start || args->end > 
args->vma->vm_end)
+   return -EINVAL;
+
+   memset(args->src, 0, sizeof(*args->src) * nr_pages);
+   args->cpages = 0;
+   args->npages = 0;
+
+   migrate_vma_collect(args);
+   }
 
if (args->cpages)
migrate_vma_unmap(args);
@@ -2804,7 +2804,7 @@ void migrate_vma_pages(struct migrate_vma *migrate)
continue;
}
 
-   if (!page) {
+   if (!page && migrate->vma) {
if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE))
continue;
if (!notified) {
-- 
git-series 0.9.1

[PATCH v2 2/3] mm/gup.c: Migrate device coherent pages when pinning instead of failing

2022-02-07 Thread Alistair Popple

Currently any attempts to pin a device coherent page will fail. This is
because device coherent pages need to be managed by a device driver, and
pinning them would prevent a driver from migrating them off the device.

However this is no reason to fail pinning of these pages. These are
coherent and accessible from the CPU so can be migrated just like
pinning ZONE_MOVABLE pages. So instead of failing all attempts to pin
them first try migrating them out of ZONE_DEVICE.

Signed-off-by: Alistair Popple 
Acked-by: Felix Kuehling 
---

Changes for v2:

 - Added Felix's Acked-by
 - Fixed missing check for dpage == NULL

 mm/gup.c | 105 ++--
 1 file changed, 95 insertions(+), 10 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 56d9577..5e826db 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1861,6 +1861,60 @@ struct page *get_dump_page(unsigned long addr)
 
 #ifdef CONFIG_MIGRATION
 /*
+ * Migrates a device coherent page back to normal memory. Caller should have a
+ * reference on page which will be copied to the new page if migration is
+ * successful or dropped on failure.
+ */
+static struct page *migrate_device_page(struct page *page,
+   unsigned int gup_flags)
+{
+   struct page *dpage;
+   struct migrate_vma args;
+   unsigned long src_pfn, dst_pfn = 0;
+
+   lock_page(page);
+   src_pfn = migrate_pfn(page_to_pfn(page)) | MIGRATE_PFN_MIGRATE;
+   args.src = _pfn;
+   args.dst = _pfn;
+   args.cpages = 1;
+   args.npages = 1;
+   args.vma = NULL;
+   migrate_vma_setup();
+   if (!(src_pfn & MIGRATE_PFN_MIGRATE))
+   return NULL;
+
+   dpage = alloc_pages(GFP_USER | __GFP_NOWARN, 0);
+
+   /*
+* get/pin the new page now so we don't have to retry gup after
+* migrating. We already have a reference so this should never fail.
+*/
+   if (dpage && WARN_ON_ONCE(!try_grab_page(dpage, gup_flags))) {
+   __free_pages(dpage, 0);
+   dpage = NULL;
+   }
+
+   if (dpage) {
+   lock_page(dpage);
+   dst_pfn = migrate_pfn(page_to_pfn(dpage));
+   }
+
+   migrate_vma_pages();
+   if (src_pfn & MIGRATE_PFN_MIGRATE)
+   copy_highpage(dpage, page);
+   migrate_vma_finalize();
+   if (dpage && !(src_pfn & MIGRATE_PFN_MIGRATE)) {
+   if (gup_flags & FOLL_PIN)
+   unpin_user_page(dpage);
+   else
+   put_page(dpage);
+   dpage = NULL;
+   }
+
+   return dpage;
+}
+
+/*
  * Check whether all pages are pinnable, if so return number of pages.  If some
  * pages are not pinnable, migrate them, and unpin all pages. Return zero if
  * pages were migrated, or if some pages were not successfully isolated.
@@ -1888,15 +1942,40 @@ static long check_and_migrate_movable_pages(unsigned 
long nr_pages,
continue;
prev_head = head;
/*
-* If we get a movable page, since we are going to be pinning
-* these entries, try to move them out if possible.
+* Device coherent pages are managed by a driver and should not
+* be pinned indefinitely as it prevents the driver moving the
+* page. So when trying to pin with FOLL_LONGTERM instead try
+* migrating page out of device memory.
 */
if (is_dev_private_or_coherent_page(head)) {
+   /*
+* device private pages will get faulted in during gup
+* so it shouldn't be possible to see one here.
+*/
WARN_ON_ONCE(is_device_private_page(head));
-   ret = -EFAULT;
-   goto unpin_pages;
+   WARN_ON_ONCE(PageCompound(head));
+
+   /*
+* migration will fail if the page is pinned, so convert
+* the pin on the source page to a normal reference.
+*/
+   if (gup_flags & FOLL_PIN) {
+   get_page(head);
+   unpin_user_page(head);
+   }
+
+   pages[i] = migrate_device_page(head, gup_flags);
+   if (!pages[i]) {
+   ret = -EBUSY;
+   break;
+   }
+   continue;
}
 
+   /*
+* If we get a movable page, since we are going to be pinning
+* these entries, try to move them out if possible.
+*/
if (!is_pinnable_page(head)) {
if (PageHuge(head)) {

Re: [PATCH 2/3] mm/gup.c: Migrate device coherent pages when pinning instead of failing

2022-02-07 Thread Alistair Popple

On Wednesday, 2 February 2022 2:03:01 AM AEDT Felix Kuehling wrote:
> 
> Am 2022-02-01 um 02:05 schrieb Alistair Popple:
> > Currently any attempts to pin a device coherent page will fail. This is
> > because device coherent pages need to be managed by a device driver, and
> > pinning them would prevent a driver from migrating them off the device.
> >
> > However this is no reason to fail pinning of these pages. These are
> > coherent and accessible from the CPU so can be migrated just like
> > pinning ZONE_MOVABLE pages. So instead of failing all attempts to pin
> > them first try migrating them out of ZONE_DEVICE.
> >
> > Signed-off-by: Alistair Popple 
> 
> Thank you for working on this. I have two questions inline.
> 
> Other than that, patches 1 and 2 are
> 
> Acked-by: Felix Kuehling 
> 
> 
> > ---
> >   mm/gup.c | 105 ++--
> >   1 file changed, 95 insertions(+), 10 deletions(-)
> >
> > diff --git a/mm/gup.c b/mm/gup.c
> > index f596b93..2cbef54 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -1834,6 +1834,60 @@ struct page *get_dump_page(unsigned long addr)
> >   
> >   #ifdef CONFIG_MIGRATION
> >   /*
> > + * Migrates a device coherent page back to normal memory. Caller should 
> > have a
> > + * reference on page which will be copied to the new page if migration is
> > + * successful or dropped on failure.
> > + */
> > +static struct page *migrate_device_page(struct page *page,
> > +   unsigned int gup_flags)
> > +{
> > +   struct page *dpage;
> > +   struct migrate_vma args;
> > +   unsigned long src_pfn, dst_pfn = 0;
> > +
> > +   lock_page(page);
> > +   src_pfn = migrate_pfn(page_to_pfn(page)) | MIGRATE_PFN_MIGRATE;
> > +   args.src = _pfn;
> > +   args.dst = _pfn;
> > +   args.cpages = 1;
> > +   args.npages = 1;
> > +   args.vma = NULL;
> > +   migrate_vma_setup();
> > +   if (!(src_pfn & MIGRATE_PFN_MIGRATE))
> > +   return NULL;
> > +
> > +   dpage = alloc_pages(GFP_USER | __GFP_NOWARN, 0);
> 
> Don't you need to check dpage for NULL before the try_grab_page call below?

Yes, thanks for pointing that out. Will fix for v2.

> > +
> > +   /*
> > +* get/pin the new page now so we don't have to retry gup after
> > +* migrating. We already have a reference so this should never fail.
> > +*/
> > +   if (WARN_ON_ONCE(!try_grab_page(dpage, gup_flags))) {
> > +   __free_pages(dpage, 0);
> > +   dpage = NULL;
> > +   }
> > +
> > +   if (dpage) {
> > +   lock_page(dpage);
> > +   dst_pfn = migrate_pfn(page_to_pfn(dpage));
> > +   }
> > +
> > +   migrate_vma_pages();
> > +   if (src_pfn & MIGRATE_PFN_MIGRATE)
> > +   copy_highpage(dpage, page);
> 
> Can't dpage can be NULL here as well?

No - migrate_vma_pages() will clear src_pfn & MIGRATE_PFN_MIGRATE if no
destination page is provided in dst_pfn.

> Regards,
>Felix
> 
> 
> > +   migrate_vma_finalize();
> > +   if (dpage && !(src_pfn & MIGRATE_PFN_MIGRATE)) {
> > +   if (gup_flags & FOLL_PIN)
> > +   unpin_user_page(dpage);
> > +   else
> > +   put_page(dpage);
> > +   dpage = NULL;
> > +   }
> > +
> > +   return dpage;
> > +}
> > +
> > +/*
> >* Check whether all pages are pinnable, if so return number of pages.  
> > If some
> >* pages are not pinnable, migrate them, and unpin all pages. Return zero 
> > if
> >* pages were migrated, or if some pages were not successfully isolated.
> > @@ -1861,15 +1915,40 @@ static long 
> > check_and_migrate_movable_pages(unsigned long nr_pages,
> > continue;
> > prev_head = head;
> > /*
> > -* If we get a movable page, since we are going to be pinning
> > -* these entries, try to move them out if possible.
> > +* Device coherent pages are managed by a driver and should not
> > +* be pinned indefinitely as it prevents the driver moving the
> > +* page. So when trying to pin with FOLL_LONGTERM instead try
> > +* migrating page out of device memory.
> >  */
> > if (is_dev_private_or_coherent_page(head)) {
> > +   /*
> > +* device priv

[PATCH v2 0/3] Migrate device coherent pages on get_user_pages()

2022-02-07 Thread Alistair Popple

Device coherent pages represent memory on a coherently attached device such
as a GPU which is usually under the control of a driver. These pages should
not be pinned as the driver needs to be able to move pages as required.
Currently this is enforced by failing any attempt to pin a device coherent
page.

A similar problem exists for ZONE_MOVABLE pages. In that case though the
pages are migrated instead of causing failure. There is no reason the
kernel can't migrate device coherent pages so this series implements
migration for device coherent pages so the same strategy of migrate and pin
can be used.

This series depends on the series "Add MEMORY_DEVICE_COHERENT for coherent
device memory mapping"[1] which is in linux-next-20220204 and should apply
cleanly to that.

[1] - 
https://lore.kernel.org/linux-mm/20220128200825.8623-1-alex.sie...@amd.com/

Changes for v2:

 - Rebased on to linux-next-20220204

Alex Sierra (1):
  tools: add hmm gup test for long term pinned device pages

Alistair Popple (2):
  migrate.c: Remove vma check in migrate_vma_setup()
  mm/gup.c: Migrate device coherent pages when pinning instead of failing

 mm/gup.c   | 105 +++---
 mm/migrate.c   |  34 
 tools/testing/selftests/vm/Makefile|   2 +-
 tools/testing/selftests/vm/hmm-tests.c |  81 -
 4 files changed, 194 insertions(+), 28 deletions(-)

base-commit: ef6b35306dd8f15a7e5e5a2532e665917a43c5d9
-- 
git-series 0.9.1

[PATCH v2 3/3] tools: add hmm gup test for long term pinned device pages

2022-02-07 Thread Alistair Popple

From: Alex Sierra 

The intention is to test device coherent type pages that have been
called through get user pages with PIN_LONGTERM flag set. These pages
should get migrated back to normal system memory.

Signed-off-by: Alex Sierra 
Signed-off-by: Alistair Popple 
Reviewed-by: Felix Kuehling  
---

Changes for v2:
 - Added Felix's Reviewed-by (thanks!)

 tools/testing/selftests/vm/Makefile|  2 +-
 tools/testing/selftests/vm/hmm-tests.c | 81 +++-
 2 files changed, 82 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/vm/Makefile 
b/tools/testing/selftests/vm/Makefile
index 96714d2..32032c7 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -143,7 +143,7 @@ $(OUTPUT)/mlock-random-test $(OUTPUT)/memfd_secret: LDLIBS 
+= -lcap
 
 $(OUTPUT)/gup_test: ../../../../mm/gup_test.h
 
-$(OUTPUT)/hmm-tests: local_config.h
+$(OUTPUT)/hmm-tests: local_config.h ../../../../mm/gup_test.h
 
 # HMM_EXTRA_LIBS may get set in local_config.mk, or it may be left empty.
 $(OUTPUT)/hmm-tests: LDLIBS += $(HMM_EXTRA_LIBS)
diff --git a/tools/testing/selftests/vm/hmm-tests.c 
b/tools/testing/selftests/vm/hmm-tests.c
index 84ec8c4..11b83a8 100644
--- a/tools/testing/selftests/vm/hmm-tests.c
+++ b/tools/testing/selftests/vm/hmm-tests.c
@@ -36,6 +36,7 @@
  * in the usual include/uapi/... directory.
  */
 #include "../../../../lib/test_hmm_uapi.h"
+#include "../../../../mm/gup_test.h"
 
 struct hmm_buffer {
void*ptr;
@@ -60,6 +61,8 @@ enum {
 #define NTIMES 10
 
 #define ALIGN(x, a) (((x) + (a - 1)) & (~((a) - 1)))
+/* Just the flags we need, copied from mm.h: */
+#define FOLL_WRITE 0x01/* check pte is writable */
 
 FIXTURE(hmm)
 {
@@ -1766,4 +1769,82 @@ TEST_F(hmm, exclusive_cow)
hmm_buffer_free(buffer);
 }
 
+/*
+ * Test get user device pages through gup_test. Setting PIN_LONGTERM flag.
+ * This should trigger a migration back to system memory for both, private
+ * and coherent type pages.
+ * This test makes use of gup_test module. Make sure GUP_TEST_CONFIG is added
+ * to your configuration before you run it.
+ */
+TEST_F(hmm, hmm_gup_test)
+{
+   struct hmm_buffer *buffer;
+   struct gup_test gup;
+   int gup_fd;
+   unsigned long npages;
+   unsigned long size;
+   unsigned long i;
+   int *ptr;
+   int ret;
+   unsigned char *m;
+
+   gup_fd = open("/sys/kernel/debug/gup_test", O_RDWR);
+   if (gup_fd == -1)
+   SKIP(return, "Skipping test, could not find gup_test driver");
+
+   npages = 4;
+   ASSERT_NE(npages, 0);
+   size = npages << self->page_shift;
+
+   buffer = malloc(sizeof(*buffer));
+   ASSERT_NE(buffer, NULL);
+
+   buffer->fd = -1;
+   buffer->size = size;
+   buffer->mirror = malloc(size);
+   ASSERT_NE(buffer->mirror, NULL);
+
+   buffer->ptr = mmap(NULL, size,
+  PROT_READ | PROT_WRITE,
+  MAP_PRIVATE | MAP_ANONYMOUS,
+  buffer->fd, 0);
+   ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+   /* Initialize buffer in system memory. */
+   for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+   ptr[i] = i;
+
+   /* Migrate memory to device. */
+   ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+   ASSERT_EQ(ret, 0);
+   ASSERT_EQ(buffer->cpages, npages);
+   /* Check what the device read. */
+   for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+   ASSERT_EQ(ptr[i], i);
+
+   gup.nr_pages_per_call = npages;
+   gup.addr = (unsigned long)buffer->ptr;
+   gup.gup_flags = FOLL_WRITE;
+   gup.size = size;
+   /*
+* Calling gup_test ioctl. It will try to PIN_LONGTERM these device 
pages
+* causing a migration back to system memory for both, private and 
coherent
+* type pages.
+*/
+   if (ioctl(gup_fd, PIN_LONGTERM_BENCHMARK, )) {
+   perror("ioctl on PIN_LONGTERM_BENCHMARK\n");
+   goto out_test;
+   }
+
+   /* Take snapshot to make sure pages have been migrated to sys memory */
+   ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_SNAPSHOT, buffer, npages);
+   ASSERT_EQ(ret, 0);
+   ASSERT_EQ(buffer->cpages, npages);
+   m = buffer->mirror;
+   for (i = 0; i < npages; i++)
+   ASSERT_EQ(m[i], HMM_DMIRROR_PROT_WRITE);
+out_test:
+   close(gup_fd);
+   hmm_buffer_free(buffer);
+}
 TEST_HARNESS_MAIN
-- 
git-series 0.9.1

Re: [PATCH v5 09/10] tools: update hmm-test to support device coherent type

2022-02-01 Thread Alistair Popple

Oh sorry, I had looked at this but forgotten to add my reviewed by:

Reviewed-by: Alistair Popple 

On Tuesday, 1 February 2022 10:27:25 AM AEDT Sierra Guiza, Alejandro (Alex) 
wrote:
> Hi Alistair,
> This is the last patch to be reviewed from this series. It already has 
> the changes from
> your last feedback (V3). Would you mind to take a look?
> Thanks a lot for reviewing the rest!
> 
> Regards,
> Alex Sierra
> 
> On 1/28/2022 2:08 PM, Alex Sierra wrote:
> > Test cases such as migrate_fault and migrate_multiple, were modified to
> > explicit migrate from device to sys memory without the need of page
> > faults, when using device coherent type.
> >
> > Snapshot test case updated to read memory device type first and based
> > on that, get the proper returned results migrate_ping_pong test case
> > added to test explicit migration from device to sys memory for both
> > private and coherent zone types.
> >
> > Helpers to migrate from device to sys memory and vicerversa
> > were also added.
> >
> > Signed-off-by: Alex Sierra 
> > Acked-by: Felix Kuehling 
> > ---
> > v2:
> > Set FIXTURE_VARIANT to add multiple device types to the FIXTURE. This
> > will run all the tests for each device type (private and coherent) in
> > case both existed during hmm-test driver probed.
> > v4:
> > Check for the number of pages successfully migrated from coherent
> > device to system at migrate_multiple test.
> > ---
> >   tools/testing/selftests/vm/hmm-tests.c | 123 -
> >   1 file changed, 102 insertions(+), 21 deletions(-)
> >
> > diff --git a/tools/testing/selftests/vm/hmm-tests.c 
> > b/tools/testing/selftests/vm/hmm-tests.c
> > index 203323967b50..84ec8c4a1dc7 100644
> > --- a/tools/testing/selftests/vm/hmm-tests.c
> > +++ b/tools/testing/selftests/vm/hmm-tests.c
> > @@ -44,6 +44,14 @@ struct hmm_buffer {
> > int fd;
> > uint64_tcpages;
> > uint64_tfaults;
> > +   int zone_device_type;
> > +};
> > +
> > +enum {
> > +   HMM_PRIVATE_DEVICE_ONE,
> > +   HMM_PRIVATE_DEVICE_TWO,
> > +   HMM_COHERENCE_DEVICE_ONE,
> > +   HMM_COHERENCE_DEVICE_TWO,
> >   };
> >   
> >   #define TWOMEG(1 << 21)
> > @@ -60,6 +68,21 @@ FIXTURE(hmm)
> > unsigned intpage_shift;
> >   };
> >   
> > +FIXTURE_VARIANT(hmm)
> > +{
> > +   int device_number;
> > +};
> > +
> > +FIXTURE_VARIANT_ADD(hmm, hmm_device_private)
> > +{
> > +   .device_number = HMM_PRIVATE_DEVICE_ONE,
> > +};
> > +
> > +FIXTURE_VARIANT_ADD(hmm, hmm_device_coherent)
> > +{
> > +   .device_number = HMM_COHERENCE_DEVICE_ONE,
> > +};
> > +
> >   FIXTURE(hmm2)
> >   {
> > int fd0;
> > @@ -68,6 +91,24 @@ FIXTURE(hmm2)
> > unsigned intpage_shift;
> >   };
> >   
> > +FIXTURE_VARIANT(hmm2)
> > +{
> > +   int device_number0;
> > +   int device_number1;
> > +};
> > +
> > +FIXTURE_VARIANT_ADD(hmm2, hmm2_device_private)
> > +{
> > +   .device_number0 = HMM_PRIVATE_DEVICE_ONE,
> > +   .device_number1 = HMM_PRIVATE_DEVICE_TWO,
> > +};
> > +
> > +FIXTURE_VARIANT_ADD(hmm2, hmm2_device_coherent)
> > +{
> > +   .device_number0 = HMM_COHERENCE_DEVICE_ONE,
> > +   .device_number1 = HMM_COHERENCE_DEVICE_TWO,
> > +};
> > +
> >   static int hmm_open(int unit)
> >   {
> > char pathname[HMM_PATH_MAX];
> > @@ -81,12 +122,19 @@ static int hmm_open(int unit)
> > return fd;
> >   }
> >   
> > +static bool hmm_is_coherent_type(int dev_num)
> > +{
> > +   return (dev_num >= HMM_COHERENCE_DEVICE_ONE);
> > +}
> > +
> >   FIXTURE_SETUP(hmm)
> >   {
> > self->page_size = sysconf(_SC_PAGE_SIZE);
> > self->page_shift = ffs(self->page_size) - 1;
> >   
> > -   self->fd = hmm_open(0);
> > +   self->fd = hmm_open(variant->device_number);
> > +   if (self->fd < 0 && hmm_is_coherent_type(variant->device_number))
> > +   SKIP(exit(0), "DEVICE_COHERENT not available");
> > ASSERT_GE(self->fd, 0);
> >   }
> >   
> > @@ -95,9 +143,11 @@ FIXTURE_SETUP(hmm2)
> > self->page_size = sysconf(_SC_PAGE_SIZE);
> > self->page_shift = ffs(self->page_size) - 1;
> >   
> > -   self->fd0 = hmm_open(0);
> > +   self->fd0 = hmm_open(variant->device_

[PATCH 2/3] mm/gup.c: Migrate device coherent pages when pinning instead of failing

2022-02-01 Thread Alistair Popple

Currently any attempts to pin a device coherent page will fail. This is
because device coherent pages need to be managed by a device driver, and
pinning them would prevent a driver from migrating them off the device.

However this is no reason to fail pinning of these pages. These are
coherent and accessible from the CPU so can be migrated just like
pinning ZONE_MOVABLE pages. So instead of failing all attempts to pin
them first try migrating them out of ZONE_DEVICE.

Signed-off-by: Alistair Popple 
---
 mm/gup.c | 105 ++--
 1 file changed, 95 insertions(+), 10 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index f596b93..2cbef54 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1834,6 +1834,60 @@ struct page *get_dump_page(unsigned long addr)
 
 #ifdef CONFIG_MIGRATION
 /*
+ * Migrates a device coherent page back to normal memory. Caller should have a
+ * reference on page which will be copied to the new page if migration is
+ * successful or dropped on failure.
+ */
+static struct page *migrate_device_page(struct page *page,
+   unsigned int gup_flags)
+{
+   struct page *dpage;
+   struct migrate_vma args;
+   unsigned long src_pfn, dst_pfn = 0;
+
+   lock_page(page);
+   src_pfn = migrate_pfn(page_to_pfn(page)) | MIGRATE_PFN_MIGRATE;
+   args.src = _pfn;
+   args.dst = _pfn;
+   args.cpages = 1;
+   args.npages = 1;
+   args.vma = NULL;
+   migrate_vma_setup();
+   if (!(src_pfn & MIGRATE_PFN_MIGRATE))
+   return NULL;
+
+   dpage = alloc_pages(GFP_USER | __GFP_NOWARN, 0);
+
+   /*
+* get/pin the new page now so we don't have to retry gup after
+* migrating. We already have a reference so this should never fail.
+*/
+   if (WARN_ON_ONCE(!try_grab_page(dpage, gup_flags))) {
+   __free_pages(dpage, 0);
+   dpage = NULL;
+   }
+
+   if (dpage) {
+   lock_page(dpage);
+   dst_pfn = migrate_pfn(page_to_pfn(dpage));
+   }
+
+   migrate_vma_pages();
+   if (src_pfn & MIGRATE_PFN_MIGRATE)
+   copy_highpage(dpage, page);
+   migrate_vma_finalize();
+   if (dpage && !(src_pfn & MIGRATE_PFN_MIGRATE)) {
+   if (gup_flags & FOLL_PIN)
+   unpin_user_page(dpage);
+   else
+   put_page(dpage);
+   dpage = NULL;
+   }
+
+   return dpage;
+}
+
+/*
  * Check whether all pages are pinnable, if so return number of pages.  If some
  * pages are not pinnable, migrate them, and unpin all pages. Return zero if
  * pages were migrated, or if some pages were not successfully isolated.
@@ -1861,15 +1915,40 @@ static long check_and_migrate_movable_pages(unsigned 
long nr_pages,
continue;
prev_head = head;
/*
-* If we get a movable page, since we are going to be pinning
-* these entries, try to move them out if possible.
+* Device coherent pages are managed by a driver and should not
+* be pinned indefinitely as it prevents the driver moving the
+* page. So when trying to pin with FOLL_LONGTERM instead try
+* migrating page out of device memory.
 */
if (is_dev_private_or_coherent_page(head)) {
+   /*
+* device private pages will get faulted in during gup
+* so it shouldn't be possible to see one here.
+*/
WARN_ON_ONCE(is_device_private_page(head));
-   ret = -EFAULT;
-   goto unpin_pages;
+   WARN_ON_ONCE(PageCompound(head));
+
+   /*
+* migration will fail if the page is pinned, so convert
+* the pin on the source page to a normal reference.
+*/
+   if (gup_flags & FOLL_PIN) {
+   get_page(head);
+   unpin_user_page(head);
+   }
+
+   pages[i] = migrate_device_page(head, gup_flags);
+   if (!pages[i]) {
+   ret = -EBUSY;
+   break;
+   }
+   continue;
}
 
+   /*
+* If we get a movable page, since we are going to be pinning
+* these entries, try to move them out if possible.
+*/
if (!is_pinnable_page(head)) {
if (PageHuge(head)) {
if (!isolate_huge_page(head, 
_page_list))
@@ -1897,16 +1976,22 @@ static long check_and_migrate_movable_p

[PATCH 0/3] Migrate device coherent pages on get_user_pages()

2022-02-01 Thread Alistair Popple

Device coherent pages represent memory on a coherently attached device such
as a GPU which is usually under the control of a driver. These pages should
not be pinned as the driver needs to be able to move pages as required.
Currently this is enforced by failing any attempt to pin a device coherent
page.

A similar problem exists for ZONE_MOVABLE pages. In that case though the
pages are migrated instead of causing failure. There is no reason the
kernel can't migrate device coherent pages so this series implements
migration for device coherent pages so the same strategy of migrate and pin
can be used.

This series depends on the series "Add MEMORY_DEVICE_COHERENT for coherent
device memory mapping"[1] and should apply cleanly on top of that.

[1] - 
https://lore.kernel.org/linux-mm/20220128200825.8623-1-alex.sie...@amd.com/

Alex Sierra (1):
  tools: add hmm gup test for long term pinned device pages

Alistair Popple (2):
  migrate.c: Remove vma check in migrate_vma_setup()
  mm/gup.c: Migrate device coherent pages when pinning instead of failing

 mm/gup.c   | 105 +++---
 mm/migrate.c   |  34 
 tools/testing/selftests/vm/Makefile|   2 +-
 tools/testing/selftests/vm/hmm-tests.c |  81 -
 4 files changed, 194 insertions(+), 28 deletions(-)

-- 
git-series 0.9.1

[PATCH 1/3] migrate.c: Remove vma check in migrate_vma_setup()

2022-02-01 Thread Alistair Popple

migrate_vma_setup() checks that a valid vma is passed so that the page
tables can be walked to find the pfns associated with a given address
range. However in some cases the pfns are already known, such as when
migrating device coherent pages during pin_user_pages() meaning a valid
vma isn't required.

Signed-off-by: Alistair Popple 
---
 mm/migrate.c | 34 +-
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index d3cc358..31ba8ca 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2581,24 +2581,24 @@ int migrate_vma_setup(struct migrate_vma *args)
 
args->start &= PAGE_MASK;
args->end &= PAGE_MASK;
-   if (!args->vma || is_vm_hugetlb_page(args->vma) ||
-   (args->vma->vm_flags & VM_SPECIAL) || vma_is_dax(args->vma))
-   return -EINVAL;
-   if (nr_pages <= 0)
-   return -EINVAL;
-   if (args->start < args->vma->vm_start ||
-   args->start >= args->vma->vm_end)
-   return -EINVAL;
-   if (args->end <= args->vma->vm_start || args->end > args->vma->vm_end)
-   return -EINVAL;
if (!args->src || !args->dst)
return -EINVAL;
-
-   memset(args->src, 0, sizeof(*args->src) * nr_pages);
-   args->cpages = 0;
-   args->npages = 0;
-
-   migrate_vma_collect(args);
+   if (args->vma) {
+   if (is_vm_hugetlb_page(args->vma) ||
+   (args->vma->vm_flags & VM_SPECIAL) || 
vma_is_dax(args->vma))
+   return -EINVAL;
+   if (args->start < args->vma->vm_start ||
+   args->start >= args->vma->vm_end)
+   return -EINVAL;
+   if (args->end <= args->vma->vm_start || args->end > 
args->vma->vm_end)
+   return -EINVAL;
+
+   memset(args->src, 0, sizeof(*args->src) * nr_pages);
+   args->cpages = 0;
+   args->npages = 0;
+
+   migrate_vma_collect(args);
+   }
 
if (args->cpages)
migrate_vma_unmap(args);
@@ -2783,7 +2783,7 @@ void migrate_vma_pages(struct migrate_vma *migrate)
continue;
}
 
-   if (!page) {
+   if (!page && migrate->vma) {
if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE))
continue;
if (!notified) {
-- 
git-series 0.9.1

[PATCH 3/3] tools: add hmm gup test for long term pinned device pages

2022-02-01 Thread Alistair Popple

From: Alex Sierra 

The intention is to test device coherent type pages that have been
called through get user pages with PIN_LONGTERM flag set. These pages
should get migrated back to normal system memory.

Signed-off-by: Alex Sierra 
Signed-off-by: Alistair Popple 
---
 tools/testing/selftests/vm/Makefile|  2 +-
 tools/testing/selftests/vm/hmm-tests.c | 81 +++-
 2 files changed, 82 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/vm/Makefile 
b/tools/testing/selftests/vm/Makefile
index 1607322..58c8427 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -142,7 +142,7 @@ $(OUTPUT)/mlock-random-test $(OUTPUT)/memfd_secret: LDLIBS 
+= -lcap
 
 $(OUTPUT)/gup_test: ../../../../mm/gup_test.h
 
-$(OUTPUT)/hmm-tests: local_config.h
+$(OUTPUT)/hmm-tests: local_config.h ../../../../mm/gup_test.h
 
 # HMM_EXTRA_LIBS may get set in local_config.mk, or it may be left empty.
 $(OUTPUT)/hmm-tests: LDLIBS += $(HMM_EXTRA_LIBS)
diff --git a/tools/testing/selftests/vm/hmm-tests.c 
b/tools/testing/selftests/vm/hmm-tests.c
index 84ec8c4..11b83a8 100644
--- a/tools/testing/selftests/vm/hmm-tests.c
+++ b/tools/testing/selftests/vm/hmm-tests.c
@@ -36,6 +36,7 @@
  * in the usual include/uapi/... directory.
  */
 #include "../../../../lib/test_hmm_uapi.h"
+#include "../../../../mm/gup_test.h"
 
 struct hmm_buffer {
void*ptr;
@@ -60,6 +61,8 @@ enum {
 #define NTIMES 10
 
 #define ALIGN(x, a) (((x) + (a - 1)) & (~((a) - 1)))
+/* Just the flags we need, copied from mm.h: */
+#define FOLL_WRITE 0x01/* check pte is writable */
 
 FIXTURE(hmm)
 {
@@ -1766,4 +1769,82 @@ TEST_F(hmm, exclusive_cow)
hmm_buffer_free(buffer);
 }
 
+/*
+ * Test get user device pages through gup_test. Setting PIN_LONGTERM flag.
+ * This should trigger a migration back to system memory for both, private
+ * and coherent type pages.
+ * This test makes use of gup_test module. Make sure GUP_TEST_CONFIG is added
+ * to your configuration before you run it.
+ */
+TEST_F(hmm, hmm_gup_test)
+{
+   struct hmm_buffer *buffer;
+   struct gup_test gup;
+   int gup_fd;
+   unsigned long npages;
+   unsigned long size;
+   unsigned long i;
+   int *ptr;
+   int ret;
+   unsigned char *m;
+
+   gup_fd = open("/sys/kernel/debug/gup_test", O_RDWR);
+   if (gup_fd == -1)
+   SKIP(return, "Skipping test, could not find gup_test driver");
+
+   npages = 4;
+   ASSERT_NE(npages, 0);
+   size = npages << self->page_shift;
+
+   buffer = malloc(sizeof(*buffer));
+   ASSERT_NE(buffer, NULL);
+
+   buffer->fd = -1;
+   buffer->size = size;
+   buffer->mirror = malloc(size);
+   ASSERT_NE(buffer->mirror, NULL);
+
+   buffer->ptr = mmap(NULL, size,
+  PROT_READ | PROT_WRITE,
+  MAP_PRIVATE | MAP_ANONYMOUS,
+  buffer->fd, 0);
+   ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+   /* Initialize buffer in system memory. */
+   for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+   ptr[i] = i;
+
+   /* Migrate memory to device. */
+   ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+   ASSERT_EQ(ret, 0);
+   ASSERT_EQ(buffer->cpages, npages);
+   /* Check what the device read. */
+   for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+   ASSERT_EQ(ptr[i], i);
+
+   gup.nr_pages_per_call = npages;
+   gup.addr = (unsigned long)buffer->ptr;
+   gup.gup_flags = FOLL_WRITE;
+   gup.size = size;
+   /*
+* Calling gup_test ioctl. It will try to PIN_LONGTERM these device 
pages
+* causing a migration back to system memory for both, private and 
coherent
+* type pages.
+*/
+   if (ioctl(gup_fd, PIN_LONGTERM_BENCHMARK, )) {
+   perror("ioctl on PIN_LONGTERM_BENCHMARK\n");
+   goto out_test;
+   }
+
+   /* Take snapshot to make sure pages have been migrated to sys memory */
+   ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_SNAPSHOT, buffer, npages);
+   ASSERT_EQ(ret, 0);
+   ASSERT_EQ(buffer->cpages, npages);
+   m = buffer->mirror;
+   for (i = 0; i < npages; i++)
+   ASSERT_EQ(m[i], HMM_DMIRROR_PROT_WRITE);
+out_test:
+   close(gup_fd);
+   hmm_buffer_free(buffer);
+}
 TEST_HARNESS_MAIN
-- 
git-series 0.9.1

Re: [PATCH] mm: add device coherent vma selection for memory migration

2022-01-31 Thread Alistair Popple

Thanks for fixing. I'm guessing Andrew will want you to resend this as part of
a new v6 series, but please add:

Reviewed-by: Alistair Popple 

On Tuesday, 1 February 2022 6:48:13 AM AEDT Alex Sierra wrote:
> This case is used to migrate pages from device memory, back to system
> memory. Device coherent type memory is cache coherent from device and CPU
> point of view.
> 
> Signed-off-by: Alex Sierra 
> Acked-by: Felix Kuehling 
> ---
> v2:
> condition added when migrations from device coherent pages.
> ---
>  include/linux/migrate.h |  1 +
>  mm/migrate.c| 12 +---
>  2 files changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index db96e10eb8da..66a34eae8cb6 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -130,6 +130,7 @@ static inline unsigned long migrate_pfn(unsigned long pfn)
>  enum migrate_vma_direction {
>   MIGRATE_VMA_SELECT_SYSTEM = 1 << 0,
>   MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1,
> + MIGRATE_VMA_SELECT_DEVICE_COHERENT = 1 << 2,
>  };
>  
>  struct migrate_vma {
> diff --git a/mm/migrate.c b/mm/migrate.c
> index cd137aedcfe5..69c6830c47c6 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2264,15 +2264,21 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>   if (is_writable_device_private_entry(entry))
>   mpfn |= MIGRATE_PFN_WRITE;
>   } else {
> - if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
> - goto next;
>   pfn = pte_pfn(pte);
> - if (is_zero_pfn(pfn)) {
> + if (is_zero_pfn(pfn) &&
> + (migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) {
>   mpfn = MIGRATE_PFN_MIGRATE;
>   migrate->cpages++;
>   goto next;
>   }
>   page = vm_normal_page(migrate->vma, addr, pte);
> + if (page && !is_zone_device_page(page) &&
> + !(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
> + goto next;
> + else if (page && is_device_coherent_page(page) &&
> + (!(migrate->flags & 
> MIGRATE_VMA_SELECT_DEVICE_COHERENT) ||
> +  page->pgmap->owner != migrate->pgmap_owner))
> + goto next;
>   mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
>   mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
>   }
>

Re: [PATCH v5 01/10] mm: add zone device coherent type memory support

2022-01-31 Thread Alistair Popple

Looks good, feel free to add:

Reviewed-by: Alistair Popple 

On Saturday, 29 January 2022 7:08:16 AM AEDT Alex Sierra wrote:
> Device memory that is cache coherent from device and CPU point of view.
> This is used on platforms that have an advanced system bus (like CAPI
> or CXL). Any page of a process can be migrated to such memory. However,
> no one should be allowed to pin such memory so that it can always be
> evicted.
> 
> Signed-off-by: Alex Sierra 
> Acked-by: Felix Kuehling 
> ---
> v4:
> - use the same system entry path for coherent device pages at
> migrate_vma_insert_page.
> 
> - Add coherent device type support for try_to_migrate /
> try_to_migrate_one.
> ---
>  include/linux/memremap.h |  8 +++
>  include/linux/mm.h   | 16 ++
>  mm/memcontrol.c  |  6 +++---
>  mm/memory-failure.c  |  8 +--
>  mm/memremap.c| 14 -
>  mm/migrate.c | 45 
>  mm/rmap.c|  5 +++--
>  7 files changed, 71 insertions(+), 31 deletions(-)
> 
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 1fafcc38acba..727b8c789193 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -39,6 +39,13 @@ struct vmem_altmap {
>   * A more complete discussion of unaddressable memory may be found in
>   * include/linux/hmm.h and Documentation/vm/hmm.rst.
>   *
> + * MEMORY_DEVICE_COHERENT:
> + * Device memory that is cache coherent from device and CPU point of view. 
> This
> + * is used on platforms that have an advanced system bus (like CAPI or CXL). 
> A
> + * driver can hotplug the device memory using ZONE_DEVICE and with that 
> memory
> + * type. Any page of a process can be migrated to such memory. However no one
> + * should be allowed to pin such memory so that it can always be evicted.
> + *
>   * MEMORY_DEVICE_FS_DAX:
>   * Host memory that has similar access semantics as System RAM i.e. DMA
>   * coherent and supports page pinning. In support of coordinating page
> @@ -59,6 +66,7 @@ struct vmem_altmap {
>  enum memory_type {
>   /* 0 is reserved to catch uninitialized type fields */
>   MEMORY_DEVICE_PRIVATE = 1,
> + MEMORY_DEVICE_COHERENT,
>   MEMORY_DEVICE_FS_DAX,
>   MEMORY_DEVICE_GENERIC,
>   MEMORY_DEVICE_PCI_P2PDMA,
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index e1a84b1e6787..0c61bf40edef 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1106,6 +1106,7 @@ static inline bool page_is_devmap_managed(struct page 
> *page)
>   return false;
>   switch (page->pgmap->type) {
>   case MEMORY_DEVICE_PRIVATE:
> + case MEMORY_DEVICE_COHERENT:
>   case MEMORY_DEVICE_FS_DAX:
>   return true;
>   default:
> @@ -1135,6 +1136,21 @@ static inline bool is_device_private_page(const struct 
> page *page)
>   page->pgmap->type == MEMORY_DEVICE_PRIVATE;
>  }
>  
> +static inline bool is_device_coherent_page(const struct page *page)
> +{
> + return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
> + is_zone_device_page(page) &&
> + page->pgmap->type == MEMORY_DEVICE_COHERENT;
> +}
> +
> +static inline bool is_dev_private_or_coherent_page(const struct page *page)
> +{
> + return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
> + is_zone_device_page(page) &&
> + (page->pgmap->type == MEMORY_DEVICE_PRIVATE ||
> + page->pgmap->type == MEMORY_DEVICE_COHERENT);
> +}
> +
>  static inline bool is_pci_p2pdma_page(const struct page *page)
>  {
>   return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 09d342c7cbd0..0882b5b2a857 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5691,8 +5691,8 @@ static int mem_cgroup_move_account(struct page *page,
>   *   2(MC_TARGET_SWAP): if the swap entry corresponding to this pte is a
>   * target for charge migration. if @target is not NULL, the entry is 
> stored
>   * in target->ent.
> - *   3(MC_TARGET_DEVICE): like MC_TARGET_PAGE  but page is 
> MEMORY_DEVICE_PRIVATE
> - * (so ZONE_DEVICE page and thus not on the lru).
> + *   3(MC_TARGET_DEVICE): like MC_TARGET_PAGE  but page is device memory and
> + *   thus not on the lru.
>   * For now we such page is charge like a regular page would be as for all
>   * intent and purposes it is just special memory taking the place of a
>   * regular page.
> @@ -5726,7 +5726,7 @@ static enum mc_target_type get_mctgt_type(struct 
> vm_area_struct *vma,
>

Re: [PATCH v5 02/10] mm: add device coherent vma selection for memory migration

2022-01-31 Thread Alistair Popple

On Saturday, 29 January 2022 7:08:17 AM AEDT Alex Sierra wrote:

[...]

>  struct migrate_vma {
> diff --git a/mm/migrate.c b/mm/migrate.c
> index cd137aedcfe5..d3cc3589e1e8 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2264,7 +2264,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>   if (is_writable_device_private_entry(entry))
>   mpfn |= MIGRATE_PFN_WRITE;
>   } else {
> - if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
> + if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM) &&
> + !(migrate->flags & 
> MIGRATE_VMA_SELECT_DEVICE_COHERENT))
>   goto next;
>   pfn = pte_pfn(pte);
>   if (is_zero_pfn(pfn)) {

Sorry, but I still don't think this is quite right.

When specifying MIGRATE_VMA_SELECT_DEVICE_COHERENT we are looking for pages to
migrate from the device back to system memory. But as currently written I think
this can also select the zero pfn when MIGRATE_VMA_SELECT_DEVICE_COHERENT is
specified. As far as I know that can never point to device memory so migration
of a zero pfn should be also be skipped in that case.

We should only migrate the zero pfn if MIGRATE_VMA_SELECT_SYSTEM is specified.

> @@ -2273,6 +2274,13 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>   goto next;
>   }
>   page = vm_normal_page(migrate->vma, addr, pte);
> + if (page && !is_zone_device_page(page) &&
> + !(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
> + goto next;
> + if (page && is_device_coherent_page(page) &&
> + (!(migrate->flags & 
> MIGRATE_VMA_SELECT_DEVICE_COHERENT) ||
> +  page->pgmap->owner != migrate->pgmap_owner))
> + goto next;
>   mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
>   mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
>   }
>

Re: [PATCH v4 02/10] mm: add device coherent vma selection for memory migration

2022-01-28 Thread Alistair Popple

On Thursday, 27 January 2022 2:09:41 PM AEDT Alex Sierra wrote:

[...]

> diff --git a/mm/migrate.c b/mm/migrate.c
> index 277562cd4cf5..2b3375e165b1 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2340,8 +2340,6 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>   if (is_writable_device_private_entry(entry))
>   mpfn |= MIGRATE_PFN_WRITE;
>   } else {
> - if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
> - goto next;

This isn't correct as it allows zero pfn pages to be selected for migration
when they shouldn't be (ie. because MIGRATE_VMA_SELECT_SYSTEM isn't specified).

>   pfn = pte_pfn(pte);
>   if (is_zero_pfn(pfn)) {
>   mpfn = MIGRATE_PFN_MIGRATE;
> @@ -2349,6 +2347,13 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>   goto next;
>   }
>   page = vm_normal_page(migrate->vma, addr, pte);
> + if (page && !is_zone_device_page(page) &&
> + !(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
> + goto next;
> + if (page && is_device_coherent_page(page) &&
> + (!(migrate->flags & 
> MIGRATE_VMA_SELECT_DEVICE_COHERENT) ||
> +  page->pgmap->owner != migrate->pgmap_owner))
> + goto next;
>   mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
>   mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
>   }
>

Re: [PATCH v4 06/10] lib: test_hmm add ioctl to get zone device type

2022-01-28 Thread Alistair Popple

Reviewed-by: Alistair Popple 

On Thursday, 27 January 2022 2:09:45 PM AEDT Alex Sierra wrote:
> new ioctl cmd added to query zone device type. This will be
> used once the test_hmm adds zone device coherent type.
> 
> Signed-off-by: Alex Sierra 
> ---
>  lib/test_hmm.c  | 23 +--
>  lib/test_hmm_uapi.h |  8 
>  2 files changed, 29 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
> index c259842f6d44..fb1fa7c6fa98 100644
> --- a/lib/test_hmm.c
> +++ b/lib/test_hmm.c
> @@ -84,6 +84,7 @@ struct dmirror_chunk {
>  struct dmirror_device {
>   struct cdev cdevice;
>   struct hmm_devmem   *devmem;
> + unsigned intzone_device_type;
>  
>   unsigned intdevmem_capacity;
>   unsigned intdevmem_count;
> @@ -1025,6 +1026,15 @@ static int dmirror_snapshot(struct dmirror *dmirror,
>   return ret;
>  }
>  
> +static int dmirror_get_device_type(struct dmirror *dmirror,
> + struct hmm_dmirror_cmd *cmd)
> +{
> + mutex_lock(>mutex);
> + cmd->zone_device_type = dmirror->mdevice->zone_device_type;
> + mutex_unlock(>mutex);
> +
> + return 0;
> +}
>  static long dmirror_fops_unlocked_ioctl(struct file *filp,
>   unsigned int command,
>   unsigned long arg)
> @@ -1075,6 +1085,9 @@ static long dmirror_fops_unlocked_ioctl(struct file 
> *filp,
>   ret = dmirror_snapshot(dmirror, );
>   break;
>  
> + case HMM_DMIRROR_GET_MEM_DEV_TYPE:
> + ret = dmirror_get_device_type(dmirror, );
> + break;
>   default:
>   return -EINVAL;
>   }
> @@ -1235,14 +1248,20 @@ static void dmirror_device_remove(struct 
> dmirror_device *mdevice)
>  static int __init hmm_dmirror_init(void)
>  {
>   int ret;
> - int id;
> + int id = 0;
> + int ndevices = 0;
>  
>   ret = alloc_chrdev_region(_dev, 0, DMIRROR_NDEVICES,
> "HMM_DMIRROR");
>   if (ret)
>   goto err_unreg;
>  
> - for (id = 0; id < DMIRROR_NDEVICES; id++) {
> + memset(dmirror_devices, 0, DMIRROR_NDEVICES * 
> sizeof(dmirror_devices[0]));
> + dmirror_devices[ndevices++].zone_device_type =
> + HMM_DMIRROR_MEMORY_DEVICE_PRIVATE;
> + dmirror_devices[ndevices++].zone_device_type =
> + HMM_DMIRROR_MEMORY_DEVICE_PRIVATE;
> + for (id = 0; id < ndevices; id++) {
>   ret = dmirror_device_init(dmirror_devices + id, id);
>   if (ret)
>   goto err_chrdev;
> diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h
> index f14dea5dcd06..17f842f1aa02 100644
> --- a/lib/test_hmm_uapi.h
> +++ b/lib/test_hmm_uapi.h
> @@ -19,6 +19,7 @@
>   * @npages: (in) number of pages to read/write
>   * @cpages: (out) number of pages copied
>   * @faults: (out) number of device page faults seen
> + * @zone_device_type: (out) zone device memory type
>   */
>  struct hmm_dmirror_cmd {
>   __u64   addr;
> @@ -26,6 +27,7 @@ struct hmm_dmirror_cmd {
>   __u64   npages;
>   __u64   cpages;
>   __u64   faults;
> + __u64   zone_device_type;
>  };
>  
>  /* Expose the address space of the calling process through hmm device file */
> @@ -35,6 +37,7 @@ struct hmm_dmirror_cmd {
>  #define HMM_DMIRROR_SNAPSHOT _IOWR('H', 0x03, struct hmm_dmirror_cmd)
>  #define HMM_DMIRROR_EXCLUSIVE_IOWR('H', 0x04, struct 
> hmm_dmirror_cmd)
>  #define HMM_DMIRROR_CHECK_EXCLUSIVE  _IOWR('H', 0x05, struct hmm_dmirror_cmd)
> +#define HMM_DMIRROR_GET_MEM_DEV_TYPE _IOWR('H', 0x06, struct hmm_dmirror_cmd)
>  
>  /*
>   * Values returned in hmm_dmirror_cmd.ptr for HMM_DMIRROR_SNAPSHOT.
> @@ -62,4 +65,9 @@ enum {
>   HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE = 0x30,
>  };
>  
> +enum {
> + /* 0 is reserved to catch uninitialized type fields */
> + HMM_DMIRROR_MEMORY_DEVICE_PRIVATE = 1,
> +};
> +
>  #endif /* _LIB_TEST_HMM_UAPI_H */
>

Re: [PATCH v4 07/10] lib: test_hmm add module param for zone device type

2022-01-28 Thread Alistair Popple

Thanks for the updates, looks good now.

Reviewed-by: Alistair Popple 

On Thursday, 27 January 2022 2:09:46 PM AEDT Alex Sierra wrote:
> In order to configure device coherent in test_hmm, two module parameters
> should be passed, which correspond to the SP start address of each
> device (2) spm_addr_dev0 & spm_addr_dev1. If no parameters are passed,
> private device type is configured.
> 
> Signed-off-by: Alex Sierra 
> ---
>  lib/test_hmm.c  | 73 -
>  lib/test_hmm_uapi.h |  1 +
>  2 files changed, 53 insertions(+), 21 deletions(-)
> 
> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
> index fb1fa7c6fa98..6f068f7c4ee3 100644
> --- a/lib/test_hmm.c
> +++ b/lib/test_hmm.c
> @@ -34,6 +34,16 @@
>  #define DEVMEM_CHUNK_SIZE(256 * 1024 * 1024U)
>  #define DEVMEM_CHUNKS_RESERVE16
>  
> +static unsigned long spm_addr_dev0;
> +module_param(spm_addr_dev0, long, 0644);
> +MODULE_PARM_DESC(spm_addr_dev0,
> + "Specify start address for SPM (special purpose memory) used 
> for device 0. By setting this Coherent device type will be used. Make sure 
> spm_addr_dev1 is set too. Minimum SPM size should be DEVMEM_CHUNK_SIZE.");
> +
> +static unsigned long spm_addr_dev1;
> +module_param(spm_addr_dev1, long, 0644);
> +MODULE_PARM_DESC(spm_addr_dev1,
> + "Specify start address for SPM (special purpose memory) used 
> for device 1. By setting this Coherent device type will be used. Make sure 
> spm_addr_dev0 is set too. Minimum SPM size should be DEVMEM_CHUNK_SIZE.");
> +
>  static const struct dev_pagemap_ops dmirror_devmem_ops;
>  static const struct mmu_interval_notifier_ops dmirror_min_ops;
>  static dev_t dmirror_dev;
> @@ -452,28 +462,44 @@ static int dmirror_write(struct dmirror *dmirror, 
> struct hmm_dmirror_cmd *cmd)
>   return ret;
>  }
>  
> -static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,
> +static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
>  struct page **ppage)
>  {
>   struct dmirror_chunk *devmem;
> - struct resource *res;
> + struct resource *res = NULL;
>   unsigned long pfn;
>   unsigned long pfn_first;
>   unsigned long pfn_last;
>   void *ptr;
> + int ret = -ENOMEM;
>  
>   devmem = kzalloc(sizeof(*devmem), GFP_KERNEL);
>   if (!devmem)
> - return false;
> + return ret;
>  
> - res = request_free_mem_region(_resource, DEVMEM_CHUNK_SIZE,
> -   "hmm_dmirror");
> - if (IS_ERR(res))
> + switch (mdevice->zone_device_type) {
> + case HMM_DMIRROR_MEMORY_DEVICE_PRIVATE:
> + res = request_free_mem_region(_resource, 
> DEVMEM_CHUNK_SIZE,
> +   "hmm_dmirror");
> + if (IS_ERR_OR_NULL(res))
> + goto err_devmem;
> + devmem->pagemap.range.start = res->start;
> + devmem->pagemap.range.end = res->end;
> + devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
> + break;
> + case HMM_DMIRROR_MEMORY_DEVICE_COHERENT:
> + devmem->pagemap.range.start = (MINOR(mdevice->cdevice.dev) - 2) 
> ?
> + spm_addr_dev0 :
> + spm_addr_dev1;
> + devmem->pagemap.range.end = devmem->pagemap.range.start +
> + DEVMEM_CHUNK_SIZE - 1;
> + devmem->pagemap.type = MEMORY_DEVICE_COHERENT;
> + break;
> + default:
> + ret = -EINVAL;
>   goto err_devmem;
> + }
>  
> - devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
> - devmem->pagemap.range.start = res->start;
> - devmem->pagemap.range.end = res->end;
>   devmem->pagemap.nr_range = 1;
>   devmem->pagemap.ops = _devmem_ops;
>   devmem->pagemap.owner = mdevice;
> @@ -494,10 +520,14 @@ static bool dmirror_allocate_chunk(struct 
> dmirror_device *mdevice,
>   mdevice->devmem_capacity = new_capacity;
>   mdevice->devmem_chunks = new_chunks;
>   }
> -
>   ptr = memremap_pages(>pagemap, numa_node_id());
> - if (IS_ERR(ptr))
> + if (IS_ERR_OR_NULL(ptr)) {
> + if (ptr)
> + ret = PTR_ERR(ptr);
> + else
> + ret = -EFAULT;
>   goto err_release;
> + }
>  
>   devmem->mdevice = mdevice;
&g

Re: [PATCH v4 01/10] mm: add zone device coherent type memory support

2022-01-28 Thread Alistair Popple

On Thursday, 27 January 2022 2:09:40 PM AEDT Alex Sierra wrote:

[...]

> diff --git a/mm/migrate.c b/mm/migrate.c
> index 1852d787e6ab..277562cd4cf5 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -362,7 +362,7 @@ static int expected_page_refs(struct address_space
> *mapping, struct page *page)> 
>  * Device private pages have an extra refcount as they are
>  * ZONE_DEVICE pages.
>  */
> 
> -   expected_count += is_device_private_page(page);
> +   expected_count += is_dev_private_or_coherent_page(page);
> 
> if (mapping)
> 
> expected_count += thp_nr_pages(page) +
> page_has_private(page);
> 
> @@ -2503,7 +2503,7 @@ static bool migrate_vma_check_page(struct page *page)
> 
>  * FIXME proper solution is to rework migration_entry_wait()
>  so
>  * it does not need to take a reference on page.
>  */
> 
> -   return is_device_private_page(page);
> +   return is_dev_private_or_coherent_page(page);

As Andrew points out this no longer applies due to changes here. I think you
can just drop this hunk though.

[...]

> diff --git a/mm/rmap.c b/mm/rmap.c
> index 6aebd1747251..32dae6839403 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1823,10 +1823,17 @@ static bool try_to_migrate_one(struct page *page, 
> struct vm_area_struct *vma,
>* pteval maps a zone device page and is therefore
>* a swap pte.
>*/
> - if (pte_swp_soft_dirty(pteval))
> - swp_pte = pte_swp_mksoft_dirty(swp_pte);
> - if (pte_swp_uffd_wp(pteval))
> - swp_pte = pte_swp_mkuffd_wp(swp_pte);
> + if (is_device_coherent_page(page)) {
> + if (pte_soft_dirty(pteval))
> + swp_pte = pte_swp_mksoft_dirty(swp_pte);
> + if (pte_uffd_wp(pteval))
> + swp_pte = pte_swp_mkuffd_wp(swp_pte);
> + } else {
> + if (pte_swp_soft_dirty(pteval))
> + swp_pte = pte_swp_mksoft_dirty(swp_pte);
> + if (pte_swp_uffd_wp(pteval))
> + swp_pte = pte_swp_mkuffd_wp(swp_pte);
> + }

As I understand things ptes for device coherent pages don't need special
treatment, therefore rather than special casing here it should just fall
through to the same path as normal pages. For that I think all you need is
something like:

-if (is_zone_device_page(page)) {
+if (is_device_private_page(page)) {

Noting that device private pages are the only zone device pages that could
have been encountered here anyway.

>   set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
>   /*
>* No need to invalidate here it will synchronize on
> @@ -1837,7 +1844,7 @@ static bool try_to_migrate_one(struct page *page, 
> struct vm_area_struct *vma,
>* Since only PAGE_SIZE pages can currently be
>* migrated, just set it to page. This will need to be
>* changed when hugepage migrations to device private
> -  * memory are supported.
> +  * or coherent memory are supported.
>*/
>   subpage = page;
>   } else if (PageHWPoison(page)) {
> @@ -1943,7 +1950,8 @@ void try_to_migrate(struct page *page, enum ttu_flags 
> flags)
>   TTU_SYNC)))
>   return;
>  
> - if (is_zone_device_page(page) && !is_device_private_page(page))
> + if (is_zone_device_page(page) &&
> + !is_dev_private_or_coherent_page(page))
>   return;
>  
>   /*
>

Re: [PATCH v4 04/10] drm/amdkfd: add SPM support for SVM

2022-01-28 Thread Alistair Popple

On Thursday, 27 January 2022 2:09:43 PM AEDT Alex Sierra wrote:

[...]

> @@ -984,3 +990,4 @@ int svm_migrate_init(struct amdgpu_device *adev)
>  
>   return 0;
>  }
> +
> 

git-am complained about this when I applied the series. Given you have to
rebase anyway it would be worth fixing this.

Re: [PATCH v4 03/10] mm/gup: fail get_user_pages for LONGTERM dev coherent type

2022-01-28 Thread Alistair Popple

On Thursday, 27 January 2022 2:09:42 PM AEDT Alex Sierra wrote:
> Avoid long term pinning for Coherent device type pages. This could
> interfere with their own device memory manager. For now, we are just
> returning error for PIN_LONGTERM Coherent device type pages. Eventually,
> these type of pages will get migrated to system memory, once the device
> migration pages support is added.
> 
> Signed-off-by: Alex Sierra 
> ---
>  mm/gup.c | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 886d6148d3d0..5291d7221826 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1720,6 +1720,12 @@ static long check_and_migrate_movable_pages(unsigned 
> long nr_pages,
>* If we get a movable page, since we are going to be pinning
>* these entries, try to move them out if possible.
>*/
> + if (is_dev_private_or_coherent_page(head)) {
> + WARN_ON_ONCE(is_device_private_page(head));
> + ret = -EFAULT;
> + goto unpin_pages;
> + }
> +
>   if (!is_pinnable_page(head)) {
>   if (PageHuge(head)) {
>   if (!isolate_huge_page(head, 
> _page_list))
> @@ -1750,6 +1756,7 @@ static long check_and_migrate_movable_pages(unsigned 
> long nr_pages,
>   if (list_empty(_page_list) && !isolation_error_count)
>   return nr_pages;
>  
> +unpin_pages:
>   if (gup_flags & FOLL_PIN) {
>   unpin_user_pages(pages, nr_pages);
>   } else {
 
If there is a mix of ZONE_MOVABLE and device pages the return value (ret) will
be subsequently lost here:

if (!list_empty(_page_list)) {
ret = migrate_pages(_page_list, alloc_migration_target,
NULL, (unsigned long), MIGRATE_SYNC,
MR_LONGTERM_PIN, NULL);
if (ret && !list_empty(_page_list))
putback_movable_pages(_page_list);
}

Which won't actually cause any problems, but it will lead to the GUP getting
retried unnecessarily. I do still intend to address this with a series to
migrate pages instead though, so I think this is ok for now as it's an unlikely
corner case anyway. Therefore feel tree to add the below when you repost:

Reviewed-by: Alistair Poppple

Re: [PATCH v4 08/10] lib: add support for device coherent type in test_hmm

2022-01-28 Thread Alistair Popple

I haven't tested the change which checks that pages migrated back to sysmem,
but it looks ok so:

Reviewed-by: Alistair Popple 

On Thursday, 27 January 2022 2:09:47 PM AEDT Alex Sierra wrote:
> Device Coherent type uses device memory that is coherently accesible by
> the CPU. This could be shown as SP (special purpose) memory range
> at the BIOS-e820 memory enumeration. If no SP memory is supported in
> system, this could be faked by setting CONFIG_EFI_FAKE_MEMMAP.
> 
> Currently, test_hmm only supports two different SP ranges of at least
> 256MB size. This could be specified in the kernel parameter variable
> efi_fake_mem. Ex. Two SP ranges of 1GB starting at 0x1 &
> 0x14000 physical address. Ex.
> efi_fake_mem=1G@0x1:0x4,1G@0x14000:0x4
> 
> Private and coherent device mirror instances can be created in the same
> probed. This is done by passing the module parameters spm_addr_dev0 &
> spm_addr_dev1. In this case, it will create four instances of
> device_mirror. The first two correspond to private device type, the
> last two to coherent type. Then, they can be easily accessed from user
> space through /dev/hmm_mirror. Usually num_device 0 and 1
> are for private, and 2 and 3 for coherent types. If no module
> parameters are passed, two instances of private type device_mirror will
> be created only.
> 
> Signed-off-by: Alex Sierra 
> ---
> v4:
> Return number of coherent device pages successfully migrated to system.
> This is returned at cmd->cpages.
> ---
>  lib/test_hmm.c  | 260 +---
>  lib/test_hmm_uapi.h |  15 ++-
>  2 files changed, 205 insertions(+), 70 deletions(-)
> 
> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
> index 6f068f7c4ee3..850d5331e370 100644
> --- a/lib/test_hmm.c
> +++ b/lib/test_hmm.c
> @@ -29,11 +29,22 @@
>  
>  #include "test_hmm_uapi.h"
>  
> -#define DMIRROR_NDEVICES 2
> +#define DMIRROR_NDEVICES 4
>  #define DMIRROR_RANGE_FAULT_TIMEOUT  1000
>  #define DEVMEM_CHUNK_SIZE(256 * 1024 * 1024U)
>  #define DEVMEM_CHUNKS_RESERVE16
>  
> +/*
> + * For device_private pages, dpage is just a dummy struct page
> + * representing a piece of device memory. dmirror_devmem_alloc_page
> + * allocates a real system memory page as backing storage to fake a
> + * real device. zone_device_data points to that backing page. But
> + * for device_coherent memory, the struct page represents real
> + * physical CPU-accessible memory that we can use directly.
> + */
> +#define BACKING_PAGE(page) (is_device_private_page((page)) ? \
> +(page)->zone_device_data : (page))
> +
>  static unsigned long spm_addr_dev0;
>  module_param(spm_addr_dev0, long, 0644);
>  MODULE_PARM_DESC(spm_addr_dev0,
> @@ -122,6 +133,21 @@ static int dmirror_bounce_init(struct dmirror_bounce 
> *bounce,
>   return 0;
>  }
>  
> +static bool dmirror_is_private_zone(struct dmirror_device *mdevice)
> +{
> + return (mdevice->zone_device_type ==
> + HMM_DMIRROR_MEMORY_DEVICE_PRIVATE) ? true : false;
> +}
> +
> +static enum migrate_vma_direction
> + dmirror_select_device(struct dmirror *dmirror)
> +{
> + return (dmirror->mdevice->zone_device_type ==
> + HMM_DMIRROR_MEMORY_DEVICE_PRIVATE) ?
> + MIGRATE_VMA_SELECT_DEVICE_PRIVATE :
> + MIGRATE_VMA_SELECT_DEVICE_COHERENT;
> +}
> +
>  static void dmirror_bounce_fini(struct dmirror_bounce *bounce)
>  {
>   vfree(bounce->ptr);
> @@ -572,16 +598,19 @@ static int dmirror_allocate_chunk(struct dmirror_device 
> *mdevice,
>  static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
>  {
>   struct page *dpage = NULL;
> - struct page *rpage;
> + struct page *rpage = NULL;
>  
>   /*
> -  * This is a fake device so we alloc real system memory to store
> -  * our device memory.
> +  * For ZONE_DEVICE private type, this is a fake device so we alloc real
> +  * system memory to store our device memory.
> +  * For ZONE_DEVICE coherent type we use the actual dpage to store the 
> data
> +  * and ignore rpage.
>*/
> - rpage = alloc_page(GFP_HIGHUSER);
> - if (!rpage)
> - return NULL;
> -
> + if (dmirror_is_private_zone(mdevice)) {
> + rpage = alloc_page(GFP_HIGHUSER);
> + if (!rpage)
> + return NULL;
> + }
>   spin_lock(>lock);
>  
>   if (mdevice->free_pages) {
> @@ -601,7 +630,8 @@ static struct page *dmirror_devmem_alloc_page(struct 
> dmirror_device *mdevice

Re: [PATCH v3 03/10] mm/gup: fail get_user_pages for LONGTERM dev coherent type

2022-01-20 Thread Alistair Popple

On Thursday, 20 January 2022 11:36:21 PM AEDT Joao Martins wrote:
> On 1/10/22 22:31, Alex Sierra wrote:
> > Avoid long term pinning for Coherent device type pages. This could
> > interfere with their own device memory manager. For now, we are just
> > returning error for PIN_LONGTERM Coherent device type pages. Eventually,
> > these type of pages will get migrated to system memory, once the device
> > migration pages support is added.
> > 
> > Signed-off-by: Alex Sierra 
> > ---
> >  mm/gup.c | 7 +++
> >  1 file changed, 7 insertions(+)
> > 
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 886d6148d3d0..9c8a075d862d 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -1720,6 +1720,12 @@ static long check_and_migrate_movable_pages(unsigned 
> > long nr_pages,
> >  * If we get a movable page, since we are going to be pinning
> >  * these entries, try to move them out if possible.
> >  */
> > +   if (is_device_page(head)) {
> > +   WARN_ON_ONCE(is_device_private_page(head));
> > +   ret = -EFAULT;
> > +   goto unpin_pages;
> > +   }
> > +
> 
> Wouldn't be more efficient for you failing earlier instead of after all the 
> pages are pinned?

Rather than failing I think the plan is to migrate the device coherent pages
like we do for ZONE_MOVABLE, so leaving this here is a good place holder until
that is done. Currently we are missing some functionality required to do that
but I am hoping to post a series fixing that soon.

> Filesystem DAX suffers from a somewhat similar issue[0] -- albeit it's more 
> related to
> blocking FOLL_LONGTERM in gup-fast while gup-slow can still do it. Coherent 
> devmap appears
> to want to block it in all gup.
> 
> On another thread Jason was suggesting about having different pgmap::flags to 
> capture
> these special cases[1] instead of selecting what different pgmap types can do 
> in various
> different places.
> 
> [0] 
> https://lore.kernel.org/linux-mm/6a18179e-65f7-367d-89a9-d5162f10f...@oracle.com/
> [1] https://lore.kernel.org/linux-mm/20211019160136.gh3686...@ziepe.ca/
>

Re: [PATCH v3 00/10] Add MEMORY_DEVICE_COHERENT for coherent device memory mapping

2022-01-20 Thread Alistair Popple

On Wednesday, 12 January 2022 10:06:03 PM AEDT Alistair Popple wrote:
> I have been looking at this in relation to the migration code and noticed we
> have the following in try_to_migrate():
> 
> if (is_zone_device_page(page) && !is_device_private_page(page))
> return;
> 
> Which if I'm understanding correctly means that migration of device coherent
> pages will always fail. Given that I do wonder how hmm-tests are passing, but
> I assume you must always be hitting this fast path in
> migrate_vma_collect_pmd():
> 
> /*
>  * Optimize for the common case where page is only mapped once
>  * in one process. If we can lock the page, then we can safely
>  * set up a special migration page table entry now.
>  */
> 
> Meaning that try_to_migrate() never gets called from migrate_vma_unmap(). So
> you will also need some changes to try_to_migrate() and possibly
> try_to_migrate_one() to make this reliable.

I have been running the hmm tests with the changes below. I'm pretty sure these
are correct because the only zone device pages try_to_migrate_one() should be
called on are device coherent/private, and coherent pages can be treated just
the same as a normal pages for migration. However it would be worth checking I
haven't missed anything.

 - Alistair

---

diff --git a/mm/rmap.c b/mm/rmap.c
index 163ac4e6bcee..15f56c27daab 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1806,7 +1806,7 @@ static bool try_to_migrate_one(struct page *page, struct 
vm_area_struct *vma,
/* Update high watermark before we lower rss */
update_hiwater_rss(mm);
 
-   if (is_zone_device_page(page)) {
+   if (is_device_private_page(page)) {
unsigned long pfn = page_to_pfn(page);
swp_entry_t entry;
pte_t swp_pte;
@@ -1947,7 +1947,7 @@ void try_to_migrate(struct page *page, enum ttu_flags 
flags)
TTU_SYNC)))
return;
 
-   if (is_zone_device_page(page) && !is_device_private_page(page))
+   if (is_zone_device_page(page) && !is_device_page(page))
return;
 
/*

Re: [PATCH v3 08/10] lib: add support for device coherent type in test_hmm

2022-01-20 Thread Alistair Popple

On Tuesday, 11 January 2022 9:31:59 AM AEDT Alex Sierra wrote:
> Device Coherent type uses device memory that is coherently accesible by
> the CPU. This could be shown as SP (special purpose) memory range
> at the BIOS-e820 memory enumeration. If no SP memory is supported in
> system, this could be faked by setting CONFIG_EFI_FAKE_MEMMAP.
> 
> Currently, test_hmm only supports two different SP ranges of at least
> 256MB size. This could be specified in the kernel parameter variable
> efi_fake_mem. Ex. Two SP ranges of 1GB starting at 0x1 &
> 0x14000 physical address. Ex.
> efi_fake_mem=1G@0x1:0x4,1G@0x14000:0x4
> 
> Private and coherent device mirror instances can be created in the same
> probed. This is done by passing the module parameters spm_addr_dev0 &
> spm_addr_dev1. In this case, it will create four instances of
> device_mirror. The first two correspond to private device type, the
> last two to coherent type. Then, they can be easily accessed from user
> space through /dev/hmm_mirror. Usually num_device 0 and 1
> are for private, and 2 and 3 for coherent types. If no module
> parameters are passed, two instances of private type device_mirror will
> be created only.
> 
> Signed-off-by: Alex Sierra 
> ---
>  lib/test_hmm.c  | 247 
>  lib/test_hmm_uapi.h |  15 ++-
>  2 files changed, 193 insertions(+), 69 deletions(-)
> 
> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
> index 9edeff52302e..7c641c5a9cfa 100644
> --- a/lib/test_hmm.c
> +++ b/lib/test_hmm.c
> @@ -29,11 +29,22 @@
>  
>  #include "test_hmm_uapi.h"
>  
> -#define DMIRROR_NDEVICES 2
> +#define DMIRROR_NDEVICES 4
>  #define DMIRROR_RANGE_FAULT_TIMEOUT  1000
>  #define DEVMEM_CHUNK_SIZE(256 * 1024 * 1024U)
>  #define DEVMEM_CHUNKS_RESERVE16
>  
> +/*
> + * For device_private pages, dpage is just a dummy struct page
> + * representing a piece of device memory. dmirror_devmem_alloc_page
> + * allocates a real system memory page as backing storage to fake a
> + * real device. zone_device_data points to that backing page. But
> + * for device_coherent memory, the struct page represents real
> + * physical CPU-accessible memory that we can use directly.
> + */
> +#define BACKING_PAGE(page) (is_device_private_page((page)) ? \
> +(page)->zone_device_data : (page))
> +
>  static unsigned long spm_addr_dev0;
>  module_param(spm_addr_dev0, long, 0644);
>  MODULE_PARM_DESC(spm_addr_dev0,
> @@ -122,6 +133,21 @@ static int dmirror_bounce_init(struct dmirror_bounce 
> *bounce,
>   return 0;
>  }
>  
> +static bool dmirror_is_private_zone(struct dmirror_device *mdevice)
> +{
> + return (mdevice->zone_device_type ==
> + HMM_DMIRROR_MEMORY_DEVICE_PRIVATE) ? true : false;
> +}
> +
> +static enum migrate_vma_direction
> + dmirror_select_device(struct dmirror *dmirror)
> +{
> + return (dmirror->mdevice->zone_device_type ==
> + HMM_DMIRROR_MEMORY_DEVICE_PRIVATE) ?
> + MIGRATE_VMA_SELECT_DEVICE_PRIVATE :
> + MIGRATE_VMA_SELECT_DEVICE_COHERENT;
> +}
> +
>  static void dmirror_bounce_fini(struct dmirror_bounce *bounce)
>  {
>   vfree(bounce->ptr);
> @@ -572,16 +598,19 @@ static int dmirror_allocate_chunk(struct dmirror_device 
> *mdevice,
>  static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
>  {
>   struct page *dpage = NULL;
> - struct page *rpage;
> + struct page *rpage = NULL;
>  
>   /*
> -  * This is a fake device so we alloc real system memory to store
> -  * our device memory.
> +  * For ZONE_DEVICE private type, this is a fake device so we alloc real
> +  * system memory to store our device memory.
> +  * For ZONE_DEVICE coherent type we use the actual dpage to store the 
> data
> +  * and ignore rpage.
>*/
> - rpage = alloc_page(GFP_HIGHUSER);
> - if (!rpage)
> - return NULL;
> -
> + if (dmirror_is_private_zone(mdevice)) {
> + rpage = alloc_page(GFP_HIGHUSER);
> + if (!rpage)
> + return NULL;
> + }
>   spin_lock(>lock);
>  
>   if (mdevice->free_pages) {
> @@ -601,7 +630,8 @@ static struct page *dmirror_devmem_alloc_page(struct 
> dmirror_device *mdevice)
>   return dpage;
>  
>  error:
> - __free_page(rpage);
> + if (rpage)
> + __free_page(rpage);
>   return NULL;
>  }
>  
> @@ -627,12 +657,15 @@ static void dmirror_migrate_alloc_and_copy(struct 
> migrate_vma *args,
>* unallocated pte_none() or read-only zero page.
>*/
>   spage = migrate_pfn_to_page(*src);
> + WARN(spage && is_zone_device_page(spage),
> +  "page already in device spage pfn: 0x%lx\n",
> +  page_to_pfn(spage));

This should also lead to test failure because we are only supposed to be
selecting system

Re: [PATCH v3 10/10] tools: update test_hmm script to support SP config

2022-01-20 Thread Alistair Popple

Looks good,

Reviewed-by: Alistair Popple 

On Tuesday, 11 January 2022 9:32:01 AM AEDT Alex Sierra wrote:
> Add two more parameters to set spm_addr_dev0 & spm_addr_dev1
> addresses. These two parameters configure the start SP
> addresses for each device in test_hmm driver.
> Consequently, this configures zone device type as coherent.
> 
> Signed-off-by: Alex Sierra 
> ---
> v2:
> Add more mknods for device coherent type. These are represented under
> /dev/hmm_mirror2 and /dev/hmm_mirror3, only in case they have created
> at probing the hmm-test driver.
> ---
>  tools/testing/selftests/vm/test_hmm.sh | 24 +---
>  1 file changed, 21 insertions(+), 3 deletions(-)
> 
> diff --git a/tools/testing/selftests/vm/test_hmm.sh 
> b/tools/testing/selftests/vm/test_hmm.sh
> index 0647b525a625..539c9371e592 100755
> --- a/tools/testing/selftests/vm/test_hmm.sh
> +++ b/tools/testing/selftests/vm/test_hmm.sh
> @@ -40,11 +40,26 @@ check_test_requirements()
>  
>  load_driver()
>  {
> - modprobe $DRIVER > /dev/null 2>&1
> + if [ $# -eq 0 ]; then
> + modprobe $DRIVER > /dev/null 2>&1
> + else
> + if [ $# -eq 2 ]; then
> + modprobe $DRIVER spm_addr_dev0=$1 spm_addr_dev1=$2
> + > /dev/null 2>&1
> + else
> + echo "Missing module parameters. Make sure pass"\
> + "spm_addr_dev0 and spm_addr_dev1"
> + usage
> + fi
> + fi
>   if [ $? == 0 ]; then
>   major=$(awk "\$2==\"HMM_DMIRROR\" {print \$1}" /proc/devices)
>   mknod /dev/hmm_dmirror0 c $major 0
>   mknod /dev/hmm_dmirror1 c $major 1
> + if [ $# -eq 2 ]; then
> + mknod /dev/hmm_dmirror2 c $major 2
> + mknod /dev/hmm_dmirror3 c $major 3
> + fi
>   fi
>  }
>  
> @@ -58,7 +73,7 @@ run_smoke()
>  {
>   echo "Running smoke test. Note, this test provides basic coverage."
>  
> - load_driver
> + load_driver $1 $2
>   $(dirname "${BASH_SOURCE[0]}")/hmm-tests
>   unload_driver
>  }
> @@ -75,6 +90,9 @@ usage()
>   echo "# Smoke testing"
>   echo "./${TEST_NAME}.sh smoke"
>   echo
> + echo "# Smoke testing with SPM enabled"
> + echo "./${TEST_NAME}.sh smoke  "
> + echo
>   exit 0
>  }
>  
> @@ -84,7 +102,7 @@ function run_test()
>   usage
>   else
>   if [ "$1" = "smoke" ]; then
> - run_smoke
> + run_smoke $2 $3
>   else
>   usage
>   fi
>

Re: [PATCH v3 07/10] lib: test_hmm add module param for zone device type

2022-01-20 Thread Alistair Popple

Thanks for splitting the coherent devices into separate device nodes. Couple of
comments below.

On Tuesday, 11 January 2022 9:31:58 AM AEDT Alex Sierra wrote:
> In order to configure device coherent in test_hmm, two module parameters
> should be passed, which correspond to the SP start address of each
> device (2) spm_addr_dev0 & spm_addr_dev1. If no parameters are passed,
> private device type is configured.
> 
> Signed-off-by: Alex Sierra 
> ---
>  lib/test_hmm.c  | 74 +++--
>  lib/test_hmm_uapi.h |  1 +
>  2 files changed, 53 insertions(+), 22 deletions(-)
> 
> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
> index 97e48164d56a..9edeff52302e 100644
> --- a/lib/test_hmm.c
> +++ b/lib/test_hmm.c
> @@ -34,6 +34,16 @@
>  #define DEVMEM_CHUNK_SIZE(256 * 1024 * 1024U)
>  #define DEVMEM_CHUNKS_RESERVE16
>  
> +static unsigned long spm_addr_dev0;
> +module_param(spm_addr_dev0, long, 0644);
> +MODULE_PARM_DESC(spm_addr_dev0,
> + "Specify start address for SPM (special purpose memory) used 
> for device 0. By setting this Coherent device type will be used. Make sure 
> spm_addr_dev1 is set too");

It would be useful if you could mention the required size for this region
(ie. DEVMEM_CHUNK_SIZE).

> +
> +static unsigned long spm_addr_dev1;
> +module_param(spm_addr_dev1, long, 0644);
> +MODULE_PARM_DESC(spm_addr_dev1,
> + "Specify start address for SPM (special purpose memory) used 
> for device 1. By setting this Coherent device type will be used. Make sure 
> spm_addr_dev0 is set too");
> +
>  static const struct dev_pagemap_ops dmirror_devmem_ops;
>  static const struct mmu_interval_notifier_ops dmirror_min_ops;
>  static dev_t dmirror_dev;
> @@ -452,29 +462,44 @@ static int dmirror_write(struct dmirror *dmirror, 
> struct hmm_dmirror_cmd *cmd)
>   return ret;
>  }
>  
> -static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,
> +static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
>  struct page **ppage)
>  {
>   struct dmirror_chunk *devmem;
> - struct resource *res;
> + struct resource *res = NULL;
>   unsigned long pfn;
>   unsigned long pfn_first;
>   unsigned long pfn_last;
>   void *ptr;
> + int ret = -ENOMEM;
>  
>   devmem = kzalloc(sizeof(*devmem), GFP_KERNEL);
>   if (!devmem)
> - return false;
> + return ret;
>  
> - res = request_free_mem_region(_resource, DEVMEM_CHUNK_SIZE,
> -   "hmm_dmirror");
> - if (IS_ERR(res))
> + switch (mdevice->zone_device_type) {
> + case HMM_DMIRROR_MEMORY_DEVICE_PRIVATE:
> + res = request_free_mem_region(_resource, 
> DEVMEM_CHUNK_SIZE,
> +   "hmm_dmirror");
> + if (IS_ERR_OR_NULL(res))
> + goto err_devmem;
> + devmem->pagemap.range.start = res->start;
> + devmem->pagemap.range.end = res->end;
> + devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
> + break;
> + case HMM_DMIRROR_MEMORY_DEVICE_COHERENT:
> + devmem->pagemap.range.start = (MINOR(mdevice->cdevice.dev) - 2) 
> ?
> + spm_addr_dev0 :
> + spm_addr_dev1;
> + devmem->pagemap.range.end = devmem->pagemap.range.start +
> + DEVMEM_CHUNK_SIZE - 1;
> + devmem->pagemap.type = MEMORY_DEVICE_COHERENT;
> + break;
> + default:
> + ret = -EINVAL;
>   goto err_devmem;
> + }
>  
> - mdevice->zone_device_type = HMM_DMIRROR_MEMORY_DEVICE_PRIVATE;

What initialises mdevice->zone_device_type now? It looks like it needs to get
initialised in hmm_dmirror_init(), which would be easier to do in the previous
patch rather than adding it here in the first place.

> - devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
> - devmem->pagemap.range.start = res->start;
> - devmem->pagemap.range.end = res->end;
>   devmem->pagemap.nr_range = 1;
>   devmem->pagemap.ops = _devmem_ops;
>   devmem->pagemap.owner = mdevice;
> @@ -495,10 +520,14 @@ static bool dmirror_allocate_chunk(struct 
> dmirror_device *mdevice,
>   mdevice->devmem_capacity = new_capacity;
>   mdevice->devmem_chunks = new_chunks;
>   }
> -
>   ptr = memremap_pages(>pagemap, numa_node_id());
> - if (IS_ERR(ptr))
> + if (IS_ERR_OR_NULL(ptr)) {
> + if (ptr)
> + ret = PTR_ERR(ptr);
> + else
> + ret = -EFAULT;
>   goto err_release;
> + }
>  
>   devmem->mdevice = mdevice;
>   pfn_first = devmem->pagemap.range.start >> PAGE_SHIFT;
> @@ -527,15 +556,17 @@ static bool dmirror_allocate_chunk(struct 
> dmirror_device *mdevice,
>

Re: [PATCH v3 06/10] lib: test_hmm add ioctl to get zone device type

2022-01-20 Thread Alistair Popple

On Tuesday, 11 January 2022 9:31:57 AM AEDT Alex Sierra wrote:

[...]

> +enum {
> + /* 0 is reserved to catch uninitialized type fields */

This seems unnecessary and can be dropped to start at zero.

Reviewed-by: Alistair Popple 

> + HMM_DMIRROR_MEMORY_DEVICE_PRIVATE = 1,
> +};
> +
>  #endif /* _LIB_TEST_HMM_UAPI_H */
>

Re: [PATCH v3 09/10] tools: update hmm-test to support device coherent type

2022-01-20 Thread Alistair Popple

On Tuesday, 11 January 2022 9:32:00 AM AEDT Alex Sierra wrote:
> Test cases such as migrate_fault and migrate_multiple, were modified to
> explicit migrate from device to sys memory without the need of page
> faults, when using device coherent type.
> 
> Snapshot test case updated to read memory device type first and based
> on that, get the proper returned results migrate_ping_pong test case

Where is the migrate_ping_pong test? Did you perhaps forget to add it? :-)

> added to test explicit migration from device to sys memory for both
> private and coherent zone types.
> 
> Helpers to migrate from device to sys memory and vicerversa
> were also added.
> 
> Signed-off-by: Alex Sierra 
> ---
> v2:
> Set FIXTURE_VARIANT to add multiple device types to the FIXTURE. This
> will run all the tests for each device type (private and coherent) in
> case both existed during hmm-test driver probed.
> ---
>  tools/testing/selftests/vm/hmm-tests.c | 122 -
>  1 file changed, 101 insertions(+), 21 deletions(-)
> 
> diff --git a/tools/testing/selftests/vm/hmm-tests.c 
> b/tools/testing/selftests/vm/hmm-tests.c
> index 864f126ffd78..8eb81dfba4b3 100644
> --- a/tools/testing/selftests/vm/hmm-tests.c
> +++ b/tools/testing/selftests/vm/hmm-tests.c
> @@ -44,6 +44,14 @@ struct hmm_buffer {
>   int fd;
>   uint64_tcpages;
>   uint64_tfaults;
> + int zone_device_type;
> +};
> +
> +enum {
> + HMM_PRIVATE_DEVICE_ONE,
> + HMM_PRIVATE_DEVICE_TWO,
> + HMM_COHERENCE_DEVICE_ONE,
> + HMM_COHERENCE_DEVICE_TWO,
>  };
>  
>  #define TWOMEG   (1 << 21)
> @@ -60,6 +68,21 @@ FIXTURE(hmm)
>   unsigned intpage_shift;
>  };
>  
> +FIXTURE_VARIANT(hmm)
> +{
> + int device_number;
> +};
> +
> +FIXTURE_VARIANT_ADD(hmm, hmm_device_private)
> +{
> + .device_number = HMM_PRIVATE_DEVICE_ONE,
> +};
> +
> +FIXTURE_VARIANT_ADD(hmm, hmm_device_coherent)
> +{
> + .device_number = HMM_COHERENCE_DEVICE_ONE,
> +};
> +
>  FIXTURE(hmm2)
>  {
>   int fd0;
> @@ -68,6 +91,24 @@ FIXTURE(hmm2)
>   unsigned intpage_shift;
>  };
>  
> +FIXTURE_VARIANT(hmm2)
> +{
> + int device_number0;
> + int device_number1;
> +};
> +
> +FIXTURE_VARIANT_ADD(hmm2, hmm2_device_private)
> +{
> + .device_number0 = HMM_PRIVATE_DEVICE_ONE,
> + .device_number1 = HMM_PRIVATE_DEVICE_TWO,
> +};
> +
> +FIXTURE_VARIANT_ADD(hmm2, hmm2_device_coherent)
> +{
> + .device_number0 = HMM_COHERENCE_DEVICE_ONE,
> + .device_number1 = HMM_COHERENCE_DEVICE_TWO,
> +};
> +
>  static int hmm_open(int unit)
>  {
>   char pathname[HMM_PATH_MAX];
> @@ -81,12 +122,19 @@ static int hmm_open(int unit)
>   return fd;
>  }
>  
> +static bool hmm_is_coherent_type(int dev_num)
> +{
> + return (dev_num >= HMM_COHERENCE_DEVICE_ONE);
> +}
> +
>  FIXTURE_SETUP(hmm)
>  {
>   self->page_size = sysconf(_SC_PAGE_SIZE);
>   self->page_shift = ffs(self->page_size) - 1;
>  
> - self->fd = hmm_open(0);
> + self->fd = hmm_open(variant->device_number);
> + if (self->fd < 0 && hmm_is_coherent_type(variant->device_number))
> + SKIP(exit(0), "DEVICE_COHERENT not available");
>   ASSERT_GE(self->fd, 0);
>  }
>  
> @@ -95,9 +143,11 @@ FIXTURE_SETUP(hmm2)
>   self->page_size = sysconf(_SC_PAGE_SIZE);
>   self->page_shift = ffs(self->page_size) - 1;
>  
> - self->fd0 = hmm_open(0);
> + self->fd0 = hmm_open(variant->device_number0);
> + if (self->fd0 < 0 && hmm_is_coherent_type(variant->device_number0))
> + SKIP(exit(0), "DEVICE_COHERENT not available");
>   ASSERT_GE(self->fd0, 0);
> - self->fd1 = hmm_open(1);
> + self->fd1 = hmm_open(variant->device_number1);
>   ASSERT_GE(self->fd1, 0);
>  }
>  
> @@ -144,6 +194,7 @@ static int hmm_dmirror_cmd(int fd,
>   }
>   buffer->cpages = cmd.cpages;
>   buffer->faults = cmd.faults;
> + buffer->zone_device_type = cmd.zone_device_type;
>  
>   return 0;
>  }
> @@ -211,6 +262,20 @@ static void hmm_nanosleep(unsigned int n)
>   nanosleep(, NULL);
>  }
>  
> +static int hmm_migrate_sys_to_dev(int fd,
> +struct hmm_buffer *buffer,
> +unsigned long npages)
> +{
> + return hmm_dmirror_cmd(fd, HMM_DMIRROR_MIGRATE_TO_DEV, buffer, npages);
> +}
> +
> +static int hmm_migrate_dev_to_sys(int fd,
> +struct hmm_buffer *buffer,
> +unsigned long npages)
> +{
> + return hmm_dmirror_cmd(fd, HMM_DMIRROR_MIGRATE_TO_SYS, buffer, npages);
> +}
> +
>  /*
>   * Simple NULL test of device open/close.
>   */
> @@ -875,7 +940,7 @@ TEST_F(hmm, migrate)
>   ptr[i] = i;
>  
>   /* Migrate memory to device. */
> - ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
> + ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
>   ASSERT_EQ(ret, 0);
>

Re: [PATCH v3 01/10] mm: add zone device coherent type memory support

2022-01-19 Thread Alistair Popple

On Tuesday, 11 January 2022 9:31:52 AM AEDT Alex Sierra wrote:
> Device memory that is cache coherent from device and CPU point of view.
> This is used on platforms that have an advanced system bus (like CAPI
> or CXL). Any page of a process can be migrated to such memory. However,
> no one should be allowed to pin such memory so that it can always be
> evicted.
> 
> Signed-off-by: Alex Sierra 
> ---
>  include/linux/memremap.h |  8 
>  include/linux/mm.h   | 16 
>  mm/memcontrol.c  |  6 +++---
>  mm/memory-failure.c  |  8 ++--
>  mm/memremap.c|  5 -
>  mm/migrate.c | 21 +
>  6 files changed, 50 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index c0e9d35889e8..ff4d398edf35 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -39,6 +39,13 @@ struct vmem_altmap {
>   * A more complete discussion of unaddressable memory may be found in
>   * include/linux/hmm.h and Documentation/vm/hmm.rst.
>   *
> + * MEMORY_DEVICE_COHERENT:
> + * Device memory that is cache coherent from device and CPU point of view. 
> This
> + * is used on platforms that have an advanced system bus (like CAPI or CXL). 
> A
> + * driver can hotplug the device memory using ZONE_DEVICE and with that 
> memory
> + * type. Any page of a process can be migrated to such memory. However no one
> + * should be allowed to pin such memory so that it can always be evicted.
> + *
>   * MEMORY_DEVICE_FS_DAX:
>   * Host memory that has similar access semantics as System RAM i.e. DMA
>   * coherent and supports page pinning. In support of coordinating page
> @@ -59,6 +66,7 @@ struct vmem_altmap {
>  enum memory_type {
>   /* 0 is reserved to catch uninitialized type fields */
>   MEMORY_DEVICE_PRIVATE = 1,
> + MEMORY_DEVICE_COHERENT,
>   MEMORY_DEVICE_FS_DAX,
>   MEMORY_DEVICE_GENERIC,
>   MEMORY_DEVICE_PCI_P2PDMA,
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 73a52aba448f..fcf96c0fc918 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1162,6 +1162,7 @@ static inline bool page_is_devmap_managed(struct page 
> *page)
>   return false;
>   switch (page->pgmap->type) {
>   case MEMORY_DEVICE_PRIVATE:
> + case MEMORY_DEVICE_COHERENT:
>   case MEMORY_DEVICE_FS_DAX:
>   return true;
>   default:
> @@ -1191,6 +1192,21 @@ static inline bool is_device_private_page(const struct 
> page *page)
>   page->pgmap->type == MEMORY_DEVICE_PRIVATE;
>  }
>  
> +static inline bool is_device_coherent_page(const struct page *page)
> +{
> + return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
> + is_zone_device_page(page) &&
> + page->pgmap->type == MEMORY_DEVICE_COHERENT;
> +}
> +
> +static inline bool is_device_page(const struct page *page)

I wish we could think of a better name for this - it's too similar to
is_zone_device_page() so I can never remember if it includes FS_DAX pages or
not. Unfortunately I don't have any better suggestions though.

> +{
> + return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
> + is_zone_device_page(page) &&
> + (page->pgmap->type == MEMORY_DEVICE_PRIVATE ||
> + page->pgmap->type == MEMORY_DEVICE_COHERENT);
> +}
> +
>  static inline bool is_pci_p2pdma_page(const struct page *page)
>  {
>   return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6da5020a8656..d0bab0747c73 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5695,8 +5695,8 @@ static int mem_cgroup_move_account(struct page *page,
>   *   2(MC_TARGET_SWAP): if the swap entry corresponding to this pte is a
>   * target for charge migration. if @target is not NULL, the entry is 
> stored
>   * in target->ent.
> - *   3(MC_TARGET_DEVICE): like MC_TARGET_PAGE  but page is 
> MEMORY_DEVICE_PRIVATE
> - * (so ZONE_DEVICE page and thus not on the lru).
> + *   3(MC_TARGET_DEVICE): like MC_TARGET_PAGE  but page is device memory and
> + *   thus not on the lru.
>   * For now we such page is charge like a regular page would be as for all
>   * intent and purposes it is just special memory taking the place of a
>   * regular page.
> @@ -5730,7 +5730,7 @@ static enum mc_target_type get_mctgt_type(struct 
> vm_area_struct *vma,
>*/
>   if (page_memcg(page) == mc.from) {
>   ret = MC_TARGET_PAGE;
> - if (is_device_private_page(page))
> + if (is_device_page(page))
>   ret = MC_TARGET_DEVICE;
>   if (target)
>   target->page = page;
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 3e6449f2102a..4cf212e5f432 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1554,12 +1554,16 @@ static int

Re: [PATCH v3 00/10] Add MEMORY_DEVICE_COHERENT for coherent device memory mapping

2022-01-12 Thread Alistair Popple

I have been looking at this in relation to the migration code and noticed we
have the following in try_to_migrate():

if (is_zone_device_page(page) && !is_device_private_page(page))
return;

Which if I'm understanding correctly means that migration of device coherent
pages will always fail. Given that I do wonder how hmm-tests are passing, but
I assume you must always be hitting this fast path in
migrate_vma_collect_pmd():

/*
 * Optimize for the common case where page is only mapped once
 * in one process. If we can lock the page, then we can safely
 * set up a special migration page table entry now.
 */

Meaning that try_to_migrate() never gets called from migrate_vma_unmap(). So
you will also need some changes to try_to_migrate() and possibly
try_to_migrate_one() to make this reliable.

 - Alistair

On Tuesday, 11 January 2022 9:31:51 AM AEDT Alex Sierra wrote:
> This patch series introduces MEMORY_DEVICE_COHERENT, a type of memory
> owned by a device that can be mapped into CPU page tables like
> MEMORY_DEVICE_GENERIC and can also be migrated like
> MEMORY_DEVICE_PRIVATE.
> 
> Christoph, the suggestion to incorporate Ralph Campbell’s refcount
> cleanup patch into our hardware page migration patchset originally came
> from you, but it proved impractical to do things in that order because
> the refcount cleanup introduced a bug with wide ranging structural
> implications. Instead, we amended Ralph’s patch so that it could be
> applied after merging the migration work. As we saw from the recent
> discussion, merging the refcount work is going to take some time and
> cooperation between multiple development groups, while the migration
> work is ready now and is needed now. So we propose to merge this
> patchset first and continue to work with Ralph and others to merge the
> refcount cleanup separately, when it is ready.
> 
> This patch series is mostly self-contained except for a few places where
> it needs to update other subsystems to handle the new memory type.
> System stability and performance are not affected according to our
> ongoing testing, including xfstests.
> 
> How it works: The system BIOS advertises the GPU device memory
> (aka VRAM) as SPM (special purpose memory) in the UEFI system address
> map.
> 
> The amdgpu driver registers the memory with devmap as
> MEMORY_DEVICE_COHERENT using devm_memremap_pages. The initial user for
> this hardware page migration capability is the Frontier supercomputer
> project. This functionality is not AMD-specific. We expect other GPU
> vendors to find this functionality useful, and possibly other hardware
> types in the future.
> 
> Our test nodes in the lab are similar to the Frontier configuration,
> with .5 TB of system memory plus 256 GB of device memory split across
> 4 GPUs, all in a single coherent address space. Page migration is
> expected to improve application efficiency significantly. We will
> report empirical results as they become available.
> 
> We extended hmm_test to cover migration of MEMORY_DEVICE_COHERENT. This
> patch set builds on HMM and our SVM memory manager already merged in
> 5.15.
> 
> v2:
> - test_hmm is now able to create private and coherent device mirror
> instances in the same driver probe. This adds more usability to the hmm
> test by not having to remove the kernel module for each device type
> test (private/coherent type). This is done by passing the module
> parameters spm_addr_dev0 & spm_addr_dev1. In this case, it will create
> four instances of device_mirror. The first two correspond to private
> device type, the last two to coherent type. Then, they can be easily
> accessed from user space through /dev/hmm_mirror. Usually
> num_device 0 and 1 are for private, and 2 and 3 for coherent types.
> 
> - Coherent device type pages at gup are now migrated back to system
> memory if they have been long term pinned (FOLL_LONGTERM). The reason
> is these pages could eventually interfere with their own device memory
> manager. A new hmm_gup_test has been added to the hmm-test to test this
> functionality. It makes use of the gup_test module to long term pin
> user pages that have been migrate to device memory first.
> 
> - Other patch corrections made by Felix, Alistair and Christoph.
> 
> v3:
> - Based on last v2 feedback we got from Alistair, we've decided to
> remove migration logic for FOLL_LONGTERM coherent device type pages at
> gup for now. Ideally, this should be done through the kernel mm,
> instead of calling the device driver to do it. Currently, there's no
> support for migrating device pages based on pfn, mainly because
> migrate_pages() relies on pages being LRU pages. Alistair mentioned, he
> has started to work on adding this migrate device pages logic. For now,
> we fail on get_user_pages call with FOLL_LONGTERM for DEVICE_COHERENT
> pages.
> 
> - Also, hmm_gup_test has been removed from hmm-test. We plan

Re: [PATCH v2 03/11] mm/gup: migrate PIN_LONGTERM dev coherent pages to system

2021-12-09 Thread Alistair Popple

On Friday, 10 December 2021 3:54:31 AM AEDT Sierra Guiza, Alejandro (Alex) 
wrote:
> 
> On 12/9/2021 10:29 AM, Felix Kuehling wrote:
> > Am 2021-12-09 um 5:53 a.m. schrieb Alistair Popple:
> >> On Thursday, 9 December 2021 5:55:26 AM AEDT Sierra Guiza, Alejandro 
> >> (Alex) wrote:
> >>> On 12/8/2021 11:30 AM, Felix Kuehling wrote:
> >>>> Am 2021-12-08 um 11:58 a.m. schrieb Felix Kuehling:
> >>>>> Am 2021-12-08 um 6:31 a.m. schrieb Alistair Popple:
> >>>>>> On Tuesday, 7 December 2021 5:52:43 AM AEDT Alex Sierra wrote:
> >>>>>>> Avoid long term pinning for Coherent device type pages. This could
> >>>>>>> interfere with their own device memory manager.
> >>>>>>> If caller tries to get user device coherent pages with PIN_LONGTERM 
> >>>>>>> flag
> >>>>>>> set, those pages will be migrated back to system memory.
> >>>>>>>
> >>>>>>> Signed-off-by: Alex Sierra
> >>>>>>> ---
> >>>>>>>mm/gup.c | 32 ++--
> >>>>>>>1 file changed, 30 insertions(+), 2 deletions(-)
> >>>>>>>
> >>>>>>> diff --git a/mm/gup.c b/mm/gup.c
> >>>>>>> index 886d6148d3d0..1572eacf07f4 100644
> >>>>>>> --- a/mm/gup.c
> >>>>>>> +++ b/mm/gup.c
> >>>>>>> @@ -1689,17 +1689,37 @@ struct page *get_dump_page(unsigned long addr)
> >>>>>>>#endif /* CONFIG_ELF_CORE */
> >>>>>>>
> >>>>>>>#ifdef CONFIG_MIGRATION
> >>>>>>> +static int migrate_device_page(unsigned long address,
> >>>>>>> + struct page *page)
> >>>>>>> +{
> >>>>>>> + struct vm_area_struct *vma = find_vma(current->mm, address);
> >>>>>>> + struct vm_fault vmf = {
> >>>>>>> + .vma = vma,
> >>>>>>> + .address = address & PAGE_MASK,
> >>>>>>> + .flags = FAULT_FLAG_USER,
> >>>>>>> + .pgoff = linear_page_index(vma, address),
> >>>>>>> + .gfp_mask = GFP_KERNEL,
> >>>>>>> + .page = page,
> >>>>>>> + };
> >>>>>>> + if (page->pgmap && page->pgmap->ops->migrate_to_ram)
> >>>>>>> + return page->pgmap->ops->migrate_to_ram();
> >>>>>> How does this synchronise against pgmap being released? As I 
> >>>>>> understand things
> >>>>>> at this point we're not holding a reference on either the page or 
> >>>>>> pgmap, so
> >>>>>> the page and therefore the pgmap may have been freed.
> >>>>>>
> >>>>>> I think a similar problem exists for device private fault handling as 
> >>>>>> well and
> >>>>>> it has been on my list of things to fix for a while. I think the 
> >>>>>> solution is to
> >>>>>> call try_get_page(), except it doesn't work with device pages due to 
> >>>>>> the whole
> >>>>>> refcount thing. That issue is blocking a fair bit of work now so I've 
> >>>>>> started
> >>>>>> looking into it.
> >>>>> At least the page should have been pinned by the __get_user_pages_locked
> >>>>> call in __gup_longterm_locked. That refcount is dropped in
> >>>>> check_and_migrate_movable_pages when it returns 0 or an error.
> >>>> Never mind. We unpin the pages first. Alex, would the migration work if
> >>>> we unpinned them afterwards? Also, the normal CPU page fault code path
> >>>> seems to make sure the page is locked (check in pfn_swap_entry_to_page)
> >>>> before calling migrate_to_ram.
> >> I don't think that's true. The check in pfn_swap_entry_to_page() is only 
> >> for
> >> migration entries:
> >>
> >>BUG_ON(is_migration_entry(entry) && !PageLocked(p));
> >>
> >> As this is coherent memory though why do we have to call into a device 
> >> driver
> >> to do the migration? Couldn't this all be done in the kernel?
> > I think you're right. I hadn't thought of that mainly because I'm even
> > less familiar with the non-device migration code. Alex, can you give
> > that a try? As long as the driver still gets a page-free callback when
> > the device page is freed, it should work.

Yes, you should still get the page-free callback when the migration code drops
the last page reference.

> ACK.Will do

There is currently not really any support for migrating device pages based on
pfn. What I think is needed is something like migrate_pages(), but that API
won't work for a couple of reasons - main one being that it relies on pages
being LRU pages.

I've been working on a series to implement an equivalent of migrate_pages() for
device-private (and by extension device-coherent) pages. It might also be useful
here so I will try and get it posted as an RFC next week.

 - Alistair

> Alex Sierra
> 
> > Regards,
> >Felix
> >
> >
> >>> No, you can not unpinned after migration. Due to the expected_count VS
> >>> page_count condition at migrate_page_move_mapping, during migrate_page 
> >>> call.
> >>>
> >>> Regards,
> >>> Alex Sierra
> >>>
> >>>> Regards,
> >>>> Felix
> >>>>
> >>>>
> >>

Re: [PATCH v2 03/11] mm/gup: migrate PIN_LONGTERM dev coherent pages to system

2021-12-09 Thread Alistair Popple

On Thursday, 9 December 2021 12:53:45 AM AEDT Jason Gunthorpe wrote:
> > I think a similar problem exists for device private fault handling as well 
> > and
> > it has been on my list of things to fix for a while. I think the solution 
> > is to
> > call try_get_page(), except it doesn't work with device pages due to the 
> > whole
> > refcount thing. That issue is blocking a fair bit of work now so I've 
> > started
> > looking into it.
> 
> Where is this?
 
Nothing posted yet. I've been going through the mailing list and the old
thread[1] to get an understanding of what is left to do. If you have any
suggestions they would be welcome.

[1] https://lore.kernel.org/all/20211014153928.16805-3-alex.sie...@amd.com/

Re: [PATCH v2 03/11] mm/gup: migrate PIN_LONGTERM dev coherent pages to system

2021-12-09 Thread Alistair Popple

On Thursday, 9 December 2021 5:55:26 AM AEDT Sierra Guiza, Alejandro (Alex) 
wrote:
> 
> On 12/8/2021 11:30 AM, Felix Kuehling wrote:
> > Am 2021-12-08 um 11:58 a.m. schrieb Felix Kuehling:
> >> Am 2021-12-08 um 6:31 a.m. schrieb Alistair Popple:
> >>> On Tuesday, 7 December 2021 5:52:43 AM AEDT Alex Sierra wrote:
> >>>> Avoid long term pinning for Coherent device type pages. This could
> >>>> interfere with their own device memory manager.
> >>>> If caller tries to get user device coherent pages with PIN_LONGTERM flag
> >>>> set, those pages will be migrated back to system memory.
> >>>>
> >>>> Signed-off-by: Alex Sierra 
> >>>> ---
> >>>>   mm/gup.c | 32 ++--
> >>>>   1 file changed, 30 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/mm/gup.c b/mm/gup.c
> >>>> index 886d6148d3d0..1572eacf07f4 100644
> >>>> --- a/mm/gup.c
> >>>> +++ b/mm/gup.c
> >>>> @@ -1689,17 +1689,37 @@ struct page *get_dump_page(unsigned long addr)
> >>>>   #endif /* CONFIG_ELF_CORE */
> >>>>   
> >>>>   #ifdef CONFIG_MIGRATION
> >>>> +static int migrate_device_page(unsigned long address,
> >>>> +struct page *page)
> >>>> +{
> >>>> +struct vm_area_struct *vma = find_vma(current->mm, address);
> >>>> +struct vm_fault vmf = {
> >>>> +.vma = vma,
> >>>> +.address = address & PAGE_MASK,
> >>>> +.flags = FAULT_FLAG_USER,
> >>>> +.pgoff = linear_page_index(vma, address),
> >>>> +.gfp_mask = GFP_KERNEL,
> >>>> +.page = page,
> >>>> +};
> >>>> +if (page->pgmap && page->pgmap->ops->migrate_to_ram)
> >>>> +return page->pgmap->ops->migrate_to_ram();
> >>> How does this synchronise against pgmap being released? As I understand 
> >>> things
> >>> at this point we're not holding a reference on either the page or pgmap, 
> >>> so
> >>> the page and therefore the pgmap may have been freed.
> >>>
> >>> I think a similar problem exists for device private fault handling as 
> >>> well and
> >>> it has been on my list of things to fix for a while. I think the solution 
> >>> is to
> >>> call try_get_page(), except it doesn't work with device pages due to the 
> >>> whole
> >>> refcount thing. That issue is blocking a fair bit of work now so I've 
> >>> started
> >>> looking into it.
> >> At least the page should have been pinned by the __get_user_pages_locked
> >> call in __gup_longterm_locked. That refcount is dropped in
> >> check_and_migrate_movable_pages when it returns 0 or an error.
> > Never mind. We unpin the pages first. Alex, would the migration work if
> > we unpinned them afterwards? Also, the normal CPU page fault code path
> > seems to make sure the page is locked (check in pfn_swap_entry_to_page)
> > before calling migrate_to_ram.

I don't think that's true. The check in pfn_swap_entry_to_page() is only for
migration entries:

BUG_ON(is_migration_entry(entry) && !PageLocked(p));

As this is coherent memory though why do we have to call into a device driver
to do the migration? Couldn't this all be done in the kernel?

> No, you can not unpinned after migration. Due to the expected_count VS 
> page_count condition at migrate_page_move_mapping, during migrate_page call.
> 
> Regards,
> Alex Sierra
> 
> > Regards,
> >Felix
> >
> >
>

Re: [PATCH v2 03/11] mm/gup: migrate PIN_LONGTERM dev coherent pages to system

2021-12-09 Thread Alistair Popple

On Tuesday, 7 December 2021 5:52:43 AM AEDT Alex Sierra wrote:
> Avoid long term pinning for Coherent device type pages. This could
> interfere with their own device memory manager.
> If caller tries to get user device coherent pages with PIN_LONGTERM flag
> set, those pages will be migrated back to system memory.
> 
> Signed-off-by: Alex Sierra 
> ---
>  mm/gup.c | 32 ++--
>  1 file changed, 30 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 886d6148d3d0..1572eacf07f4 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1689,17 +1689,37 @@ struct page *get_dump_page(unsigned long addr)
>  #endif /* CONFIG_ELF_CORE */
>  
>  #ifdef CONFIG_MIGRATION
> +static int migrate_device_page(unsigned long address,
> + struct page *page)
> +{
> + struct vm_area_struct *vma = find_vma(current->mm, address);
> + struct vm_fault vmf = {
> + .vma = vma,
> + .address = address & PAGE_MASK,
> + .flags = FAULT_FLAG_USER,
> + .pgoff = linear_page_index(vma, address),
> + .gfp_mask = GFP_KERNEL,
> + .page = page,
> + };
> + if (page->pgmap && page->pgmap->ops->migrate_to_ram)
> + return page->pgmap->ops->migrate_to_ram();

How does this synchronise against pgmap being released? As I understand things
at this point we're not holding a reference on either the page or pgmap, so
the page and therefore the pgmap may have been freed.

I think a similar problem exists for device private fault handling as well and
it has been on my list of things to fix for a while. I think the solution is to
call try_get_page(), except it doesn't work with device pages due to the whole
refcount thing. That issue is blocking a fair bit of work now so I've started
looking into it.

> +
> + return -EBUSY;
> +}
> +
>  /*
>   * Check whether all pages are pinnable, if so return number of pages.  If 
> some
>   * pages are not pinnable, migrate them, and unpin all pages. Return zero if
>   * pages were migrated, or if some pages were not successfully isolated.
>   * Return negative error if migration fails.
>   */
> -static long check_and_migrate_movable_pages(unsigned long nr_pages,
> +static long check_and_migrate_movable_pages(unsigned long start,
> + unsigned long nr_pages,
>   struct page **pages,
>   unsigned int gup_flags)
>  {
>   unsigned long i;
> + unsigned long page_index;
>   unsigned long isolation_error_count = 0;
>   bool drain_allow = true;
>   LIST_HEAD(movable_page_list);
> @@ -1720,6 +1740,10 @@ static long check_and_migrate_movable_pages(unsigned 
> long nr_pages,
>* If we get a movable page, since we are going to be pinning
>* these entries, try to move them out if possible.
>*/
> + if (is_device_page(head)) {
> + page_index = i;
> + goto unpin_pages;
> + }
>   if (!is_pinnable_page(head)) {
>   if (PageHuge(head)) {
>   if (!isolate_huge_page(head, 
> _page_list))
> @@ -1750,12 +1774,16 @@ static long check_and_migrate_movable_pages(unsigned 
> long nr_pages,
>   if (list_empty(_page_list) && !isolation_error_count)
>   return nr_pages;
>  
> +unpin_pages:
>   if (gup_flags & FOLL_PIN) {
>   unpin_user_pages(pages, nr_pages);
>   } else {
>   for (i = 0; i < nr_pages; i++)
>   put_page(pages[i]);
>   }
> + if (is_device_page(head))
> + return migrate_device_page(start + page_index * PAGE_SIZE, 
> head);

This isn't very optimal - if a range contains more than one device page (which
seems likely) we will have to go around the whole gup/check_and_migrate loop
once for each device page which seems unnecessary. You should be able to either
build a list or migrate them as you go through the loop. I'm also currently
looking into how to extend migrate_pages() to support device pages which might
be useful here too.

> +
>   if (!list_empty(_page_list)) {
>   ret = migrate_pages(_page_list, alloc_migration_target,
>   NULL, (unsigned long), MIGRATE_SYNC,
> @@ -1798,7 +1826,7 @@ static long __gup_longterm_locked(struct mm_struct *mm,
>NULL, gup_flags);
>   if (rc <= 0)
>   break;
> - rc = check_and_migrate_movable_pages(rc, pages, gup_flags);
> + rc = check_and_migrate_movable_pages(start, rc, pages, 
> gup_flags);
>   } while (!rc);
>   memalloc_pin_restore(flags);
>  
>

Re: [PATCH v1 1/9] mm: add zone device coherent type memory support

2021-11-23 Thread Alistair Popple

On Tuesday, 23 November 2021 4:16:55 AM AEDT Felix Kuehling wrote:

[...]

> > Right, so long as my fix goes in I don't think there is anything wrong with
> > pinning device public pages. Agree that we should avoid FOLL_LONGTERM pins 
> > for
> > device memory though. I think the way to do that is update 
> > is_pinnable_page()
> > so we treat device pages the same as other unpinnable pages ie. long-term 
> > pins
> > will migrate the page.
> 
> I'm trying to understand check_and_migrate_movable_pages in gup.c. It
> doesn't look like the right way to migrate device pages. We may have to
> do something different there as well. So instead of changing
> is_pinnable_page, it maybe better to explicitly check for is_device_page
> or is_device_coherent_page in check_and_migrate_movable_pages to migrate
> it correctly, or just fail outright.

Yes, I think you're right. I was thinking check_and_migrate_movable_pages()
would work for coherent device pages. Now I see it won't because it assumes
they are lru pages and it tries to isolate them which will never succeed
because device pages aren't on a lru.

I think migrating them is the right thing to do for FOLL_LONGTERM though.

 - Alistair

> Thanks,
>   Felix
> 
> >
> >>> In the case of device-private pages this is enforced by the fact they 
> >>> never
> >>> have present pte's, so any attempt to GUP them results in a fault. But if 
> >>> I'm
> >>> understanding this series correctly that won't be the case for coherent 
> >>> device
> >>> pages right?
> >> Right.
> >>
> >> Regards,
> >>   Felix
> >>
> >>
>  -return is_device_private_page(page);
>  +return is_device_page(page);
>   }
>   
>   /* For file back page */
>  @@ -2791,7 +2791,7 @@ EXPORT_SYMBOL(migrate_vma_setup);
>    * handle_pte_fault()
>    *   do_anonymous_page()
>    * to map in an anonymous zero page but the struct page will be a 
>  ZONE_DEVICE
>  - * private page.
>  + * private or coherent page.
>    */
>   static void migrate_vma_insert_page(struct migrate_vma *migrate,
>   unsigned long addr,
>  @@ -2867,10 +2867,15 @@ static void migrate_vma_insert_page(struct 
>  migrate_vma *migrate,
>   swp_entry = 
>  make_readable_device_private_entry(
>   
>  page_to_pfn(page));
>   entry = swp_entry_to_pte(swp_entry);
>  +} else if (is_device_page(page)) {
> >>> How about adding an explicit `is_device_coherent_page()` helper? It would 
> >>> make
> >>> the test more explicit that this is expected to handle just coherent 
> >>> pages and
> >>> I bet there will be future changes that need to differentiate between 
> >>> private
> >>> and coherent pages anyway.
> >>>
>  +entry = pte_mkold(mk_pte(page,
>  + 
>  READ_ONCE(vma->vm_page_prot)));
>  +if (vma->vm_flags & VM_WRITE)
>  +entry = pte_mkwrite(pte_mkdirty(entry));
>   } else {
>   /*
>  - * For now we only support migrating to 
>  un-addressable
>  - * device memory.
>  + * We support migrating to private and coherent 
>  types
>  + * for device zone memory.
>    */
>   pr_warn_once("Unsupported ZONE_DEVICE page 
>  type.\n");
>   goto abort;
>  @@ -2976,10 +2981,10 @@ void migrate_vma_pages(struct migrate_vma 
>  *migrate)
>   mapping = page_mapping(page);
>   
>   if (is_zone_device_page(newpage)) {
>  -if (is_device_private_page(newpage)) {
>  +if (is_device_page(newpage)) {
>   /*
>  - * For now only support private 
>  anonymous when
>  - * migrating to un-addressable device 
>  memory.
>  + * For now only support private and 
>  coherent
>  + * anonymous when migrating to device 
>  memory.
>    */
>   if (mapping) {
>   migrate->src[i] &= 
>  ~MIGRATE_PFN_MIGRATE;
> 
> >
> >
>

1 2 >

1 - 100 of 112 matches

Mail list logo