Re: [PATCH v9 06/14] mm/gup: migrate device coherent pages when pinning instead of failing
On 7/18/2022 3:32 PM, Andrew Morton wrote: On Mon, 18 Jul 2022 12:56:29 +0200 David Hildenbrand wrote: /* * Try to move out any movable page before pinning the range. */ @@ -1919,7 +1948,8 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages, folio_nr_pages(folio)); } - if (!list_empty(_page_list) || isolation_error_count) + if (!list_empty(_page_list) || isolation_error_count + || coherent_pages) The common style is to a) add the || to the end of the previous line b) indent such the we have a nice-to-read alignment if (!list_empty(_page_list) || isolation_error_count || coherent_pages) I missed that. This series is now in mm-stable so any fix will need to be a standalone followup patch, please. Hi Andrew, Just wanted to make sure nothing is missing from our side to merge this patch series. Regards, Alex Sierra Apart from that lgtm. Reviewed-by: David Hildenbrand And your reviewed-by's will be lost. Stupid git.
Re: [PATCH] mm/gup: migrate device coherent pages when pinning instead of failing
On 7/14/2022 9:11 PM, Alistair Popple wrote: Currently any attempts to pin a device coherent page will fail. This is because device coherent pages need to be managed by a device driver, and pinning them would prevent a driver from migrating them off the device. However this is no reason to fail pinning of these pages. These are coherent and accessible from the CPU so can be migrated just like pinning ZONE_MOVABLE pages. So instead of failing all attempts to pin them first try migrating them out of ZONE_DEVICE. [hch: rebased to the split device memory checks, moved migrate_device_page to migrate_device.c] Signed-off-by: Alistair Popple Acked-by: Felix Kuehling Signed-off-by: Christoph Hellwig --- This patch hopefully addresses all of David's comments. It replaces both my "mm: remove the vma check in migrate_vma_setup()" and "mm/gup: migrate device coherent pages when pinning instead of failing" patches. I'm not sure what the best way of including this is, perhaps Alex can respin the series with this patch instead? For sure Alistair. I'll include this in my next patch series version. Thanks, Alex Sierra - Alistair mm/gup.c| 50 +-- mm/internal.h | 1 + mm/migrate_device.c | 52 + 3 files changed, 96 insertions(+), 7 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index b65fe8bf5af4..22b97ab61cd9 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1881,7 +1881,7 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages, unsigned long isolation_error_count = 0, i; struct folio *prev_folio = NULL; LIST_HEAD(movable_page_list); - bool drain_allow = true; + bool drain_allow = true, coherent_pages = false; int ret = 0; for (i = 0; i < nr_pages; i++) { @@ -1891,9 +1891,38 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages, continue; prev_folio = folio; - if (folio_is_longterm_pinnable(folio)) + /* +* Device coherent pages are managed by a driver and should not +* be pinned indefinitely as it prevents the driver moving the +* page. So when trying to pin with FOLL_LONGTERM instead try +* to migrate the page out of device memory. +*/ + if (folio_is_device_coherent(folio)) { + /* +* We always want a new GUP lookup with device coherent +* pages. +*/ + pages[i] = 0; + coherent_pages = true; + + /* +* Migration will fail if the page is pinned, so convert +* the pin on the source page to a normal reference. +*/ + if (gup_flags & FOLL_PIN) { + get_page(>page); + unpin_user_page(>page); + } + + ret = migrate_device_coherent_page(>page); + if (ret) + goto unpin_pages; + continue; + } + if (folio_is_longterm_pinnable(folio)) + continue; /* * Try to move out any movable page before pinning the range. */ @@ -1919,7 +1948,8 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages, folio_nr_pages(folio)); } - if (!list_empty(_page_list) || isolation_error_count) + if (!list_empty(_page_list) || isolation_error_count + || coherent_pages) goto unpin_pages; /* @@ -1929,10 +1959,16 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages, return nr_pages; unpin_pages: - if (gup_flags & FOLL_PIN) { - unpin_user_pages(pages, nr_pages); - } else { - for (i = 0; i < nr_pages; i++) + /* +* pages[i] might be NULL if any device coherent pages were found. +*/ + for (i = 0; i < nr_pages; i++) { + if (!pages[i]) + continue; + + if (gup_flags & FOLL_PIN) + unpin_user_page(pages[i]); + else put_page(pages[i]); } diff --git a/mm/internal.h b/mm/internal.h index c0f8fbe0445b..899dab512c5a 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -853,6 +853,7 @@ int numa_migrate_prep(struct page *page, struct vm_area_struct *vma, unsigned long addr, int page_nid, int *flags); void free_zone_device_page(struct page *page); +int migrate_device_coherent_page(struct page *page); /* * mm/gup.c diff --git a/mm/migrate_device.c
Re: [PATCH v6 02/14] mm: handling Non-LRU pages returned by vm_normal_pages
On 6/28/2022 5:42 AM, David Hildenbrand wrote: On 28.06.22 02:14, Alex Sierra wrote: With DEVICE_COHERENT, we'll soon have vm_normal_pages() return device-managed anonymous pages that are not LRU pages. Although they behave like normal pages for purposes of mapping in CPU page, and for COW. They do not support LRU lists, NUMA migration or THP. We also introduced a FOLL_LRU flag that adds the same behaviour to follow_page and related APIs, to allow callers to specify that they expect to put pages on an LRU list. Signed-off-by: Alex Sierra Acked-by: Felix Kuehling Reviewed-by: Alistair Popple --- I think my review feedback regarding FOLL_LRU has been ignored. Sorry David, this has been addressed in v7. Regards, Alex Sierra
Re: [PATCH v5 01/13] mm: add zone device coherent type memory support
On 6/23/2022 1:21 PM, David Hildenbrand wrote: On 23.06.22 20:20, Sierra Guiza, Alejandro (Alex) wrote: On 6/23/2022 2:57 AM, David Hildenbrand wrote: On 23.06.22 01:16, Sierra Guiza, Alejandro (Alex) wrote: On 6/21/2022 11:16 AM, David Hildenbrand wrote: On 21.06.22 18:08, Sierra Guiza, Alejandro (Alex) wrote: On 6/21/2022 7:25 AM, David Hildenbrand wrote: On 21.06.22 13:55, Alistair Popple wrote: David Hildenbrand writes: On 21.06.22 13:25, Felix Kuehling wrote: Am 6/17/22 um 23:19 schrieb David Hildenbrand: On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote: On 6/17/2022 12:33 PM, David Hildenbrand wrote: On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote: On 6/17/2022 4:40 AM, David Hildenbrand wrote: On 31.05.22 22:00, Alex Sierra wrote: Device memory that is cache coherent from device and CPU point of view. This is used on platforms that have an advanced system bus (like CAPI or CXL). Any page of a process can be migrated to such memory. However, no one should be allowed to pin such memory so that it can always be evicted. Signed-off-by: Alex Sierra Acked-by: Felix Kuehling Reviewed-by: Alistair Popple [hch: rebased ontop of the refcount changes, removed is_dev_private_or_coherent_page] Signed-off-by: Christoph Hellwig --- include/linux/memremap.h | 19 +++ mm/memcontrol.c | 7 --- mm/memory-failure.c | 8 ++-- mm/memremap.c| 10 ++ mm/migrate_device.c | 16 +++- mm/rmap.c| 5 +++-- 6 files changed, 49 insertions(+), 16 deletions(-) diff --git a/include/linux/memremap.h b/include/linux/memremap.h index 8af304f6b504..9f752ebed613 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -41,6 +41,13 @@ struct vmem_altmap { * A more complete discussion of unaddressable memory may be found in * include/linux/hmm.h and Documentation/vm/hmm.rst. * + * MEMORY_DEVICE_COHERENT: + * Device memory that is cache coherent from device and CPU point of view. This + * is used on platforms that have an advanced system bus (like CAPI or CXL). A + * driver can hotplug the device memory using ZONE_DEVICE and with that memory + * type. Any page of a process can be migrated to such memory. However no one Any page might not be right, I'm pretty sure. ... just thinking about special pages like vdso, shared zeropage, ... pinned pages ... Well, you cannot migrate long term pages, that's what I meant :) + * should be allowed to pin such memory so that it can always be evicted. + * * MEMORY_DEVICE_FS_DAX: * Host memory that has similar access semantics as System RAM i.e. DMA * coherent and supports page pinning. In support of coordinating page @@ -61,6 +68,7 @@ struct vmem_altmap { enum memory_type { /* 0 is reserved to catch uninitialized type fields */ MEMORY_DEVICE_PRIVATE = 1, + MEMORY_DEVICE_COHERENT, MEMORY_DEVICE_FS_DAX, MEMORY_DEVICE_GENERIC, MEMORY_DEVICE_PCI_P2PDMA, @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio) In general, this LGTM, and it should be correct with PageAnonExclusive I think. However, where exactly is pinning forbidden? Long-term pinning is forbidden since it would interfere with the device memory manager owning the device-coherent pages (e.g. evictions in TTM). However, normal pinning is allowed on this device type. I don't see updates to folio_is_pinnable() in this patch. Device coherent type pages should return true here, as they are pinnable pages. That function is only called for long-term pinnings in try_grab_folio(). So wouldn't try_grab_folio() simply pin these pages? What am I missing? As far as I understand this return NULL for long term pin pages. Otherwise they get refcount incremented. I don't follow. You're saying a) folio_is_pinnable() returns true for device coherent pages and that b) device coherent pages don't get long-term pinned Yet, the code says struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) { if (flags & FOLL_GET) return try_get_folio(page, refs); else if (flags & FOLL_PIN) { struct folio *folio; /* * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a * right zone, so fail and let the caller fall back to the slow * path. */ if (unlikely((flags & FOLL_LONGTERM) && !is_pinnable_page(page))) return NULL; ... return folio; } } What prevents these pages from getting long-term pinned as stated in this patch? Long-term pinning is handled by __gup_longterm_locked, which mig
Re: [PATCH v5 01/13] mm: add zone device coherent type memory support
On 6/23/2022 2:57 AM, David Hildenbrand wrote: On 23.06.22 01:16, Sierra Guiza, Alejandro (Alex) wrote: On 6/21/2022 11:16 AM, David Hildenbrand wrote: On 21.06.22 18:08, Sierra Guiza, Alejandro (Alex) wrote: On 6/21/2022 7:25 AM, David Hildenbrand wrote: On 21.06.22 13:55, Alistair Popple wrote: David Hildenbrand writes: On 21.06.22 13:25, Felix Kuehling wrote: Am 6/17/22 um 23:19 schrieb David Hildenbrand: On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote: On 6/17/2022 12:33 PM, David Hildenbrand wrote: On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote: On 6/17/2022 4:40 AM, David Hildenbrand wrote: On 31.05.22 22:00, Alex Sierra wrote: Device memory that is cache coherent from device and CPU point of view. This is used on platforms that have an advanced system bus (like CAPI or CXL). Any page of a process can be migrated to such memory. However, no one should be allowed to pin such memory so that it can always be evicted. Signed-off-by: Alex Sierra Acked-by: Felix Kuehling Reviewed-by: Alistair Popple [hch: rebased ontop of the refcount changes, removed is_dev_private_or_coherent_page] Signed-off-by: Christoph Hellwig --- include/linux/memremap.h | 19 +++ mm/memcontrol.c | 7 --- mm/memory-failure.c | 8 ++-- mm/memremap.c| 10 ++ mm/migrate_device.c | 16 +++- mm/rmap.c| 5 +++-- 6 files changed, 49 insertions(+), 16 deletions(-) diff --git a/include/linux/memremap.h b/include/linux/memremap.h index 8af304f6b504..9f752ebed613 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -41,6 +41,13 @@ struct vmem_altmap { * A more complete discussion of unaddressable memory may be found in * include/linux/hmm.h and Documentation/vm/hmm.rst. * + * MEMORY_DEVICE_COHERENT: + * Device memory that is cache coherent from device and CPU point of view. This + * is used on platforms that have an advanced system bus (like CAPI or CXL). A + * driver can hotplug the device memory using ZONE_DEVICE and with that memory + * type. Any page of a process can be migrated to such memory. However no one Any page might not be right, I'm pretty sure. ... just thinking about special pages like vdso, shared zeropage, ... pinned pages ... Well, you cannot migrate long term pages, that's what I meant :) + * should be allowed to pin such memory so that it can always be evicted. + * * MEMORY_DEVICE_FS_DAX: * Host memory that has similar access semantics as System RAM i.e. DMA * coherent and supports page pinning. In support of coordinating page @@ -61,6 +68,7 @@ struct vmem_altmap { enum memory_type { /* 0 is reserved to catch uninitialized type fields */ MEMORY_DEVICE_PRIVATE = 1, + MEMORY_DEVICE_COHERENT, MEMORY_DEVICE_FS_DAX, MEMORY_DEVICE_GENERIC, MEMORY_DEVICE_PCI_P2PDMA, @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio) In general, this LGTM, and it should be correct with PageAnonExclusive I think. However, where exactly is pinning forbidden? Long-term pinning is forbidden since it would interfere with the device memory manager owning the device-coherent pages (e.g. evictions in TTM). However, normal pinning is allowed on this device type. I don't see updates to folio_is_pinnable() in this patch. Device coherent type pages should return true here, as they are pinnable pages. That function is only called for long-term pinnings in try_grab_folio(). So wouldn't try_grab_folio() simply pin these pages? What am I missing? As far as I understand this return NULL for long term pin pages. Otherwise they get refcount incremented. I don't follow. You're saying a) folio_is_pinnable() returns true for device coherent pages and that b) device coherent pages don't get long-term pinned Yet, the code says struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) { if (flags & FOLL_GET) return try_get_folio(page, refs); else if (flags & FOLL_PIN) { struct folio *folio; /* * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a * right zone, so fail and let the caller fall back to the slow * path. */ if (unlikely((flags & FOLL_LONGTERM) && !is_pinnable_page(page))) return NULL; ... return folio; } } What prevents these pages from getting long-term pinned as stated in this patch? Long-term pinning is handled by __gup_longterm_locked, which migrates pages returned by __get_user_pages_locked that cannot be long-term pinned. try_grab_folio is OK to grab the pages. Anything that can't be long-term pin
Re: [PATCH v5 01/13] mm: add zone device coherent type memory support
On 6/21/2022 11:16 AM, David Hildenbrand wrote: On 21.06.22 18:08, Sierra Guiza, Alejandro (Alex) wrote: On 6/21/2022 7:25 AM, David Hildenbrand wrote: On 21.06.22 13:55, Alistair Popple wrote: David Hildenbrand writes: On 21.06.22 13:25, Felix Kuehling wrote: Am 6/17/22 um 23:19 schrieb David Hildenbrand: On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote: On 6/17/2022 12:33 PM, David Hildenbrand wrote: On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote: On 6/17/2022 4:40 AM, David Hildenbrand wrote: On 31.05.22 22:00, Alex Sierra wrote: Device memory that is cache coherent from device and CPU point of view. This is used on platforms that have an advanced system bus (like CAPI or CXL). Any page of a process can be migrated to such memory. However, no one should be allowed to pin such memory so that it can always be evicted. Signed-off-by: Alex Sierra Acked-by: Felix Kuehling Reviewed-by: Alistair Popple [hch: rebased ontop of the refcount changes, removed is_dev_private_or_coherent_page] Signed-off-by: Christoph Hellwig --- include/linux/memremap.h | 19 +++ mm/memcontrol.c | 7 --- mm/memory-failure.c | 8 ++-- mm/memremap.c| 10 ++ mm/migrate_device.c | 16 +++- mm/rmap.c| 5 +++-- 6 files changed, 49 insertions(+), 16 deletions(-) diff --git a/include/linux/memremap.h b/include/linux/memremap.h index 8af304f6b504..9f752ebed613 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -41,6 +41,13 @@ struct vmem_altmap { * A more complete discussion of unaddressable memory may be found in * include/linux/hmm.h and Documentation/vm/hmm.rst. * + * MEMORY_DEVICE_COHERENT: + * Device memory that is cache coherent from device and CPU point of view. This + * is used on platforms that have an advanced system bus (like CAPI or CXL). A + * driver can hotplug the device memory using ZONE_DEVICE and with that memory + * type. Any page of a process can be migrated to such memory. However no one Any page might not be right, I'm pretty sure. ... just thinking about special pages like vdso, shared zeropage, ... pinned pages ... Well, you cannot migrate long term pages, that's what I meant :) + * should be allowed to pin such memory so that it can always be evicted. + * * MEMORY_DEVICE_FS_DAX: * Host memory that has similar access semantics as System RAM i.e. DMA * coherent and supports page pinning. In support of coordinating page @@ -61,6 +68,7 @@ struct vmem_altmap { enum memory_type { /* 0 is reserved to catch uninitialized type fields */ MEMORY_DEVICE_PRIVATE = 1, + MEMORY_DEVICE_COHERENT, MEMORY_DEVICE_FS_DAX, MEMORY_DEVICE_GENERIC, MEMORY_DEVICE_PCI_P2PDMA, @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio) In general, this LGTM, and it should be correct with PageAnonExclusive I think. However, where exactly is pinning forbidden? Long-term pinning is forbidden since it would interfere with the device memory manager owning the device-coherent pages (e.g. evictions in TTM). However, normal pinning is allowed on this device type. I don't see updates to folio_is_pinnable() in this patch. Device coherent type pages should return true here, as they are pinnable pages. That function is only called for long-term pinnings in try_grab_folio(). So wouldn't try_grab_folio() simply pin these pages? What am I missing? As far as I understand this return NULL for long term pin pages. Otherwise they get refcount incremented. I don't follow. You're saying a) folio_is_pinnable() returns true for device coherent pages and that b) device coherent pages don't get long-term pinned Yet, the code says struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) { if (flags & FOLL_GET) return try_get_folio(page, refs); else if (flags & FOLL_PIN) { struct folio *folio; /* * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a * right zone, so fail and let the caller fall back to the slow * path. */ if (unlikely((flags & FOLL_LONGTERM) && !is_pinnable_page(page))) return NULL; ... return folio; } } What prevents these pages from getting long-term pinned as stated in this patch? Long-term pinning is handled by __gup_longterm_locked, which migrates pages returned by __get_user_pages_locked that cannot be long-term pinned. try_grab_folio is OK to grab the pages. Anything that can't be long-term pinned will be migrated afterwards, and __get_user_pages_locked will be retried. The migration of DEVICE_COHERENT pages was imp
Re: [PATCH v5 01/13] mm: add zone device coherent type memory support
On 6/21/2022 7:16 PM, Alistair Popple wrote: David Hildenbrand writes: On 21.06.22 18:08, Sierra Guiza, Alejandro (Alex) wrote: On 6/21/2022 7:25 AM, David Hildenbrand wrote: On 21.06.22 13:55, Alistair Popple wrote: David Hildenbrand writes: On 21.06.22 13:25, Felix Kuehling wrote: Am 6/17/22 um 23:19 schrieb David Hildenbrand: On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote: On 6/17/2022 12:33 PM, David Hildenbrand wrote: On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote: On 6/17/2022 4:40 AM, David Hildenbrand wrote: On 31.05.22 22:00, Alex Sierra wrote: Device memory that is cache coherent from device and CPU point of view. This is used on platforms that have an advanced system bus (like CAPI or CXL). Any page of a process can be migrated to such memory. However, no one should be allowed to pin such memory so that it can always be evicted. Signed-off-by: Alex Sierra Acked-by: Felix Kuehling Reviewed-by: Alistair Popple [hch: rebased ontop of the refcount changes, removed is_dev_private_or_coherent_page] Signed-off-by: Christoph Hellwig --- include/linux/memremap.h | 19 +++ mm/memcontrol.c | 7 --- mm/memory-failure.c | 8 ++-- mm/memremap.c| 10 ++ mm/migrate_device.c | 16 +++- mm/rmap.c| 5 +++-- 6 files changed, 49 insertions(+), 16 deletions(-) diff --git a/include/linux/memremap.h b/include/linux/memremap.h index 8af304f6b504..9f752ebed613 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -41,6 +41,13 @@ struct vmem_altmap { * A more complete discussion of unaddressable memory may be found in * include/linux/hmm.h and Documentation/vm/hmm.rst. * + * MEMORY_DEVICE_COHERENT: + * Device memory that is cache coherent from device and CPU point of view. This + * is used on platforms that have an advanced system bus (like CAPI or CXL). A + * driver can hotplug the device memory using ZONE_DEVICE and with that memory + * type. Any page of a process can be migrated to such memory. However no one Any page might not be right, I'm pretty sure. ... just thinking about special pages like vdso, shared zeropage, ... pinned pages ... Well, you cannot migrate long term pages, that's what I meant :) + * should be allowed to pin such memory so that it can always be evicted. + * * MEMORY_DEVICE_FS_DAX: * Host memory that has similar access semantics as System RAM i.e. DMA * coherent and supports page pinning. In support of coordinating page @@ -61,6 +68,7 @@ struct vmem_altmap { enum memory_type { /* 0 is reserved to catch uninitialized type fields */ MEMORY_DEVICE_PRIVATE = 1, + MEMORY_DEVICE_COHERENT, MEMORY_DEVICE_FS_DAX, MEMORY_DEVICE_GENERIC, MEMORY_DEVICE_PCI_P2PDMA, @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio) In general, this LGTM, and it should be correct with PageAnonExclusive I think. However, where exactly is pinning forbidden? Long-term pinning is forbidden since it would interfere with the device memory manager owning the device-coherent pages (e.g. evictions in TTM). However, normal pinning is allowed on this device type. I don't see updates to folio_is_pinnable() in this patch. Device coherent type pages should return true here, as they are pinnable pages. That function is only called for long-term pinnings in try_grab_folio(). So wouldn't try_grab_folio() simply pin these pages? What am I missing? As far as I understand this return NULL for long term pin pages. Otherwise they get refcount incremented. I don't follow. You're saying a) folio_is_pinnable() returns true for device coherent pages and that b) device coherent pages don't get long-term pinned Yet, the code says struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) { if (flags & FOLL_GET) return try_get_folio(page, refs); else if (flags & FOLL_PIN) { struct folio *folio; /* * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a * right zone, so fail and let the caller fall back to the slow * path. */ if (unlikely((flags & FOLL_LONGTERM) && !is_pinnable_page(page))) return NULL; ... return folio; } } What prevents these pages from getting long-term pinned as stated in this patch? Long-term pinning is handled by __gup_longterm_locked, which migrates pages returned by __get_user_pages_locked that cannot be long-term pinned. try_grab_folio is OK to grab the pages. Anything that can't be long-term pinned will be migrated afterwards, and __get_user_pages_locked will be retried. The migration of D
Re: [PATCH v5 01/13] mm: add zone device coherent type memory support
On 6/21/2022 7:25 AM, David Hildenbrand wrote: On 21.06.22 13:55, Alistair Popple wrote: David Hildenbrand writes: On 21.06.22 13:25, Felix Kuehling wrote: Am 6/17/22 um 23:19 schrieb David Hildenbrand: On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote: On 6/17/2022 12:33 PM, David Hildenbrand wrote: On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote: On 6/17/2022 4:40 AM, David Hildenbrand wrote: On 31.05.22 22:00, Alex Sierra wrote: Device memory that is cache coherent from device and CPU point of view. This is used on platforms that have an advanced system bus (like CAPI or CXL). Any page of a process can be migrated to such memory. However, no one should be allowed to pin such memory so that it can always be evicted. Signed-off-by: Alex Sierra Acked-by: Felix Kuehling Reviewed-by: Alistair Popple [hch: rebased ontop of the refcount changes, removed is_dev_private_or_coherent_page] Signed-off-by: Christoph Hellwig --- include/linux/memremap.h | 19 +++ mm/memcontrol.c | 7 --- mm/memory-failure.c | 8 ++-- mm/memremap.c| 10 ++ mm/migrate_device.c | 16 +++- mm/rmap.c| 5 +++-- 6 files changed, 49 insertions(+), 16 deletions(-) diff --git a/include/linux/memremap.h b/include/linux/memremap.h index 8af304f6b504..9f752ebed613 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -41,6 +41,13 @@ struct vmem_altmap { * A more complete discussion of unaddressable memory may be found in * include/linux/hmm.h and Documentation/vm/hmm.rst. * + * MEMORY_DEVICE_COHERENT: + * Device memory that is cache coherent from device and CPU point of view. This + * is used on platforms that have an advanced system bus (like CAPI or CXL). A + * driver can hotplug the device memory using ZONE_DEVICE and with that memory + * type. Any page of a process can be migrated to such memory. However no one Any page might not be right, I'm pretty sure. ... just thinking about special pages like vdso, shared zeropage, ... pinned pages ... Well, you cannot migrate long term pages, that's what I meant :) + * should be allowed to pin such memory so that it can always be evicted. + * * MEMORY_DEVICE_FS_DAX: * Host memory that has similar access semantics as System RAM i.e. DMA * coherent and supports page pinning. In support of coordinating page @@ -61,6 +68,7 @@ struct vmem_altmap { enum memory_type { /* 0 is reserved to catch uninitialized type fields */ MEMORY_DEVICE_PRIVATE = 1, + MEMORY_DEVICE_COHERENT, MEMORY_DEVICE_FS_DAX, MEMORY_DEVICE_GENERIC, MEMORY_DEVICE_PCI_P2PDMA, @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio) In general, this LGTM, and it should be correct with PageAnonExclusive I think. However, where exactly is pinning forbidden? Long-term pinning is forbidden since it would interfere with the device memory manager owning the device-coherent pages (e.g. evictions in TTM). However, normal pinning is allowed on this device type. I don't see updates to folio_is_pinnable() in this patch. Device coherent type pages should return true here, as they are pinnable pages. That function is only called for long-term pinnings in try_grab_folio(). So wouldn't try_grab_folio() simply pin these pages? What am I missing? As far as I understand this return NULL for long term pin pages. Otherwise they get refcount incremented. I don't follow. You're saying a) folio_is_pinnable() returns true for device coherent pages and that b) device coherent pages don't get long-term pinned Yet, the code says struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) { if (flags & FOLL_GET) return try_get_folio(page, refs); else if (flags & FOLL_PIN) { struct folio *folio; /* * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a * right zone, so fail and let the caller fall back to the slow * path. */ if (unlikely((flags & FOLL_LONGTERM) && !is_pinnable_page(page))) return NULL; ... return folio; } } What prevents these pages from getting long-term pinned as stated in this patch? Long-term pinning is handled by __gup_longterm_locked, which migrates pages returned by __get_user_pages_locked that cannot be long-term pinned. try_grab_folio is OK to grab the pages. Anything that can't be long-term pinned will be migrated afterwards, and __get_user_pages_locked will be retried. The migration of DEVICE_COHERENT pages was implemented by Alistair in patch 5/13 ("mm/gup: migrate device coherent pages when pinning instead of failing").
Re: [PATCH v5 01/13] mm: add zone device coherent type memory support
On 6/17/2022 12:33 PM, David Hildenbrand wrote: On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote: On 6/17/2022 4:40 AM, David Hildenbrand wrote: On 31.05.22 22:00, Alex Sierra wrote: Device memory that is cache coherent from device and CPU point of view. This is used on platforms that have an advanced system bus (like CAPI or CXL). Any page of a process can be migrated to such memory. However, no one should be allowed to pin such memory so that it can always be evicted. Signed-off-by: Alex Sierra Acked-by: Felix Kuehling Reviewed-by: Alistair Popple [hch: rebased ontop of the refcount changes, removed is_dev_private_or_coherent_page] Signed-off-by: Christoph Hellwig --- include/linux/memremap.h | 19 +++ mm/memcontrol.c | 7 --- mm/memory-failure.c | 8 ++-- mm/memremap.c| 10 ++ mm/migrate_device.c | 16 +++- mm/rmap.c| 5 +++-- 6 files changed, 49 insertions(+), 16 deletions(-) diff --git a/include/linux/memremap.h b/include/linux/memremap.h index 8af304f6b504..9f752ebed613 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -41,6 +41,13 @@ struct vmem_altmap { * A more complete discussion of unaddressable memory may be found in * include/linux/hmm.h and Documentation/vm/hmm.rst. * + * MEMORY_DEVICE_COHERENT: + * Device memory that is cache coherent from device and CPU point of view. This + * is used on platforms that have an advanced system bus (like CAPI or CXL). A + * driver can hotplug the device memory using ZONE_DEVICE and with that memory + * type. Any page of a process can be migrated to such memory. However no one Any page might not be right, I'm pretty sure. ... just thinking about special pages like vdso, shared zeropage, ... pinned pages ... Well, you cannot migrate long term pages, that's what I meant :) + * should be allowed to pin such memory so that it can always be evicted. + * * MEMORY_DEVICE_FS_DAX: * Host memory that has similar access semantics as System RAM i.e. DMA * coherent and supports page pinning. In support of coordinating page @@ -61,6 +68,7 @@ struct vmem_altmap { enum memory_type { /* 0 is reserved to catch uninitialized type fields */ MEMORY_DEVICE_PRIVATE = 1, + MEMORY_DEVICE_COHERENT, MEMORY_DEVICE_FS_DAX, MEMORY_DEVICE_GENERIC, MEMORY_DEVICE_PCI_P2PDMA, @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio) In general, this LGTM, and it should be correct with PageAnonExclusive I think. However, where exactly is pinning forbidden? Long-term pinning is forbidden since it would interfere with the device memory manager owning the device-coherent pages (e.g. evictions in TTM). However, normal pinning is allowed on this device type. I don't see updates to folio_is_pinnable() in this patch. Device coherent type pages should return true here, as they are pinnable pages. So wouldn't try_grab_folio() simply pin these pages? What am I missing? As far as I understand this return NULL for long term pin pages. Otherwise they get refcount incremented. Regards, Alex Sierra
Re: [PATCH v5 01/13] mm: add zone device coherent type memory support
On 6/17/2022 4:40 AM, David Hildenbrand wrote: On 31.05.22 22:00, Alex Sierra wrote: Device memory that is cache coherent from device and CPU point of view. This is used on platforms that have an advanced system bus (like CAPI or CXL). Any page of a process can be migrated to such memory. However, no one should be allowed to pin such memory so that it can always be evicted. Signed-off-by: Alex Sierra Acked-by: Felix Kuehling Reviewed-by: Alistair Popple [hch: rebased ontop of the refcount changes, removed is_dev_private_or_coherent_page] Signed-off-by: Christoph Hellwig --- include/linux/memremap.h | 19 +++ mm/memcontrol.c | 7 --- mm/memory-failure.c | 8 ++-- mm/memremap.c| 10 ++ mm/migrate_device.c | 16 +++- mm/rmap.c| 5 +++-- 6 files changed, 49 insertions(+), 16 deletions(-) diff --git a/include/linux/memremap.h b/include/linux/memremap.h index 8af304f6b504..9f752ebed613 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -41,6 +41,13 @@ struct vmem_altmap { * A more complete discussion of unaddressable memory may be found in * include/linux/hmm.h and Documentation/vm/hmm.rst. * + * MEMORY_DEVICE_COHERENT: + * Device memory that is cache coherent from device and CPU point of view. This + * is used on platforms that have an advanced system bus (like CAPI or CXL). A + * driver can hotplug the device memory using ZONE_DEVICE and with that memory + * type. Any page of a process can be migrated to such memory. However no one Any page might not be right, I'm pretty sure. ... just thinking about special pages like vdso, shared zeropage, ... pinned pages ... Hi David, Yes, I think you're right. This type does not cover all special pages. I need to correct that on the cover letter. Pinned pages are allowed as long as they're not long term pinned. Regards, Alex Sierra + * should be allowed to pin such memory so that it can always be evicted. + * * MEMORY_DEVICE_FS_DAX: * Host memory that has similar access semantics as System RAM i.e. DMA * coherent and supports page pinning. In support of coordinating page @@ -61,6 +68,7 @@ struct vmem_altmap { enum memory_type { /* 0 is reserved to catch uninitialized type fields */ MEMORY_DEVICE_PRIVATE = 1, + MEMORY_DEVICE_COHERENT, MEMORY_DEVICE_FS_DAX, MEMORY_DEVICE_GENERIC, MEMORY_DEVICE_PCI_P2PDMA, @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio) In general, this LGTM, and it should be correct with PageAnonExclusive I think. However, where exactly is pinning forbidden? Long-term pinning is forbidden since it would interfere with the device memory manager owning the device-coherent pages (e.g. evictions in TTM). However, normal pinning is allowed on this device type. Regards, Alex Sierra
Re: [PATCH v4 07/13] lib: test_hmm add ioctl to get zone device type
On 5/31/2022 12:31 PM, Andrew Morton wrote: On Tue, 31 May 2022 10:56:23 -0500 Alex Sierra wrote: new ioctl cmd added to query zone device type. This will be used once the test_hmm adds zone device coherent type. @@ -1026,6 +1027,15 @@ static int dmirror_snapshot(struct dmirror *dmirror, return ret; } +static int dmirror_get_device_type(struct dmirror *dmirror, + struct hmm_dmirror_cmd *cmd) +{ + mutex_lock(>mutex); + cmd->zone_device_type = dmirror->mdevice->zone_device_type; + mutex_unlock(>mutex); What does the locking here do? Presumably cmd->zone_device_type can become out of date the instant the mutex is released, so what was the point in taking the mutex? Actually this is not used at all. Thanks for finding it. Honestly, I don't remember what we used this type request for. I will remove all related code and send a new patch series version. Regards, Alex Sierra And does it make sense to return potentially out-of-date info to userspace? Perhaps this interface simply shouldn't exist?
Re: [PATCH v3 02/13] mm: handling Non-LRU pages returned by vm_normal_pages
On 5/24/2022 11:11 PM, Alistair Popple wrote: Alex Sierra writes: With DEVICE_COHERENT, we'll soon have vm_normal_pages() return device-managed anonymous pages that are not LRU pages. Although they behave like normal pages for purposes of mapping in CPU page, and for COW. They do not support LRU lists, NUMA migration or THP. We also introduced a FOLL_LRU flag that adds the same behaviour to follow_page and related APIs, to allow callers to specify that they expect to put pages on an LRU list. Continuing the follow up from the thread for v2: This means by default GUP can return non-LRU pages. I didn't see anywhere that would be a problem but I didn't check everything. Did you check this or is there some other reason I've missed that makes this not a problem? I have double checked all gup and pin_user_pages callers and none of them seem to have interaction with LRU APIs. And actually if I'm understanding things correctly callers of GUP/PUP/follow_page_pte() should already expect to get non-LRU pages returned: page = vm_normal_page(vma, address, pte); if ((flags & FOLL_LRU) && page && is_zone_device_page(page)) page = NULL; if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) { /* * Only return device mapping pages in the FOLL_GET or FOLL_PIN * case since they are only valid while holding the pgmap * reference. */ *pgmap = get_dev_pagemap(pte_pfn(pte), *pgmap); if (*pgmap) page = pte_page(pte); Which I think makes FOLL_LRU confusing, because if understand correctly even with FOLL_LRU it is still possible for follow_page_pte() to return a non-LRU page. Could we do something like this to make it consistent: if ((flags & FOLL_LRU) && (page && is_zone_device_page(page) || !page && pte_devmap(pte))) Hi Alistair, Not sure if this suggestion is a replacement for the first or the second condition in the snip code above. We know device coherent type will not be set with devmap. So we could do the following: if ((flags & FOLL_LRU) && page && is_zone_device_page(page)) - page = NULL; + goto no_page; Regards, Alex Sierra Looking at callers that currently use FOLL_LRU I don't think this would change any behaviour as they already filter out devmap through various other means. Signed-off-by: Alex Sierra Acked-by: Felix Kuehling --- fs/proc/task_mmu.c | 2 +- include/linux/mm.h | 3 ++- mm/gup.c | 2 ++ mm/huge_memory.c | 2 +- mm/khugepaged.c| 9 ++--- mm/ksm.c | 6 +++--- mm/madvise.c | 4 ++-- mm/memory.c| 9 - mm/mempolicy.c | 2 +- mm/migrate.c | 4 ++-- mm/mlock.c | 2 +- mm/mprotect.c | 2 +- 12 files changed, 30 insertions(+), 17 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index f46060eb91b5..5d620733f173 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1785,7 +1785,7 @@ static struct page *can_gather_numa_stats(pte_t pte, struct vm_area_struct *vma, return NULL; page = vm_normal_page(vma, addr, pte); - if (!page) + if (!page || is_zone_device_page(page)) return NULL; if (PageReserved(page)) diff --git a/include/linux/mm.h b/include/linux/mm.h index 9f44254af8ce..d7f253a0c41e 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -601,7 +601,7 @@ struct vm_operations_struct { #endif /* * Called by vm_normal_page() for special PTEs to find the -* page for @addr. This is useful if the default behavior +* page for @addr. This is useful if the default behavior * (using pte_page()) would not find the correct page. */ struct page *(*find_special_page)(struct vm_area_struct *vma, @@ -2929,6 +2929,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address, #define FOLL_NUMA 0x200 /* force NUMA hinting page fault */ #define FOLL_MIGRATION0x400 /* wait for page to replace migration entry */ #define FOLL_TRIED0x800 /* a retry, previous pass started an IO */ +#define FOLL_LRU0x1000 /* return only LRU (anon or page cache) */ #define FOLL_REMOTE 0x2000 /* we are working on non-current tsk/mm */ #define FOLL_COW 0x4000 /* internal GUP flag */ #define FOLL_ANON 0x8000 /* don't do file mappings */ diff --git a/mm/gup.c b/mm/gup.c index 501bc150792c..c9cbac06bcc5 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -479,6 +479,8 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, } page = vm_normal_page(vma, address, pte); + if ((flags & FOLL_LRU) && page && is_zone_device_page(page)) + page = NULL; if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) { /* * Only return device mapping pages in the FOLL_GET or FOLL_PIN diff --git
Re: [PATCH v2 11/13] mm: handling Non-LRU pages returned by vm_normal_pages
On 5/23/2022 7:02 AM, Alistair Popple wrote: Technically I think this patch should be earlier in the series. As I understand it patch 1 allows DEVICE_COHERENT pages to be inserted in the page tables and therefore makes it possible for page table walkers to see non-LRU pages. Patch will reordered in V3. Regards, Alex Sierra Some more comments below: Alex Sierra writes: With DEVICE_COHERENT, we'll soon have vm_normal_pages() return device-managed anonymous pages that are not LRU pages. Although they behave like normal pages for purposes of mapping in CPU page, and for COW. They do not support LRU lists, NUMA migration or THP. We also introduced a FOLL_LRU flag that adds the same behaviour to follow_page and related APIs, to allow callers to specify that they expect to put pages on an LRU list. This means by default GUP can return non-LRU pages. I didn't see anywhere that would be a problem but I didn't check everything. Did you check this or is there some other reason I've missed that makes this not a problem? I have double checked all gup and pin_user_pages callers and none of them seem to have interaction with LRU APIs. Regards, Alex Sierra [...] diff --git a/mm/khugepaged.c b/mm/khugepaged.c index a4e5eaf3eb01..eb3cfd679800 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -627,7 +627,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, goto out; } page = vm_normal_page(vma, address, pteval); - if (unlikely(!page)) { + if (unlikely(!page) || unlikely(is_zone_device_page(page))) { result = SCAN_PAGE_NULL; goto out; } @@ -1276,7 +1276,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, writable = true; page = vm_normal_page(vma, _address, pteval); - if (unlikely(!page)) { + if (unlikely(!page) || unlikely(is_zone_device_page(page))) { result = SCAN_PAGE_NULL; goto out_unmap; } @@ -1484,7 +1484,8 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr) goto abort; page = vm_normal_page(vma, addr, *pte); - + if (page && is_zone_device_page(page)) + page = NULL; /* * Note that uprobe, debugger, or MAP_PRIVATE may change the * page table, but the new page will not be a subpage of hpage. @@ -1502,6 +1503,8 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr) if (pte_none(*pte)) continue; page = vm_normal_page(vma, addr, *pte); + if (page && is_zone_device_page(page)) + goto abort; Are either of these two cases actually possible? DEVICE_COHERENT doesn't currently support THP, so if I'm understanding correctly we couldn't have a pte mapped DEVICE_COHERENT THP right? Assuming that's the case I think WARN_ON_ONCE() would be better. Correct, change included in V3 patch series. Regards, Alex Otherwise I think everything else looks reasonable. page_remove_rmap(page, vma, false); } diff --git a/mm/ksm.c b/mm/ksm.c index 063a48eeb5ee..f16056efca21 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -474,7 +474,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr) do { cond_resched(); page = follow_page(vma, addr, - FOLL_GET | FOLL_MIGRATION | FOLL_REMOTE); + FOLL_GET | FOLL_MIGRATION | FOLL_REMOTE | FOLL_LRU); if (IS_ERR_OR_NULL(page)) break; if (PageKsm(page)) @@ -559,7 +559,7 @@ static struct page *get_mergeable_page(struct rmap_item *rmap_item) if (!vma) goto out; - page = follow_page(vma, addr, FOLL_GET); + page = follow_page(vma, addr, FOLL_GET | FOLL_LRU); if (IS_ERR_OR_NULL(page)) goto out; if (PageAnon(page)) { @@ -2288,7 +2288,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page) while (ksm_scan.address < vma->vm_end) { if (ksm_test_exit(mm)) break; - *page = follow_page(vma, ksm_scan.address, FOLL_GET); + *page = follow_page(vma, ksm_scan.address, FOLL_GET | FOLL_LRU); if (IS_ERR_OR_NULL(*page)) { ksm_scan.address += PAGE_SIZE; cond_resched(); diff --git a/mm/madvise.c b/mm/madvise.c index 1873616a37d2..e9c24c834e98 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -413,7 +413,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
Re: [PATCH v1 13/15] mm: handling Non-LRU pages returned by vm_normal_pages
On 5/11/2022 1:50 PM, Jason Gunthorpe wrote: On Thu, May 05, 2022 at 04:34:36PM -0500, Alex Sierra wrote: diff --git a/mm/memory.c b/mm/memory.c index 76e3af9639d9..892c4cc54dc2 100644 +++ b/mm/memory.c @@ -621,6 +621,13 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr, if (is_zero_pfn(pfn)) return NULL; if (pte_devmap(pte)) +/* + * NOTE: Technically this should goto check_pfn label. However, page->_mapcount + * is never incremented for device pages that are mmap through DAX mechanism + * using pmem driver mounted into ext4 filesystem. When these pages are unmap, + * zap_pte_range is called and vm_normal_page return a valid page with + * page_mapcount() = 0, before page_remove_rmap is called. + */ return NULL; ? Where does this series cause device coherent to be returned? In our case, device coherent pages could be obtained as a result of migration(Patches 6/7 of 15), ending up mapped in CPU page tables. Later on, these pages might need to be returned by get_user_pages or other callers through vm_normal_pages. Our approach in this series, is to handle device-coherent-managed pages returned by vm_normal_pages, inside each caller. EX. device coherent pages don’t support LRU lists, NUMA migration or THP. Wasn't the plan to not set pte_devmap() ? amdgpu does not set pte_devmap for our DEVICE_COHERENT pages. DEVMAP flags are set by drivers like virtio_fs or pmem, where MEMORY_DEVICE_FS_DAX type is used. This patch series deals with DEVICE_COHERENT pages. My understanding was, that the DAX code and DEVICE_GENERIC would be fixed up later by someone more familiar with it. Were you expecting that we'd fix the DAX usage of pte_devmap flags in this patch series as well? Regards, Alex Sierra Jason
Re: [PATCH v1 01/15] mm: add zone device coherent type memory support
On 5/11/2022 9:58 PM, Alistair Popple wrote: Alex Sierra writes: [...] diff --git a/mm/rmap.c b/mm/rmap.c index fedb82371efe..d57102cd4b43 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1995,7 +1995,8 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags) TTU_SYNC))) return; - if (folio_is_zone_device(folio) && !folio_is_device_private(folio)) + if (folio_is_zone_device(folio) && + (!folio_is_device_private(folio) && !folio_is_device_coherent(folio))) return; /* I vaguely recall commenting on this previously, or at least intending to. In try_to_migrate_one() we have this: if (folio_is_zone_device(folio)) { unsigned long pfn = folio_pfn(folio); swp_entry_t entry; pte_t swp_pte; /* * Store the pfn of the page in a special migration * pte. do_swap_page() will wait until the migration * pte is removed and then restart fault handling. */ entry = pte_to_swp_entry(pteval); if (is_writable_device_private_entry(entry)) entry = make_writable_migration_entry(pfn); else entry = make_readable_migration_entry(pfn); swp_pte = swp_entry_to_pte(entry); The check in try_to_migrate() guarantees that if folio_is_zone_device() is true this must be a DEVICE_PRIVATE page and it treats it as such by assuming there is a special device private swap entry there. Relying on that assumption seems bad, and I have no idea why I didn't just use is_device_private_page() originally but I think the fix is just to change this to: if (folio_is_device_private(folio)) Thanks Alistair. This worked fine. I'll drop patch 4 and update this patch in the next series version. Regards, Alex Sierra And let DEVICE_COHERENT pages fall through to normal page processing. - Alistair
RE: [PATCH v1 04/15] mm: add device coherent checker to remove migration pte
@apop...@nvidia.com Could you please check this patch? It's somehow related to migrate_device_page() for long term device coherent pages. Regards, Alex Sierra > -Original Message- > From: amd-gfx On Behalf Of Alex > Sierra > Sent: Thursday, May 5, 2022 4:34 PM > To: j...@nvidia.com > Cc: rcampb...@nvidia.com; wi...@infradead.org; da...@redhat.com; > Kuehling, Felix ; apop...@nvidia.com; amd- > g...@lists.freedesktop.org; linux-...@vger.kernel.org; linux...@kvack.org; > jgli...@redhat.com; dri-devel@lists.freedesktop.org; akpm@linux- > foundation.org; linux-e...@vger.kernel.org; h...@lst.de > Subject: [PATCH v1 04/15] mm: add device coherent checker to remove > migration pte > > During remove_migration_pte(), entries for device coherent type pages that > were not created through special migration ptes, ignore _PAGE_RW flag. This > path can be found at migrate_device_page(), where valid vma is not > required. In this case, migrate_vma_collect_pmd() is not called and special > migration ptes are not set. > > Signed-off-by: Alex Sierra > --- > mm/migrate.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/mm/migrate.c b/mm/migrate.c index > 6c31ee1e1c9b..e18ddee56f37 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -206,7 +206,8 @@ static bool remove_migration_pte(struct folio *folio, >* Recheck VMA as permissions can change since migration > started >*/ > entry = pte_to_swp_entry(*pvmw.pte); > - if (is_writable_migration_entry(entry)) > + if (is_writable_migration_entry(entry) || > + is_device_coherent_page(pfn_to_page(pvmw.pfn))) > pte = maybe_mkwrite(pte, vma); > else if (pte_swp_uffd_wp(*pvmw.pte)) > pte = pte_mkuffd_wp(pte); > -- > 2.32.0
Re: [PATCH v2 1/3] mm: add vm_normal_lru_pages for LRU handled pages only
On 4/4/2022 12:38 PM, Jason Gunthorpe wrote: On Fri, Apr 01, 2022 at 04:08:35PM -0400, Felix Kuehling wrote: In general I find the vm_normal_lru_page vs vm_normal_page API highly confusing. An explicit check for zone device pages in the dozen or so spots that care has a much better documentation value, especially if accompanied by comments where it isn't entirely obvious. OK. We can do that. It would solve the function naming problem, and we'd have more visibility of device page handling in more places in the kernel, which has educational value. Personally I find the 'is page XYZ' pretty confusing, like I don't know half of what the PageKsm annotations are for.. Testing against a specific property the code goes on to use right away seems more descriptive to me. Hi Jason, Are you referring to test for properties such as is_lru_page, is_numa_page, is_lockable_page, etc? Otherwise, could you provide an example? Regards, Alex Sierra Jason
RE: [PATCH] drm/amdkfd: Add SVM API support capability bits
Hi Matthew, I sent this patch by accident. Please ignore it. Regards, Alex Sierra > -Original Message- > From: Matthew Wilcox > Sent: Wednesday, March 30, 2022 4:29 PM > To: Sierra Guiza, Alejandro (Alex) > Cc: j...@nvidia.com; da...@redhat.com; Kuehling, Felix > ; linux...@kvack.org; rcampb...@nvidia.com; > linux-e...@vger.kernel.org; linux-...@vger.kernel.org; amd- > g...@lists.freedesktop.org; dri-devel@lists.freedesktop.org; h...@lst.de; > jgli...@redhat.com; apop...@nvidia.com; a...@linux-foundation.org > Subject: Re: [PATCH] drm/amdkfd: Add SVM API support capability bits > > On Wed, Mar 30, 2022 at 04:24:20PM -0500, Alex Sierra wrote: > > From: Philip Yang > > > > SVMAPISupported property added to HSA_CAPABILITY, the value match > > HSA_CAPABILITY defined in Thunk spec: > > > > SVMAPISupported: it will not be supported on older kernels that don't > > have HMM or on systems with GFXv8 or older GPUs without support for > > 48-bit virtual addresses. > > > > CoherentHostAccess property added to HSA_MEMORYPROPERTY, the > value > > match HSA_MEMORYPROPERTY defined in Thunk spec: > > > > CoherentHostAccess: whether or not device memory can be coherently > > accessed by the host CPU. > > Could you translate this commit message into English? Reviewing > Documentation/process/5.Posting.rst might be helpful.
RE: [PATCH] drm/amdkfd: Add SVM API support capability bits
Please ignore this patch. > -Original Message- > From: amd-gfx On Behalf Of Alex > Sierra > Sent: Wednesday, March 30, 2022 4:24 PM > To: j...@nvidia.com > Cc: rcampb...@nvidia.com; wi...@infradead.org; da...@redhat.com; > Kuehling, Felix ; apop...@nvidia.com; amd- > g...@lists.freedesktop.org; linux-...@vger.kernel.org; linux...@kvack.org; > jgli...@redhat.com; dri-devel@lists.freedesktop.org; akpm@linux- > foundation.org; linux-e...@vger.kernel.org; h...@lst.de > Subject: [PATCH] drm/amdkfd: Add SVM API support capability bits > > From: Philip Yang > > SVMAPISupported property added to HSA_CAPABILITY, the value match > HSA_CAPABILITY defined in Thunk spec: > > SVMAPISupported: it will not be supported on older kernels that don't have > HMM or on systems with GFXv8 or older GPUs without support for 48-bit > virtual addresses. > > CoherentHostAccess property added to HSA_MEMORYPROPERTY, the value > match HSA_MEMORYPROPERTY defined in Thunk spec: > > CoherentHostAccess: whether or not device memory can be coherently > accessed by the host CPU. > > Signed-off-by: Philip Yang > Reviewed-by: Felix Kuehling > Signed-off-by: Felix Kuehling > Signed-off-by: Alex Deucher > --- > drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 6 ++ > drivers/gpu/drm/amd/amdkfd/kfd_topology.h | 10 ++ > 2 files changed, 12 insertions(+), 4 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c > b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c > index cdef608db4f4..083ac9babfa8 100644 > --- a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c > @@ -1419,6 +1419,12 @@ int kfd_topology_add_device(struct kfd_dev > *gpu) > dev->node_props.capability |= (adev->ras_features != 0) ? > HSA_CAP_RASEVENTNOTIFY : 0; > > + /* SVM API and HMM page migration work together, device memory > type > + * is initialized to not 0 when page migration register device memory. > + */ > + if (adev->kfd.dev->pgmap.type != 0) > + dev->node_props.capability |= > HSA_CAP_SVMAPI_SUPPORTED; > + > kfd_debug_print_topology(); > > if (!res) > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_topology.h > b/drivers/gpu/drm/amd/amdkfd/kfd_topology.h > index b8b68087bd7a..6bd6380b0ee0 100644 > --- a/drivers/gpu/drm/amd/amdkfd/kfd_topology.h > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_topology.h > @@ -53,8 +53,9 @@ > #define HSA_CAP_ASIC_REVISION_MASK 0x03c0 > #define HSA_CAP_ASIC_REVISION_SHIFT 22 > #define HSA_CAP_SRAM_EDCSUPPORTED0x0400 > +#define HSA_CAP_SVMAPI_SUPPORTED 0x0800 > > -#define HSA_CAP_RESERVED 0xf80f8000 > +#define HSA_CAP_RESERVED 0xf00f8000 > > struct kfd_node_properties { > uint64_t hive_id; > @@ -98,9 +99,10 @@ struct kfd_node_properties { > #define HSA_MEM_HEAP_TYPE_GPU_LDS4 > #define HSA_MEM_HEAP_TYPE_GPU_SCRATCH5 > > -#define HSA_MEM_FLAGS_HOT_PLUGGABLE 0x0001 > -#define HSA_MEM_FLAGS_NON_VOLATILE 0x0002 > -#define HSA_MEM_FLAGS_RESERVED 0xfffc > +#define HSA_MEM_FLAGS_HOT_PLUGGABLE 0x0001 > +#define HSA_MEM_FLAGS_NON_VOLATILE 0x0002 > +#define HSA_MEM_FLAGS_COHERENTHOSTACCESS 0x0004 > +#define HSA_MEM_FLAGS_RESERVED 0xfff8 > > struct kfd_mem_properties { > struct list_headlist; > -- > 2.32.0
Re: [PATCH v6 01/10] mm: add zone device coherent type memory support
On 2/11/2022 10:39 AM, David Hildenbrand wrote: On 11.02.22 17:15, David Hildenbrand wrote: On 01.02.22 16:48, Alex Sierra wrote: Device memory that is cache coherent from device and CPU point of view. This is used on platforms that have an advanced system bus (like CAPI or CXL). Any page of a process can be migrated to such memory. However, no one should be allowed to pin such memory so that it can always be evicted. Signed-off-by: Alex Sierra Acked-by: Felix Kuehling Reviewed-by: Alistair Popple So, I'm currently messing with PageAnon() pages and CoW semantics ... all these PageAnon() ZONE_DEVICE variants don't necessarily make my life easier but I'm not sure yet if they make my life harder. I hope you can help me understand some of that stuff. 1) What are expected CoW semantics for DEVICE_COHERENT? I assume we'll share them just like other PageAnon() pages during fork() readable, and the first sharer writing to them receives an "ordinary" !ZONE_DEVICE copy. So this would be just like DEVICE_EXCLUSIVE CoW handling I assume, just that we don't have to go through the loop of restoring a device exclusive entry? 2) How are these pages freed to clear/invalidate PageAnon() ? I assume for PageAnon() ZONE_DEVICE pages we'll always for via free_devmap_managed_page(), correct? 3) FOLL_PIN While you write "no one should be allowed to pin such memory", patch #2 only blocks FOLL_LONGTERM. So I assume we allow ordinary FOLL_PIN and you might want to be a bit more precise? Device coherent pages can be FOLL_PIN. However, we want to avoid FOLL_LONGTERM with these pages because that would affect our memory manager's ability to evict device memory. Regards, Alex Sierra ... I'm pretty sure we cannot FOLL_PIN DEVICE_PRIVATE pages, but can we FILL_PIN DEVICE_EXCLUSIVE pages? I strongly assume so? Thanks for any information. (digging a bit more, I realized that device exclusive pages are not actually/necessarily ZONE_DEVICE pages -- so I assume DEVICE_COHERENT will be the actual first PageAnon() ZONE_DEVICE pages that can be present in a page table.) Yes, that's correct. Device coherent pages are pte present. Regards, Alex Sierra
Re: start sorting out the ZONE_DEVICE refcount mess v2
Christoph, Thanks a lot for rebase our patches. I just ran our amdgpu hmm-tests with this series and all passed. Regards, Alex Sierra On 2/10/2022 1:28 AM, Christoph Hellwig wrote: Hi all, this series removes the offset by one refcount for ZONE_DEVICE pages that are freed back to the driver owning them, which is just device private ones for now, but also the planned device coherent pages and the ehanced p2p ones pending. It does not address the fsdax pages yet, which will be attacked in a follow on series. Note that if we want to get the p2p series rebased on top of this we'll need a git branch for this series. I could offer to host one. A git tree is available here: git://git.infradead.org/users/hch/misc.git pgmap-refcount Gitweb: http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/pgmap-refcount Changes since v1: - add a missing memremap.h include in memcontrol.c - include rebased versions of the device coherent support and device coherent migration support series as well as additional cleanup patches Diffstt: arch/arm64/mm/mmu.c |1 arch/powerpc/kvm/book3s_hv_uvmem.c |1 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 35 - drivers/gpu/drm/amd/amdkfd/kfd_priv.h|1 drivers/gpu/drm/drm_cache.c |2 drivers/gpu/drm/nouveau/nouveau_dmem.c |3 drivers/gpu/drm/nouveau/nouveau_svm.c|1 drivers/infiniband/core/rw.c |1 drivers/nvdimm/pmem.h|1 drivers/nvme/host/pci.c |1 drivers/nvme/target/io-cmd-bdev.c|1 fs/Kconfig |2 fs/fuse/virtio_fs.c |1 include/linux/hmm.h |9 include/linux/memremap.h | 36 + include/linux/migrate.h |1 include/linux/mm.h | 59 -- lib/test_hmm.c | 353 ++--- lib/test_hmm_uapi.h | 22 mm/Kconfig |7 mm/Makefile |1 mm/gup.c | 127 +++- mm/internal.h|3 mm/memcontrol.c | 19 mm/memory-failure.c |8 mm/memremap.c| 75 +- mm/migrate.c | 763 mm/migrate_device.c | 822 +++ mm/rmap.c|5 mm/swap.c| 49 - tools/testing/selftests/vm/Makefile |2 tools/testing/selftests/vm/hmm-tests.c | 204 ++- tools/testing/selftests/vm/test_hmm.sh | 24 33 files changed, 1552 insertions(+), 1088 deletions(-)
Re: [PATCH v5 09/10] tools: update hmm-test to support device coherent type
Hi Alistair, This is the last patch to be reviewed from this series. It already has the changes from your last feedback (V3). Would you mind to take a look? Thanks a lot for reviewing the rest! Regards, Alex Sierra On 1/28/2022 2:08 PM, Alex Sierra wrote: Test cases such as migrate_fault and migrate_multiple, were modified to explicit migrate from device to sys memory without the need of page faults, when using device coherent type. Snapshot test case updated to read memory device type first and based on that, get the proper returned results migrate_ping_pong test case added to test explicit migration from device to sys memory for both private and coherent zone types. Helpers to migrate from device to sys memory and vicerversa were also added. Signed-off-by: Alex Sierra Acked-by: Felix Kuehling --- v2: Set FIXTURE_VARIANT to add multiple device types to the FIXTURE. This will run all the tests for each device type (private and coherent) in case both existed during hmm-test driver probed. v4: Check for the number of pages successfully migrated from coherent device to system at migrate_multiple test. --- tools/testing/selftests/vm/hmm-tests.c | 123 - 1 file changed, 102 insertions(+), 21 deletions(-) diff --git a/tools/testing/selftests/vm/hmm-tests.c b/tools/testing/selftests/vm/hmm-tests.c index 203323967b50..84ec8c4a1dc7 100644 --- a/tools/testing/selftests/vm/hmm-tests.c +++ b/tools/testing/selftests/vm/hmm-tests.c @@ -44,6 +44,14 @@ struct hmm_buffer { int fd; uint64_tcpages; uint64_tfaults; + int zone_device_type; +}; + +enum { + HMM_PRIVATE_DEVICE_ONE, + HMM_PRIVATE_DEVICE_TWO, + HMM_COHERENCE_DEVICE_ONE, + HMM_COHERENCE_DEVICE_TWO, }; #define TWOMEG (1 << 21) @@ -60,6 +68,21 @@ FIXTURE(hmm) unsigned intpage_shift; }; +FIXTURE_VARIANT(hmm) +{ + int device_number; +}; + +FIXTURE_VARIANT_ADD(hmm, hmm_device_private) +{ + .device_number = HMM_PRIVATE_DEVICE_ONE, +}; + +FIXTURE_VARIANT_ADD(hmm, hmm_device_coherent) +{ + .device_number = HMM_COHERENCE_DEVICE_ONE, +}; + FIXTURE(hmm2) { int fd0; @@ -68,6 +91,24 @@ FIXTURE(hmm2) unsigned intpage_shift; }; +FIXTURE_VARIANT(hmm2) +{ + int device_number0; + int device_number1; +}; + +FIXTURE_VARIANT_ADD(hmm2, hmm2_device_private) +{ + .device_number0 = HMM_PRIVATE_DEVICE_ONE, + .device_number1 = HMM_PRIVATE_DEVICE_TWO, +}; + +FIXTURE_VARIANT_ADD(hmm2, hmm2_device_coherent) +{ + .device_number0 = HMM_COHERENCE_DEVICE_ONE, + .device_number1 = HMM_COHERENCE_DEVICE_TWO, +}; + static int hmm_open(int unit) { char pathname[HMM_PATH_MAX]; @@ -81,12 +122,19 @@ static int hmm_open(int unit) return fd; } +static bool hmm_is_coherent_type(int dev_num) +{ + return (dev_num >= HMM_COHERENCE_DEVICE_ONE); +} + FIXTURE_SETUP(hmm) { self->page_size = sysconf(_SC_PAGE_SIZE); self->page_shift = ffs(self->page_size) - 1; - self->fd = hmm_open(0); + self->fd = hmm_open(variant->device_number); + if (self->fd < 0 && hmm_is_coherent_type(variant->device_number)) + SKIP(exit(0), "DEVICE_COHERENT not available"); ASSERT_GE(self->fd, 0); } @@ -95,9 +143,11 @@ FIXTURE_SETUP(hmm2) self->page_size = sysconf(_SC_PAGE_SIZE); self->page_shift = ffs(self->page_size) - 1; - self->fd0 = hmm_open(0); + self->fd0 = hmm_open(variant->device_number0); + if (self->fd0 < 0 && hmm_is_coherent_type(variant->device_number0)) + SKIP(exit(0), "DEVICE_COHERENT not available"); ASSERT_GE(self->fd0, 0); - self->fd1 = hmm_open(1); + self->fd1 = hmm_open(variant->device_number1); ASSERT_GE(self->fd1, 0); } @@ -144,6 +194,7 @@ static int hmm_dmirror_cmd(int fd, } buffer->cpages = cmd.cpages; buffer->faults = cmd.faults; + buffer->zone_device_type = cmd.zone_device_type; return 0; } @@ -211,6 +262,20 @@ static void hmm_nanosleep(unsigned int n) nanosleep(, NULL); } +static int hmm_migrate_sys_to_dev(int fd, + struct hmm_buffer *buffer, + unsigned long npages) +{ + return hmm_dmirror_cmd(fd, HMM_DMIRROR_MIGRATE_TO_DEV, buffer, npages); +} + +static int hmm_migrate_dev_to_sys(int fd, + struct hmm_buffer *buffer, + unsigned long npages) +{ + return hmm_dmirror_cmd(fd, HMM_DMIRROR_MIGRATE_TO_SYS, buffer, npages); +} + /* * Simple NULL test of device open/close. */ @@ -875,7 +940,7 @@ TEST_F(hmm, migrate) ptr[i] = i; /* Migrate memory to device. */ - ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages); + ret =
Re: [PATCH v4 00/10] Add MEMORY_DEVICE_COHERENT for coherent device memory mapping
Andrew, We're somehow new on this procedure. Are you referring to rebase this patch series to git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git <5.17-rc1 tag>? Regards, Alex Sierra Alex Deucher, Just a quick heads up. This patch series contains changes to the amdgpu driver which we're planning to merge through Andrew's tree, If that's ok with you. Regards, Alex Sierra On 1/27/2022 4:32 PM, Andrew Morton wrote: On Wed, 26 Jan 2022 21:09:39 -0600 Alex Sierra wrote: This patch series introduces MEMORY_DEVICE_COHERENT, a type of memory owned by a device that can be mapped into CPU page tables like MEMORY_DEVICE_GENERIC and can also be migrated like MEMORY_DEVICE_PRIVATE. Some more reviewer input appears to be desirable here. I was going to tentatively add it to -mm and -next, but problems. 5.17-rc1's mm/migrate.c:migrate_vma_check_page() is rather different from the tree you patched. Please redo, refresh and resend?
Re: [PATCH v3 09/10] tools: update hmm-test to support device coherent type
On 1/20/2022 12:14 AM, Alistair Popple wrote: On Tuesday, 11 January 2022 9:32:00 AM AEDT Alex Sierra wrote: Test cases such as migrate_fault and migrate_multiple, were modified to explicit migrate from device to sys memory without the need of page faults, when using device coherent type. Snapshot test case updated to read memory device type first and based on that, get the proper returned results migrate_ping_pong test case Where is the migrate_ping_pong test? Did you perhaps forget to add it? :-) Migration from device coherent to system is tested with migrate_multiple too. Therefore, I've removed migrate_ping_pong test. BTW, I just added the "number of pages migrated" checker after migrate from coherent to system on v4 series. Regards, Alejandro Sierra added to test explicit migration from device to sys memory for both private and coherent zone types. Helpers to migrate from device to sys memory and vicerversa were also added. Signed-off-by: Alex Sierra --- v2: Set FIXTURE_VARIANT to add multiple device types to the FIXTURE. This will run all the tests for each device type (private and coherent) in case both existed during hmm-test driver probed. --- tools/testing/selftests/vm/hmm-tests.c | 122 - 1 file changed, 101 insertions(+), 21 deletions(-) diff --git a/tools/testing/selftests/vm/hmm-tests.c b/tools/testing/selftests/vm/hmm-tests.c index 864f126ffd78..8eb81dfba4b3 100644 --- a/tools/testing/selftests/vm/hmm-tests.c +++ b/tools/testing/selftests/vm/hmm-tests.c @@ -44,6 +44,14 @@ struct hmm_buffer { int fd; uint64_tcpages; uint64_tfaults; + int zone_device_type; +}; + +enum { + HMM_PRIVATE_DEVICE_ONE, + HMM_PRIVATE_DEVICE_TWO, + HMM_COHERENCE_DEVICE_ONE, + HMM_COHERENCE_DEVICE_TWO, }; #define TWOMEG (1 << 21) @@ -60,6 +68,21 @@ FIXTURE(hmm) unsigned intpage_shift; }; +FIXTURE_VARIANT(hmm) +{ + int device_number; +}; + +FIXTURE_VARIANT_ADD(hmm, hmm_device_private) +{ + .device_number = HMM_PRIVATE_DEVICE_ONE, +}; + +FIXTURE_VARIANT_ADD(hmm, hmm_device_coherent) +{ + .device_number = HMM_COHERENCE_DEVICE_ONE, +}; + FIXTURE(hmm2) { int fd0; @@ -68,6 +91,24 @@ FIXTURE(hmm2) unsigned intpage_shift; }; +FIXTURE_VARIANT(hmm2) +{ + int device_number0; + int device_number1; +}; + +FIXTURE_VARIANT_ADD(hmm2, hmm2_device_private) +{ + .device_number0 = HMM_PRIVATE_DEVICE_ONE, + .device_number1 = HMM_PRIVATE_DEVICE_TWO, +}; + +FIXTURE_VARIANT_ADD(hmm2, hmm2_device_coherent) +{ + .device_number0 = HMM_COHERENCE_DEVICE_ONE, + .device_number1 = HMM_COHERENCE_DEVICE_TWO, +}; + static int hmm_open(int unit) { char pathname[HMM_PATH_MAX]; @@ -81,12 +122,19 @@ static int hmm_open(int unit) return fd; } +static bool hmm_is_coherent_type(int dev_num) +{ + return (dev_num >= HMM_COHERENCE_DEVICE_ONE); +} + FIXTURE_SETUP(hmm) { self->page_size = sysconf(_SC_PAGE_SIZE); self->page_shift = ffs(self->page_size) - 1; - self->fd = hmm_open(0); + self->fd = hmm_open(variant->device_number); + if (self->fd < 0 && hmm_is_coherent_type(variant->device_number)) + SKIP(exit(0), "DEVICE_COHERENT not available"); ASSERT_GE(self->fd, 0); } @@ -95,9 +143,11 @@ FIXTURE_SETUP(hmm2) self->page_size = sysconf(_SC_PAGE_SIZE); self->page_shift = ffs(self->page_size) - 1; - self->fd0 = hmm_open(0); + self->fd0 = hmm_open(variant->device_number0); + if (self->fd0 < 0 && hmm_is_coherent_type(variant->device_number0)) + SKIP(exit(0), "DEVICE_COHERENT not available"); ASSERT_GE(self->fd0, 0); - self->fd1 = hmm_open(1); + self->fd1 = hmm_open(variant->device_number1); ASSERT_GE(self->fd1, 0); } @@ -144,6 +194,7 @@ static int hmm_dmirror_cmd(int fd, } buffer->cpages = cmd.cpages; buffer->faults = cmd.faults; + buffer->zone_device_type = cmd.zone_device_type; return 0; } @@ -211,6 +262,20 @@ static void hmm_nanosleep(unsigned int n) nanosleep(, NULL); } +static int hmm_migrate_sys_to_dev(int fd, + struct hmm_buffer *buffer, + unsigned long npages) +{ + return hmm_dmirror_cmd(fd, HMM_DMIRROR_MIGRATE_TO_DEV, buffer, npages); +} + +static int hmm_migrate_dev_to_sys(int fd, + struct hmm_buffer *buffer, + unsigned long npages) +{ + return hmm_dmirror_cmd(fd, HMM_DMIRROR_MIGRATE_TO_SYS, buffer, npages); +} + /* * Simple NULL test of device open/close. */ @@ -875,7 +940,7 @@ TEST_F(hmm, migrate) ptr[i] = i; /* Migrate memory to device. */ - ret = hmm_dmirror_cmd(self->fd,
Re: [PATCH v2 03/11] mm/gup: migrate PIN_LONGTERM dev coherent pages to system
On 12/9/2021 10:29 AM, Felix Kuehling wrote: Am 2021-12-09 um 5:53 a.m. schrieb Alistair Popple: On Thursday, 9 December 2021 5:55:26 AM AEDT Sierra Guiza, Alejandro (Alex) wrote: On 12/8/2021 11:30 AM, Felix Kuehling wrote: Am 2021-12-08 um 11:58 a.m. schrieb Felix Kuehling: Am 2021-12-08 um 6:31 a.m. schrieb Alistair Popple: On Tuesday, 7 December 2021 5:52:43 AM AEDT Alex Sierra wrote: Avoid long term pinning for Coherent device type pages. This could interfere with their own device memory manager. If caller tries to get user device coherent pages with PIN_LONGTERM flag set, those pages will be migrated back to system memory. Signed-off-by: Alex Sierra --- mm/gup.c | 32 ++-- 1 file changed, 30 insertions(+), 2 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index 886d6148d3d0..1572eacf07f4 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1689,17 +1689,37 @@ struct page *get_dump_page(unsigned long addr) #endif /* CONFIG_ELF_CORE */ #ifdef CONFIG_MIGRATION +static int migrate_device_page(unsigned long address, + struct page *page) +{ + struct vm_area_struct *vma = find_vma(current->mm, address); + struct vm_fault vmf = { + .vma = vma, + .address = address & PAGE_MASK, + .flags = FAULT_FLAG_USER, + .pgoff = linear_page_index(vma, address), + .gfp_mask = GFP_KERNEL, + .page = page, + }; + if (page->pgmap && page->pgmap->ops->migrate_to_ram) + return page->pgmap->ops->migrate_to_ram(); How does this synchronise against pgmap being released? As I understand things at this point we're not holding a reference on either the page or pgmap, so the page and therefore the pgmap may have been freed. I think a similar problem exists for device private fault handling as well and it has been on my list of things to fix for a while. I think the solution is to call try_get_page(), except it doesn't work with device pages due to the whole refcount thing. That issue is blocking a fair bit of work now so I've started looking into it. At least the page should have been pinned by the __get_user_pages_locked call in __gup_longterm_locked. That refcount is dropped in check_and_migrate_movable_pages when it returns 0 or an error. Never mind. We unpin the pages first. Alex, would the migration work if we unpinned them afterwards? Also, the normal CPU page fault code path seems to make sure the page is locked (check in pfn_swap_entry_to_page) before calling migrate_to_ram. I don't think that's true. The check in pfn_swap_entry_to_page() is only for migration entries: BUG_ON(is_migration_entry(entry) && !PageLocked(p)); As this is coherent memory though why do we have to call into a device driver to do the migration? Couldn't this all be done in the kernel? I think you're right. I hadn't thought of that mainly because I'm even less familiar with the non-device migration code. Alex, can you give that a try? As long as the driver still gets a page-free callback when the device page is freed, it should work. ACK.Will do Alex Sierra Regards, Felix No, you can not unpinned after migration. Due to the expected_count VS page_count condition at migrate_page_move_mapping, during migrate_page call. Regards, Alex Sierra Regards, Felix
Re: [PATCH v2 03/11] mm/gup: migrate PIN_LONGTERM dev coherent pages to system
On 12/8/2021 11:30 AM, Felix Kuehling wrote: Am 2021-12-08 um 11:58 a.m. schrieb Felix Kuehling: Am 2021-12-08 um 6:31 a.m. schrieb Alistair Popple: On Tuesday, 7 December 2021 5:52:43 AM AEDT Alex Sierra wrote: Avoid long term pinning for Coherent device type pages. This could interfere with their own device memory manager. If caller tries to get user device coherent pages with PIN_LONGTERM flag set, those pages will be migrated back to system memory. Signed-off-by: Alex Sierra --- mm/gup.c | 32 ++-- 1 file changed, 30 insertions(+), 2 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index 886d6148d3d0..1572eacf07f4 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1689,17 +1689,37 @@ struct page *get_dump_page(unsigned long addr) #endif /* CONFIG_ELF_CORE */ #ifdef CONFIG_MIGRATION +static int migrate_device_page(unsigned long address, + struct page *page) +{ + struct vm_area_struct *vma = find_vma(current->mm, address); + struct vm_fault vmf = { + .vma = vma, + .address = address & PAGE_MASK, + .flags = FAULT_FLAG_USER, + .pgoff = linear_page_index(vma, address), + .gfp_mask = GFP_KERNEL, + .page = page, + }; + if (page->pgmap && page->pgmap->ops->migrate_to_ram) + return page->pgmap->ops->migrate_to_ram(); How does this synchronise against pgmap being released? As I understand things at this point we're not holding a reference on either the page or pgmap, so the page and therefore the pgmap may have been freed. I think a similar problem exists for device private fault handling as well and it has been on my list of things to fix for a while. I think the solution is to call try_get_page(), except it doesn't work with device pages due to the whole refcount thing. That issue is blocking a fair bit of work now so I've started looking into it. At least the page should have been pinned by the __get_user_pages_locked call in __gup_longterm_locked. That refcount is dropped in check_and_migrate_movable_pages when it returns 0 or an error. Never mind. We unpin the pages first. Alex, would the migration work if we unpinned them afterwards? Also, the normal CPU page fault code path seems to make sure the page is locked (check in pfn_swap_entry_to_page) before calling migrate_to_ram. No, you can not unpinned after migration. Due to the expected_count VS page_count condition at migrate_page_move_mapping, during migrate_page call. Regards, Alex Sierra Regards, Felix
Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount
On 10/14/2021 3:57 PM, Ralph Campbell wrote: On 10/14/21 11:01 AM, Jason Gunthorpe wrote: On Thu, Oct 14, 2021 at 10:35:27AM -0700, Ralph Campbell wrote: I ran xfstests-dev using the kernel boot option to "fake" a pmem device when I first posted this patch. The tests ran OK (or at least the same tests passed with and without my patch). Hmm. I know nothing of xfstests but tests/generic/413 Looks kind of like it might cover this situation? Did it run for you? Jason I don't remember. I'll have to rerun the test which might take a day or two to set up again. I just ran this generic/413 on my side using pmem fake device. It does fail. I remember we proposed a fix on this patch before try_get_page was removed. @@ -1186,7 +1153,7 @@ bool __must_check try_grab_page(struct page *page, unsigned int flags); static inline __must_check bool try_get_page(struct page *page) { page = compound_head(page); - if (WARN_ON_ONCE(page_ref_count(page) <= 0)) + if (WARN_ON_ONCE(page_ref_count(page) < (int)!is_zone_device_page(page))) return false; page_ref_inc(page); return true; Alex
Re: [PATCH v2 09/12] lib: test_hmm add module param for zone device type
On 9/21/2021 12:14 AM, Alistair Popple wrote: On Tuesday, 21 September 2021 6:05:30 AM AEST Sierra Guiza, Alejandro (Alex) wrote: On 9/20/2021 3:53 AM, Alistair Popple wrote: On Tuesday, 14 September 2021 2:16:01 AM AEST Alex Sierra wrote: In order to configure device public in test_hmm, two module parameters should be passed, which correspond to the SP start address of each device (2) spm_addr_dev0 & spm_addr_dev1. If no parameters are passed, private device type is configured. It's a pity that testing this seems to require some amount of special setup to test. Is there a way this could be made to work on a more standard setup similar to how DEVICE_PRIVATE is tested? Hi Alistair We tried to do it as simpler as possible. Unfortunately, there are two main requirements to register dev memory as DEVICE_PUBLIC type. This memory must NOT be accessed by any memory allocator (SLAB, SLOB, SLUB) plus, it has to be CPU coherently accessed. We also want to avoid aliasing the same PFNs for different page types (regular system memory and DEVICE_PUBLIC). So we don't want the reserved memory to be part of the kernel's memory map before we call memremap_pages. A transparent way of doing it, without any special HW, was setting a portion of system memory as SPM (Special purpose memory). And use this as our “device fake” memory. Ok, I think it's great that we can test this without special HW but the boot time configuration is still a bit annoying. Would it be possible to allocate memory fitting the above requirements by hot unplugging it with something like offline_and_remove_memory()? I also don't see why the DEVICE_PRIVATE and DEVICE_PUBLIC testing should be mutually exclusive - why can't we test both without reloading the module? You could do both DEVICE_PRIVATE and DEVICE_PUBLIC tests by running the test_hmm_sh script twice, just passing the proper parameters. Even when you booted with fake EFI SP regions. If spm_address_dev0/1 parameters are passed, the module gets configured with DEVICE_PUBLIC type. Otherwise DEVICE_PRIVATE type is set. Technically the only complication in testing DEVICE_PUBLIC is the fake SPM boot parameter. Alex Sierra - Alistair Regards, Alex Sierra Signed-off-by: Alex Sierra --- v5: Remove devmem->pagemap.type = MEMORY_DEVICE_PRIVATE at dmirror_allocate_chunk that was forcing to configure pagemap.type to MEMORY_DEVICE_PRIVATE v6: Check for null pointers for resource and memremap references at dmirror_allocate_chunk v7: Due to patch dropped from these patch series "kernel: resource: lookup_resource as exported symbol", lookup_resource was not longer a callable function. This was used in public device configuration, to get start and end addresses, to create pgmap->range struct. This information is now taken directly from the spm_addr_devX parameters and the fixed size DEVMEM_CHUNK_SIZE. --- lib/test_hmm.c | 66 +++-- lib/test_hmm_uapi.h | 1 + 2 files changed, 47 insertions(+), 20 deletions(-) diff --git a/lib/test_hmm.c b/lib/test_hmm.c index 3cd91ca31dd7..ef27e355738a 100644 --- a/lib/test_hmm.c +++ b/lib/test_hmm.c @@ -33,6 +33,16 @@ #define DEVMEM_CHUNK_SIZE(256 * 1024 * 1024U) #define DEVMEM_CHUNKS_RESERVE16 +static unsigned long spm_addr_dev0; +module_param(spm_addr_dev0, long, 0644); +MODULE_PARM_DESC(spm_addr_dev0, + "Specify start address for SPM (special purpose memory) used for device 0. By setting this Generic device type will be used. Make sure spm_addr_dev1 is set too"); + +static unsigned long spm_addr_dev1; +module_param(spm_addr_dev1, long, 0644); +MODULE_PARM_DESC(spm_addr_dev1, + "Specify start address for SPM (special purpose memory) used for device 1. By setting this Generic device type will be used. Make sure spm_addr_dev0 is set too"); + static const struct dev_pagemap_ops dmirror_devmem_ops; static const struct mmu_interval_notifier_ops dmirror_min_ops; static dev_t dmirror_dev; @@ -450,11 +460,11 @@ static int dmirror_write(struct dmirror *dmirror, struct hmm_dmirror_cmd *cmd) return ret; } -static bool dmirror_allocate_chunk(struct dmirror_device *mdevice, +static int dmirror_allocate_chunk(struct dmirror_device *mdevice, struct page **ppage) { struct dmirror_chunk *devmem; - struct resource *res; + struct resource *res = NULL; unsigned long pfn; unsigned long pfn_first; unsigned long pfn_last; @@ -462,17 +472,29 @@ static bool dmirror_allocate_chunk(struct dmirror_device *mdevice, devmem = kzalloc(sizeof(*devmem), GFP_KERNEL); if (!devmem) - return false; + return -ENOMEM; - res = request_free_mem_region(_resource, DEVMEM_CHUNK_SIZE, - "hmm_dmirror"); -
Re: [PATCH v2 09/12] lib: test_hmm add module param for zone device type
On 9/20/2021 3:53 AM, Alistair Popple wrote: On Tuesday, 14 September 2021 2:16:01 AM AEST Alex Sierra wrote: In order to configure device public in test_hmm, two module parameters should be passed, which correspond to the SP start address of each device (2) spm_addr_dev0 & spm_addr_dev1. If no parameters are passed, private device type is configured. It's a pity that testing this seems to require some amount of special setup to test. Is there a way this could be made to work on a more standard setup similar to how DEVICE_PRIVATE is tested? Hi Alistair We tried to do it as simpler as possible. Unfortunately, there are two main requirements to register dev memory as DEVICE_PUBLIC type. This memory must NOT be accessed by any memory allocator (SLAB, SLOB, SLUB) plus, it has to be CPU coherently accessed. We also want to avoid aliasing the same PFNs for different page types (regular system memory and DEVICE_PUBLIC). So we don't want the reserved memory to be part of the kernel's memory map before we call memremap_pages. A transparent way of doing it, without any special HW, was setting a portion of system memory as SPM (Special purpose memory). And use this as our “device fake” memory. Regards, Alex Sierra Signed-off-by: Alex Sierra --- v5: Remove devmem->pagemap.type = MEMORY_DEVICE_PRIVATE at dmirror_allocate_chunk that was forcing to configure pagemap.type to MEMORY_DEVICE_PRIVATE v6: Check for null pointers for resource and memremap references at dmirror_allocate_chunk v7: Due to patch dropped from these patch series "kernel: resource: lookup_resource as exported symbol", lookup_resource was not longer a callable function. This was used in public device configuration, to get start and end addresses, to create pgmap->range struct. This information is now taken directly from the spm_addr_devX parameters and the fixed size DEVMEM_CHUNK_SIZE. --- lib/test_hmm.c | 66 +++-- lib/test_hmm_uapi.h | 1 + 2 files changed, 47 insertions(+), 20 deletions(-) diff --git a/lib/test_hmm.c b/lib/test_hmm.c index 3cd91ca31dd7..ef27e355738a 100644 --- a/lib/test_hmm.c +++ b/lib/test_hmm.c @@ -33,6 +33,16 @@ #define DEVMEM_CHUNK_SIZE (256 * 1024 * 1024U) #define DEVMEM_CHUNKS_RESERVE 16 +static unsigned long spm_addr_dev0; +module_param(spm_addr_dev0, long, 0644); +MODULE_PARM_DESC(spm_addr_dev0, + "Specify start address for SPM (special purpose memory) used for device 0. By setting this Generic device type will be used. Make sure spm_addr_dev1 is set too"); + +static unsigned long spm_addr_dev1; +module_param(spm_addr_dev1, long, 0644); +MODULE_PARM_DESC(spm_addr_dev1, + "Specify start address for SPM (special purpose memory) used for device 1. By setting this Generic device type will be used. Make sure spm_addr_dev0 is set too"); + static const struct dev_pagemap_ops dmirror_devmem_ops; static const struct mmu_interval_notifier_ops dmirror_min_ops; static dev_t dmirror_dev; @@ -450,11 +460,11 @@ static int dmirror_write(struct dmirror *dmirror, struct hmm_dmirror_cmd *cmd) return ret; } -static bool dmirror_allocate_chunk(struct dmirror_device *mdevice, +static int dmirror_allocate_chunk(struct dmirror_device *mdevice, struct page **ppage) { struct dmirror_chunk *devmem; - struct resource *res; + struct resource *res = NULL; unsigned long pfn; unsigned long pfn_first; unsigned long pfn_last; @@ -462,17 +472,29 @@ static bool dmirror_allocate_chunk(struct dmirror_device *mdevice, devmem = kzalloc(sizeof(*devmem), GFP_KERNEL); if (!devmem) - return false; + return -ENOMEM; - res = request_free_mem_region(_resource, DEVMEM_CHUNK_SIZE, - "hmm_dmirror"); - if (IS_ERR(res)) - goto err_devmem; + if (!spm_addr_dev0 && !spm_addr_dev1) { + res = request_free_mem_region(_resource, DEVMEM_CHUNK_SIZE, + "hmm_dmirror"); + if (IS_ERR_OR_NULL(res)) + goto err_devmem; + devmem->pagemap.range.start = res->start; + devmem->pagemap.range.end = res->end; + devmem->pagemap.type = MEMORY_DEVICE_PRIVATE; + mdevice->zone_device_type = HMM_DMIRROR_MEMORY_DEVICE_PRIVATE; + } else if (spm_addr_dev0 && spm_addr_dev1) { + devmem->pagemap.range.start = MINOR(mdevice->cdevice.dev) ? + spm_addr_dev0 : + spm_addr_dev1; + devmem->pagemap.range.end = devmem->pagemap.range.start + + DEVMEM_CHUNK_SIZE - 1; + devmem->pagemap.type = MEMORY_DEVICE_PUBLIC; +
Re: [PATCH v1 03/14] mm: add iomem vma selection for memory migration
On 8/25/2021 2:46 AM, Christoph Hellwig wrote: On Tue, Aug 24, 2021 at 10:48:17PM -0500, Alex Sierra wrote: } else { - if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) + if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM) && + !(migrate->flags & MIGRATE_VMA_SELECT_IOMEM)) goto next; pfn = pte_pfn(pte); if (is_zero_pfn(pfn)) { .. also how is this going to work for the device public memory? That should be pte_special() an thus fail vm_normal_page. Perhaps we're missing something, as we're not doing any special marking for the device public pfn/entries. pfn_valid return true, pte_special return false and pfn_t_devmap return false on these pages. Same as system pages. That's the reason vm_normal_page returns the page correctly through pfn_to_page helper. Regards, Alex S.
Re: [PATCH v6 02/13] mm: remove extra ZONE_DEVICE struct page refcount
On 8/18/2021 2:28 PM, Ralph Campbell wrote: On 8/17/21 5:35 PM, Felix Kuehling wrote: Am 2021-08-17 um 8:01 p.m. schrieb Ralph Campbell: On 8/12/21 11:31 PM, Alex Sierra wrote: From: Ralph Campbell ZONE_DEVICE struct pages have an extra reference count that complicates the code for put_page() and several places in the kernel that need to check the reference count to see that a page is not being used (gup, compaction, migration, etc.). Clean up the code so the reference count doesn't need to be treated specially for ZONE_DEVICE. v2: AS: merged this patch in linux 5.11 version v5: AS: add condition at try_grab_page to check for the zone device type, while page ref counter is checked less/equal to zero. In case of device zone, pages ref counter are initialized to zero. Signed-off-by: Ralph Campbell Signed-off-by: Alex Sierra --- arch/powerpc/kvm/book3s_hv_uvmem.c | 2 +- drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +- fs/dax.c | 4 +- include/linux/dax.h | 2 +- include/linux/memremap.h | 7 +-- include/linux/mm.h | 13 + lib/test_hmm.c | 2 +- mm/internal.h | 8 +++ mm/memremap.c | 68 +++--- mm/migrate.c | 5 -- mm/page_alloc.c | 3 ++ mm/swap.c | 45 ++--- 12 files changed, 46 insertions(+), 115 deletions(-) I haven't seen a response to the issues I raised back at v3 of this series. https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-mm%2F4f6dd918-d79b-1aa7-3a4c-caa67ddc29bc%40nvidia.com%2Fdata=04%7C01%7Calex.sierra%40amd.com%7Cd2bd2d4fbf764528540908d9627e5dcd%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637649117156919772%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=P7FxYm%2BkJaCkMFa3OHtuKrPOn7SvytFRmYQdIzq7rN4%3Dreserved=0 Did I miss something? I think part of the response was that we did more testing. Alex added support for DEVICE_GENERIC pages to test_hmm and he ran DAX tests recommended by Theodore Tso. In that testing he ran into a WARN_ON_ONCE about a zero page refcount in try_get_page. The fix is in the latest version of patch 2. But it's already obsolete because John Hubbard is about to remove that function altogether. I think the issues you raised were more uncertainty than known bugs. It seems the fact that you can have DAX pages with 0 refcount is a feature more than a bug. Regards, Felix Did you test on a system without CONFIG_ARCH_HAS_PTE_SPECIAL defined? In that case, mmap() of a DAX device will call insert_page() which calls get_page() which would trigger VM_BUG_ON_PAGE(). I can believe it is OK for PTE_SPECIAL page table entries to have no struct page or that MEMORY_DEVICE_GENERIC struct pages be mapped with a zero reference count using insert_pfn(). Hi Ralph, We have tried the DAX tests with and without CONFIG_ARCH_HAS_PTE_SPECIAL defined. Apparently none of the tests touches that condition for a DAX device. Of course, that doesn't mean it could happen. Regards, Alex S. I find it hard to believe that other MM developers don't see an issue with a struct page with refcount == 0 and mapcount == 1. I don't see where init_page_count() is being called for the MEMORY_DEVICE_GENERIC or MEMORY_DEVICE_PRIVATE struct pages the AMD driver allocates and passes to migrate_vma_setup(). Looks like svm_migrate_get_vram_page() needs to call init_page_count() instead of get_page(). (I'm looking at branch origin/alexsierrag/device_generic https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver.gitdata=04%7C01%7Calex.sierra%40amd.com%7Cd2bd2d4fbf764528540908d9627e5dcd%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637649117156919772%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=IXe8HP2s8x5OdJdERBkGOYJCQk3iqCu5AYkwpDL8zec%3Dreserved=0) Yes, you're right. My bad. Thanks for catching this up. I didn't realize I was missing to define CONFIG_DEBUG_VM on my build. Therefore this BUG was never caught. It worked after I replaced get_pages by init_page_count at svm_migrate_get_vram_page. However, I don't think this is the best way to fix it. Ideally, get_pages call should work for device pages with ref count equal to 0 too. Otherwise, we could overwrite refcounter if someone else is grabbing the page concurrently. I was thinking to add a special condition in get_pages for dev pages. This could also fix the insert_page -> get_page call from a DAX device. Regards, Alex S. Also, what about the other places where is_device_private_page() is called? Don't they need to be updated to call is_device_page() instead? One of my goals for this patch was to
Re: [PATCH v6 05/13] drm/amdkfd: generic type as sys mem on migration to ram
On 8/15/2021 10:38 AM, Christoph Hellwig wrote: On Fri, Aug 13, 2021 at 01:31:42AM -0500, Alex Sierra wrote: migrate.vma = vma; migrate.start = start; migrate.end = end; - migrate.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE; migrate.pgmap_owner = SVM_ADEV_PGMAP_OWNER(adev); + if (adev->gmc.xgmi.connected_to_cpu) + migrate.flags = MIGRATE_VMA_SELECT_SYSTEM; + else + migrate.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE; It's been a while since I touched this migrate code, but doesn't this mean that if the range already contains system memory the migration now won't do anything? for the connected_to_cpu case? For above’s condition equal to connected_to_cpu , we’re explicitly migrating from device memory to system memory with device generic type. In this type, device PTEs are present in CPU page table. During migrate_vma_collect_pmd walk op at migrate_vma_setup call, there’s a condition for present pte that require migrate->flags be set for MIGRATE_VMA_SELECT_SYSTEM. Otherwise, the migration for this entry will be ignored. Regards, Alex S.
Re: [PATCH v5 00/13] Support DEVICE_GENERIC memory in migrate_vma_*
On 8/12/2021 1:34 AM, Christoph Hellwig wrote: Do you have a pointer to a git branch with this series and all dependencies to ease testing? Hi Christoph, Yes, actually tomorrow we're planning to send a new version with detailed instructions on how to setup and run each of the tests. This will also contain a link to download our kernel repo with all these patches. Regards, Alejandro S. On Thu, Aug 12, 2021 at 01:30:47AM -0500, Alex Sierra wrote: v1: AMD is building a system architecture for the Frontier supercomputer with a coherent interconnect between CPUs and GPUs. This hardware architecture allows the CPUs to coherently access GPU device memory. We have hardware in our labs and we are working with our partner HPE on the BIOS, firmware and software for delivery to the DOE. The system BIOS advertises the GPU device memory (aka VRAM) as SPM (special purpose memory) in the UEFI system address map. The amdgpu driver looks it up with lookup_resource and registers it with devmap as MEMORY_DEVICE_GENERIC using devm_memremap_pages. Now we're trying to migrate data to and from that memory using the migrate_vma_* helpers so we can support page-based migration in our unified memory allocations, while also supporting CPU access to those pages. This patch series makes a few changes to make MEMORY_DEVICE_GENERIC pages behave correctly in the migrate_vma_* helpers. We are looking for feedback about this approach. If we're close, what's needed to make our patches acceptable upstream? If we're not close, any suggestions how else to achieve what we are trying to do (i.e. page migration and coherent CPU access to VRAM)? This work is based on HMM and our SVM memory manager that was recently upstreamed to Dave Airlie's drm-next branch https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcgit.freedesktop.org%2Fdrm%2Fdrm%2Flog%2F%3Fh%3Ddrm-nextdata=04%7C01%7Calex.sierra%40amd.com%7Ce4e14caf03de403b1e2208d95d5b378b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637643468663206442%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=W7NRFXF0t1bqAzsV1RZh4MAPly26c7XPn8kCKvaj%2FMw%3Dreserved=0 On top of that we did some rework of our VRAM management for migrations to remove some incorrect assumptions, allow partially successful migrations and GPU memory mappings that mix pages in VRAM and system memory. https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210527205606.2660-6-Felix.Kuehling%40amd.com%2FT%2F%23r996356015e295780eb50453e7dbd5d0d68b47cbcdata=04%7C01%7Calex.sierra%40amd.com%7Ce4e14caf03de403b1e2208d95d5b378b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637643468663206442%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=GZZrxedpAlYLfDgO1Y3jvpFacG313SXiUT3ErB4xO0Q%3Dreserved=0 v2: This patch series version has merged "[RFC PATCH v3 0/2] mm: remove extra ZONE_DEVICE struct page refcount" patch series made by Ralph Campbell. It also applies at the top of these series, our changes to support device generic type in migration_vma helpers. This has been tested in systems with device memory that has coherent access by CPU. Also addresses the following feedback made in v1: - Isolate in one patch kernel/resource.c modification, based on Christoph's feedback. - Add helpers check for generic and private type to avoid duplicated long lines. v3: - Include cover letter from v1. - Rename dax_layout_is_idle_page func to dax_page_unused in patch ext4/xfs: add page refcount helper. v4: - Add support for zone device generic type in lib/test_hmm and tool/testing/selftest/vm/hmm-tests. - Add missing page refcount helper to fuse/dax.c. This was included in one of Ralph Campbell's patches. v5: - Cosmetic changes on patches 3, 5 and 13 - Bug founded at test_hmm, remove devmem->pagemap.type = MEMORY_DEVICE_PRIVATE at dmirror_allocate_chunk that was forcing to configure pagemap.type to MEMORY_DEVICE_PRIVATE. - A bug was found while running one of the xfstest (generic/413) used to validate fs_dax device type. This was first introduced by patch: "mm: remove extra ZONE_DEVICE struct page refcount" whic is part of these patch series. The bug was showed as WARNING message at try_grab_page function call, due to a page refcounter equal to zero. Part of "mm: remove extra ZONE_DEVICE struct page refcount" changes, was to initialize page refcounter to zero. Therefore, a special condition was added to try_grab_page on this v5, were it checks for device zone pages too. It is included in the same patch. This is how mm changes from these patch series have been validated: - hmm-tests were run using device private and device generic types. This last, just added in these patch series. efi_fake_mem was used to mimic SPM memory for device generic. - xfstests tool was used to validate fs-dax device type and page refcounter changes. DAX configuration was used along with emulated
Re: [PATCH v4 10/13] lib: test_hmm add module param for zone device type
On 7/22/2021 12:26 PM, Jason Gunthorpe wrote: On Thu, Jul 22, 2021 at 11:59:17AM -0500, Sierra Guiza, Alejandro (Alex) wrote: On 7/22/2021 7:23 AM, Jason Gunthorpe wrote: On Sat, Jul 17, 2021 at 02:21:32PM -0500, Alex Sierra wrote: In order to configure device generic in test_hmm, two module parameters should be passed, which correspon to the SP start address of each device (2) spm_addr_dev0 & spm_addr_dev1. If no parameters are passed, private device type is configured. I don't think tests should need configuration like this, is it really necessary? How can people with normal HW run this test? Hi Jason, The idea was to add an easy way to validate the codepaths touched by this patch series, which make modifications to the migration helpers for device generic type pages. We're using CONFIG_EFI_FAKE_MEMMAP to create fake SPM devices inside system memory. No special HW needed. And passing the kernel parameter efi_fake_mem. Ex. efi_fake_mem=1G@0x1:0x4. I should probably need to include a small example of how to set this in the test_hmm.sh usage(). I don't think anything about hmm is sensitive to how the pages are acquired - you can't create device generic pages without relying on FAKE_MEMMAP? The reason we used fake SPM approach was to have a "special memory" not managed by Linux (NOT registered as normal system memory). But also accessible by the CPU. For device_generic we cannot allocate new physical addresses. We need the physical address to match the actual system memory physical address, so that CPU mappings work as expected. Would you recommend to use a different approach? Regards, Alex Sierra Jason
Re: [PATCH v3 0/8] Support DEVICE_GENERIC memory in migrate_vma_*
On 7/17/2021 2:54 PM, Sierra Guiza, Alejandro (Alex) wrote: On 7/16/2021 5:14 PM, Felix Kuehling wrote: Am 2021-07-16 um 11:07 a.m. schrieb Theodore Y. Ts'o: On Wed, Jun 23, 2021 at 05:49:55PM -0400, Felix Kuehling wrote: I can think of two ways to test the changes for MEMORY_DEVICE_GENERIC in this patch series in a way that is reproducible without special hardware and firmware: For the reference counting changes we could use the dax driver with hmem and use efi_fake_mem on the kernel command line to create some DEVICE_GENERIC pages. I'm open to suggestions for good user mode tests to exercise dax functionality on this type of memory. Sorry for the thread necromancy, but now that the merge window is past No worries. Alejandro should have a new version of this series in a few days, with updates to hmm_test and some fixes. V4 patch series have been sent for review. https://marc.info/?l=dri-devel=162654971618911=2 Regards, Alex Sierra Today I test ext4's dax support, without having any $$$ DAX hardware, by using the kernel command line "memmap=4G!9G:memmap=9G!14G" which reserves memory so that creates two pmem device and then I run xfstests with DAX enabled using qemu or using a Google Compute Engine VM, using TEST_DEV=/dev/pmem0 and SCRATCH_DEV=/dev/pmem1. If you can give me a recipe for what kernel configs I should enable, and what magic kernel command line arguments to use, then I'd be able to test your patch set with ext4. That would be great! Regarding kernel config options, it should be the same as what you're using for DAX testing today. We're not changing or adding any Kconfig options. But let me take a stab: ZONE_DEVICE HMM_MIRROR MMU_NOTIFIER DEVICE_PRIVATE (maybe not needed for your test) FS_DAX Hi Theodore, I wonder if you had chance to set the kernel configs from Felix to enable DAX in xfstests. I've been trying to test FS DAX on my side using virtio-fs + QEMU. Unfortunately, Im having some problems setting up the environment with the guest kernel. Could you share your VM (QEMU or GCE) setup to run it with xfstests? Regards, Alex S. I'm not sure what you're looking for in terms of kernel command line, other than the memmap options you already found. There are some more options to run hmm_test with fake SPM (DEVICE_GENERIC) memory, but we're already running that ourselves. That will also be in the next revision of this patch series. In order to run hmm test with generic device type enabled, set the following: kernel config: EFI_FAKE_MEMMAP RUNTIME_TESTING_MENU TEST_HMM=m Kernel parameters to fake SP memory. The addresses set here are based on my system's usable memory enumerated by BIOS-e820 at boot. The test requires two SP devices of at least 256MB. efi_fake_mem=1G@0x1:0x4,1G@0x14000:0x4 To run the hmm_test in generic device type pass the SP addresses to the sh script. sudo ./test_hmm.sh smoke 0x1 0x14000 If you can run your xfstests with DAX on top of this patch series, that would be very helpful. That's to make sure the ZONE_DEVICE page refcount changes don't break DAX. Regards, Felix Cheers, - Ted
Re: [PATCH v4 10/13] lib: test_hmm add module param for zone device type
On 7/22/2021 7:23 AM, Jason Gunthorpe wrote: On Sat, Jul 17, 2021 at 02:21:32PM -0500, Alex Sierra wrote: In order to configure device generic in test_hmm, two module parameters should be passed, which correspon to the SP start address of each device (2) spm_addr_dev0 & spm_addr_dev1. If no parameters are passed, private device type is configured. I don't think tests should need configuration like this, is it really necessary? How can people with normal HW run this test? Hi Jason, The idea was to add an easy way to validate the codepaths touched by this patch series, which make modifications to the migration helpers for device generic type pages. We're using CONFIG_EFI_FAKE_MEMMAP to create fake SPM devices inside system memory. No special HW needed. And passing the kernel parameter efi_fake_mem. Ex. efi_fake_mem=1G@0x1:0x4. I should probably need to include a small example of how to set this in the test_hmm.sh usage(). Alex S. Jason
Re: [PATCH v3 0/8] Support DEVICE_GENERIC memory in migrate_vma_*
On 7/16/2021 5:14 PM, Felix Kuehling wrote: Am 2021-07-16 um 11:07 a.m. schrieb Theodore Y. Ts'o: On Wed, Jun 23, 2021 at 05:49:55PM -0400, Felix Kuehling wrote: I can think of two ways to test the changes for MEMORY_DEVICE_GENERIC in this patch series in a way that is reproducible without special hardware and firmware: For the reference counting changes we could use the dax driver with hmem and use efi_fake_mem on the kernel command line to create some DEVICE_GENERIC pages. I'm open to suggestions for good user mode tests to exercise dax functionality on this type of memory. Sorry for the thread necromancy, but now that the merge window is past No worries. Alejandro should have a new version of this series in a few days, with updates to hmm_test and some fixes. V4 patch series have been sent for review. https://marc.info/?l=dri-devel=162654971618911=2 Regards, Alex Sierra Today I test ext4's dax support, without having any $$$ DAX hardware, by using the kernel command line "memmap=4G!9G:memmap=9G!14G" which reserves memory so that creates two pmem device and then I run xfstests with DAX enabled using qemu or using a Google Compute Engine VM, using TEST_DEV=/dev/pmem0 and SCRATCH_DEV=/dev/pmem1. If you can give me a recipe for what kernel configs I should enable, and what magic kernel command line arguments to use, then I'd be able to test your patch set with ext4. That would be great! Regarding kernel config options, it should be the same as what you're using for DAX testing today. We're not changing or adding any Kconfig options. But let me take a stab: ZONE_DEVICE HMM_MIRROR MMU_NOTIFIER DEVICE_PRIVATE (maybe not needed for your test) FS_DAX I'm not sure what you're looking for in terms of kernel command line, other than the memmap options you already found. There are some more options to run hmm_test with fake SPM (DEVICE_GENERIC) memory, but we're already running that ourselves. That will also be in the next revision of this patch series. In order to run hmm test with generic device type enabled, set the following: kernel config: EFI_FAKE_MEMMAP RUNTIME_TESTING_MENU TEST_HMM=m Kernel parameters to fake SP memory. The addresses set here are based on my system's usable memory enumerated by BIOS-e820 at boot. The test requires two SP devices of at least 256MB. efi_fake_mem=1G@0x1:0x4,1G@0x14000:0x4 To run the hmm_test in generic device type pass the SP addresses to the sh script. sudo ./test_hmm.sh smoke 0x1 0x14000 If you can run your xfstests with DAX on top of this patch series, that would be very helpful. That's to make sure the ZONE_DEVICE page refcount changes don't break DAX. Regards, Felix Cheers, - Ted
Re: [PATCH v3 1/8] ext4/xfs: add page refcount helper
On 6/17/2021 6:52 PM, Dave Chinner wrote: On Thu, Jun 17, 2021 at 10:16:58AM -0500, Alex Sierra wrote: From: Ralph Campbell There are several places where ZONE_DEVICE struct pages assume a reference count == 1 means the page is idle and free. Instead of open coding this, add a helper function to hide this detail. v2: [AS]: rename dax_layout_is_idle_page func to dax_page_unused Did you even compile test this? Thanks for catching this Dave, my mistake, I was building without CONFIG_FS_DAX, as I was using DEVICE_GENERIC only. I'll send the corrected patch in replay. Regards, Alex Sierra Signed-off-by: Ralph Campbell Signed-off-by: Alex Sierra --- fs/dax.c| 4 ++-- fs/ext4/inode.c | 5 + fs/xfs/xfs_file.c | 4 +--- include/linux/dax.h | 10 ++ 4 files changed, 14 insertions(+), 9 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index 26d5dcd2d69e..321f4ddc6643 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -358,7 +358,7 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping, for_each_mapped_pfn(entry, pfn) { struct page *page = pfn_to_page(pfn); - WARN_ON_ONCE(trunc && page_ref_count(page) > 1); + WARN_ON_ONCE(trunc && !dax_layout_is_idle_page(page)); Because you still use dax_layout_is_idle_page() here, not dax_page_unused()... WARN_ON_ONCE(page->mapping && page->mapping != mapping); page->mapping = NULL; page->index = 0; @@ -372,7 +372,7 @@ static struct page *dax_busy_page(void *entry) for_each_mapped_pfn(entry, pfn) { struct page *page = pfn_to_page(pfn); - if (page_ref_count(page) > 1) + if (!dax_layout_is_idle_page(page)) Here too. Cheers, Dave.
Re: [PATCH v3 0/8] Support DEVICE_GENERIC memory in migrate_vma_*
On 6/17/2021 10:16 AM, Alex Sierra wrote: v1: AMD is building a system architecture for the Frontier supercomputer with a coherent interconnect between CPUs and GPUs. This hardware architecture allows the CPUs to coherently access GPU device memory. We have hardware in our labs and we are working with our partner HPE on the BIOS, firmware and software for delivery to the DOE. The system BIOS advertises the GPU device memory (aka VRAM) as SPM (special purpose memory) in the UEFI system address map. The amdgpu driver looks it up with lookup_resource and registers it with devmap as MEMORY_DEVICE_GENERIC using devm_memremap_pages. Now we're trying to migrate data to and from that memory using the migrate_vma_* helpers so we can support page-based migration in our unified memory allocations, while also supporting CPU access to those pages. This patch series makes a few changes to make MEMORY_DEVICE_GENERIC pages behave correctly in the migrate_vma_* helpers. We are looking for feedback about this approach. If we're close, what's needed to make our patches acceptable upstream? If we're not close, any suggestions how else to achieve what we are trying to do (i.e. page migration and coherent CPU access to VRAM)? This work is based on HMM and our SVM memory manager that was recently upstreamed to Dave Airlie's drm-next branch https://lore.kernel.org/dri-devel/20210527205606.2660-6-felix.kuehl...@amd.com/T/#r996356015e295780eb50453e7dbd5d0d68b47cbc Corrected link: https://cgit.freedesktop.org/drm/drm/log/?h=drm-next Regards, Alex Sierra On top of that we did some rework of our VRAM management for migrations to remove some incorrect assumptions, allow partially successful migrations and GPU memory mappings that mix pages in VRAM and system memory. https://patchwork.kernel.org/project/dri-devel/list/?series=489811 Corrected link: https://lore.kernel.org/dri-devel/20210527205606.2660-6-felix.kuehl...@amd.com/T/#r996356015e295780eb50453e7dbd5d0d68b47cbc Regards, Alex Sierra v2: This patch series version has merged "[RFC PATCH v3 0/2] mm: remove extra ZONE_DEVICE struct page refcount" patch series made by Ralph Campbell. It also applies at the top of these series, our changes to support device generic type in migration_vma helpers. This has been tested in systems with device memory that has coherent access by CPU. Also addresses the following feedback made in v1: - Isolate in one patch kernel/resource.c modification, based on Christoph's feedback. - Add helpers check for generic and private type to avoid duplicated long lines. v3: - Include cover letter from v1 - Rename dax_layout_is_idle_page func to dax_page_unused in patch ext4/xfs: add page refcount helper Patches 1-2 Rebased Ralph Campbell's ZONE_DEVICE page refcounting patches Patches 4-5 are for context to show how we are looking up the SPM memory and registering it with devmap. Patches 3,6-8 are the changes we are trying to upstream or rework to make them acceptable upstream. Alex Sierra (6): kernel: resource: lookup_resource as exported symbol drm/amdkfd: add SPM support for SVM drm/amdkfd: generic type as sys mem on migration to ram include/linux/mm.h: helpers to check zone device generic type mm: add generic type support to migrate_vma helpers mm: call pgmap->ops->page_free for DEVICE_GENERIC pages Ralph Campbell (2): ext4/xfs: add page refcount helper mm: remove extra ZONE_DEVICE struct page refcount arch/powerpc/kvm/book3s_hv_uvmem.c | 2 +- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 15 -- drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +- fs/dax.c | 8 +-- fs/ext4/inode.c | 5 +- fs/xfs/xfs_file.c| 4 +- include/linux/dax.h | 10 include/linux/memremap.h | 7 +-- include/linux/mm.h | 52 +++--- kernel/resource.c| 2 +- lib/test_hmm.c | 2 +- mm/internal.h| 8 +++ mm/memremap.c| 69 +++- mm/migrate.c | 13 ++--- mm/page_alloc.c | 3 ++ mm/swap.c| 45 ++-- 16 files changed, 83 insertions(+), 164 deletions(-)
Re: [PATCH] drm/ttm: fix pipelined gutting for evictions v2
On 7/24/2020 10:01 AM, Felix Kuehling wrote: Am 2020-07-24 um 7:56 a.m. schrieb Christian König: We can't pipeline that during eviction because the memory needs to be available immediately. v2: fix how we cleanup the BOs resources Signed-off-by: Christian König Reviewed-by: Felix Kuehling It would be good to get a Tested-by from Alex as well. Thanks, Felix Tested-by: Alex Sierra Regards, Alex Sierra --- drivers/gpu/drm/ttm/ttm_bo.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index 0768a054a916..469aa93ea317 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -652,8 +652,12 @@ static int ttm_bo_evict(struct ttm_buffer_object *bo, placement.num_busy_placement = 0; bdev->driver->evict_flags(bo, ); - if (!placement.num_placement && !placement.num_busy_placement) - return ttm_bo_pipeline_gutting(bo); + if (!placement.num_placement && !placement.num_busy_placement) { + ttm_bo_wait(bo, false, false); + + ttm_bo_cleanup_memtype_use(bo); + return 0; + } evict_mem = bo->mem; evict_mem.mm_node = NULL; ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel