Re: [PATCH v9 06/14] mm/gup: migrate device coherent pages when pinning instead of failing

2022-07-25 Thread Sierra Guiza, Alejandro (Alex)



On 7/18/2022 3:32 PM, Andrew Morton wrote:

On Mon, 18 Jul 2022 12:56:29 +0200 David Hildenbrand  wrote:


/*
 * Try to move out any movable page before pinning the range.
 */
@@ -1919,7 +1948,8 @@ static long check_and_migrate_movable_pages(unsigned long 
nr_pages,
folio_nr_pages(folio));
}
  
-	if (!list_empty(_page_list) || isolation_error_count)

+   if (!list_empty(_page_list) || isolation_error_count
+   || coherent_pages)

The common style is to

a) add the || to the end of the previous line
b) indent such the we have a nice-to-read alignment

if (!list_empty(_page_list) || isolation_error_count ||
 coherent_pages)


I missed that.  This series is now in mm-stable so any fix will need to
be a standalone followup patch, please.

Hi Andrew,
Just wanted to make sure nothing is missing from our side to merge this 
patch series.


Regards,
Alex Sierra



Apart from that lgtm.

Reviewed-by: David Hildenbrand 

And your reviewed-by's will be lost.  Stupid git.


Re: [PATCH] mm/gup: migrate device coherent pages when pinning instead of failing

2022-07-16 Thread Sierra Guiza, Alejandro (Alex)



On 7/14/2022 9:11 PM, Alistair Popple wrote:

Currently any attempts to pin a device coherent page will fail. This is
because device coherent pages need to be managed by a device driver, and
pinning them would prevent a driver from migrating them off the device.

However this is no reason to fail pinning of these pages. These are
coherent and accessible from the CPU so can be migrated just like
pinning ZONE_MOVABLE pages. So instead of failing all attempts to pin
them first try migrating them out of ZONE_DEVICE.

[hch: rebased to the split device memory checks,
   moved migrate_device_page to migrate_device.c]

Signed-off-by: Alistair Popple 
Acked-by: Felix Kuehling 
Signed-off-by: Christoph Hellwig 
---

This patch hopefully addresses all of David's comments. It replaces both my "mm:
remove the vma check in migrate_vma_setup()" and "mm/gup: migrate device
coherent pages when pinning instead of failing" patches. I'm not sure what the
best way of including this is, perhaps Alex can respin the series with this
patch instead?


For sure Alistair. I'll include this in my next patch series version.

Thanks,
Alex Sierra



  - Alistair

  mm/gup.c| 50 +--
  mm/internal.h   |  1 +
  mm/migrate_device.c | 52 +
  3 files changed, 96 insertions(+), 7 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index b65fe8bf5af4..22b97ab61cd9 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1881,7 +1881,7 @@ static long check_and_migrate_movable_pages(unsigned long 
nr_pages,
unsigned long isolation_error_count = 0, i;
struct folio *prev_folio = NULL;
LIST_HEAD(movable_page_list);
-   bool drain_allow = true;
+   bool drain_allow = true, coherent_pages = false;
int ret = 0;
  
  	for (i = 0; i < nr_pages; i++) {

@@ -1891,9 +1891,38 @@ static long check_and_migrate_movable_pages(unsigned 
long nr_pages,
continue;
prev_folio = folio;
  
-		if (folio_is_longterm_pinnable(folio))

+   /*
+* Device coherent pages are managed by a driver and should not
+* be pinned indefinitely as it prevents the driver moving the
+* page. So when trying to pin with FOLL_LONGTERM instead try
+* to migrate the page out of device memory.
+*/
+   if (folio_is_device_coherent(folio)) {
+   /*
+* We always want a new GUP lookup with device coherent
+* pages.
+*/
+   pages[i] = 0;
+   coherent_pages = true;
+
+   /*
+* Migration will fail if the page is pinned, so convert
+* the pin on the source page to a normal reference.
+*/
+   if (gup_flags & FOLL_PIN) {
+   get_page(>page);
+   unpin_user_page(>page);
+   }
+
+   ret = migrate_device_coherent_page(>page);
+   if (ret)
+   goto unpin_pages;
+
continue;
+   }
  
+		if (folio_is_longterm_pinnable(folio))

+   continue;
/*
 * Try to move out any movable page before pinning the range.
 */
@@ -1919,7 +1948,8 @@ static long check_and_migrate_movable_pages(unsigned long 
nr_pages,
folio_nr_pages(folio));
}
  
-	if (!list_empty(_page_list) || isolation_error_count)

+   if (!list_empty(_page_list) || isolation_error_count
+   || coherent_pages)
goto unpin_pages;
  
  	/*

@@ -1929,10 +1959,16 @@ static long check_and_migrate_movable_pages(unsigned 
long nr_pages,
return nr_pages;
  
  unpin_pages:

-   if (gup_flags & FOLL_PIN) {
-   unpin_user_pages(pages, nr_pages);
-   } else {
-   for (i = 0; i < nr_pages; i++)
+   /*
+* pages[i] might be NULL if any device coherent pages were found.
+*/
+   for (i = 0; i < nr_pages; i++) {
+   if (!pages[i])
+   continue;
+
+   if (gup_flags & FOLL_PIN)
+   unpin_user_page(pages[i]);
+   else
put_page(pages[i]);
}
  
diff --git a/mm/internal.h b/mm/internal.h

index c0f8fbe0445b..899dab512c5a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -853,6 +853,7 @@ int numa_migrate_prep(struct page *page, struct 
vm_area_struct *vma,
  unsigned long addr, int page_nid, int *flags);
  
  void free_zone_device_page(struct page *page);

+int migrate_device_coherent_page(struct page *page);
  
  /*

   * mm/gup.c
diff --git a/mm/migrate_device.c 

Re: [PATCH v6 02/14] mm: handling Non-LRU pages returned by vm_normal_pages

2022-06-28 Thread Sierra Guiza, Alejandro (Alex)



On 6/28/2022 5:42 AM, David Hildenbrand wrote:

On 28.06.22 02:14, Alex Sierra wrote:

With DEVICE_COHERENT, we'll soon have vm_normal_pages() return
device-managed anonymous pages that are not LRU pages. Although they
behave like normal pages for purposes of mapping in CPU page, and for
COW. They do not support LRU lists, NUMA migration or THP.

We also introduced a FOLL_LRU flag that adds the same behaviour to
follow_page and related APIs, to allow callers to specify that they
expect to put pages on an LRU list.

Signed-off-by: Alex Sierra 
Acked-by: Felix Kuehling 
Reviewed-by: Alistair Popple 
---

I think my review feedback regarding FOLL_LRU has been ignored.


Sorry David, this has been addressed in v7.

Regards,
Alex Sierra






Re: [PATCH v5 01/13] mm: add zone device coherent type memory support

2022-06-24 Thread Sierra Guiza, Alejandro (Alex)



On 6/23/2022 1:21 PM, David Hildenbrand wrote:

On 23.06.22 20:20, Sierra Guiza, Alejandro (Alex) wrote:

On 6/23/2022 2:57 AM, David Hildenbrand wrote:

On 23.06.22 01:16, Sierra Guiza, Alejandro (Alex) wrote:

On 6/21/2022 11:16 AM, David Hildenbrand wrote:

On 21.06.22 18:08, Sierra Guiza, Alejandro (Alex) wrote:

On 6/21/2022 7:25 AM, David Hildenbrand wrote:

On 21.06.22 13:55, Alistair Popple wrote:

David Hildenbrand  writes:


On 21.06.22 13:25, Felix Kuehling wrote:

Am 6/17/22 um 23:19 schrieb David Hildenbrand:

On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote:

On 6/17/2022 12:33 PM, David Hildenbrand wrote:

On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote:

On 6/17/2022 4:40 AM, David Hildenbrand wrote:

On 31.05.22 22:00, Alex Sierra wrote:

Device memory that is cache coherent from device and CPU point of view.
This is used on platforms that have an advanced system bus (like CAPI
or CXL). Any page of a process can be migrated to such memory. However,
no one should be allowed to pin such memory so that it can always be
evicted.

Signed-off-by: Alex Sierra
Acked-by: Felix Kuehling
Reviewed-by: Alistair Popple
[hch: rebased ontop of the refcount changes,
 removed is_dev_private_or_coherent_page]
Signed-off-by: Christoph Hellwig
---
include/linux/memremap.h | 19 +++
mm/memcontrol.c  |  7 ---
mm/memory-failure.c  |  8 ++--
mm/memremap.c| 10 ++
mm/migrate_device.c  | 16 +++-
mm/rmap.c|  5 +++--
6 files changed, 49 insertions(+), 16 deletions(-)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 8af304f6b504..9f752ebed613 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -41,6 +41,13 @@ struct vmem_altmap {
 * A more complete discussion of unaddressable memory may be found in
 * include/linux/hmm.h and Documentation/vm/hmm.rst.
 *
+ * MEMORY_DEVICE_COHERENT:
+ * Device memory that is cache coherent from device and CPU point of view. This
+ * is used on platforms that have an advanced system bus (like CAPI or CXL). A
+ * driver can hotplug the device memory using ZONE_DEVICE and with that memory
+ * type. Any page of a process can be migrated to such memory. However no one

Any page might not be right, I'm pretty sure. ... just thinking about special 
pages
like vdso, shared zeropage, ... pinned pages ...

Well, you cannot migrate long term pages, that's what I meant :)


+ * should be allowed to pin such memory so that it can always be evicted.
+ *
 * MEMORY_DEVICE_FS_DAX:
 * Host memory that has similar access semantics as System RAM i.e. DMA
 * coherent and supports page pinning. In support of coordinating page
@@ -61,6 +68,7 @@ struct vmem_altmap {
enum memory_type {
/* 0 is reserved to catch uninitialized type fields */
MEMORY_DEVICE_PRIVATE = 1,
+   MEMORY_DEVICE_COHERENT,
MEMORY_DEVICE_FS_DAX,
MEMORY_DEVICE_GENERIC,
MEMORY_DEVICE_PCI_P2PDMA,
@@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct 
folio *folio)

In general, this LGTM, and it should be correct with PageAnonExclusive I think.


However, where exactly is pinning forbidden?

Long-term pinning is forbidden since it would interfere with the device
memory manager owning the
device-coherent pages (e.g. evictions in TTM). However, normal pinning
is allowed on this device type.

I don't see updates to folio_is_pinnable() in this patch.

Device coherent type pages should return true here, as they are pinnable
pages.

That function is only called for long-term pinnings in try_grab_folio().


So wouldn't try_grab_folio() simply pin these pages? What am I missing?

As far as I understand this return NULL for long term pin pages.
Otherwise they get refcount incremented.

I don't follow.

You're saying

a) folio_is_pinnable() returns true for device coherent pages

and that

b) device coherent pages don't get long-term pinned


Yet, the code says

struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
{
if (flags & FOLL_GET)
return try_get_folio(page, refs);
else if (flags & FOLL_PIN) {
struct folio *folio;

/*
 * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
 * right zone, so fail and let the caller fall back to the slow
 * path.
 */
if (unlikely((flags & FOLL_LONGTERM) &&
 !is_pinnable_page(page)))
return NULL;
...
return folio;
}
}


What prevents these pages from getting long-term pinned as stated in this patch?

Long-term pinning is handled by __gup_longterm_locked, which mig

Re: [PATCH v5 01/13] mm: add zone device coherent type memory support

2022-06-23 Thread Sierra Guiza, Alejandro (Alex)



On 6/23/2022 2:57 AM, David Hildenbrand wrote:

On 23.06.22 01:16, Sierra Guiza, Alejandro (Alex) wrote:

On 6/21/2022 11:16 AM, David Hildenbrand wrote:

On 21.06.22 18:08, Sierra Guiza, Alejandro (Alex) wrote:

On 6/21/2022 7:25 AM, David Hildenbrand wrote:

On 21.06.22 13:55, Alistair Popple wrote:

David Hildenbrand  writes:


On 21.06.22 13:25, Felix Kuehling wrote:

Am 6/17/22 um 23:19 schrieb David Hildenbrand:

On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote:

On 6/17/2022 12:33 PM, David Hildenbrand wrote:

On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote:

On 6/17/2022 4:40 AM, David Hildenbrand wrote:

On 31.05.22 22:00, Alex Sierra wrote:

Device memory that is cache coherent from device and CPU point of view.
This is used on platforms that have an advanced system bus (like CAPI
or CXL). Any page of a process can be migrated to such memory. However,
no one should be allowed to pin such memory so that it can always be
evicted.

Signed-off-by: Alex Sierra
Acked-by: Felix Kuehling
Reviewed-by: Alistair Popple
[hch: rebased ontop of the refcount changes,
removed is_dev_private_or_coherent_page]
Signed-off-by: Christoph Hellwig
---
   include/linux/memremap.h | 19 +++
   mm/memcontrol.c  |  7 ---
   mm/memory-failure.c  |  8 ++--
   mm/memremap.c| 10 ++
   mm/migrate_device.c  | 16 +++-
   mm/rmap.c|  5 +++--
   6 files changed, 49 insertions(+), 16 deletions(-)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 8af304f6b504..9f752ebed613 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -41,6 +41,13 @@ struct vmem_altmap {
* A more complete discussion of unaddressable memory may be found in
* include/linux/hmm.h and Documentation/vm/hmm.rst.
*
+ * MEMORY_DEVICE_COHERENT:
+ * Device memory that is cache coherent from device and CPU point of view. This
+ * is used on platforms that have an advanced system bus (like CAPI or CXL). A
+ * driver can hotplug the device memory using ZONE_DEVICE and with that memory
+ * type. Any page of a process can be migrated to such memory. However no one

Any page might not be right, I'm pretty sure. ... just thinking about special 
pages
like vdso, shared zeropage, ... pinned pages ...

Well, you cannot migrate long term pages, that's what I meant :)


+ * should be allowed to pin such memory so that it can always be evicted.
+ *
* MEMORY_DEVICE_FS_DAX:
* Host memory that has similar access semantics as System RAM i.e. DMA
* coherent and supports page pinning. In support of coordinating page
@@ -61,6 +68,7 @@ struct vmem_altmap {
   enum memory_type {
/* 0 is reserved to catch uninitialized type fields */
MEMORY_DEVICE_PRIVATE = 1,
+   MEMORY_DEVICE_COHERENT,
MEMORY_DEVICE_FS_DAX,
MEMORY_DEVICE_GENERIC,
MEMORY_DEVICE_PCI_P2PDMA,
@@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct 
folio *folio)

In general, this LGTM, and it should be correct with PageAnonExclusive I think.


However, where exactly is pinning forbidden?

Long-term pinning is forbidden since it would interfere with the device
memory manager owning the
device-coherent pages (e.g. evictions in TTM). However, normal pinning
is allowed on this device type.

I don't see updates to folio_is_pinnable() in this patch.

Device coherent type pages should return true here, as they are pinnable
pages.

That function is only called for long-term pinnings in try_grab_folio().


So wouldn't try_grab_folio() simply pin these pages? What am I missing?

As far as I understand this return NULL for long term pin pages.
Otherwise they get refcount incremented.

I don't follow.

You're saying

a) folio_is_pinnable() returns true for device coherent pages

and that

b) device coherent pages don't get long-term pinned


Yet, the code says

struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
{
if (flags & FOLL_GET)
return try_get_folio(page, refs);
else if (flags & FOLL_PIN) {
struct folio *folio;

/*
 * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
 * right zone, so fail and let the caller fall back to the slow
 * path.
 */
if (unlikely((flags & FOLL_LONGTERM) &&
 !is_pinnable_page(page)))
return NULL;
...
return folio;
}
}


What prevents these pages from getting long-term pinned as stated in this patch?

Long-term pinning is handled by __gup_longterm_locked, which migrates
pages returned by __get_user_pages_locked that cannot be long-term
pinned. try_grab_folio is OK to grab the pages. Anything that can't be
long-term pin

Re: [PATCH v5 01/13] mm: add zone device coherent type memory support

2022-06-22 Thread Sierra Guiza, Alejandro (Alex)



On 6/21/2022 11:16 AM, David Hildenbrand wrote:

On 21.06.22 18:08, Sierra Guiza, Alejandro (Alex) wrote:

On 6/21/2022 7:25 AM, David Hildenbrand wrote:

On 21.06.22 13:55, Alistair Popple wrote:

David Hildenbrand  writes:


On 21.06.22 13:25, Felix Kuehling wrote:

Am 6/17/22 um 23:19 schrieb David Hildenbrand:

On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote:

On 6/17/2022 12:33 PM, David Hildenbrand wrote:

On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote:

On 6/17/2022 4:40 AM, David Hildenbrand wrote:

On 31.05.22 22:00, Alex Sierra wrote:

Device memory that is cache coherent from device and CPU point of view.
This is used on platforms that have an advanced system bus (like CAPI
or CXL). Any page of a process can be migrated to such memory. However,
no one should be allowed to pin such memory so that it can always be
evicted.

Signed-off-by: Alex Sierra
Acked-by: Felix Kuehling
Reviewed-by: Alistair Popple
[hch: rebased ontop of the refcount changes,
   removed is_dev_private_or_coherent_page]
Signed-off-by: Christoph Hellwig
---
  include/linux/memremap.h | 19 +++
  mm/memcontrol.c  |  7 ---
  mm/memory-failure.c  |  8 ++--
  mm/memremap.c| 10 ++
  mm/migrate_device.c  | 16 +++-
  mm/rmap.c|  5 +++--
  6 files changed, 49 insertions(+), 16 deletions(-)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 8af304f6b504..9f752ebed613 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -41,6 +41,13 @@ struct vmem_altmap {
   * A more complete discussion of unaddressable memory may be found in
   * include/linux/hmm.h and Documentation/vm/hmm.rst.
   *
+ * MEMORY_DEVICE_COHERENT:
+ * Device memory that is cache coherent from device and CPU point of view. This
+ * is used on platforms that have an advanced system bus (like CAPI or CXL). A
+ * driver can hotplug the device memory using ZONE_DEVICE and with that memory
+ * type. Any page of a process can be migrated to such memory. However no one

Any page might not be right, I'm pretty sure. ... just thinking about special 
pages
like vdso, shared zeropage, ... pinned pages ...

Well, you cannot migrate long term pages, that's what I meant :)


+ * should be allowed to pin such memory so that it can always be evicted.
+ *
   * MEMORY_DEVICE_FS_DAX:
   * Host memory that has similar access semantics as System RAM i.e. DMA
   * coherent and supports page pinning. In support of coordinating page
@@ -61,6 +68,7 @@ struct vmem_altmap {
  enum memory_type {
/* 0 is reserved to catch uninitialized type fields */
MEMORY_DEVICE_PRIVATE = 1,
+   MEMORY_DEVICE_COHERENT,
MEMORY_DEVICE_FS_DAX,
MEMORY_DEVICE_GENERIC,
MEMORY_DEVICE_PCI_P2PDMA,
@@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct 
folio *folio)

In general, this LGTM, and it should be correct with PageAnonExclusive I think.


However, where exactly is pinning forbidden?

Long-term pinning is forbidden since it would interfere with the device
memory manager owning the
device-coherent pages (e.g. evictions in TTM). However, normal pinning
is allowed on this device type.

I don't see updates to folio_is_pinnable() in this patch.

Device coherent type pages should return true here, as they are pinnable
pages.

That function is only called for long-term pinnings in try_grab_folio().


So wouldn't try_grab_folio() simply pin these pages? What am I missing?

As far as I understand this return NULL for long term pin pages.
Otherwise they get refcount incremented.

I don't follow.

You're saying

a) folio_is_pinnable() returns true for device coherent pages

and that

b) device coherent pages don't get long-term pinned


Yet, the code says

struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
{
if (flags & FOLL_GET)
return try_get_folio(page, refs);
else if (flags & FOLL_PIN) {
struct folio *folio;

/*
 * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
 * right zone, so fail and let the caller fall back to the slow
 * path.
 */
if (unlikely((flags & FOLL_LONGTERM) &&
 !is_pinnable_page(page)))
return NULL;
...
return folio;
}
}


What prevents these pages from getting long-term pinned as stated in this patch?

Long-term pinning is handled by __gup_longterm_locked, which migrates
pages returned by __get_user_pages_locked that cannot be long-term
pinned. try_grab_folio is OK to grab the pages. Anything that can't be
long-term pinned will be migrated afterwards, and
__get_user_pages_locked will be retried. The migration of
DEVICE_COHERENT pages was imp

Re: [PATCH v5 01/13] mm: add zone device coherent type memory support

2022-06-22 Thread Sierra Guiza, Alejandro (Alex)



On 6/21/2022 7:16 PM, Alistair Popple wrote:

David Hildenbrand  writes:


On 21.06.22 18:08, Sierra Guiza, Alejandro (Alex) wrote:

On 6/21/2022 7:25 AM, David Hildenbrand wrote:

On 21.06.22 13:55, Alistair Popple wrote:

David Hildenbrand  writes:


On 21.06.22 13:25, Felix Kuehling wrote:

Am 6/17/22 um 23:19 schrieb David Hildenbrand:

On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote:

On 6/17/2022 12:33 PM, David Hildenbrand wrote:

On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote:

On 6/17/2022 4:40 AM, David Hildenbrand wrote:

On 31.05.22 22:00, Alex Sierra wrote:

Device memory that is cache coherent from device and CPU point of view.
This is used on platforms that have an advanced system bus (like CAPI
or CXL). Any page of a process can be migrated to such memory. However,
no one should be allowed to pin such memory so that it can always be
evicted.

Signed-off-by: Alex Sierra
Acked-by: Felix Kuehling
Reviewed-by: Alistair Popple
[hch: rebased ontop of the refcount changes,
   removed is_dev_private_or_coherent_page]
Signed-off-by: Christoph Hellwig
---
  include/linux/memremap.h | 19 +++
  mm/memcontrol.c  |  7 ---
  mm/memory-failure.c  |  8 ++--
  mm/memremap.c| 10 ++
  mm/migrate_device.c  | 16 +++-
  mm/rmap.c|  5 +++--
  6 files changed, 49 insertions(+), 16 deletions(-)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 8af304f6b504..9f752ebed613 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -41,6 +41,13 @@ struct vmem_altmap {
   * A more complete discussion of unaddressable memory may be found in
   * include/linux/hmm.h and Documentation/vm/hmm.rst.
   *
+ * MEMORY_DEVICE_COHERENT:
+ * Device memory that is cache coherent from device and CPU point of view. This
+ * is used on platforms that have an advanced system bus (like CAPI or CXL). A
+ * driver can hotplug the device memory using ZONE_DEVICE and with that memory
+ * type. Any page of a process can be migrated to such memory. However no one

Any page might not be right, I'm pretty sure. ... just thinking about special 
pages
like vdso, shared zeropage, ... pinned pages ...

Well, you cannot migrate long term pages, that's what I meant :)


+ * should be allowed to pin such memory so that it can always be evicted.
+ *
   * MEMORY_DEVICE_FS_DAX:
   * Host memory that has similar access semantics as System RAM i.e. DMA
   * coherent and supports page pinning. In support of coordinating page
@@ -61,6 +68,7 @@ struct vmem_altmap {
  enum memory_type {
/* 0 is reserved to catch uninitialized type fields */
MEMORY_DEVICE_PRIVATE = 1,
+   MEMORY_DEVICE_COHERENT,
MEMORY_DEVICE_FS_DAX,
MEMORY_DEVICE_GENERIC,
MEMORY_DEVICE_PCI_P2PDMA,
@@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct 
folio *folio)

In general, this LGTM, and it should be correct with PageAnonExclusive I think.


However, where exactly is pinning forbidden?

Long-term pinning is forbidden since it would interfere with the device
memory manager owning the
device-coherent pages (e.g. evictions in TTM). However, normal pinning
is allowed on this device type.

I don't see updates to folio_is_pinnable() in this patch.

Device coherent type pages should return true here, as they are pinnable
pages.

That function is only called for long-term pinnings in try_grab_folio().


So wouldn't try_grab_folio() simply pin these pages? What am I missing?

As far as I understand this return NULL for long term pin pages.
Otherwise they get refcount incremented.

I don't follow.

You're saying

a) folio_is_pinnable() returns true for device coherent pages

and that

b) device coherent pages don't get long-term pinned


Yet, the code says

struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
{
if (flags & FOLL_GET)
return try_get_folio(page, refs);
else if (flags & FOLL_PIN) {
struct folio *folio;

/*
 * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
 * right zone, so fail and let the caller fall back to the slow
 * path.
 */
if (unlikely((flags & FOLL_LONGTERM) &&
 !is_pinnable_page(page)))
return NULL;
...
return folio;
}
}


What prevents these pages from getting long-term pinned as stated in this patch?

Long-term pinning is handled by __gup_longterm_locked, which migrates
pages returned by __get_user_pages_locked that cannot be long-term
pinned. try_grab_folio is OK to grab the pages. Anything that can't be
long-term pinned will be migrated afterwards, and
__get_user_pages_locked will be retried. The migration of
D

Re: [PATCH v5 01/13] mm: add zone device coherent type memory support

2022-06-21 Thread Sierra Guiza, Alejandro (Alex)



On 6/21/2022 7:25 AM, David Hildenbrand wrote:

On 21.06.22 13:55, Alistair Popple wrote:

David Hildenbrand  writes:


On 21.06.22 13:25, Felix Kuehling wrote:

Am 6/17/22 um 23:19 schrieb David Hildenbrand:

On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote:

On 6/17/2022 12:33 PM, David Hildenbrand wrote:

On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote:

On 6/17/2022 4:40 AM, David Hildenbrand wrote:

On 31.05.22 22:00, Alex Sierra wrote:

Device memory that is cache coherent from device and CPU point of view.
This is used on platforms that have an advanced system bus (like CAPI
or CXL). Any page of a process can be migrated to such memory. However,
no one should be allowed to pin such memory so that it can always be
evicted.

Signed-off-by: Alex Sierra
Acked-by: Felix Kuehling
Reviewed-by: Alistair Popple
[hch: rebased ontop of the refcount changes,
  removed is_dev_private_or_coherent_page]
Signed-off-by: Christoph Hellwig
---
 include/linux/memremap.h | 19 +++
 mm/memcontrol.c  |  7 ---
 mm/memory-failure.c  |  8 ++--
 mm/memremap.c| 10 ++
 mm/migrate_device.c  | 16 +++-
 mm/rmap.c|  5 +++--
 6 files changed, 49 insertions(+), 16 deletions(-)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 8af304f6b504..9f752ebed613 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -41,6 +41,13 @@ struct vmem_altmap {
  * A more complete discussion of unaddressable memory may be found in
  * include/linux/hmm.h and Documentation/vm/hmm.rst.
  *
+ * MEMORY_DEVICE_COHERENT:
+ * Device memory that is cache coherent from device and CPU point of view. This
+ * is used on platforms that have an advanced system bus (like CAPI or CXL). A
+ * driver can hotplug the device memory using ZONE_DEVICE and with that memory
+ * type. Any page of a process can be migrated to such memory. However no one

Any page might not be right, I'm pretty sure. ... just thinking about special 
pages
like vdso, shared zeropage, ... pinned pages ...

Well, you cannot migrate long term pages, that's what I meant :)


+ * should be allowed to pin such memory so that it can always be evicted.
+ *
  * MEMORY_DEVICE_FS_DAX:
  * Host memory that has similar access semantics as System RAM i.e. DMA
  * coherent and supports page pinning. In support of coordinating page
@@ -61,6 +68,7 @@ struct vmem_altmap {
 enum memory_type {
/* 0 is reserved to catch uninitialized type fields */
MEMORY_DEVICE_PRIVATE = 1,
+   MEMORY_DEVICE_COHERENT,
MEMORY_DEVICE_FS_DAX,
MEMORY_DEVICE_GENERIC,
MEMORY_DEVICE_PCI_P2PDMA,
@@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct 
folio *folio)

In general, this LGTM, and it should be correct with PageAnonExclusive I think.


However, where exactly is pinning forbidden?

Long-term pinning is forbidden since it would interfere with the device
memory manager owning the
device-coherent pages (e.g. evictions in TTM). However, normal pinning
is allowed on this device type.

I don't see updates to folio_is_pinnable() in this patch.

Device coherent type pages should return true here, as they are pinnable
pages.

That function is only called for long-term pinnings in try_grab_folio().


So wouldn't try_grab_folio() simply pin these pages? What am I missing?

As far as I understand this return NULL for long term pin pages.
Otherwise they get refcount incremented.

I don't follow.

You're saying

a) folio_is_pinnable() returns true for device coherent pages

and that

b) device coherent pages don't get long-term pinned


Yet, the code says

struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
{
if (flags & FOLL_GET)
return try_get_folio(page, refs);
else if (flags & FOLL_PIN) {
struct folio *folio;

/*
 * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
 * right zone, so fail and let the caller fall back to the slow
 * path.
 */
if (unlikely((flags & FOLL_LONGTERM) &&
 !is_pinnable_page(page)))
return NULL;
...
return folio;
}
}


What prevents these pages from getting long-term pinned as stated in this patch?

Long-term pinning is handled by __gup_longterm_locked, which migrates
pages returned by __get_user_pages_locked that cannot be long-term
pinned. try_grab_folio is OK to grab the pages. Anything that can't be
long-term pinned will be migrated afterwards, and
__get_user_pages_locked will be retried. The migration of
DEVICE_COHERENT pages was implemented by Alistair in patch 5/13
("mm/gup: migrate device coherent pages when pinning instead of failing").


Re: [PATCH v5 01/13] mm: add zone device coherent type memory support

2022-06-17 Thread Sierra Guiza, Alejandro (Alex)



On 6/17/2022 12:33 PM, David Hildenbrand wrote:

On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote:

On 6/17/2022 4:40 AM, David Hildenbrand wrote:

On 31.05.22 22:00, Alex Sierra wrote:

Device memory that is cache coherent from device and CPU point of view.
This is used on platforms that have an advanced system bus (like CAPI
or CXL). Any page of a process can be migrated to such memory. However,
no one should be allowed to pin such memory so that it can always be
evicted.

Signed-off-by: Alex Sierra 
Acked-by: Felix Kuehling 
Reviewed-by: Alistair Popple 
[hch: rebased ontop of the refcount changes,
removed is_dev_private_or_coherent_page]
Signed-off-by: Christoph Hellwig 
---
   include/linux/memremap.h | 19 +++
   mm/memcontrol.c  |  7 ---
   mm/memory-failure.c  |  8 ++--
   mm/memremap.c| 10 ++
   mm/migrate_device.c  | 16 +++-
   mm/rmap.c|  5 +++--
   6 files changed, 49 insertions(+), 16 deletions(-)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 8af304f6b504..9f752ebed613 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -41,6 +41,13 @@ struct vmem_altmap {
* A more complete discussion of unaddressable memory may be found in
* include/linux/hmm.h and Documentation/vm/hmm.rst.
*
+ * MEMORY_DEVICE_COHERENT:
+ * Device memory that is cache coherent from device and CPU point of view. This
+ * is used on platforms that have an advanced system bus (like CAPI or CXL). A
+ * driver can hotplug the device memory using ZONE_DEVICE and with that memory
+ * type. Any page of a process can be migrated to such memory. However no one

Any page might not be right, I'm pretty sure. ... just thinking about special 
pages
like vdso, shared zeropage, ... pinned pages ...

Well, you cannot migrate long term pages, that's what I meant :)


+ * should be allowed to pin such memory so that it can always be evicted.
+ *
* MEMORY_DEVICE_FS_DAX:
* Host memory that has similar access semantics as System RAM i.e. DMA
* coherent and supports page pinning. In support of coordinating page
@@ -61,6 +68,7 @@ struct vmem_altmap {
   enum memory_type {
/* 0 is reserved to catch uninitialized type fields */
MEMORY_DEVICE_PRIVATE = 1,
+   MEMORY_DEVICE_COHERENT,
MEMORY_DEVICE_FS_DAX,
MEMORY_DEVICE_GENERIC,
MEMORY_DEVICE_PCI_P2PDMA,
@@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct 
folio *folio)

In general, this LGTM, and it should be correct with PageAnonExclusive I think.


However, where exactly is pinning forbidden?

Long-term pinning is forbidden since it would interfere with the device
memory manager owning the
device-coherent pages (e.g. evictions in TTM). However, normal pinning
is allowed on this device type.

I don't see updates to folio_is_pinnable() in this patch.
Device coherent type pages should return true here, as they are pinnable 
pages.


So wouldn't try_grab_folio() simply pin these pages? What am I missing?


As far as I understand this return NULL for long term pin pages. 
Otherwise they get refcount incremented.


Regards,
Alex Sierra





Re: [PATCH v5 01/13] mm: add zone device coherent type memory support

2022-06-17 Thread Sierra Guiza, Alejandro (Alex)



On 6/17/2022 4:40 AM, David Hildenbrand wrote:

On 31.05.22 22:00, Alex Sierra wrote:

Device memory that is cache coherent from device and CPU point of view.
This is used on platforms that have an advanced system bus (like CAPI
or CXL). Any page of a process can be migrated to such memory. However,
no one should be allowed to pin such memory so that it can always be
evicted.

Signed-off-by: Alex Sierra 
Acked-by: Felix Kuehling 
Reviewed-by: Alistair Popple 
[hch: rebased ontop of the refcount changes,
   removed is_dev_private_or_coherent_page]
Signed-off-by: Christoph Hellwig 
---
  include/linux/memremap.h | 19 +++
  mm/memcontrol.c  |  7 ---
  mm/memory-failure.c  |  8 ++--
  mm/memremap.c| 10 ++
  mm/migrate_device.c  | 16 +++-
  mm/rmap.c|  5 +++--
  6 files changed, 49 insertions(+), 16 deletions(-)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 8af304f6b504..9f752ebed613 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -41,6 +41,13 @@ struct vmem_altmap {
   * A more complete discussion of unaddressable memory may be found in
   * include/linux/hmm.h and Documentation/vm/hmm.rst.
   *
+ * MEMORY_DEVICE_COHERENT:
+ * Device memory that is cache coherent from device and CPU point of view. This
+ * is used on platforms that have an advanced system bus (like CAPI or CXL). A
+ * driver can hotplug the device memory using ZONE_DEVICE and with that memory
+ * type. Any page of a process can be migrated to such memory. However no one

Any page might not be right, I'm pretty sure. ... just thinking about special 
pages
like vdso, shared zeropage, ... pinned pages ...


Hi David,

Yes, I think you're right. This type does not cover all special pages.  
I need to correct that on the cover letter.

Pinned pages are allowed as long as they're not long term pinned.

Regards,
Alex Sierra




+ * should be allowed to pin such memory so that it can always be evicted.
+ *
   * MEMORY_DEVICE_FS_DAX:
   * Host memory that has similar access semantics as System RAM i.e. DMA
   * coherent and supports page pinning. In support of coordinating page
@@ -61,6 +68,7 @@ struct vmem_altmap {
  enum memory_type {
/* 0 is reserved to catch uninitialized type fields */
MEMORY_DEVICE_PRIVATE = 1,
+   MEMORY_DEVICE_COHERENT,
MEMORY_DEVICE_FS_DAX,
MEMORY_DEVICE_GENERIC,
MEMORY_DEVICE_PCI_P2PDMA,
@@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct 
folio *folio)

In general, this LGTM, and it should be correct with PageAnonExclusive I think.


However, where exactly is pinning forbidden?


Long-term pinning is forbidden since it would interfere with the device 
memory manager owning the
device-coherent pages (e.g. evictions in TTM). However, normal pinning 
is allowed on this device type.


Regards,
Alex Sierra





Re: [PATCH v4 07/13] lib: test_hmm add ioctl to get zone device type

2022-05-31 Thread Sierra Guiza, Alejandro (Alex)



On 5/31/2022 12:31 PM, Andrew Morton wrote:

On Tue, 31 May 2022 10:56:23 -0500 Alex Sierra  wrote:


new ioctl cmd added to query zone device type. This will be
used once the test_hmm adds zone device coherent type.

@@ -1026,6 +1027,15 @@ static int dmirror_snapshot(struct dmirror *dmirror,
return ret;
  }
  
+static int dmirror_get_device_type(struct dmirror *dmirror,

+   struct hmm_dmirror_cmd *cmd)
+{
+   mutex_lock(>mutex);
+   cmd->zone_device_type = dmirror->mdevice->zone_device_type;
+   mutex_unlock(>mutex);

What does the locking here do?

Presumably cmd->zone_device_type can become out of date the instant the
mutex is released, so what was the point in taking the mutex?


Actually this is not used at all. Thanks for finding it. Honestly, I 
don't remember what we used this type request for.

I will remove all related code  and send a new patch series version.

Regards,
Alex Sierra


And does it make sense to return potentially out-of-date info to
userspace?  Perhaps this interface simply shouldn't exist?


Re: [PATCH v3 02/13] mm: handling Non-LRU pages returned by vm_normal_pages

2022-05-26 Thread Sierra Guiza, Alejandro (Alex)



On 5/24/2022 11:11 PM, Alistair Popple wrote:

Alex Sierra  writes:


With DEVICE_COHERENT, we'll soon have vm_normal_pages() return
device-managed anonymous pages that are not LRU pages. Although they
behave like normal pages for purposes of mapping in CPU page, and for
COW. They do not support LRU lists, NUMA migration or THP.

We also introduced a FOLL_LRU flag that adds the same behaviour to
follow_page and related APIs, to allow callers to specify that they
expect to put pages on an LRU list.

Continuing the follow up from the thread for v2:


This means by default GUP can return non-LRU pages. I didn't see
anywhere that would be a problem but I didn't check everything. Did you
check this or is there some other reason I've missed that makes this not
a problem?

I have double checked all gup and pin_user_pages callers and none of them seem
to have interaction with LRU APIs.

And actually if I'm understanding things correctly callers of
GUP/PUP/follow_page_pte() should already expect to get non-LRU pages
returned:

 page = vm_normal_page(vma, address, pte);
 if ((flags & FOLL_LRU) && page && is_zone_device_page(page))
 page = NULL;
 if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) {
 /*
  * Only return device mapping pages in the FOLL_GET or FOLL_PIN
  * case since they are only valid while holding the pgmap
  * reference.
  */
 *pgmap = get_dev_pagemap(pte_pfn(pte), *pgmap);
 if (*pgmap)
 page = pte_page(pte);

Which I think makes FOLL_LRU confusing, because if understand correctly
even with FOLL_LRU it is still possible for follow_page_pte() to return
a non-LRU page. Could we do something like this to make it consistent:

 if ((flags & FOLL_LRU) && (page && is_zone_device_page(page) ||
 !page && pte_devmap(pte)))


Hi Alistair,
Not sure if this suggestion is a replacement for the first or the second 
condition in the snip code above. We know device coherent type will not 
be set with devmap. So we could do the following:


 if ((flags & FOLL_LRU) && page && is_zone_device_page(page))
-   page = NULL;
+   goto no_page;

Regards,
Alex Sierra



Looking at callers that currently use FOLL_LRU I don't think this would
change any behaviour as they already filter out devmap through various
other means.


Signed-off-by: Alex Sierra 
Acked-by: Felix Kuehling 
---
  fs/proc/task_mmu.c | 2 +-
  include/linux/mm.h | 3 ++-
  mm/gup.c   | 2 ++
  mm/huge_memory.c   | 2 +-
  mm/khugepaged.c| 9 ++---
  mm/ksm.c   | 6 +++---
  mm/madvise.c   | 4 ++--
  mm/memory.c| 9 -
  mm/mempolicy.c | 2 +-
  mm/migrate.c   | 4 ++--
  mm/mlock.c | 2 +-
  mm/mprotect.c  | 2 +-
  12 files changed, 30 insertions(+), 17 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index f46060eb91b5..5d620733f173 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1785,7 +1785,7 @@ static struct page *can_gather_numa_stats(pte_t pte, 
struct vm_area_struct *vma,
return NULL;

page = vm_normal_page(vma, addr, pte);
-   if (!page)
+   if (!page || is_zone_device_page(page))
return NULL;

if (PageReserved(page))
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9f44254af8ce..d7f253a0c41e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -601,7 +601,7 @@ struct vm_operations_struct {
  #endif
/*
 * Called by vm_normal_page() for special PTEs to find the
-* page for @addr.  This is useful if the default behavior
+* page for @addr. This is useful if the default behavior
 * (using pte_page()) would not find the correct page.
 */
struct page *(*find_special_page)(struct vm_area_struct *vma,
@@ -2929,6 +2929,7 @@ struct page *follow_page(struct vm_area_struct *vma, 
unsigned long address,
  #define FOLL_NUMA 0x200   /* force NUMA hinting page fault */
  #define FOLL_MIGRATION0x400   /* wait for page to replace migration 
entry */
  #define FOLL_TRIED0x800   /* a retry, previous pass started an IO */
+#define FOLL_LRU0x1000  /* return only LRU (anon or page cache) */
  #define FOLL_REMOTE   0x2000  /* we are working on non-current tsk/mm */
  #define FOLL_COW  0x4000  /* internal GUP flag */
  #define FOLL_ANON 0x8000  /* don't do file mappings */
diff --git a/mm/gup.c b/mm/gup.c
index 501bc150792c..c9cbac06bcc5 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -479,6 +479,8 @@ static struct page *follow_page_pte(struct vm_area_struct 
*vma,
}

page = vm_normal_page(vma, address, pte);
+   if ((flags & FOLL_LRU) && page && is_zone_device_page(page))
+   page = NULL;
if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) {
/*
 * Only return device mapping pages in the FOLL_GET or FOLL_PIN
diff --git 

Re: [PATCH v2 11/13] mm: handling Non-LRU pages returned by vm_normal_pages

2022-05-24 Thread Sierra Guiza, Alejandro (Alex)



On 5/23/2022 7:02 AM, Alistair Popple wrote:

Technically I think this patch should be earlier in the series. As I
understand it patch 1 allows DEVICE_COHERENT pages to be inserted in the
page tables and therefore makes it possible for page table walkers to
see non-LRU pages.


Patch will reordered in V3.

Regards,
Alex Sierra



Some more comments below:

Alex Sierra  writes:


With DEVICE_COHERENT, we'll soon have vm_normal_pages() return
device-managed anonymous pages that are not LRU pages. Although they
behave like normal pages for purposes of mapping in CPU page, and for
COW. They do not support LRU lists, NUMA migration or THP.

We also introduced a FOLL_LRU flag that adds the same behaviour to
follow_page and related APIs, to allow callers to specify that they
expect to put pages on an LRU list.

This means by default GUP can return non-LRU pages. I didn't see
anywhere that would be a problem but I didn't check everything. Did you
check this or is there some other reason I've missed that makes this not
a problem?


I have double checked all gup and pin_user_pages callers and none of 
them seem to have interaction with LRU APIs.


Regards,
Alex Sierra



[...]


diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index a4e5eaf3eb01..eb3cfd679800 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -627,7 +627,7 @@ static int __collapse_huge_page_isolate(struct 
vm_area_struct *vma,
goto out;
}
page = vm_normal_page(vma, address, pteval);
-   if (unlikely(!page)) {
+   if (unlikely(!page) || unlikely(is_zone_device_page(page))) {
result = SCAN_PAGE_NULL;
goto out;
}
@@ -1276,7 +1276,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
writable = true;

page = vm_normal_page(vma, _address, pteval);
-   if (unlikely(!page)) {
+   if (unlikely(!page) || unlikely(is_zone_device_page(page))) {
result = SCAN_PAGE_NULL;
goto out_unmap;
}
@@ -1484,7 +1484,8 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, 
unsigned long addr)
goto abort;

page = vm_normal_page(vma, addr, *pte);
-
+   if (page && is_zone_device_page(page))
+   page = NULL;
/*
 * Note that uprobe, debugger, or MAP_PRIVATE may change the
 * page table, but the new page will not be a subpage of hpage.
@@ -1502,6 +1503,8 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, 
unsigned long addr)
if (pte_none(*pte))
continue;
page = vm_normal_page(vma, addr, *pte);
+   if (page && is_zone_device_page(page))
+   goto abort;

Are either of these two cases actually possible? DEVICE_COHERENT doesn't
currently support THP, so if I'm understanding correctly we couldn't
have a pte mapped DEVICE_COHERENT THP right? Assuming that's the case I
think WARN_ON_ONCE() would be better.


Correct, change included in V3 patch series.

Regards,
Alex



Otherwise I think everything else looks reasonable.


page_remove_rmap(page, vma, false);
}

diff --git a/mm/ksm.c b/mm/ksm.c
index 063a48eeb5ee..f16056efca21 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -474,7 +474,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned 
long addr)
do {
cond_resched();
page = follow_page(vma, addr,
-   FOLL_GET | FOLL_MIGRATION | FOLL_REMOTE);
+   FOLL_GET | FOLL_MIGRATION | FOLL_REMOTE | 
FOLL_LRU);
if (IS_ERR_OR_NULL(page))
break;
if (PageKsm(page))
@@ -559,7 +559,7 @@ static struct page *get_mergeable_page(struct rmap_item 
*rmap_item)
if (!vma)
goto out;

-   page = follow_page(vma, addr, FOLL_GET);
+   page = follow_page(vma, addr, FOLL_GET | FOLL_LRU);
if (IS_ERR_OR_NULL(page))
goto out;
if (PageAnon(page)) {
@@ -2288,7 +2288,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct 
page **page)
while (ksm_scan.address < vma->vm_end) {
if (ksm_test_exit(mm))
break;
-   *page = follow_page(vma, ksm_scan.address, FOLL_GET);
+   *page = follow_page(vma, ksm_scan.address, FOLL_GET | 
FOLL_LRU);
if (IS_ERR_OR_NULL(*page)) {
ksm_scan.address += PAGE_SIZE;
cond_resched();
diff --git a/mm/madvise.c b/mm/madvise.c
index 1873616a37d2..e9c24c834e98 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -413,7 +413,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
   

Re: [PATCH v1 13/15] mm: handling Non-LRU pages returned by vm_normal_pages

2022-05-12 Thread Sierra Guiza, Alejandro (Alex)



On 5/11/2022 1:50 PM, Jason Gunthorpe wrote:

On Thu, May 05, 2022 at 04:34:36PM -0500, Alex Sierra wrote:


diff --git a/mm/memory.c b/mm/memory.c
index 76e3af9639d9..892c4cc54dc2 100644
+++ b/mm/memory.c
@@ -621,6 +621,13 @@ struct page *vm_normal_page(struct vm_area_struct *vma, 
unsigned long addr,
if (is_zero_pfn(pfn))
return NULL;
if (pte_devmap(pte))
+/*
+ * NOTE: Technically this should goto check_pfn label. However, page->_mapcount
+ * is never incremented for device pages that are mmap through DAX mechanism
+ * using pmem driver mounted into ext4 filesystem. When these pages are unmap,
+ * zap_pte_range is called and vm_normal_page return a valid page with
+ * page_mapcount() = 0, before page_remove_rmap is called.
+ */
return NULL;

? Where does this series cause device coherent to be returned?
In our case, device coherent pages could be obtained as a result of 
migration(Patches 6/7 of 15), ending up mapped in CPU page tables. Later 
on, these pages might need to be returned by get_user_pages or other 
callers through vm_normal_pages. Our approach in this series, is to 
handle device-coherent-managed pages returned by vm_normal_pages, inside 
each caller. EX. device coherent pages don’t support LRU lists, NUMA 
migration or THP.


Wasn't the plan to not set pte_devmap() ?


amdgpu does not set pte_devmap for our DEVICE_COHERENT pages. DEVMAP 
flags are set by drivers like virtio_fs or pmem, where 
MEMORY_DEVICE_FS_DAX type is used.
This patch series deals with DEVICE_COHERENT pages. My understanding 
was, that the DAX code and DEVICE_GENERIC would be fixed up later by 
someone more familiar with it. Were you expecting that we'd fix the DAX 
usage of pte_devmap flags in this patch series as well?


Regards,
Alex Sierra



Jason


Re: [PATCH v1 01/15] mm: add zone device coherent type memory support

2022-05-12 Thread Sierra Guiza, Alejandro (Alex)



On 5/11/2022 9:58 PM, Alistair Popple wrote:

Alex Sierra  writes:

[...]


diff --git a/mm/rmap.c b/mm/rmap.c
index fedb82371efe..d57102cd4b43 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1995,7 +1995,8 @@ void try_to_migrate(struct folio *folio, enum ttu_flags 
flags)
TTU_SYNC)))
return;

-   if (folio_is_zone_device(folio) && !folio_is_device_private(folio))
+   if (folio_is_zone_device(folio) &&
+   (!folio_is_device_private(folio) && 
!folio_is_device_coherent(folio)))
return;

/*

I vaguely recall commenting on this previously, or at least intending
to. In try_to_migrate_one() we have this:

if (folio_is_zone_device(folio)) {
unsigned long pfn = folio_pfn(folio);
swp_entry_t entry;
pte_t swp_pte;

/*
 * Store the pfn of the page in a special migration
 * pte. do_swap_page() will wait until the migration
 * pte is removed and then restart fault handling.
 */
entry = pte_to_swp_entry(pteval);
if (is_writable_device_private_entry(entry))
entry = make_writable_migration_entry(pfn);
else
entry = make_readable_migration_entry(pfn);
swp_pte = swp_entry_to_pte(entry);

The check in try_to_migrate() guarantees that if folio_is_zone_device()
is true this must be a DEVICE_PRIVATE page and it treats it as such by
assuming there is a special device private swap entry there.

Relying on that assumption seems bad, and I have no idea why I didn't
just use is_device_private_page() originally but I think the fix is just
to change this to:

if (folio_is_device_private(folio))


Thanks Alistair. This worked fine. I'll drop patch 4 and update this 
patch in the next series version.


Regards,
Alex Sierra



And let DEVICE_COHERENT pages fall through to normal page processing.

  - Alistair


RE: [PATCH v1 04/15] mm: add device coherent checker to remove migration pte

2022-05-05 Thread Sierra Guiza, Alejandro (Alex)
@apop...@nvidia.com Could you please check this patch? It's somehow related to 
migrate_device_page() for long term device coherent pages.

Regards,
Alex Sierra
> -Original Message-
> From: amd-gfx  On Behalf Of Alex
> Sierra
> Sent: Thursday, May 5, 2022 4:34 PM
> To: j...@nvidia.com
> Cc: rcampb...@nvidia.com; wi...@infradead.org; da...@redhat.com;
> Kuehling, Felix ; apop...@nvidia.com; amd-
> g...@lists.freedesktop.org; linux-...@vger.kernel.org; linux...@kvack.org;
> jgli...@redhat.com; dri-devel@lists.freedesktop.org; akpm@linux-
> foundation.org; linux-e...@vger.kernel.org; h...@lst.de
> Subject: [PATCH v1 04/15] mm: add device coherent checker to remove
> migration pte
> 
> During remove_migration_pte(), entries for device coherent type pages that
> were not created through special migration ptes, ignore _PAGE_RW flag. This
> path can be found at migrate_device_page(), where valid vma is not
> required. In this case, migrate_vma_collect_pmd() is not called and special
> migration ptes are not set.
> 
> Signed-off-by: Alex Sierra 
> ---
>  mm/migrate.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c index
> 6c31ee1e1c9b..e18ddee56f37 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -206,7 +206,8 @@ static bool remove_migration_pte(struct folio *folio,
>* Recheck VMA as permissions can change since migration
> started
>*/
>   entry = pte_to_swp_entry(*pvmw.pte);
> - if (is_writable_migration_entry(entry))
> + if (is_writable_migration_entry(entry) ||
> + is_device_coherent_page(pfn_to_page(pvmw.pfn)))
>   pte = maybe_mkwrite(pte, vma);
>   else if (pte_swp_uffd_wp(*pvmw.pte))
>   pte = pte_mkuffd_wp(pte);
> --
> 2.32.0



Re: [PATCH v2 1/3] mm: add vm_normal_lru_pages for LRU handled pages only

2022-04-04 Thread Sierra Guiza, Alejandro (Alex)



On 4/4/2022 12:38 PM, Jason Gunthorpe wrote:

On Fri, Apr 01, 2022 at 04:08:35PM -0400, Felix Kuehling wrote:


In general I find the vm_normal_lru_page vs vm_normal_page
API highly confusing.  An explicit check for zone device pages
in the dozen or so spots that care has a much better documentation
value, especially if accompanied by comments where it isn't entirely
obvious.

OK. We can do that. It would solve the function naming problem, and we'd
have more visibility of device page handling in more places in the kernel,
which has educational value.

Personally I find the 'is page XYZ' pretty confusing, like I don't
know half of what the PageKsm annotations are for..

Testing against a specific property the code goes on to use right away
seems more descriptive to me.


Hi Jason,

Are you referring to test for properties such as is_lru_page, 
is_numa_page, is_lockable_page, etc?

Otherwise, could you provide an example?

Regards,
Alex Sierra



Jason


RE: [PATCH] drm/amdkfd: Add SVM API support capability bits

2022-03-30 Thread Sierra Guiza, Alejandro (Alex)
Hi Matthew,
I sent this patch by accident. Please ignore it.

Regards,
Alex Sierra

> -Original Message-
> From: Matthew Wilcox 
> Sent: Wednesday, March 30, 2022 4:29 PM
> To: Sierra Guiza, Alejandro (Alex) 
> Cc: j...@nvidia.com; da...@redhat.com; Kuehling, Felix
> ; linux...@kvack.org; rcampb...@nvidia.com;
> linux-e...@vger.kernel.org; linux-...@vger.kernel.org; amd-
> g...@lists.freedesktop.org; dri-devel@lists.freedesktop.org; h...@lst.de;
> jgli...@redhat.com; apop...@nvidia.com; a...@linux-foundation.org
> Subject: Re: [PATCH] drm/amdkfd: Add SVM API support capability bits
> 
> On Wed, Mar 30, 2022 at 04:24:20PM -0500, Alex Sierra wrote:
> > From: Philip Yang 
> >
> > SVMAPISupported property added to HSA_CAPABILITY, the value match
> > HSA_CAPABILITY defined in Thunk spec:
> >
> > SVMAPISupported: it will not be supported on older kernels that don't
> > have HMM or on systems with GFXv8 or older GPUs without support for
> > 48-bit virtual addresses.
> >
> > CoherentHostAccess property added to HSA_MEMORYPROPERTY, the
> value
> > match HSA_MEMORYPROPERTY defined in Thunk spec:
> >
> > CoherentHostAccess: whether or not device memory can be coherently
> > accessed by the host CPU.
> 
> Could you translate this commit message into English?  Reviewing
> Documentation/process/5.Posting.rst might be helpful.


RE: [PATCH] drm/amdkfd: Add SVM API support capability bits

2022-03-30 Thread Sierra Guiza, Alejandro (Alex)
Please ignore this patch.

> -Original Message-
> From: amd-gfx  On Behalf Of Alex
> Sierra
> Sent: Wednesday, March 30, 2022 4:24 PM
> To: j...@nvidia.com
> Cc: rcampb...@nvidia.com; wi...@infradead.org; da...@redhat.com;
> Kuehling, Felix ; apop...@nvidia.com; amd-
> g...@lists.freedesktop.org; linux-...@vger.kernel.org; linux...@kvack.org;
> jgli...@redhat.com; dri-devel@lists.freedesktop.org; akpm@linux-
> foundation.org; linux-e...@vger.kernel.org; h...@lst.de
> Subject: [PATCH] drm/amdkfd: Add SVM API support capability bits
> 
> From: Philip Yang 
> 
> SVMAPISupported property added to HSA_CAPABILITY, the value match
> HSA_CAPABILITY defined in Thunk spec:
> 
> SVMAPISupported: it will not be supported on older kernels that don't have
> HMM or on systems with GFXv8 or older GPUs without support for 48-bit
> virtual addresses.
> 
> CoherentHostAccess property added to HSA_MEMORYPROPERTY, the value
> match HSA_MEMORYPROPERTY defined in Thunk spec:
> 
> CoherentHostAccess: whether or not device memory can be coherently
> accessed by the host CPU.
> 
> Signed-off-by: Philip Yang 
> Reviewed-by: Felix Kuehling 
> Signed-off-by: Felix Kuehling 
> Signed-off-by: Alex Deucher 
> ---
>  drivers/gpu/drm/amd/amdkfd/kfd_topology.c |  6 ++
> drivers/gpu/drm/amd/amdkfd/kfd_topology.h | 10 ++
>  2 files changed, 12 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
> b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
> index cdef608db4f4..083ac9babfa8 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
> @@ -1419,6 +1419,12 @@ int kfd_topology_add_device(struct kfd_dev
> *gpu)
>   dev->node_props.capability |= (adev->ras_features != 0) ?
>   HSA_CAP_RASEVENTNOTIFY : 0;
> 
> + /* SVM API and HMM page migration work together, device memory
> type
> +  * is initialized to not 0 when page migration register device memory.
> +  */
> + if (adev->kfd.dev->pgmap.type != 0)
> + dev->node_props.capability |=
> HSA_CAP_SVMAPI_SUPPORTED;
> +
>   kfd_debug_print_topology();
> 
>   if (!res)
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_topology.h
> b/drivers/gpu/drm/amd/amdkfd/kfd_topology.h
> index b8b68087bd7a..6bd6380b0ee0 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_topology.h
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_topology.h
> @@ -53,8 +53,9 @@
>  #define HSA_CAP_ASIC_REVISION_MASK   0x03c0
>  #define HSA_CAP_ASIC_REVISION_SHIFT  22
>  #define HSA_CAP_SRAM_EDCSUPPORTED0x0400
> +#define HSA_CAP_SVMAPI_SUPPORTED 0x0800
> 
> -#define HSA_CAP_RESERVED 0xf80f8000
> +#define HSA_CAP_RESERVED 0xf00f8000
> 
>  struct kfd_node_properties {
>   uint64_t hive_id;
> @@ -98,9 +99,10 @@ struct kfd_node_properties {
>  #define HSA_MEM_HEAP_TYPE_GPU_LDS4
>  #define HSA_MEM_HEAP_TYPE_GPU_SCRATCH5
> 
> -#define HSA_MEM_FLAGS_HOT_PLUGGABLE  0x0001
> -#define HSA_MEM_FLAGS_NON_VOLATILE   0x0002
> -#define HSA_MEM_FLAGS_RESERVED   0xfffc
> +#define HSA_MEM_FLAGS_HOT_PLUGGABLE  0x0001
> +#define HSA_MEM_FLAGS_NON_VOLATILE   0x0002
> +#define HSA_MEM_FLAGS_COHERENTHOSTACCESS 0x0004
> +#define HSA_MEM_FLAGS_RESERVED   0xfff8
> 
>  struct kfd_mem_properties {
>   struct list_headlist;
> --
> 2.32.0



Re: [PATCH v6 01/10] mm: add zone device coherent type memory support

2022-02-11 Thread Sierra Guiza, Alejandro (Alex)


On 2/11/2022 10:39 AM, David Hildenbrand wrote:

On 11.02.22 17:15, David Hildenbrand wrote:

On 01.02.22 16:48, Alex Sierra wrote:

Device memory that is cache coherent from device and CPU point of view.
This is used on platforms that have an advanced system bus (like CAPI
or CXL). Any page of a process can be migrated to such memory. However,
no one should be allowed to pin such memory so that it can always be
evicted.

Signed-off-by: Alex Sierra
Acked-by: Felix Kuehling
Reviewed-by: Alistair Popple

So, I'm currently messing with PageAnon() pages and CoW semantics ...
all these PageAnon() ZONE_DEVICE variants don't necessarily make my life
easier but I'm not sure yet if they make my life harder. I hope you can
help me understand some of that stuff.

1) What are expected CoW semantics for DEVICE_COHERENT?

I assume we'll share them just like other PageAnon() pages during fork()
readable, and the first sharer writing to them receives an "ordinary"
!ZONE_DEVICE copy.

So this would be just like DEVICE_EXCLUSIVE CoW handling I assume, just
that we don't have to go through the loop of restoring a device
exclusive entry?

2) How are these pages freed to clear/invalidate PageAnon() ?

I assume for PageAnon() ZONE_DEVICE pages we'll always for via
free_devmap_managed_page(), correct?


3) FOLL_PIN

While you write "no one should be allowed to pin such memory", patch #2
only blocks FOLL_LONGTERM. So I assume we allow ordinary FOLL_PIN and
you might want to be a bit more precise?


Device coherent pages can be FOLL_PIN. However, we want to avoid 
FOLL_LONGTERM with
these pages because that would affect our memory manager's ability to 
evict device memory.


Regards,
Alex Sierra




... I'm pretty sure we cannot FOLL_PIN DEVICE_PRIVATE pages, but can we
FILL_PIN DEVICE_EXCLUSIVE pages? I strongly assume so?


Thanks for any information.


(digging a bit more, I realized that device exclusive pages are not
actually/necessarily ZONE_DEVICE pages -- so I assume DEVICE_COHERENT
will be the actual first PageAnon() ZONE_DEVICE pages that can be
present in a page table.)


Yes, that's correct. Device coherent pages are pte present.

Regards,
Alex Sierra



Re: start sorting out the ZONE_DEVICE refcount mess v2

2022-02-10 Thread Sierra Guiza, Alejandro (Alex)

Christoph,
Thanks a lot for rebase our patches. I just ran our amdgpu hmm-tests 
with this series and all passed.


Regards,
Alex Sierra

On 2/10/2022 1:28 AM, Christoph Hellwig wrote:

Hi all,

this series removes the offset by one refcount for ZONE_DEVICE pages
that are freed back to the driver owning them, which is just device
private ones for now, but also the planned device coherent pages
and the ehanced p2p ones pending.

It does not address the fsdax pages yet, which will be attacked in a
follow on series.

Note that if we want to get the p2p series rebased on top of this
we'll need a git branch for this series.  I could offer to host one.

A git tree is available here:

 git://git.infradead.org/users/hch/misc.git pgmap-refcount

Gitweb:

 
http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/pgmap-refcount

Changes since v1:
  - add a missing memremap.h include in memcontrol.c
  - include rebased versions of the device coherent support and
device coherent migration support series as well as additional
cleanup patches

Diffstt:
  arch/arm64/mm/mmu.c  |1
  arch/powerpc/kvm/book3s_hv_uvmem.c   |1
  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |   35 -
  drivers/gpu/drm/amd/amdkfd/kfd_priv.h|1
  drivers/gpu/drm/drm_cache.c  |2
  drivers/gpu/drm/nouveau/nouveau_dmem.c   |3
  drivers/gpu/drm/nouveau/nouveau_svm.c|1
  drivers/infiniband/core/rw.c |1
  drivers/nvdimm/pmem.h|1
  drivers/nvme/host/pci.c  |1
  drivers/nvme/target/io-cmd-bdev.c|1
  fs/Kconfig   |2
  fs/fuse/virtio_fs.c  |1
  include/linux/hmm.h  |9
  include/linux/memremap.h |   36 +
  include/linux/migrate.h  |1
  include/linux/mm.h   |   59 --
  lib/test_hmm.c   |  353 ++---
  lib/test_hmm_uapi.h  |   22
  mm/Kconfig   |7
  mm/Makefile  |1
  mm/gup.c |  127 +++-
  mm/internal.h|3
  mm/memcontrol.c  |   19
  mm/memory-failure.c  |8
  mm/memremap.c|   75 +-
  mm/migrate.c |  763 
  mm/migrate_device.c  |  822 
+++
  mm/rmap.c|5
  mm/swap.c|   49 -
  tools/testing/selftests/vm/Makefile  |2
  tools/testing/selftests/vm/hmm-tests.c   |  204 ++-
  tools/testing/selftests/vm/test_hmm.sh   |   24
  33 files changed, 1552 insertions(+), 1088 deletions(-)


Re: [PATCH v5 09/10] tools: update hmm-test to support device coherent type

2022-01-31 Thread Sierra Guiza, Alejandro (Alex)

Hi Alistair,
This is the last patch to be reviewed from this series. It already has 
the changes from

your last feedback (V3). Would you mind to take a look?
Thanks a lot for reviewing the rest!

Regards,
Alex Sierra

On 1/28/2022 2:08 PM, Alex Sierra wrote:

Test cases such as migrate_fault and migrate_multiple, were modified to
explicit migrate from device to sys memory without the need of page
faults, when using device coherent type.

Snapshot test case updated to read memory device type first and based
on that, get the proper returned results migrate_ping_pong test case
added to test explicit migration from device to sys memory for both
private and coherent zone types.

Helpers to migrate from device to sys memory and vicerversa
were also added.

Signed-off-by: Alex Sierra 
Acked-by: Felix Kuehling 
---
v2:
Set FIXTURE_VARIANT to add multiple device types to the FIXTURE. This
will run all the tests for each device type (private and coherent) in
case both existed during hmm-test driver probed.
v4:
Check for the number of pages successfully migrated from coherent
device to system at migrate_multiple test.
---
  tools/testing/selftests/vm/hmm-tests.c | 123 -
  1 file changed, 102 insertions(+), 21 deletions(-)

diff --git a/tools/testing/selftests/vm/hmm-tests.c 
b/tools/testing/selftests/vm/hmm-tests.c
index 203323967b50..84ec8c4a1dc7 100644
--- a/tools/testing/selftests/vm/hmm-tests.c
+++ b/tools/testing/selftests/vm/hmm-tests.c
@@ -44,6 +44,14 @@ struct hmm_buffer {
int fd;
uint64_tcpages;
uint64_tfaults;
+   int zone_device_type;
+};
+
+enum {
+   HMM_PRIVATE_DEVICE_ONE,
+   HMM_PRIVATE_DEVICE_TWO,
+   HMM_COHERENCE_DEVICE_ONE,
+   HMM_COHERENCE_DEVICE_TWO,
  };
  
  #define TWOMEG		(1 << 21)

@@ -60,6 +68,21 @@ FIXTURE(hmm)
unsigned intpage_shift;
  };
  
+FIXTURE_VARIANT(hmm)

+{
+   int device_number;
+};
+
+FIXTURE_VARIANT_ADD(hmm, hmm_device_private)
+{
+   .device_number = HMM_PRIVATE_DEVICE_ONE,
+};
+
+FIXTURE_VARIANT_ADD(hmm, hmm_device_coherent)
+{
+   .device_number = HMM_COHERENCE_DEVICE_ONE,
+};
+
  FIXTURE(hmm2)
  {
int fd0;
@@ -68,6 +91,24 @@ FIXTURE(hmm2)
unsigned intpage_shift;
  };
  
+FIXTURE_VARIANT(hmm2)

+{
+   int device_number0;
+   int device_number1;
+};
+
+FIXTURE_VARIANT_ADD(hmm2, hmm2_device_private)
+{
+   .device_number0 = HMM_PRIVATE_DEVICE_ONE,
+   .device_number1 = HMM_PRIVATE_DEVICE_TWO,
+};
+
+FIXTURE_VARIANT_ADD(hmm2, hmm2_device_coherent)
+{
+   .device_number0 = HMM_COHERENCE_DEVICE_ONE,
+   .device_number1 = HMM_COHERENCE_DEVICE_TWO,
+};
+
  static int hmm_open(int unit)
  {
char pathname[HMM_PATH_MAX];
@@ -81,12 +122,19 @@ static int hmm_open(int unit)
return fd;
  }
  
+static bool hmm_is_coherent_type(int dev_num)

+{
+   return (dev_num >= HMM_COHERENCE_DEVICE_ONE);
+}
+
  FIXTURE_SETUP(hmm)
  {
self->page_size = sysconf(_SC_PAGE_SIZE);
self->page_shift = ffs(self->page_size) - 1;
  
-	self->fd = hmm_open(0);

+   self->fd = hmm_open(variant->device_number);
+   if (self->fd < 0 && hmm_is_coherent_type(variant->device_number))
+   SKIP(exit(0), "DEVICE_COHERENT not available");
ASSERT_GE(self->fd, 0);
  }
  
@@ -95,9 +143,11 @@ FIXTURE_SETUP(hmm2)

self->page_size = sysconf(_SC_PAGE_SIZE);
self->page_shift = ffs(self->page_size) - 1;
  
-	self->fd0 = hmm_open(0);

+   self->fd0 = hmm_open(variant->device_number0);
+   if (self->fd0 < 0 && hmm_is_coherent_type(variant->device_number0))
+   SKIP(exit(0), "DEVICE_COHERENT not available");
ASSERT_GE(self->fd0, 0);
-   self->fd1 = hmm_open(1);
+   self->fd1 = hmm_open(variant->device_number1);
ASSERT_GE(self->fd1, 0);
  }
  
@@ -144,6 +194,7 @@ static int hmm_dmirror_cmd(int fd,

}
buffer->cpages = cmd.cpages;
buffer->faults = cmd.faults;
+   buffer->zone_device_type = cmd.zone_device_type;
  
  	return 0;

  }
@@ -211,6 +262,20 @@ static void hmm_nanosleep(unsigned int n)
nanosleep(, NULL);
  }
  
+static int hmm_migrate_sys_to_dev(int fd,

+  struct hmm_buffer *buffer,
+  unsigned long npages)
+{
+   return hmm_dmirror_cmd(fd, HMM_DMIRROR_MIGRATE_TO_DEV, buffer, npages);
+}
+
+static int hmm_migrate_dev_to_sys(int fd,
+  struct hmm_buffer *buffer,
+  unsigned long npages)
+{
+   return hmm_dmirror_cmd(fd, HMM_DMIRROR_MIGRATE_TO_SYS, buffer, npages);
+}
+
  /*
   * Simple NULL test of device open/close.
   */
@@ -875,7 +940,7 @@ TEST_F(hmm, migrate)
ptr[i] = i;
  
  	/* Migrate memory to device. */

-   ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+   ret = 

Re: [PATCH v4 00/10] Add MEMORY_DEVICE_COHERENT for coherent device memory mapping

2022-01-27 Thread Sierra Guiza, Alejandro (Alex)

Andrew,
We're somehow new on this procedure. Are you referring to rebase this 
patch series to
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git 
<5.17-rc1 tag>?


Regards,
Alex Sierra

Alex Deucher,
Just a quick heads up. This patch series contains changes to the amdgpu 
driver which we're

planning to merge through Andrew's tree, If that's ok with you.

Regards,
Alex Sierra

On 1/27/2022 4:32 PM, Andrew Morton wrote:

On Wed, 26 Jan 2022 21:09:39 -0600 Alex Sierra  wrote:


This patch series introduces MEMORY_DEVICE_COHERENT, a type of memory
owned by a device that can be mapped into CPU page tables like
MEMORY_DEVICE_GENERIC and can also be migrated like
MEMORY_DEVICE_PRIVATE.

Some more reviewer input appears to be desirable here.

I was going to tentatively add it to -mm and -next, but problems.
5.17-rc1's mm/migrate.c:migrate_vma_check_page() is rather different
from the tree you patched.  Please redo, refresh and resend?



Re: [PATCH v3 09/10] tools: update hmm-test to support device coherent type

2022-01-26 Thread Sierra Guiza, Alejandro (Alex)



On 1/20/2022 12:14 AM, Alistair Popple wrote:

On Tuesday, 11 January 2022 9:32:00 AM AEDT Alex Sierra wrote:

Test cases such as migrate_fault and migrate_multiple, were modified to
explicit migrate from device to sys memory without the need of page
faults, when using device coherent type.

Snapshot test case updated to read memory device type first and based
on that, get the proper returned results migrate_ping_pong test case

Where is the migrate_ping_pong test? Did you perhaps forget to add it? :-)


Migration from device coherent to system is tested with migrate_multiple 
too. Therefore,
I've removed migrate_ping_pong test. BTW, I just added the "number of 
pages migrated"

checker after migrate from coherent to system on v4 series.

Regards,
Alejandro Sierra




added to test explicit migration from device to sys memory for both
private and coherent zone types.

Helpers to migrate from device to sys memory and vicerversa
were also added.

Signed-off-by: Alex Sierra 
---
v2:
Set FIXTURE_VARIANT to add multiple device types to the FIXTURE. This
will run all the tests for each device type (private and coherent) in
case both existed during hmm-test driver probed.
---
  tools/testing/selftests/vm/hmm-tests.c | 122 -
  1 file changed, 101 insertions(+), 21 deletions(-)

diff --git a/tools/testing/selftests/vm/hmm-tests.c 
b/tools/testing/selftests/vm/hmm-tests.c
index 864f126ffd78..8eb81dfba4b3 100644
--- a/tools/testing/selftests/vm/hmm-tests.c
+++ b/tools/testing/selftests/vm/hmm-tests.c
@@ -44,6 +44,14 @@ struct hmm_buffer {
int fd;
uint64_tcpages;
uint64_tfaults;
+   int zone_device_type;
+};
+
+enum {
+   HMM_PRIVATE_DEVICE_ONE,
+   HMM_PRIVATE_DEVICE_TWO,
+   HMM_COHERENCE_DEVICE_ONE,
+   HMM_COHERENCE_DEVICE_TWO,
  };
  
  #define TWOMEG		(1 << 21)

@@ -60,6 +68,21 @@ FIXTURE(hmm)
unsigned intpage_shift;
  };
  
+FIXTURE_VARIANT(hmm)

+{
+   int device_number;
+};
+
+FIXTURE_VARIANT_ADD(hmm, hmm_device_private)
+{
+   .device_number = HMM_PRIVATE_DEVICE_ONE,
+};
+
+FIXTURE_VARIANT_ADD(hmm, hmm_device_coherent)
+{
+   .device_number = HMM_COHERENCE_DEVICE_ONE,
+};
+
  FIXTURE(hmm2)
  {
int fd0;
@@ -68,6 +91,24 @@ FIXTURE(hmm2)
unsigned intpage_shift;
  };
  
+FIXTURE_VARIANT(hmm2)

+{
+   int device_number0;
+   int device_number1;
+};
+
+FIXTURE_VARIANT_ADD(hmm2, hmm2_device_private)
+{
+   .device_number0 = HMM_PRIVATE_DEVICE_ONE,
+   .device_number1 = HMM_PRIVATE_DEVICE_TWO,
+};
+
+FIXTURE_VARIANT_ADD(hmm2, hmm2_device_coherent)
+{
+   .device_number0 = HMM_COHERENCE_DEVICE_ONE,
+   .device_number1 = HMM_COHERENCE_DEVICE_TWO,
+};
+
  static int hmm_open(int unit)
  {
char pathname[HMM_PATH_MAX];
@@ -81,12 +122,19 @@ static int hmm_open(int unit)
return fd;
  }
  
+static bool hmm_is_coherent_type(int dev_num)

+{
+   return (dev_num >= HMM_COHERENCE_DEVICE_ONE);
+}
+
  FIXTURE_SETUP(hmm)
  {
self->page_size = sysconf(_SC_PAGE_SIZE);
self->page_shift = ffs(self->page_size) - 1;
  
-	self->fd = hmm_open(0);

+   self->fd = hmm_open(variant->device_number);
+   if (self->fd < 0 && hmm_is_coherent_type(variant->device_number))
+   SKIP(exit(0), "DEVICE_COHERENT not available");
ASSERT_GE(self->fd, 0);
  }
  
@@ -95,9 +143,11 @@ FIXTURE_SETUP(hmm2)

self->page_size = sysconf(_SC_PAGE_SIZE);
self->page_shift = ffs(self->page_size) - 1;
  
-	self->fd0 = hmm_open(0);

+   self->fd0 = hmm_open(variant->device_number0);
+   if (self->fd0 < 0 && hmm_is_coherent_type(variant->device_number0))
+   SKIP(exit(0), "DEVICE_COHERENT not available");
ASSERT_GE(self->fd0, 0);
-   self->fd1 = hmm_open(1);
+   self->fd1 = hmm_open(variant->device_number1);
ASSERT_GE(self->fd1, 0);
  }
  
@@ -144,6 +194,7 @@ static int hmm_dmirror_cmd(int fd,

}
buffer->cpages = cmd.cpages;
buffer->faults = cmd.faults;
+   buffer->zone_device_type = cmd.zone_device_type;
  
  	return 0;

  }
@@ -211,6 +262,20 @@ static void hmm_nanosleep(unsigned int n)
nanosleep(, NULL);
  }
  
+static int hmm_migrate_sys_to_dev(int fd,

+  struct hmm_buffer *buffer,
+  unsigned long npages)
+{
+   return hmm_dmirror_cmd(fd, HMM_DMIRROR_MIGRATE_TO_DEV, buffer, npages);
+}
+
+static int hmm_migrate_dev_to_sys(int fd,
+  struct hmm_buffer *buffer,
+  unsigned long npages)
+{
+   return hmm_dmirror_cmd(fd, HMM_DMIRROR_MIGRATE_TO_SYS, buffer, npages);
+}
+
  /*
   * Simple NULL test of device open/close.
   */
@@ -875,7 +940,7 @@ TEST_F(hmm, migrate)
ptr[i] = i;
  
  	/* Migrate memory to device. */

-   ret = hmm_dmirror_cmd(self->fd, 

Re: [PATCH v2 03/11] mm/gup: migrate PIN_LONGTERM dev coherent pages to system

2021-12-09 Thread Sierra Guiza, Alejandro (Alex)


On 12/9/2021 10:29 AM, Felix Kuehling wrote:

Am 2021-12-09 um 5:53 a.m. schrieb Alistair Popple:

On Thursday, 9 December 2021 5:55:26 AM AEDT Sierra Guiza, Alejandro (Alex) 
wrote:

On 12/8/2021 11:30 AM, Felix Kuehling wrote:

Am 2021-12-08 um 11:58 a.m. schrieb Felix Kuehling:

Am 2021-12-08 um 6:31 a.m. schrieb Alistair Popple:

On Tuesday, 7 December 2021 5:52:43 AM AEDT Alex Sierra wrote:

Avoid long term pinning for Coherent device type pages. This could
interfere with their own device memory manager.
If caller tries to get user device coherent pages with PIN_LONGTERM flag
set, those pages will be migrated back to system memory.

Signed-off-by: Alex Sierra
---
   mm/gup.c | 32 ++--
   1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 886d6148d3d0..1572eacf07f4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1689,17 +1689,37 @@ struct page *get_dump_page(unsigned long addr)
   #endif /* CONFIG_ELF_CORE */
   
   #ifdef CONFIG_MIGRATION

+static int migrate_device_page(unsigned long address,
+   struct page *page)
+{
+   struct vm_area_struct *vma = find_vma(current->mm, address);
+   struct vm_fault vmf = {
+   .vma = vma,
+   .address = address & PAGE_MASK,
+   .flags = FAULT_FLAG_USER,
+   .pgoff = linear_page_index(vma, address),
+   .gfp_mask = GFP_KERNEL,
+   .page = page,
+   };
+   if (page->pgmap && page->pgmap->ops->migrate_to_ram)
+   return page->pgmap->ops->migrate_to_ram();

How does this synchronise against pgmap being released? As I understand things
at this point we're not holding a reference on either the page or pgmap, so
the page and therefore the pgmap may have been freed.

I think a similar problem exists for device private fault handling as well and
it has been on my list of things to fix for a while. I think the solution is to
call try_get_page(), except it doesn't work with device pages due to the whole
refcount thing. That issue is blocking a fair bit of work now so I've started
looking into it.

At least the page should have been pinned by the __get_user_pages_locked
call in __gup_longterm_locked. That refcount is dropped in
check_and_migrate_movable_pages when it returns 0 or an error.

Never mind. We unpin the pages first. Alex, would the migration work if
we unpinned them afterwards? Also, the normal CPU page fault code path
seems to make sure the page is locked (check in pfn_swap_entry_to_page)
before calling migrate_to_ram.

I don't think that's true. The check in pfn_swap_entry_to_page() is only for
migration entries:

BUG_ON(is_migration_entry(entry) && !PageLocked(p));

As this is coherent memory though why do we have to call into a device driver
to do the migration? Couldn't this all be done in the kernel?

I think you're right. I hadn't thought of that mainly because I'm even
less familiar with the non-device migration code. Alex, can you give
that a try? As long as the driver still gets a page-free callback when
the device page is freed, it should work.


ACK.Will do

Alex Sierra


Regards,
   Felix



No, you can not unpinned after migration. Due to the expected_count VS
page_count condition at migrate_page_move_mapping, during migrate_page call.

Regards,
Alex Sierra


Regards,
Felix




Re: [PATCH v2 03/11] mm/gup: migrate PIN_LONGTERM dev coherent pages to system

2021-12-08 Thread Sierra Guiza, Alejandro (Alex)



On 12/8/2021 11:30 AM, Felix Kuehling wrote:

Am 2021-12-08 um 11:58 a.m. schrieb Felix Kuehling:

Am 2021-12-08 um 6:31 a.m. schrieb Alistair Popple:

On Tuesday, 7 December 2021 5:52:43 AM AEDT Alex Sierra wrote:

Avoid long term pinning for Coherent device type pages. This could
interfere with their own device memory manager.
If caller tries to get user device coherent pages with PIN_LONGTERM flag
set, those pages will be migrated back to system memory.

Signed-off-by: Alex Sierra 
---
  mm/gup.c | 32 ++--
  1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 886d6148d3d0..1572eacf07f4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1689,17 +1689,37 @@ struct page *get_dump_page(unsigned long addr)
  #endif /* CONFIG_ELF_CORE */
  
  #ifdef CONFIG_MIGRATION

+static int migrate_device_page(unsigned long address,
+   struct page *page)
+{
+   struct vm_area_struct *vma = find_vma(current->mm, address);
+   struct vm_fault vmf = {
+   .vma = vma,
+   .address = address & PAGE_MASK,
+   .flags = FAULT_FLAG_USER,
+   .pgoff = linear_page_index(vma, address),
+   .gfp_mask = GFP_KERNEL,
+   .page = page,
+   };
+   if (page->pgmap && page->pgmap->ops->migrate_to_ram)
+   return page->pgmap->ops->migrate_to_ram();

How does this synchronise against pgmap being released? As I understand things
at this point we're not holding a reference on either the page or pgmap, so
the page and therefore the pgmap may have been freed.

I think a similar problem exists for device private fault handling as well and
it has been on my list of things to fix for a while. I think the solution is to
call try_get_page(), except it doesn't work with device pages due to the whole
refcount thing. That issue is blocking a fair bit of work now so I've started
looking into it.

At least the page should have been pinned by the __get_user_pages_locked
call in __gup_longterm_locked. That refcount is dropped in
check_and_migrate_movable_pages when it returns 0 or an error.

Never mind. We unpin the pages first. Alex, would the migration work if
we unpinned them afterwards? Also, the normal CPU page fault code path
seems to make sure the page is locked (check in pfn_swap_entry_to_page)
before calling migrate_to_ram.


No, you can not unpinned after migration. Due to the expected_count VS 
page_count condition at migrate_page_move_mapping, during migrate_page call.


Regards,
Alex Sierra


Regards,
   Felix




Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount

2021-10-14 Thread Sierra Guiza, Alejandro (Alex)



On 10/14/2021 3:57 PM, Ralph Campbell wrote:


On 10/14/21 11:01 AM, Jason Gunthorpe wrote:

On Thu, Oct 14, 2021 at 10:35:27AM -0700, Ralph Campbell wrote:


I ran xfstests-dev using the kernel boot option to "fake" a pmem device
when I first posted this patch. The tests ran OK (or at least the same
tests passed with and without my patch).

Hmm. I know nothing of xfstests but

tests/generic/413

Looks kind of like it might cover this situation?

Did it run for you?

Jason


I don't remember. I'll have to rerun the test which might take a day 
or two

to set up again.


I just ran this generic/413 on my side using pmem fake device. It does fail.
I remember we proposed a fix on this patch before try_get_page was removed.
@@ -1186,7 +1153,7 @@ bool __must_check try_grab_page(struct page *page, 
unsigned int flags);

 static inline __must_check bool try_get_page(struct page *page)
 {
    page = compound_head(page);
-   if (WARN_ON_ONCE(page_ref_count(page) <= 0))
+   if (WARN_ON_ONCE(page_ref_count(page) < 
(int)!is_zone_device_page(page)))

    return false;
    page_ref_inc(page);
    return true;

Alex



Re: [PATCH v2 09/12] lib: test_hmm add module param for zone device type

2021-09-23 Thread Sierra Guiza, Alejandro (Alex)



On 9/21/2021 12:14 AM, Alistair Popple wrote:

On Tuesday, 21 September 2021 6:05:30 AM AEST Sierra Guiza, Alejandro (Alex) 
wrote:

On 9/20/2021 3:53 AM, Alistair Popple wrote:

On Tuesday, 14 September 2021 2:16:01 AM AEST Alex Sierra wrote:

In order to configure device public in test_hmm, two module parameters
should be passed, which correspond to the SP start address of each
device (2) spm_addr_dev0 & spm_addr_dev1. If no parameters are passed,
private device type is configured.

It's a pity that testing this seems to require some amount of special setup to
test. Is there a way this could be made to work on a more standard setup
similar to how DEVICE_PRIVATE is tested?

Hi Alistair
We tried to do it as simpler as possible. Unfortunately, there are two main
requirements to register dev memory as DEVICE_PUBLIC type. This memory must
NOT be accessed by any memory allocator (SLAB, SLOB, SLUB) plus, it has
to be
CPU coherently accessed.  We also want to avoid aliasing the same PFNs for
different page types (regular system memory and DEVICE_PUBLIC). So we don't
want the reserved memory to be part of the kernel's memory map before we
call
memremap_pages. A transparent way of doing it, without any special HW, was
setting a portion of system memory as SPM (Special purpose memory). And use
this as our “device fake” memory.

Ok, I think it's great that we can test this without special HW but the boot
time configuration is still a bit annoying. Would it be possible to allocate
memory fitting the above requirements by hot unplugging it with something like
offline_and_remove_memory()?

I also don't see why the DEVICE_PRIVATE and DEVICE_PUBLIC testing should be
mutually exclusive - why can't we test both without reloading the module?
You could do both DEVICE_PRIVATE and DEVICE_PUBLIC tests by running the 
test_hmm_sh
script twice, just passing the proper parameters. Even when you booted 
with fake EFI SP
regions. If spm_address_dev0/1 parameters are passed, the module gets 
configured with
DEVICE_PUBLIC type. Otherwise DEVICE_PRIVATE type is set. Technically 
the only

complication in testing DEVICE_PUBLIC is the fake SPM boot parameter.

Alex Sierra


  - Alistair


Regards,
Alex Sierra


Signed-off-by: Alex Sierra 
---
v5:
Remove devmem->pagemap.type = MEMORY_DEVICE_PRIVATE at
dmirror_allocate_chunk that was forcing to configure pagemap.type
to MEMORY_DEVICE_PRIVATE

v6:
Check for null pointers for resource and memremap references
at dmirror_allocate_chunk

v7:
Due to patch dropped from these patch series "kernel: resource:
lookup_resource as exported symbol", lookup_resource was not longer a
callable function. This was used in public device configuration, to
get start and end addresses, to create pgmap->range struct. This
information is now taken directly from the spm_addr_devX parameters and
the fixed size DEVMEM_CHUNK_SIZE.
---
   lib/test_hmm.c  | 66 +++--
   lib/test_hmm_uapi.h |  1 +
   2 files changed, 47 insertions(+), 20 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 3cd91ca31dd7..ef27e355738a 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -33,6 +33,16 @@
   #define DEVMEM_CHUNK_SIZE(256 * 1024 * 1024U)
   #define DEVMEM_CHUNKS_RESERVE16
   
+static unsigned long spm_addr_dev0;

+module_param(spm_addr_dev0, long, 0644);
+MODULE_PARM_DESC(spm_addr_dev0,
+   "Specify start address for SPM (special purpose memory) used for 
device 0. By setting this Generic device type will be used. Make sure spm_addr_dev1 is 
set too");
+
+static unsigned long spm_addr_dev1;
+module_param(spm_addr_dev1, long, 0644);
+MODULE_PARM_DESC(spm_addr_dev1,
+   "Specify start address for SPM (special purpose memory) used for 
device 1. By setting this Generic device type will be used. Make sure spm_addr_dev0 is 
set too");
+
   static const struct dev_pagemap_ops dmirror_devmem_ops;
   static const struct mmu_interval_notifier_ops dmirror_min_ops;
   static dev_t dmirror_dev;
@@ -450,11 +460,11 @@ static int dmirror_write(struct dmirror *dmirror, struct 
hmm_dmirror_cmd *cmd)
return ret;
   }
   
-static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,

+static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
   struct page **ppage)
   {
struct dmirror_chunk *devmem;
-   struct resource *res;
+   struct resource *res = NULL;
unsigned long pfn;
unsigned long pfn_first;
unsigned long pfn_last;
@@ -462,17 +472,29 @@ static bool dmirror_allocate_chunk(struct dmirror_device 
*mdevice,
   
   	devmem = kzalloc(sizeof(*devmem), GFP_KERNEL);

if (!devmem)
-   return false;
+   return -ENOMEM;
   
-	res = request_free_mem_region(_resource, DEVMEM_CHUNK_SIZE,

- "hmm_dmirror");
- 

Re: [PATCH v2 09/12] lib: test_hmm add module param for zone device type

2021-09-20 Thread Sierra Guiza, Alejandro (Alex)



On 9/20/2021 3:53 AM, Alistair Popple wrote:

On Tuesday, 14 September 2021 2:16:01 AM AEST Alex Sierra wrote:

In order to configure device public in test_hmm, two module parameters
should be passed, which correspond to the SP start address of each
device (2) spm_addr_dev0 & spm_addr_dev1. If no parameters are passed,
private device type is configured.

It's a pity that testing this seems to require some amount of special setup to
test. Is there a way this could be made to work on a more standard setup
similar to how DEVICE_PRIVATE is tested?

Hi Alistair
We tried to do it as simpler as possible. Unfortunately, there are two main
requirements to register dev memory as DEVICE_PUBLIC type. This memory must
NOT be accessed by any memory allocator (SLAB, SLOB, SLUB) plus, it has 
to be

CPU coherently accessed.  We also want to avoid aliasing the same PFNs for
different page types (regular system memory and DEVICE_PUBLIC). So we don't
want the reserved memory to be part of the kernel's memory map before we 
call

memremap_pages. A transparent way of doing it, without any special HW, was
setting a portion of system memory as SPM (Special purpose memory). And use
this as our “device fake” memory.

Regards,
Alex Sierra




Signed-off-by: Alex Sierra 
---
v5:
Remove devmem->pagemap.type = MEMORY_DEVICE_PRIVATE at
dmirror_allocate_chunk that was forcing to configure pagemap.type
to MEMORY_DEVICE_PRIVATE

v6:
Check for null pointers for resource and memremap references
at dmirror_allocate_chunk

v7:
Due to patch dropped from these patch series "kernel: resource:
lookup_resource as exported symbol", lookup_resource was not longer a
callable function. This was used in public device configuration, to
get start and end addresses, to create pgmap->range struct. This
information is now taken directly from the spm_addr_devX parameters and
the fixed size DEVMEM_CHUNK_SIZE.
---
  lib/test_hmm.c  | 66 +++--
  lib/test_hmm_uapi.h |  1 +
  2 files changed, 47 insertions(+), 20 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 3cd91ca31dd7..ef27e355738a 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -33,6 +33,16 @@
  #define DEVMEM_CHUNK_SIZE (256 * 1024 * 1024U)
  #define DEVMEM_CHUNKS_RESERVE 16
  
+static unsigned long spm_addr_dev0;

+module_param(spm_addr_dev0, long, 0644);
+MODULE_PARM_DESC(spm_addr_dev0,
+   "Specify start address for SPM (special purpose memory) used for 
device 0. By setting this Generic device type will be used. Make sure spm_addr_dev1 is 
set too");
+
+static unsigned long spm_addr_dev1;
+module_param(spm_addr_dev1, long, 0644);
+MODULE_PARM_DESC(spm_addr_dev1,
+   "Specify start address for SPM (special purpose memory) used for 
device 1. By setting this Generic device type will be used. Make sure spm_addr_dev0 is 
set too");
+
  static const struct dev_pagemap_ops dmirror_devmem_ops;
  static const struct mmu_interval_notifier_ops dmirror_min_ops;
  static dev_t dmirror_dev;
@@ -450,11 +460,11 @@ static int dmirror_write(struct dmirror *dmirror, struct 
hmm_dmirror_cmd *cmd)
return ret;
  }
  
-static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,

+static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
   struct page **ppage)
  {
struct dmirror_chunk *devmem;
-   struct resource *res;
+   struct resource *res = NULL;
unsigned long pfn;
unsigned long pfn_first;
unsigned long pfn_last;
@@ -462,17 +472,29 @@ static bool dmirror_allocate_chunk(struct dmirror_device 
*mdevice,
  
  	devmem = kzalloc(sizeof(*devmem), GFP_KERNEL);

if (!devmem)
-   return false;
+   return -ENOMEM;
  
-	res = request_free_mem_region(_resource, DEVMEM_CHUNK_SIZE,

- "hmm_dmirror");
-   if (IS_ERR(res))
-   goto err_devmem;
+   if (!spm_addr_dev0 && !spm_addr_dev1) {
+   res = request_free_mem_region(_resource, 
DEVMEM_CHUNK_SIZE,
+ "hmm_dmirror");
+   if (IS_ERR_OR_NULL(res))
+   goto err_devmem;
+   devmem->pagemap.range.start = res->start;
+   devmem->pagemap.range.end = res->end;
+   devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
+   mdevice->zone_device_type = HMM_DMIRROR_MEMORY_DEVICE_PRIVATE;
+   } else if (spm_addr_dev0 && spm_addr_dev1) {
+   devmem->pagemap.range.start = MINOR(mdevice->cdevice.dev) ?
+   spm_addr_dev0 :
+   spm_addr_dev1;
+   devmem->pagemap.range.end = devmem->pagemap.range.start +
+   DEVMEM_CHUNK_SIZE - 1;
+   devmem->pagemap.type = MEMORY_DEVICE_PUBLIC;
+   

Re: [PATCH v1 03/14] mm: add iomem vma selection for memory migration

2021-08-25 Thread Sierra Guiza, Alejandro (Alex)



On 8/25/2021 2:46 AM, Christoph Hellwig wrote:

On Tue, Aug 24, 2021 at 10:48:17PM -0500, Alex Sierra wrote:

} else {
-   if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
+   if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM) &&
+   !(migrate->flags & MIGRATE_VMA_SELECT_IOMEM))
goto next;
pfn = pte_pfn(pte);
if (is_zero_pfn(pfn)) {

.. also how is this going to work for the device public memory?  That
should be pte_special() an thus fail vm_normal_page.
Perhaps we're missing something, as we're not doing any special marking 
for the device public pfn/entries.
pfn_valid return true, pte_special return false and pfn_t_devmap return 
false on these pages. Same as system pages.
That's the reason vm_normal_page returns the page correctly through 
pfn_to_page helper.


Regards,
Alex S.


Re: [PATCH v6 02/13] mm: remove extra ZONE_DEVICE struct page refcount

2021-08-19 Thread Sierra Guiza, Alejandro (Alex)



On 8/18/2021 2:28 PM, Ralph Campbell wrote:

On 8/17/21 5:35 PM, Felix Kuehling wrote:

Am 2021-08-17 um 8:01 p.m. schrieb Ralph Campbell:

On 8/12/21 11:31 PM, Alex Sierra wrote:

From: Ralph Campbell 

ZONE_DEVICE struct pages have an extra reference count that
complicates the
code for put_page() and several places in the kernel that need to
check the
reference count to see that a page is not being used (gup, compaction,
migration, etc.). Clean up the code so the reference count doesn't
need to
be treated specially for ZONE_DEVICE.

v2:
AS: merged this patch in linux 5.11 version

v5:
AS: add condition at try_grab_page to check for the zone device type,
while
page ref counter is checked less/equal to zero. In case of device
zone, pages
ref counter are initialized to zero.

Signed-off-by: Ralph Campbell 
Signed-off-by: Alex Sierra 
---
   arch/powerpc/kvm/book3s_hv_uvmem.c |  2 +-
   drivers/gpu/drm/nouveau/nouveau_dmem.c |  2 +-
   fs/dax.c   |  4 +-
   include/linux/dax.h    |  2 +-
   include/linux/memremap.h   |  7 +--
   include/linux/mm.h | 13 +
   lib/test_hmm.c |  2 +-
   mm/internal.h  |  8 +++
   mm/memremap.c  | 68 
+++---

   mm/migrate.c   |  5 --
   mm/page_alloc.c    |  3 ++
   mm/swap.c  | 45 ++---
   12 files changed, 46 insertions(+), 115 deletions(-)


I haven't seen a response to the issues I raised back at v3 of this
series.
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-mm%2F4f6dd918-d79b-1aa7-3a4c-caa67ddc29bc%40nvidia.com%2Fdata=04%7C01%7Calex.sierra%40amd.com%7Cd2bd2d4fbf764528540908d9627e5dcd%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637649117156919772%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=P7FxYm%2BkJaCkMFa3OHtuKrPOn7SvytFRmYQdIzq7rN4%3Dreserved=0 




Did I miss something?

I think part of the response was that we did more testing. Alex added
support for DEVICE_GENERIC pages to test_hmm and he ran DAX tests
recommended by Theodore Tso. In that testing he ran into a WARN_ON_ONCE
about a zero page refcount in try_get_page. The fix is in the latest
version of patch 2. But it's already obsolete because John Hubbard is
about to remove that function altogether.

I think the issues you raised were more uncertainty than known bugs. It
seems the fact that you can have DAX pages with 0 refcount is a feature
more than a bug.

Regards,
   Felix


Did you test on a system without CONFIG_ARCH_HAS_PTE_SPECIAL defined?
In that case, mmap() of a DAX device will call insert_page() which calls
get_page() which would trigger VM_BUG_ON_PAGE().

I can believe it is OK for PTE_SPECIAL page table entries to have no
struct page or that MEMORY_DEVICE_GENERIC struct pages be mapped with
a zero reference count using insert_pfn().

Hi Ralph,
We have tried the DAX tests with and without CONFIG_ARCH_HAS_PTE_SPECIAL 
defined.
Apparently none of the tests touches that condition for a DAX device. Of 
course,

that doesn't mean it could happen.

Regards,
Alex S.




I find it hard to believe that other MM developers don't see an issue
with a struct page with refcount == 0 and mapcount == 1.

I don't see where init_page_count() is being called for the
MEMORY_DEVICE_GENERIC or MEMORY_DEVICE_PRIVATE struct pages the AMD
driver allocates and passes to migrate_vma_setup().
Looks like svm_migrate_get_vram_page() needs to call init_page_count()
instead of get_page(). (I'm looking at branch 
origin/alexsierrag/device_generic

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver.gitdata=04%7C01%7Calex.sierra%40amd.com%7Cd2bd2d4fbf764528540908d9627e5dcd%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637649117156919772%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=IXe8HP2s8x5OdJdERBkGOYJCQk3iqCu5AYkwpDL8zec%3Dreserved=0)
Yes, you're right. My bad. Thanks for catching this up. I didn't realize 
I was missing

to define CONFIG_DEBUG_VM on my build. Therefore this BUG was never caught.
It worked after I replaced get_pages by init_page_count at
svm_migrate_get_vram_page. However, I don't think this is the best way 
to fix it.
Ideally, get_pages call should work for device pages with ref count 
equal to 0
too. Otherwise, we could overwrite refcounter if someone else is 
grabbing the page

concurrently.
I was thinking to add a special condition in get_pages for dev pages. 
This could

also fix the insert_page -> get_page call from a DAX device.

Regards,
Alex S.



Also, what about the other places where is_device_private_page() is 
called?

Don't they need to be updated to call is_device_page() instead?
One of my goals for this patch was to 

Re: [PATCH v6 05/13] drm/amdkfd: generic type as sys mem on migration to ram

2021-08-16 Thread Sierra Guiza, Alejandro (Alex)



On 8/15/2021 10:38 AM, Christoph Hellwig wrote:

On Fri, Aug 13, 2021 at 01:31:42AM -0500, Alex Sierra wrote:

migrate.vma = vma;
migrate.start = start;
migrate.end = end;
-   migrate.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
migrate.pgmap_owner = SVM_ADEV_PGMAP_OWNER(adev);
  
+	if (adev->gmc.xgmi.connected_to_cpu)

+   migrate.flags = MIGRATE_VMA_SELECT_SYSTEM;
+   else
+   migrate.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;

It's been a while since I touched this migrate code, but doesn't this
mean that if the range already contains system memory the migration
now won't do anything? for the connected_to_cpu case?


For above’s condition equal to connected_to_cpu , we’re explicitly 
migrating from
device memory to system memory with device generic type. In this type, 
device PTEs are

present in CPU page table.

During migrate_vma_collect_pmd walk op at migrate_vma_setup call, 
there’s a condition
for present pte that require migrate->flags be set for 
MIGRATE_VMA_SELECT_SYSTEM.

Otherwise, the migration for this entry will be ignored.

Regards,
Alex S.



Re: [PATCH v5 00/13] Support DEVICE_GENERIC memory in migrate_vma_*

2021-08-12 Thread Sierra Guiza, Alejandro (Alex)



On 8/12/2021 1:34 AM, Christoph Hellwig wrote:

Do you have a pointer to a git branch with this series and all dependencies
to ease testing?


Hi Christoph,

Yes, actually tomorrow we're planning to send a new version with 
detailed instructions


on how to setup and run each of the tests. This will also contain a link 
to download our


kernel repo with all these patches.

Regards,

Alejandro S.



On Thu, Aug 12, 2021 at 01:30:47AM -0500, Alex Sierra wrote:

v1:
AMD is building a system architecture for the Frontier supercomputer with a
coherent interconnect between CPUs and GPUs. This hardware architecture allows
the CPUs to coherently access GPU device memory. We have hardware in our labs
and we are working with our partner HPE on the BIOS, firmware and software
for delivery to the DOE.

The system BIOS advertises the GPU device memory (aka VRAM) as SPM
(special purpose memory) in the UEFI system address map. The amdgpu driver looks
it up with lookup_resource and registers it with devmap as MEMORY_DEVICE_GENERIC
using devm_memremap_pages.

Now we're trying to migrate data to and from that memory using the migrate_vma_*
helpers so we can support page-based migration in our unified memory 
allocations,
while also supporting CPU access to those pages.

This patch series makes a few changes to make MEMORY_DEVICE_GENERIC pages behave
correctly in the migrate_vma_* helpers. We are looking for feedback about this
approach. If we're close, what's needed to make our patches acceptable upstream?
If we're not close, any suggestions how else to achieve what we are trying to do
(i.e. page migration and coherent CPU access to VRAM)?

This work is based on HMM and our SVM memory manager that was recently 
upstreamed
to Dave Airlie's drm-next branch
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcgit.freedesktop.org%2Fdrm%2Fdrm%2Flog%2F%3Fh%3Ddrm-nextdata=04%7C01%7Calex.sierra%40amd.com%7Ce4e14caf03de403b1e2208d95d5b378b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637643468663206442%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=W7NRFXF0t1bqAzsV1RZh4MAPly26c7XPn8kCKvaj%2FMw%3Dreserved=0
On top of that we did some rework of our VRAM management for migrations to 
remove
some incorrect assumptions, allow partially successful migrations and GPU memory
mappings that mix pages in VRAM and system memory.
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210527205606.2660-6-Felix.Kuehling%40amd.com%2FT%2F%23r996356015e295780eb50453e7dbd5d0d68b47cbcdata=04%7C01%7Calex.sierra%40amd.com%7Ce4e14caf03de403b1e2208d95d5b378b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637643468663206442%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=GZZrxedpAlYLfDgO1Y3jvpFacG313SXiUT3ErB4xO0Q%3Dreserved=0

v2:
This patch series version has merged "[RFC PATCH v3 0/2]
mm: remove extra ZONE_DEVICE struct page refcount" patch series made by
Ralph Campbell. It also applies at the top of these series, our changes
to support device generic type in migration_vma helpers.
This has been tested in systems with device memory that has coherent
access by CPU.

Also addresses the following feedback made in v1:
- Isolate in one patch kernel/resource.c modification, based
on Christoph's feedback.
- Add helpers check for generic and private type to avoid
duplicated long lines.

v3:
- Include cover letter from v1.
- Rename dax_layout_is_idle_page func to dax_page_unused in patch
ext4/xfs: add page refcount helper.

v4:
- Add support for zone device generic type in lib/test_hmm and
tool/testing/selftest/vm/hmm-tests.
- Add missing page refcount helper to fuse/dax.c. This was included in
one of Ralph Campbell's patches.

v5:
- Cosmetic changes on patches 3, 5 and 13
- Bug founded at test_hmm, remove devmem->pagemap.type = MEMORY_DEVICE_PRIVATE
at dmirror_allocate_chunk that was forcing to configure pagemap.type to
MEMORY_DEVICE_PRIVATE.
- A bug was found while running one of the xfstest (generic/413) used to
validate fs_dax device type. This was first introduced by patch: "mm: remove
extra ZONE_DEVICE struct page refcount" whic is part of these patch series.
The bug was showed as WARNING message at try_grab_page function call, due to
a page refcounter equal to zero. Part of "mm: remove extra ZONE_DEVICE struct
page refcount" changes, was to initialize page refcounter to zero. Therefore,
a special condition was added to try_grab_page on this v5, were it checks for
device zone pages too. It is included in the same patch.

This is how mm changes from these patch series have been validated:
- hmm-tests were run using device private and device generic types. This last,
just added in these patch series. efi_fake_mem was used to mimic SPM memory
for device generic.
- xfstests tool was used to validate fs-dax device type and page refcounter
changes. DAX configuration was used along with emulated 

Re: [PATCH v4 10/13] lib: test_hmm add module param for zone device type

2021-07-28 Thread Sierra Guiza, Alejandro (Alex)



On 7/22/2021 12:26 PM, Jason Gunthorpe wrote:

On Thu, Jul 22, 2021 at 11:59:17AM -0500, Sierra Guiza, Alejandro (Alex) wrote:

On 7/22/2021 7:23 AM, Jason Gunthorpe wrote:

On Sat, Jul 17, 2021 at 02:21:32PM -0500, Alex Sierra wrote:

In order to configure device generic in test_hmm, two
module parameters should be passed, which correspon to the
SP start address of each device (2) spm_addr_dev0 &
spm_addr_dev1. If no parameters are passed, private device
type is configured.

I don't think tests should need configuration like this, is it really
necessary? How can people with normal HW run this test?

Hi Jason,
The idea was to add an easy way to validate the codepaths touched by this
patch series, which make modifications to the migration helpers for device
generic type pages. We're using CONFIG_EFI_FAKE_MEMMAP to create fake SPM
devices inside system memory. No special HW needed. And passing the kernel
parameter efi_fake_mem. Ex. efi_fake_mem=1G@0x1:0x4. I should
probably need to include a small example of how to set this in the
test_hmm.sh
usage().

I don't think anything about hmm is sensitive to how the pages are
acquired - you can't create device generic pages without relying on
FAKE_MEMMAP?

The reason we used fake SPM approach was to have a "special memory"
not managed by Linux (NOT registered as normal system memory). But
also accessible by the CPU.

For device_generic we cannot allocate new physical addresses.
We need the physical address to match the actual system memory
physical address, so that CPU mappings work as expected.

Would you recommend to use a different approach?

Regards,
Alex Sierra



Jason


Re: [PATCH v3 0/8] Support DEVICE_GENERIC memory in migrate_vma_*

2021-07-23 Thread Sierra Guiza, Alejandro (Alex)



On 7/17/2021 2:54 PM, Sierra Guiza, Alejandro (Alex) wrote:


On 7/16/2021 5:14 PM, Felix Kuehling wrote:

Am 2021-07-16 um 11:07 a.m. schrieb Theodore Y. Ts'o:

On Wed, Jun 23, 2021 at 05:49:55PM -0400, Felix Kuehling wrote:
I can think of two ways to test the changes for 
MEMORY_DEVICE_GENERIC in
this patch series in a way that is reproducible without special 
hardware and

firmware:

For the reference counting changes we could use the dax driver with 
hmem and
use efi_fake_mem on the kernel command line to create some 
DEVICE_GENERIC
pages. I'm open to suggestions for good user mode tests to exercise 
dax

functionality on this type of memory.

Sorry for the thread necromancy, but now that the merge window is
past

No worries. Alejandro should have a new version of this series in a few
days, with updates to hmm_test and some fixes.


V4 patch series have been sent for review.
https://marc.info/?l=dri-devel=162654971618911=2

Regards,
Alex Sierra





Today I test ext4's dax support, without having any $$$ DAX hardware,
by using the kernel command line "memmap=4G!9G:memmap=9G!14G" which
reserves memory so that creates two pmem device and then I run
xfstests with DAX enabled using qemu or using a Google Compute Engine
VM, using TEST_DEV=/dev/pmem0 and SCRATCH_DEV=/dev/pmem1.

If you can give me a recipe for what kernel configs I should enable,
and what magic kernel command line arguments to use, then I'd be able
to test your patch set with ext4.

That would be great!

Regarding kernel config options, it should be the same as what you're
using for DAX testing today. We're not changing or adding any Kconfig
options. But let me take a stab:

ZONE_DEVICE
HMM_MIRROR
MMU_NOTIFIER
DEVICE_PRIVATE (maybe not needed for your test)
FS_DAX

Hi Theodore,
I wonder if you had chance to set the kernel configs from Felix to 
enable DAX in xfstests.


I've been trying to test FS DAX on my side using virtio-fs + QEMU. 
Unfortunately, Im having some problems setting up the environment with 
the guest kernel. Could you share your VM (QEMU or GCE) setup to run it 
with xfstests?


Regards,
Alex S.



I'm not sure what you're looking for in terms of kernel command line,
other than the memmap options you already found. There are some more
options to run hmm_test with fake SPM (DEVICE_GENERIC) memory, but we're
already running that ourselves. That will also be in the next revision
of this patch series.


In order to run hmm test with generic device type enabled, set the 
following:


kernel config:
EFI_FAKE_MEMMAP
RUNTIME_TESTING_MENU
TEST_HMM=m

Kernel parameters to fake SP memory. The addresses set here are based 
on my system's usable memory enumerated by BIOS-e820 at boot. The test 
requires two SP devices of at least 256MB.

efi_fake_mem=1G@0x1:0x4,1G@0x14000:0x4

To run the hmm_test in generic device type pass the SP addresses to 
the sh script.

sudo ./test_hmm.sh smoke 0x1 0x14000



If you can run your xfstests with DAX on top of this patch series, that
would be very helpful. That's to make sure the ZONE_DEVICE page refcount
changes don't break DAX.

Regards,
   Felix



Cheers,

    - Ted


Re: [PATCH v4 10/13] lib: test_hmm add module param for zone device type

2021-07-22 Thread Sierra Guiza, Alejandro (Alex)



On 7/22/2021 7:23 AM, Jason Gunthorpe wrote:

On Sat, Jul 17, 2021 at 02:21:32PM -0500, Alex Sierra wrote:

In order to configure device generic in test_hmm, two
module parameters should be passed, which correspon to the
SP start address of each device (2) spm_addr_dev0 &
spm_addr_dev1. If no parameters are passed, private device
type is configured.

I don't think tests should need configuration like this, is it really
necessary? How can people with normal HW run this test?

Hi Jason,
The idea was to add an easy way to validate the codepaths touched by this
patch series, which make modifications to the migration helpers for device
generic type pages. We're using CONFIG_EFI_FAKE_MEMMAP to create fake SPM
devices inside system memory. No special HW needed. And passing the kernel
parameter efi_fake_mem. Ex. efi_fake_mem=1G@0x1:0x4. I should
probably need to include a small example of how to set this in the 
test_hmm.sh

usage().

Alex S.


Jason


Re: [PATCH v3 0/8] Support DEVICE_GENERIC memory in migrate_vma_*

2021-07-17 Thread Sierra Guiza, Alejandro (Alex)



On 7/16/2021 5:14 PM, Felix Kuehling wrote:

Am 2021-07-16 um 11:07 a.m. schrieb Theodore Y. Ts'o:

On Wed, Jun 23, 2021 at 05:49:55PM -0400, Felix Kuehling wrote:

I can think of two ways to test the changes for MEMORY_DEVICE_GENERIC in
this patch series in a way that is reproducible without special hardware and
firmware:

For the reference counting changes we could use the dax driver with hmem and
use efi_fake_mem on the kernel command line to create some DEVICE_GENERIC
pages. I'm open to suggestions for good user mode tests to exercise dax
functionality on this type of memory.

Sorry for the thread necromancy, but now that the merge window is
past

No worries. Alejandro should have a new version of this series in a few
days, with updates to hmm_test and some fixes.


V4 patch series have been sent for review.
https://marc.info/?l=dri-devel=162654971618911=2

Regards,
Alex Sierra





Today I test ext4's dax support, without having any $$$ DAX hardware,
by using the kernel command line "memmap=4G!9G:memmap=9G!14G" which
reserves memory so that creates two pmem device and then I run
xfstests with DAX enabled using qemu or using a Google Compute Engine
VM, using TEST_DEV=/dev/pmem0 and SCRATCH_DEV=/dev/pmem1.

If you can give me a recipe for what kernel configs I should enable,
and what magic kernel command line arguments to use, then I'd be able
to test your patch set with ext4.

That would be great!

Regarding kernel config options, it should be the same as what you're
using for DAX testing today. We're not changing or adding any Kconfig
options. But let me take a stab:

ZONE_DEVICE
HMM_MIRROR
MMU_NOTIFIER
DEVICE_PRIVATE (maybe not needed for your test)
FS_DAX

I'm not sure what you're looking for in terms of kernel command line,
other than the memmap options you already found. There are some more
options to run hmm_test with fake SPM (DEVICE_GENERIC) memory, but we're
already running that ourselves. That will also be in the next revision
of this patch series.


In order to run hmm test with generic device type enabled, set the 
following:


kernel config:
EFI_FAKE_MEMMAP
RUNTIME_TESTING_MENU
TEST_HMM=m

Kernel parameters to fake SP memory. The addresses set here are based on 
my system's usable memory enumerated by BIOS-e820 at boot. The test 
requires two SP devices of at least 256MB.

efi_fake_mem=1G@0x1:0x4,1G@0x14000:0x4

To run the hmm_test in generic device type pass the SP addresses to the 
sh script.

sudo ./test_hmm.sh smoke 0x1 0x14000



If you can run your xfstests with DAX on top of this patch series, that
would be very helpful. That's to make sure the ZONE_DEVICE page refcount
changes don't break DAX.

Regards,
   Felix



Cheers,

- Ted


Re: [PATCH v3 1/8] ext4/xfs: add page refcount helper

2021-06-19 Thread Sierra Guiza, Alejandro (Alex)



On 6/17/2021 6:52 PM, Dave Chinner wrote:

On Thu, Jun 17, 2021 at 10:16:58AM -0500, Alex Sierra wrote:

From: Ralph Campbell 

There are several places where ZONE_DEVICE struct pages assume a reference
count == 1 means the page is idle and free. Instead of open coding this,
add a helper function to hide this detail.

v2:
[AS]: rename dax_layout_is_idle_page func to dax_page_unused

Did you even compile test this?
Thanks for catching this Dave, my mistake, I was building without 
CONFIG_FS_DAX, as I was using DEVICE_GENERIC only.

I'll send the corrected patch in replay.

Regards,

Alex Sierra




Signed-off-by: Ralph Campbell 
Signed-off-by: Alex Sierra 
---
  fs/dax.c|  4 ++--
  fs/ext4/inode.c |  5 +
  fs/xfs/xfs_file.c   |  4 +---
  include/linux/dax.h | 10 ++
  4 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 26d5dcd2d69e..321f4ddc6643 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -358,7 +358,7 @@ static void dax_disassociate_entry(void *entry, struct 
address_space *mapping,
for_each_mapped_pfn(entry, pfn) {
struct page *page = pfn_to_page(pfn);
  
-		WARN_ON_ONCE(trunc && page_ref_count(page) > 1);

+   WARN_ON_ONCE(trunc && !dax_layout_is_idle_page(page));

Because you still use dax_layout_is_idle_page() here, not
dax_page_unused()...


WARN_ON_ONCE(page->mapping && page->mapping != mapping);
page->mapping = NULL;
page->index = 0;
@@ -372,7 +372,7 @@ static struct page *dax_busy_page(void *entry)
for_each_mapped_pfn(entry, pfn) {
struct page *page = pfn_to_page(pfn);
  
-		if (page_ref_count(page) > 1)

+   if (!dax_layout_is_idle_page(page))

Here too.

Cheers,

Dave.


Re: [PATCH v3 0/8] Support DEVICE_GENERIC memory in migrate_vma_*

2021-06-17 Thread Sierra Guiza, Alejandro (Alex)



On 6/17/2021 10:16 AM, Alex Sierra wrote:

v1:
AMD is building a system architecture for the Frontier supercomputer with a
coherent interconnect between CPUs and GPUs. This hardware architecture allows
the CPUs to coherently access GPU device memory. We have hardware in our labs
and we are working with our partner HPE on the BIOS, firmware and software
for delivery to the DOE.

The system BIOS advertises the GPU device memory (aka VRAM) as SPM
(special purpose memory) in the UEFI system address map. The amdgpu driver looks
it up with lookup_resource and registers it with devmap as MEMORY_DEVICE_GENERIC
using devm_memremap_pages.

Now we're trying to migrate data to and from that memory using the migrate_vma_*
helpers so we can support page-based migration in our unified memory 
allocations,
while also supporting CPU access to those pages.

This patch series makes a few changes to make MEMORY_DEVICE_GENERIC pages behave
correctly in the migrate_vma_* helpers. We are looking for feedback about this
approach. If we're close, what's needed to make our patches acceptable upstream?
If we're not close, any suggestions how else to achieve what we are trying to do
(i.e. page migration and coherent CPU access to VRAM)?

This work is based on HMM and our SVM memory manager that was recently 
upstreamed
to Dave Airlie's drm-next branch
https://lore.kernel.org/dri-devel/20210527205606.2660-6-felix.kuehl...@amd.com/T/#r996356015e295780eb50453e7dbd5d0d68b47cbc

Corrected link:

https://cgit.freedesktop.org/drm/drm/log/?h=drm-next

Regards,
Alex Sierra


On top of that we did some rework of our VRAM management for migrations to 
remove
some incorrect assumptions, allow partially successful migrations and GPU memory
mappings that mix pages in VRAM and system memory.
https://patchwork.kernel.org/project/dri-devel/list/?series=489811


Corrected link:

https://lore.kernel.org/dri-devel/20210527205606.2660-6-felix.kuehl...@amd.com/T/#r996356015e295780eb50453e7dbd5d0d68b47cbc

Regards,
Alex Sierra



v2:
This patch series version has merged "[RFC PATCH v3 0/2]
mm: remove extra ZONE_DEVICE struct page refcount" patch series made by
Ralph Campbell. It also applies at the top of these series, our changes
to support device generic type in migration_vma helpers.
This has been tested in systems with device memory that has coherent
access by CPU.

Also addresses the following feedback made in v1:
- Isolate in one patch kernel/resource.c modification, based
on Christoph's feedback.
- Add helpers check for generic and private type to avoid
duplicated long lines.

v3:
- Include cover letter from v1
- Rename dax_layout_is_idle_page func to dax_page_unused in patch
ext4/xfs: add page refcount helper

Patches 1-2 Rebased Ralph Campbell's ZONE_DEVICE page refcounting patches
Patches 4-5 are for context to show how we are looking up the SPM
memory and registering it with devmap.
Patches 3,6-8 are the changes we are trying to upstream or rework to
make them acceptable upstream.

Alex Sierra (6):
   kernel: resource: lookup_resource as exported symbol
   drm/amdkfd: add SPM support for SVM
   drm/amdkfd: generic type as sys mem on migration to ram
   include/linux/mm.h: helpers to check zone device generic type
   mm: add generic type support to migrate_vma helpers
   mm: call pgmap->ops->page_free for DEVICE_GENERIC pages

Ralph Campbell (2):
   ext4/xfs: add page refcount helper
   mm: remove extra ZONE_DEVICE struct page refcount

  arch/powerpc/kvm/book3s_hv_uvmem.c   |  2 +-
  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 15 --
  drivers/gpu/drm/nouveau/nouveau_dmem.c   |  2 +-
  fs/dax.c |  8 +--
  fs/ext4/inode.c  |  5 +-
  fs/xfs/xfs_file.c|  4 +-
  include/linux/dax.h  | 10 
  include/linux/memremap.h |  7 +--
  include/linux/mm.h   | 52 +++---
  kernel/resource.c|  2 +-
  lib/test_hmm.c   |  2 +-
  mm/internal.h|  8 +++
  mm/memremap.c| 69 +++-
  mm/migrate.c | 13 ++---
  mm/page_alloc.c  |  3 ++
  mm/swap.c| 45 ++--
  16 files changed, 83 insertions(+), 164 deletions(-)



Re: [PATCH] drm/ttm: fix pipelined gutting for evictions v2

2020-07-26 Thread Sierra Guiza, Alejandro (Alex)


On 7/24/2020 10:01 AM, Felix Kuehling wrote:

Am 2020-07-24 um 7:56 a.m. schrieb Christian König:

We can't pipeline that during eviction because the memory needs
to be available immediately.

v2: fix how we cleanup the BOs resources

Signed-off-by: Christian König 

Reviewed-by: Felix Kuehling 

It would be good to get a Tested-by from Alex as well.

Thanks,
   Felix


Tested-by: Alex Sierra 

Regards,

Alex Sierra




---
  drivers/gpu/drm/ttm/ttm_bo.c | 8 ++--
  1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index 0768a054a916..469aa93ea317 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -652,8 +652,12 @@ static int ttm_bo_evict(struct ttm_buffer_object *bo,
placement.num_busy_placement = 0;
bdev->driver->evict_flags(bo, );
  
-	if (!placement.num_placement && !placement.num_busy_placement)

-   return ttm_bo_pipeline_gutting(bo);
+   if (!placement.num_placement && !placement.num_busy_placement) {
+   ttm_bo_wait(bo, false, false);
+
+   ttm_bo_cleanup_memtype_use(bo);
+   return 0;
+   }
  
  	evict_mem = bo->mem;

evict_mem.mm_node = NULL;

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel