Re: [PATCH v3 2/8] mm: remove extra ZONE_DEVICE struct page refcount

2021-06-29 Thread Ralph Campbell

On 6/28/21 9:46 AM, Felix Kuehling wrote:

Am 2021-06-17 um 3:16 p.m. schrieb Ralph Campbell:

On 6/17/21 8:16 AM, Alex Sierra wrote:

From: Ralph Campbell 

ZONE_DEVICE struct pages have an extra reference count that
complicates the
code for put_page() and several places in the kernel that need to
check the
reference count to see that a page is not being used (gup, compaction,
migration, etc.). Clean up the code so the reference count doesn't
need to
be treated specially for ZONE_DEVICE.

v2:
AS: merged this patch in linux 5.11 version

Signed-off-by: Ralph Campbell 
Signed-off-by: Alex Sierra 
---
   arch/powerpc/kvm/book3s_hv_uvmem.c |  2 +-
   drivers/gpu/drm/nouveau/nouveau_dmem.c |  2 +-
   fs/dax.c   |  4 +-
   include/linux/dax.h    |  2 +-
   include/linux/memremap.h   |  7 +--
   include/linux/mm.h | 44 -
   lib/test_hmm.c |  2 +-
   mm/internal.h  |  8 +++
   mm/memremap.c  | 68 +++---
   mm/migrate.c   |  5 --
   mm/page_alloc.c    |  3 ++
   mm/swap.c  | 45 ++---
   12 files changed, 45 insertions(+), 147 deletions(-)


I think it is great that you are picking this up and trying to revive it.

However, I have a number of concerns about how it affects existing
ZONE_DEVICE
MEMORY_DEVICE_GENERIC and MEMORY_DEVICE_FS_DAX users and I don't see this
addressing them. For example, dev_dax_probe() allocates
MEMORY_DEVICE_GENERIC
struct pages and then:
   dev_dax_fault()
     dev_dax_huge_fault()
   __dev_dax_pte_fault()
     vmf_insert_mixed()
which just inserts the PFN into the CPU page tables without increasing
the page
refcount so it is zero (whereas it was one before). But using
get_page() will
trigger VM_BUG_ON_PAGE() if it is enabled. There isn't any current
notion of
free verses allocated for these struct pages. I suppose init_page_count()
could be called on all the struct pages in dev_dax_probe() to fix that
though.

Hi Ralph,

For DEVICE_GENERIC pages free_zone_device_page doesn't do anything. So
I'm not sure what the reference counting is good for in this case.

Alex is going to add free_zone_device_page support for DEVICE_GENERIC
pages (patch 8 of this series). However, even then, it only does
anything if there is an actual call to put_page. Where would that call
come from in the dev_dax driver case?


Correct, the drivers/dax/device.c driver allocates MEMORY_DEVICE_GENERIC
struct pages and doesn't seem to allocate/free the page nor increment/decrement
the reference count but it does call vmf_insert_mixed() if the /dev/file
is mmap()'ed into a user process' address space. If devm_memremap_pages()
returns the array of ZONE_DEVICE struct pages initialized with a reference
count of zero, then the CPU page tables will have a PTE/PFN that points to
a struct page with a zero reference count. This goes against the normal
expectation in the rest of the mm code that assumes a page mapped by a CPU
has a non-zero reference count.
So yes, nothing "bad" happens because put_page() isn't called but the
reference count will differ from other drivers that call vmf_insert_mixed()
or vm_insert_page() where the page was allocated with alloc_pages() or
similar.


I'm even less clear about how to fix MEMORY_DEVICE_FS_DAX. File
systems have clear
allocate and free states for backing storage but there are the
complications with
the page cache references, etc. to consider. The >1 to 1 reference
count seems to
be used to tell when a page is idle (no I/O, reclaim scanners) rather
than free
(not allocated to any file) but I'm not 100% sure about that since I
don't really
understand all the issues around why a file system needs to have a DAX
mount option
besides knowing that the storage block size has to be a multiple of
the page size.

The only thing that happens in free_zone_device_page for FS_DAX pages is
wake_up_var(>_refcount). I guess, whoever is waiting for this
wake-up will need to be prepared to see a refcount 0 instead of 1 now. I
see these callers that would need to be updated:

./fs/ext4/inode.c:        error = ___wait_var_event(>_refcount,
./fs/ext4/inode.c-                atomic_read(>_refcount) == 1,
./fs/ext4/inode.c-                TASK_INTERRUPTIBLE, 0, 0,
./fs/ext4/inode.c-                ext4_wait_dax_page(ei));
--
./fs/fuse/dax.c:    return ___wait_var_event(>_refcount,
./fs/fuse/dax.c-            atomic_read(>_refcount) == 1,
TASK_INTERRUPTIBLE,
./fs/fuse/dax.c-            0, 0, fuse_wait_dax_page(inode));
--
./fs/xfs/xfs_file.c:    return ___wait_var_event(>_refcount,
./fs/xfs/xfs_file.c-            atomic_read(>_refcount) == 1,
TASK_INTERRUPTIBLE,
./fs/xfs/xfs_file.c-            0, 0, xfs_wait_dax_page(inode));

Regarding your page-cache comment, doesn't DAX mean that those file
pages are not in the page cache?


Re: [PATCH v3 2/8] mm: remove extra ZONE_DEVICE struct page refcount

2021-06-28 Thread Felix Kuehling


Am 2021-06-17 um 3:16 p.m. schrieb Ralph Campbell:
>
> On 6/17/21 8:16 AM, Alex Sierra wrote:
>> From: Ralph Campbell 
>>
>> ZONE_DEVICE struct pages have an extra reference count that
>> complicates the
>> code for put_page() and several places in the kernel that need to
>> check the
>> reference count to see that a page is not being used (gup, compaction,
>> migration, etc.). Clean up the code so the reference count doesn't
>> need to
>> be treated specially for ZONE_DEVICE.
>>
>> v2:
>> AS: merged this patch in linux 5.11 version
>>
>> Signed-off-by: Ralph Campbell 
>> Signed-off-by: Alex Sierra 
>> ---
>>   arch/powerpc/kvm/book3s_hv_uvmem.c |  2 +-
>>   drivers/gpu/drm/nouveau/nouveau_dmem.c |  2 +-
>>   fs/dax.c   |  4 +-
>>   include/linux/dax.h    |  2 +-
>>   include/linux/memremap.h   |  7 +--
>>   include/linux/mm.h | 44 -
>>   lib/test_hmm.c |  2 +-
>>   mm/internal.h  |  8 +++
>>   mm/memremap.c  | 68 +++---
>>   mm/migrate.c   |  5 --
>>   mm/page_alloc.c    |  3 ++
>>   mm/swap.c  | 45 ++---
>>   12 files changed, 45 insertions(+), 147 deletions(-)
>>
> I think it is great that you are picking this up and trying to revive it.
>
> However, I have a number of concerns about how it affects existing
> ZONE_DEVICE
> MEMORY_DEVICE_GENERIC and MEMORY_DEVICE_FS_DAX users and I don't see this
> addressing them. For example, dev_dax_probe() allocates
> MEMORY_DEVICE_GENERIC
> struct pages and then:
>   dev_dax_fault()
>     dev_dax_huge_fault()
>   __dev_dax_pte_fault()
>     vmf_insert_mixed()
> which just inserts the PFN into the CPU page tables without increasing
> the page
> refcount so it is zero (whereas it was one before). But using
> get_page() will
> trigger VM_BUG_ON_PAGE() if it is enabled. There isn't any current
> notion of
> free verses allocated for these struct pages. I suppose init_page_count()
> could be called on all the struct pages in dev_dax_probe() to fix that
> though.

Hi Ralph,

For DEVICE_GENERIC pages free_zone_device_page doesn't do anything. So
I'm not sure what the reference counting is good for in this case.

Alex is going to add free_zone_device_page support for DEVICE_GENERIC
pages (patch 8 of this series). However, even then, it only does
anything if there is an actual call to put_page. Where would that call
come from in the dev_dax driver case?


>
> I'm even less clear about how to fix MEMORY_DEVICE_FS_DAX. File
> systems have clear
> allocate and free states for backing storage but there are the
> complications with
> the page cache references, etc. to consider. The >1 to 1 reference
> count seems to
> be used to tell when a page is idle (no I/O, reclaim scanners) rather
> than free
> (not allocated to any file) but I'm not 100% sure about that since I
> don't really
> understand all the issues around why a file system needs to have a DAX
> mount option
> besides knowing that the storage block size has to be a multiple of
> the page size.

The only thing that happens in free_zone_device_page for FS_DAX pages is
wake_up_var(>_refcount). I guess, whoever is waiting for this
wake-up will need to be prepared to see a refcount 0 instead of 1 now. I
see these callers that would need to be updated:

./fs/ext4/inode.c:        error = ___wait_var_event(>_refcount,
./fs/ext4/inode.c-                atomic_read(>_refcount) == 1,
./fs/ext4/inode.c-                TASK_INTERRUPTIBLE, 0, 0,
./fs/ext4/inode.c-                ext4_wait_dax_page(ei));
--
./fs/fuse/dax.c:    return ___wait_var_event(>_refcount,
./fs/fuse/dax.c-            atomic_read(>_refcount) == 1,
TASK_INTERRUPTIBLE,
./fs/fuse/dax.c-            0, 0, fuse_wait_dax_page(inode));
--
./fs/xfs/xfs_file.c:    return ___wait_var_event(>_refcount,
./fs/xfs/xfs_file.c-            atomic_read(>_refcount) == 1,
TASK_INTERRUPTIBLE,
./fs/xfs/xfs_file.c-            0, 0, xfs_wait_dax_page(inode));

Regarding your page-cache comment, doesn't DAX mean that those file
pages are not in the page cache?

Regards,
  Felix




Re: [PATCH v3 2/8] mm: remove extra ZONE_DEVICE struct page refcount

2021-06-17 Thread Ralph Campbell



On 6/17/21 8:16 AM, Alex Sierra wrote:

From: Ralph Campbell 

ZONE_DEVICE struct pages have an extra reference count that complicates the
code for put_page() and several places in the kernel that need to check the
reference count to see that a page is not being used (gup, compaction,
migration, etc.). Clean up the code so the reference count doesn't need to
be treated specially for ZONE_DEVICE.

v2:
AS: merged this patch in linux 5.11 version

Signed-off-by: Ralph Campbell 
Signed-off-by: Alex Sierra 
---
  arch/powerpc/kvm/book3s_hv_uvmem.c |  2 +-
  drivers/gpu/drm/nouveau/nouveau_dmem.c |  2 +-
  fs/dax.c   |  4 +-
  include/linux/dax.h|  2 +-
  include/linux/memremap.h   |  7 +--
  include/linux/mm.h | 44 -
  lib/test_hmm.c |  2 +-
  mm/internal.h  |  8 +++
  mm/memremap.c  | 68 +++---
  mm/migrate.c   |  5 --
  mm/page_alloc.c|  3 ++
  mm/swap.c  | 45 ++---
  12 files changed, 45 insertions(+), 147 deletions(-)


I think it is great that you are picking this up and trying to revive it.

However, I have a number of concerns about how it affects existing ZONE_DEVICE
MEMORY_DEVICE_GENERIC and MEMORY_DEVICE_FS_DAX users and I don't see this
addressing them. For example, dev_dax_probe() allocates MEMORY_DEVICE_GENERIC
struct pages and then:
  dev_dax_fault()
dev_dax_huge_fault()
  __dev_dax_pte_fault()
vmf_insert_mixed()
which just inserts the PFN into the CPU page tables without increasing the page
refcount so it is zero (whereas it was one before). But using get_page() will
trigger VM_BUG_ON_PAGE() if it is enabled. There isn't any current notion of
free verses allocated for these struct pages. I suppose init_page_count()
could be called on all the struct pages in dev_dax_probe() to fix that though.

I'm even less clear about how to fix MEMORY_DEVICE_FS_DAX. File systems have 
clear
allocate and free states for backing storage but there are the complications 
with
the page cache references, etc. to consider. The >1 to 1 reference count seems 
to
be used to tell when a page is idle (no I/O, reclaim scanners) rather than free
(not allocated to any file) but I'm not 100% sure about that since I don't 
really
understand all the issues around why a file system needs to have a DAX mount 
option
besides knowing that the storage block size has to be a multiple of the page 
size.



[PATCH v3 2/8] mm: remove extra ZONE_DEVICE struct page refcount

2021-06-17 Thread Alex Sierra
From: Ralph Campbell 

ZONE_DEVICE struct pages have an extra reference count that complicates the
code for put_page() and several places in the kernel that need to check the
reference count to see that a page is not being used (gup, compaction,
migration, etc.). Clean up the code so the reference count doesn't need to
be treated specially for ZONE_DEVICE.

v2:
AS: merged this patch in linux 5.11 version

Signed-off-by: Ralph Campbell 
Signed-off-by: Alex Sierra 
---
 arch/powerpc/kvm/book3s_hv_uvmem.c |  2 +-
 drivers/gpu/drm/nouveau/nouveau_dmem.c |  2 +-
 fs/dax.c   |  4 +-
 include/linux/dax.h|  2 +-
 include/linux/memremap.h   |  7 +--
 include/linux/mm.h | 44 -
 lib/test_hmm.c |  2 +-
 mm/internal.h  |  8 +++
 mm/memremap.c  | 68 +++---
 mm/migrate.c   |  5 --
 mm/page_alloc.c|  3 ++
 mm/swap.c  | 45 ++---
 12 files changed, 45 insertions(+), 147 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 84e5a2dc8be5..acee67710620 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -711,7 +711,7 @@ static struct page *kvmppc_uvmem_get_page(unsigned long 
gpa, struct kvm *kvm)
 
dpage = pfn_to_page(uvmem_pfn);
dpage->zone_device_data = pvt;
-   get_page(dpage);
+   init_page_count(dpage);
lock_page(dpage);
return dpage;
 out_clear:
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 92987daa5e17..8bc7120e1216 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -324,7 +324,7 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm)
return NULL;
}
 
-   get_page(page);
+   init_page_count(page);
lock_page(page);
return page;
 }
diff --git a/fs/dax.c b/fs/dax.c
index 321f4ddc6643..7b4c6b35b098 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -560,14 +560,14 @@ static void *grab_mapping_entry(struct xa_state *xas,
 
 /**
  * dax_layout_busy_page_range - find first pinned page in @mapping
- * @mapping: address space to scan for a page with ref count > 1
+ * @mapping: address space to scan for a page with ref count > 0
  * @start: Starting offset. Page containing 'start' is included.
  * @end: End offset. Page containing 'end' is included. If 'end' is LLONG_MAX,
  *   pages from 'start' till the end of file are included.
  *
  * DAX requires ZONE_DEVICE mapped pages. These pages are never
  * 'onlined' to the page allocator so they are considered idle when
- * page->count == 1. A filesystem uses this interface to determine if
+ * page->count == 0. A filesystem uses this interface to determine if
  * any page in the mapping is busy, i.e. for DMA, or other
  * get_user_pages() usages.
  *
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 8b5da1d60dbc..05fc982ce153 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -245,7 +245,7 @@ static inline bool dax_mapping(struct address_space 
*mapping)
 
 static inline bool dax_page_unused(struct page *page)
 {
-   return page_ref_count(page) == 1;
+   return page_ref_count(page) == 0;
 }
 
 #define dax_wait_page(_inode, _page, _wait_cb) \
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 79c49e7f5c30..327f32427d21 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -66,9 +66,10 @@ enum memory_type {
 
 struct dev_pagemap_ops {
/*
-* Called once the page refcount reaches 1.  (ZONE_DEVICE pages never
-* reach 0 refcount unless there is a refcount bug. This allows the
-* device driver to implement its own memory management.)
+* Called once the page refcount reaches 0. The reference count
+* should be reset to one with init_page_count(page) before reusing
+* the page. This allows the device driver to implement its own
+* memory management.
 */
void (*page_free)(struct page *page);
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index c9900aedc195..d8d79bb94be8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1117,39 +1117,6 @@ static inline bool is_zone_device_page(const struct page 
*page)
 }
 #endif
 
-#ifdef CONFIG_DEV_PAGEMAP_OPS
-void free_devmap_managed_page(struct page *page);
-DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
-
-static inline bool page_is_devmap_managed(struct page *page)
-{
-   if (!static_branch_unlikely(_managed_key))
-   return false;
-   if (!is_zone_device_page(page))
-   return false;
-   switch (page->pgmap->type) {
-   case MEMORY_DEVICE_PRIVATE:
-   case