from:"David Hildenbrand"

Re: [PATCH v13 2/8] mm/gup: Introduce check_and_migrate_movable_folios()

2024-04-05 Thread David Hildenbrand


On 04.04.24 09:26, Vivek Kasireddy wrote:

This helper is the folio equivalent of check_and_migrate_movable_pages().
Therefore, all the rules that apply to check_and_migrate_movable_pages()
also apply to this one as well. Currently, this helper is only used by
memfd_pin_folios().

This patch also includes changes to rename and convert the internal
functions collect_longterm_unpinnable_pages() and
migrate_longterm_unpinnable_pages() to work on folios. As a result,
check_and_migrate_movable_pages() is now a wrapper around
check_and_migrate_movable_folios().

Cc: David Hildenbrand 
Cc: Matthew Wilcox 
Cc: Christoph Hellwig 
Cc: Jason Gunthorpe 
Cc: Peter Xu 
Suggested-by: David Hildenbrand 
Signed-off-by: Vivek Kasireddy 
---


[...]


+/*
+ * Check whether all folios are *allowed* to be pinned indefinitely (longterm).
+ * Rather confusingly, all folios in the range are required to be pinned via
+ * FOLL_PIN, before calling this routine.
+ *
+ * If any folios in the range are not allowed to be pinned, then this routine
+ * will migrate those folios away, unpin all the folios in the range and return
+ * -EAGAIN. The caller should re-pin the entire range with FOLL_PIN and then
+ * call this routine again.
+ *
+ * If an error other than -EAGAIN occurs, this indicates a migration failure.
+ * The caller should give up, and propagate the error back up the call stack.
+ *
+ * If everything is OK and all folios in the range are allowed to be pinned,
+ * then this routine leaves all folios pinned and returns zero for success.
+ */
+static long check_and_migrate_movable_folios(unsigned long nr_folios,
+struct folio **folios)
+{
+   unsigned long collected;
+   LIST_HEAD(movable_folio_list);
+
+   collected = collect_longterm_unpinnable_folios(_folio_list,
+  nr_folios, folios);
+   if (!collected)
+   return 0;
+
+   return migrate_longterm_unpinnable_folios(_folio_list,
+ nr_folios, folios);
+}
+
  /*
   * Check whether all pages are *allowed* to be pinned. Rather confusingly, all
   * pages in the range are required to be pinned via FOLL_PIN, before calling


Likely we should just drop that comment and refer to 
check_and_migrate_movable_folios() instead. No need to duplicate all that.



@@ -2555,16 +2585,20 @@ static int migrate_longterm_unpinnable_pages(
  static long check_and_migrate_movable_pages(unsigned long nr_pages,
struct page **pages)
  {
-   unsigned long collected;
-   LIST_HEAD(movable_page_list);
+   struct folio **folios;
+   long i, ret;
  
-	collected = collect_longterm_unpinnable_pages(_page_list,

-   nr_pages, pages);
-   if (!collected)
-   return 0;
+   folios = kmalloc_array(nr_pages, sizeof(*folios), GFP_KERNEL);
+   if (!folios)
+   return -ENOMEM;
  
-	return migrate_longterm_unpinnable_pages(_page_list, nr_pages,

-   pages);
+   for (i = 0; i < nr_pages; i++)
+   folios[i] = page_folio(pages[i]);



I wonder if we have to handle pages[i] being NULL. Hopefully not :)

Looks straight forward now:

Acked-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [PATCH v13 1/8] mm/gup: Introduce unpin_folio/unpin_folios helpers

2024-04-05 Thread David Hildenbrand


On 04.04.24 09:26, Vivek Kasireddy wrote:

These helpers are the folio versions of unpin_user_page/unpin_user_pages.
They are currently only useful for unpinning folios pinned by
memfd_pin_folios() or other associated routines. However, they could
find new uses in the future, when more and more folio-only helpers
are added to GUP.

We should probably sanity check the folio as part of unpin similar
to how it is done in unpin_user_page/unpin_user_pages but we cannot
cleanly do that at the moment without also checking the subpage.
Therefore, sanity checking needs to be added to these routines once
we have a way to determine if any given folio is anon-exclusive (via
a per folio AnonExclusive flag).

Cc: David Hildenbrand 
Cc: Matthew Wilcox 
Cc: Christoph Hellwig 
Cc: Jason Gunthorpe 
Cc: Peter Xu 
Suggested-by: David Hildenbrand 
Signed-off-by: Vivek Kasireddy 
---


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [PATCH v12 2/8] mm/gup: Introduce check_and_migrate_movable_folios()

2024-04-02 Thread David Hildenbrand


On 25.02.24 08:56, Vivek Kasireddy wrote:

This helper is the folio equivalent of check_and_migrate_movable_pages().
Therefore, all the rules that apply to check_and_migrate_movable_pages()
also apply to this one as well. Currently, this helper is only used by
memfd_pin_folios().

This patch also includes changes to rename and convert the internal
functions collect_longterm_unpinnable_pages() and
migrate_longterm_unpinnable_pages() to work on folios. Since they
are also used by check_and_migrate_movable_pages(), a temporary
array is used to collect and share the folios with these functions.

Cc: David Hildenbrand 
Cc: Matthew Wilcox 
Cc: Christoph Hellwig 
Cc: Jason Gunthorpe 
Cc: Peter Xu 
Suggested-by: David Hildenbrand 
Signed-off-by: Vivek Kasireddy 
---
  mm/gup.c | 129 +++
  1 file changed, 92 insertions(+), 37 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 0a45eda6aaeb..1410af954a4e 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2099,20 +2099,24 @@ struct page *get_dump_page(unsigned long addr)
  
  #ifdef CONFIG_MIGRATION

  /*
- * Returns the number of collected pages. Return value is always >= 0.
+ * Returns the number of collected folios. Return value is always >= 0.
   */
-static unsigned long collect_longterm_unpinnable_pages(
-   struct list_head *movable_page_list,
-   unsigned long nr_pages,
+static unsigned long collect_longterm_unpinnable_folios(
+   struct list_head *movable_folio_list,
+   unsigned long nr_folios,
+   struct folio **folios,
struct page **pages)


This function really shouldn't consume both folios and pages.

Either use "folios" and handle the conversion from pages->folios in the 
caller, or handle it similar to release_pages() where we can pass either 
and simply always do page_folio() on the given pointer, using 
essentially an abstracted pointer type and always calling page_folio() 
on that thing.


The easiest is likely to just do the page->folio conversion in the 
caller by looping over the arrays once more. See below.


Temporary memory allocation can be avoided by using an abstracted 
pointer type.


[...]

  
+		folio = folios[i];

if (folio == prev_folio)
continue;
prev_folio = folio;
@@ -2126,7 +2130,7 @@ static unsigned long collect_longterm_unpinnable_pages(
continue;
  
  		if (folio_test_hugetlb(folio)) {

-   isolate_hugetlb(folio, movable_page_list);
+   isolate_hugetlb(folio, movable_folio_list);
continue;
}
  
@@ -2138,7 +2142,7 @@ static unsigned long collect_longterm_unpinnable_pages(

if (!folio_isolate_lru(folio))
continue;
  
-		list_add_tail(>lru, movable_page_list);

+   list_add_tail(>lru, movable_folio_list);
node_stat_mod_folio(folio,
NR_ISOLATED_ANON + folio_is_file_lru(folio),
folio_nr_pages(folio));
@@ -2148,27 +2152,28 @@ static unsigned long collect_longterm_unpinnable_pages(
  }
  
  /*

- * Unpins all pages and migrates device coherent pages and movable_page_list.
- * Returns -EAGAIN if all pages were successfully migrated or -errno for 
failure
- * (or partial success).
+ * Unpins all folios and migrates device coherent folios and 
movable_folio_list.
+ * Returns -EAGAIN if all folios were successfully migrated or -errno for
+ * failure (or partial success).
   */
-static int migrate_longterm_unpinnable_pages(
-   struct list_head *movable_page_list,
-   unsigned long nr_pages,
-   struct page **pages)
+static int migrate_longterm_unpinnable_folios(
+   struct list_head *movable_folio_list,
+   unsigned long nr_folios,
+   struct folio **folios)
  {
int ret;
unsigned long i;
  
-	for (i = 0; i < nr_pages; i++) {

-   struct folio *folio = page_folio(pages[i]);
+   for (i = 0; i < nr_folios; i++) {
+   struct folio *folio = folios[i];
  
  		if (folio_is_device_coherent(folio)) {

/*
-* Migration will fail if the page is pinned, so convert
-* the pin on the source page to a normal reference.
+* Migration will fail if the folio is pinned, so
+* convert the pin on the source folio to a normal
+* reference.
 */
-

Re: [PATCH v12 1/8] mm/gup: Introduce unpin_folio/unpin_folios helpers

2024-04-02 Thread David Hildenbrand


On 02.04.24 15:52, David Hildenbrand wrote:

On 25.02.24 08:56, Vivek Kasireddy wrote:

These helpers are the folio versions of unpin_user_page/unpin_user_pages.
They are currently only useful for unpinning folios pinned by
memfd_pin_folios() or other associated routines. However, they could
find new uses in the future, when more and more folio-only helpers
are added to GUP.

Cc: David Hildenbrand 
Cc: Matthew Wilcox 
Cc: Christoph Hellwig 
Cc: Jason Gunthorpe 
Cc: Peter Xu 
Suggested-by: David Hildenbrand 
Signed-off-by: Vivek Kasireddy 
---
   include/linux/mm.h |  2 ++
   mm/gup.c   | 81 --
   2 files changed, 74 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6f4825d82965..36e4c2b22600 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1601,11 +1601,13 @@ static inline void put_page(struct page *page)
   #define GUP_PIN_COUNTING_BIAS (1U << 10)
   
   void unpin_user_page(struct page *page);

+void unpin_folio(struct folio *folio);
   void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
 bool make_dirty);
   void unpin_user_page_range_dirty_lock(struct page *page, unsigned long 
npages,
  bool make_dirty);
   void unpin_user_pages(struct page **pages, unsigned long npages);
+void unpin_folios(struct folio **folios, unsigned long nfolios);
   
   static inline bool is_cow_mapping(vm_flags_t flags)

   {
diff --git a/mm/gup.c b/mm/gup.c
index df83182ec72d..0a45eda6aaeb 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -30,6 +30,23 @@ struct follow_page_context {
unsigned int page_mask;
   };
   
+static inline void sanity_check_pinned_folios(struct folio **folios,

+ unsigned long nfolios)
+{
+   if (!IS_ENABLED(CONFIG_DEBUG_VM))
+   return;
+
+   for (; nfolios; nfolios--, folios++) {
+   struct folio *folio = *folios;
+
+   if (is_zero_folio(folio) ||
+   !folio_test_anon(folio))
+   continue;
+
+   VM_BUG_ON_FOLIO(!PageAnonExclusive(>page), folio);


That change is wrong (and the split makes the check confusing).

It could be that the first subpage is no longer exclusive, but the given
(sanity_check_pinned_pages() ) subpage is exclusive for large folios.

I suggest dropping that change, and instead, in
unpin_folio()/unpin_folios(), reject any anon folios for now.

So, replace the sanity_check_pinned_folios() in unpin_folio() /
unpin_folios() by a VM_WARN_ON(folio_test_anon(folio));


After reading patch #2: drop both the sanity check and VM_WARN_ON() from 
unpin_folio()/unpin_folios(), and add a comment to the patch description 
that we cannot do the sanity checking without the subpage, and that we 
can reintroduce it once we have a single per-folio AnonExclusive bit.


--
Cheers,

David / dhildenb

Re: [PATCH v12 1/8] mm/gup: Introduce unpin_folio/unpin_folios helpers

2024-04-02 Thread David Hildenbrand


On 25.02.24 08:56, Vivek Kasireddy wrote:

These helpers are the folio versions of unpin_user_page/unpin_user_pages.
They are currently only useful for unpinning folios pinned by
memfd_pin_folios() or other associated routines. However, they could
find new uses in the future, when more and more folio-only helpers
are added to GUP.

Cc: David Hildenbrand 
Cc: Matthew Wilcox 
Cc: Christoph Hellwig 
Cc: Jason Gunthorpe 
Cc: Peter Xu 
Suggested-by: David Hildenbrand 
Signed-off-by: Vivek Kasireddy 
---
  include/linux/mm.h |  2 ++
  mm/gup.c   | 81 --
  2 files changed, 74 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6f4825d82965..36e4c2b22600 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1601,11 +1601,13 @@ static inline void put_page(struct page *page)
  #define GUP_PIN_COUNTING_BIAS (1U << 10)
  
  void unpin_user_page(struct page *page);

+void unpin_folio(struct folio *folio);
  void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
 bool make_dirty);
  void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,
  bool make_dirty);
  void unpin_user_pages(struct page **pages, unsigned long npages);
+void unpin_folios(struct folio **folios, unsigned long nfolios);
  
  static inline bool is_cow_mapping(vm_flags_t flags)

  {
diff --git a/mm/gup.c b/mm/gup.c
index df83182ec72d..0a45eda6aaeb 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -30,6 +30,23 @@ struct follow_page_context {
unsigned int page_mask;
  };
  
+static inline void sanity_check_pinned_folios(struct folio **folios,

+ unsigned long nfolios)
+{
+   if (!IS_ENABLED(CONFIG_DEBUG_VM))
+   return;
+
+   for (; nfolios; nfolios--, folios++) {
+   struct folio *folio = *folios;
+
+   if (is_zero_folio(folio) ||
+   !folio_test_anon(folio))
+   continue;
+
+   VM_BUG_ON_FOLIO(!PageAnonExclusive(>page), folio);


That change is wrong (and the split makes the check confusing).

It could be that the first subpage is no longer exclusive, but the given 
(sanity_check_pinned_pages() ) subpage is exclusive for large folios.


I suggest dropping that change, and instead, in 
unpin_folio()/unpin_folios(), reject any anon folios for now.


So, replace the sanity_check_pinned_folios() in unpin_folio() / 
unpin_folios() by a VM_WARN_ON(folio_test_anon(folio));


It will all be better once we have a single anon-exclusive flag per 
folio (which I am working on), but in the meantime, we really don't 
expect code that called pin_user_pages() to call unpin_folios().


[...]

  
+/**

+ * unpin_folio() - release a dma-pinned folio
+ * @folio: pointer to folio to be released
+ *
+ * Folios that were pinned via memfd_pin_folios() or other similar routines
+ * must be released either using unpin_folio() or unpin_folios(). This is so
+ * that such folios can be separately tracked and uniquely handled.


I'd drop the last sentence; no need for apologies/explanations, this is 
simply how ;pinning works :)



+ */
+void unpin_folio(struct folio *folio)
+{
+   sanity_check_pinned_folios(, 1);
+   gup_put_folio(folio, 1, FOLL_PIN);
+}
+EXPORT_SYMBOL(unpin_folio);


Can we restrict that to EXPORT_SYMBOL_GPL for now? memfd_pin_folios() 
uses EXPORT_SYMBOL_GPL...



+
  /**
   * folio_add_pin - Try to get an additional pin on a pinned folio
   * @folio: The folio to be pinned
@@ -488,6 +516,41 @@ void unpin_user_pages(struct page **pages, unsigned long 
npages)
  }
  EXPORT_SYMBOL(unpin_user_pages);
  
+/**

+ * unpin_folios() - release an array of gup-pinned folios.
+ * @folios:  array of folios to be marked dirty and released.
+ * @nfolios: number of folios in the @folios array.
+ *
+ * For each folio in the @folios array, release the folio using unpin_folio().
+ *
+ * Please see the unpin_folio() documentation for details.
+ */
+void unpin_folios(struct folio **folios, unsigned long nfolios)
+{
+   unsigned long i = 0, j;
+
+   /*
+* If this WARN_ON() fires, then the system *might* be leaking folios
+* (by leaving them pinned), but probably not. More likely, gup/pup
+* returned a hard -ERRNO error to the caller, who erroneously passed
+* it here.
+*/
+   if (WARN_ON(IS_ERR_VALUE(nfolios)))
+   return;
+
+   sanity_check_pinned_folios(folios, nfolios);
+   while (i < nfolios) {
+   for (j = i + 1; j < nfolios; j++)
+   if (folios[i] != folios[j])
+   break;
+
+   if (folios[i])
+   gup_put_folio(folios[i], j - i, FOLL_PIN);
+   i = j;
+   }
+}
+EXPORT_SYMBOL(unpin_folios);


Same thought here.

--
Cheers,

David / dhildenb

Re: [PATCH v12 0/8] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios

2024-03-29 Thread David Hildenbrand


On 29.03.24 06:38, Kasireddy, Vivek wrote:

Hi David,



On 25.02.24 08:56, Vivek Kasireddy wrote:

Currently, some drivers (e.g, Udmabuf) that want to longterm-pin
the pages/folios associated with a memfd, do so by simply taking a
reference on them. This is not desirable because the pages/folios
may reside in Movable zone or CMA block.

Therefore, having drivers use memfd_pin_folios() API ensures that
the folios are appropriately pinned via FOLL_PIN for longterm DMA.

This patchset also introduces a few helpers and converts the Udmabuf
driver to use folios and memfd_pin_folios() API to longterm-pin
the folios for DMA. Two new Udmabuf selftests are also included to
test the driver and the new API.

---


Sorry Vivek, I got distracted. What's the state of this? I assume it's
not in an mm tree yet.

No problem. Right, they are not in any tree yet. The first two mm patches that
add the unpin_folios() and check_and_migrate_movable_folios() helpers still
need to be reviewed.



I try to get this reviewed this week. If I fail to do that, please ping me.

Ok, sounds good!


.. as it's already Friday (and even a public Holiday today+Monday here), 
let me prioritize this for next week!


--
Cheers,

David / dhildenb

Re: [PATCH v12 0/8] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios

2024-03-26 Thread David Hildenbrand


On 25.02.24 08:56, Vivek Kasireddy wrote:

Currently, some drivers (e.g, Udmabuf) that want to longterm-pin
the pages/folios associated with a memfd, do so by simply taking a
reference on them. This is not desirable because the pages/folios
may reside in Movable zone or CMA block.

Therefore, having drivers use memfd_pin_folios() API ensures that
the folios are appropriately pinned via FOLL_PIN for longterm DMA.

This patchset also introduces a few helpers and converts the Udmabuf
driver to use folios and memfd_pin_folios() API to longterm-pin
the folios for DMA. Two new Udmabuf selftests are also included to
test the driver and the new API.

---


Sorry Vivek, I got distracted. What's the state of this? I assume it's 
not in an mm tree yet.


I try to get this reviewed this week. If I fail to do that, please ping me.

--
Cheers,

David / dhildenb

Re: [PATCH v7 3/6] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios (v7)

2023-12-13 Thread David Hildenbrand


Hi,


Sorry, I'm still not happy about the current state, because (1) the
folio vs. pages handling is still mixed (2) we're returning+pinning a
large folio multiple times.

I can address (1) in a follow-up series and as far as (2) is concerned, my
understanding is that we need to increase the folio's refcount as and
when the folio's tail pages are used. Is this not the case? It appears
this is what unpin_user_pages() expects as well. Do you see any
concern with this?


If you'd just pin the folio once, you'd also only have to unpin it once.

Bu that raises a question: Is it a requirement for the user of this 
interface, being able to unmap+unpin each individual page?


If you really want to handle each subpage possibly individually, then 
subpage or folio+offset makes more sense, agreed.






See below if there is an easy way to clean this up.


@@ -5,6 +5,7 @@

   #include 

   #include 
+#include 
   #include 
   #include 
   #include 
@@ -17,6 +18,7 @@
   #include 
   #include 
   #include 
+#include 
   #include 
   #include 

@@ -3410,3 +3412,156 @@ long pin_user_pages_unlocked(unsigned long

start, unsigned long nr_pages,

 , gup_flags);
   }
   EXPORT_SYMBOL(pin_user_pages_unlocked);
+
+/**
+ * memfd_pin_folios() - pin folios associated with a memfd
+ * @memfd:  the memfd whose folios are to be pinned
+ * @start:  starting memfd offset
+ * @nr_pages:   number of pages from start to pin


We're not pinning pages. An inclusive range [start, end] would be clearer.

Ok, I'll make this change in the next version.




+ * @folios: array that receives pointers to the folios pinned.
+ *  Should be at-least nr_pages long.
+ * @offsets:array that receives offsets of pages in their folios.
+ *  Should be at-least nr_pages long.


See below, I'm wondering if this is really required once we return each folio
only once.

The offsets can be calculated by the caller (udmabuf) as well but doing so
in this interface would prevent special handling in the caller for the hugetlb
case. Please look at patch 5 in this series (udmabuf: Pin the pages using
memfd_pin_folios() API (v5)) for more details as to what I mean.



I'll have a look later to be reminded about the target use case :)





+ *
+ * It must be noted that the folios may be pinned for an indefinite amount
+ * of time. And, in most cases, the duration of time they may stay pinned
+ * would be controlled by the userspace. This behavior is effectively the
+ * same as using FOLL_LONGTERM with other GUP APIs.
+ *
+ * Returns number of folios pinned. This would be equal to the number of
+ * pages requested. If no folios were pinned, it returns -errno.
+ */
+long memfd_pin_folios(struct file *memfd, unsigned long start,
+ unsigned long nr_pages, struct folio **folios,
+ pgoff_t *offsets)
+{
+   unsigned long end = start + (nr_pages << PAGE_SHIFT) - 1;
+   unsigned int max_pgs, pgoff, pgshift = PAGE_SHIFT;
+   pgoff_t start_idx, end_idx, next_idx;
+   unsigned int flags, nr_folios, i, j;
+   struct folio *folio = NULL;
+   struct folio_batch fbatch;
+   struct page **pages;
+   struct hstate *h;
+   long ret;
+
+   if (!nr_pages)
+   return -EINVAL;
+
+   if (!memfd)
+   return -EINVAL;
+
+   if (!shmem_file(memfd) && !is_file_hugepages(memfd))
+   return -EINVAL;
+
+   pages = kmalloc_array(nr_pages, sizeof(*pages), GFP_KERNEL);
+   if (!pages)
+   return -ENOMEM;
+
+   if (is_file_hugepages(memfd)) {
+   h = hstate_file(memfd);
+   pgshift = huge_page_shift(h);
+   }
+
+   flags = memalloc_pin_save();
+   do {
+   i = 0;
+   start_idx = start >> pgshift;
+   end_idx = end >> pgshift;
+   if (is_file_hugepages(memfd)) {
+   start_idx <<= huge_page_order(h);
+   end_idx <<= huge_page_order(h);
+   }
+
+   folio_batch_init();
+   while (start_idx <= end_idx) {
+   /*
+* In most cases, we should be able to find the folios
+* in the page cache. If we cannot find them for some
+* reason, we try to allocate them and add them to

the

+* page cache.
+*/
+   nr_folios = filemap_get_folios_contig(memfd-
f_mapping,
+ _idx,
+ end_idx,
+ );
+   if (folio) {
+   folio_put(folio);
+   folio = NULL;
+   }
+
+   next_idx = 0;
+   for

Re: [PATCH v7 0/6] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios (v7)

2023-12-12 Thread David Hildenbrand


On 12.12.23 08:37, Vivek Kasireddy wrote:

The first two patches were previously reviewed but not yet merged.
These ones need to be merged first as the fourth patch depends on
the changes introduced in them and they also fix bugs seen in
very specific scenarios (running Qemu with hugetlb=on, blob=true
and rebooting guest VM).

The third patch introduces memfd_pin_folios() API and the fourth
patch converts udmabuf driver to use folios. The fifth patch shows
how the udmabuf driver can make use of the new API to longterm-pin
the folios. The last patch adds two new udmabuf selftests to verify
data coherency after potential page migration.



We should probably get started with the first two patches, they are
independent of the remaining discussion regarding memfd_pin_folios().

--
Cheers,

David / dhildenb

Re: [PATCH v7 3/6] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios (v7)

2023-12-12 Thread David Hildenbrand


On 12.12.23 08:38, Vivek Kasireddy wrote:

For drivers that would like to longterm-pin the folios associated
with a memfd, the memfd_pin_folios() API provides an option to
not only pin the folios via FOLL_PIN but also to check and migrate
them if they reside in movable zone or CMA block. This API
currently works with memfds but it should work with any files
that belong to either shmemfs or hugetlbfs. Files belonging to
other filesystems are rejected for now.

The folios need to be located first before pinning them via FOLL_PIN.
If they are found in the page cache, they can be immediately pinned.
Otherwise, they need to be allocated using the filesystem specific
APIs and then pinned.

v2:
- Drop gup_flags and improve comments and commit message (David)
- Allocate a page if we cannot find in page cache for the hugetlbfs
   case as well (David)
- Don't unpin pages if there is a migration related failure (David)
- Drop the unnecessary nr_pages <= 0 check (Jason)
- Have the caller of the API pass in file * instead of fd (Jason)

v3: (David)
- Enclose the huge page allocation code with #ifdef CONFIG_HUGETLB_PAGE
   (Build error reported by kernel test robot )
- Don't forget memalloc_pin_restore() on non-migration related errors
- Improve the readability of the cleanup code associated with
   non-migration related errors
- Augment the comments by describing FOLL_LONGTERM like behavior
- Include the R-b tag from Jason

v4:
- Remove the local variable "page" and instead use 3 return statements
   in alloc_file_page() (David)
- Add the R-b tag from David

v5: (David)
- For hugetlb case, ensure that we only obtain head pages from the
   mapping by using __filemap_get_folio() instead of find_get_page_flags()
- Handle -EEXIST when two or more potential users try to simultaneously
   add a huge page to the mapping by forcing them to retry on failure

v6: (Christoph)
- Rename this API to memfd_pin_user_pages() to make it clear that it
   is intended for memfds
- Move the memfd page allocation helper from gup.c to memfd.c
- Fix indentation errors in memfd_pin_user_pages()
- For contiguous ranges of folios, use a helper such as
   filemap_get_folios_contig() to lookup the page cache in batches

v7:
- Rename this API to memfd_pin_folios() and make it return folios
   and offsets instead of pages (David)
- Don't continue processing the folios in the batch returned by
   filemap_get_folios_contig() if they do not have correct next_idx
- Add the R-b tag from Christoph



Sorry, I'm still not happy about the current state, because (1) the
folio vs. pages handling is still mixed (2) we're returning+pinning a
large folio multiple times.

See below if there is an easy way to clean this up.


@@ -5,6 +5,7 @@

  #include 
  
  #include 

+#include 
  #include 
  #include 
  #include 
@@ -17,6 +18,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  
@@ -3410,3 +3412,156 @@ long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages,

 , gup_flags);
  }
  EXPORT_SYMBOL(pin_user_pages_unlocked);
+
+/**
+ * memfd_pin_folios() - pin folios associated with a memfd
+ * @memfd:  the memfd whose folios are to be pinned
+ * @start:  starting memfd offset
+ * @nr_pages:   number of pages from start to pin


We're not pinning pages. An inclusive range [start, end] would be clearer.


+ * @folios: array that receives pointers to the folios pinned.
+ *  Should be at-least nr_pages long.
+ * @offsets:array that receives offsets of pages in their folios.
+ *  Should be at-least nr_pages long.


See below, I'm wondering if this is really required once we return each folio
only once.


+ *
+ * Attempt to pin folios associated with a memfd; given that a memfd is
+ * either backed by shmem or hugetlb, the folios can either be found in
+ * the page cache or need to be allocated if necessary. Once the folios
+ * are located, they are all pinned via FOLL_PIN and the @offsets array
+ * is populated with offsets of the pages in their respective folios.
+ * Therefore, for each page the caller requested, there will be a
+ * corresponding entry in both @folios and @offsets. And, eventually,
+ * these pinned folios need to be released either using unpin_user_pages()
+ * or unpin_user_page().


Oh, we don't have a folio function yet? Should be easy to add, and we'd really
add them.


+ *
+ * It must be noted that the folios may be pinned for an indefinite amount
+ * of time. And, in most cases, the duration of time they may stay pinned
+ * would be controlled by the userspace. This behavior is effectively the
+ * same as using FOLL_LONGTERM with other GUP APIs.
+ *
+ * Returns number of folios pinned. This would be equal to the number of
+ * pages requested. If no folios were pinned, it returns -errno.
+ */
+long memfd_pin_folios(struct file *memfd, unsigned long start,
+ unsigned long nr_pages, struct folio **folios,
+

Re: [PATCH v6 3/5] mm/gup: Introduce memfd_pin_user_pages() for pinning memfd pages (v6)

2023-12-08 Thread David Hildenbrand


On 08.12.23 08:57, Kasireddy, Vivek wrote:

Hi David,




On 05.12.23 06:35, Vivek Kasireddy wrote:

For drivers that would like to longterm-pin the pages associated
with a memfd, the pin_user_pages_fd() API provides an option to
not only pin the pages via FOLL_PIN but also to check and migrate
them if they reside in movable zone or CMA block. This API
currently works with memfds but it should work with any files
that belong to either shmemfs or hugetlbfs. Files belonging to
other filesystems are rejected for now.

The pages need to be located first before pinning them via FOLL_PIN.
If they are found in the page cache, they can be immediately pinned.
Otherwise, they need to be allocated using the filesystem specific
APIs and then pinned.

v2:
- Drop gup_flags and improve comments and commit message (David)
- Allocate a page if we cannot find in page cache for the hugetlbfs
 case as well (David)
- Don't unpin pages if there is a migration related failure (David)
- Drop the unnecessary nr_pages <= 0 check (Jason)
- Have the caller of the API pass in file * instead of fd (Jason)

v3: (David)
- Enclose the huge page allocation code with #ifdef

CONFIG_HUGETLB_PAGE

 (Build error reported by kernel test robot )
- Don't forget memalloc_pin_restore() on non-migration related errors
- Improve the readability of the cleanup code associated with
 non-migration related errors
- Augment the comments by describing FOLL_LONGTERM like behavior
- Include the R-b tag from Jason

v4:
- Remove the local variable "page" and instead use 3 return statements
 in alloc_file_page() (David)
- Add the R-b tag from David

v5: (David)
- For hugetlb case, ensure that we only obtain head pages from the
 mapping by using __filemap_get_folio() instead of

find_get_page_flags()

- Handle -EEXIST when two or more potential users try to simultaneously
 add a huge page to the mapping by forcing them to retry on failure

v6: (Christoph)
- Rename this API to memfd_pin_user_pages() to make it clear that it
 is intended for memfds
- Move the memfd page allocation helper from gup.c to memfd.c
- Fix indentation errors in memfd_pin_user_pages()
- For contiguous ranges of folios, use a helper such as
 filemap_get_folios_contig() to lookup the page cache in batches

Cc: David Hildenbrand 
Cc: Christoph Hellwig 
Cc: Daniel Vetter 
Cc: Mike Kravetz 
Cc: Hugh Dickins 
Cc: Peter Xu 
Cc: Gerd Hoffmann 
Cc: Dongwon Kim 
Cc: Junxiao Chang 
Suggested-by: Jason Gunthorpe 
Reviewed-by: Jason Gunthorpe  (v2)
Reviewed-by: David Hildenbrand  (v3)
Signed-off-by: Vivek Kasireddy 
---
include/linux/memfd.h |   5 +++
include/linux/mm.h|   2 +
mm/gup.c  | 102

++

mm/memfd.c|  34 ++
4 files changed, 143 insertions(+)

diff --git a/include/linux/memfd.h b/include/linux/memfd.h
index e7abf6fa4c52..6fc0d1282151 100644
--- a/include/linux/memfd.h
+++ b/include/linux/memfd.h
@@ -6,11 +6,16 @@

#ifdef CONFIG_MEMFD_CREATE
extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned int

arg);

+extern struct page *memfd_alloc_page(struct file *memfd, pgoff_t idx);
#else
static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned int

a)

{
return -EINVAL;
}
+static inline struct page *memfd_alloc_page(struct file *memfd, pgoff_t

idx)

+{
+   return ERR_PTR(-EINVAL);
+}
#endif

#endif /* __LINUX_MEMFD_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 418d26608ece..ac69db45509f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2472,6 +2472,8 @@ long get_user_pages_unlocked(unsigned long

start, unsigned long nr_pages,

struct page **pages, unsigned int gup_flags);
long pin_user_pages_unlocked(unsigned long start, unsigned long

nr_pages,

struct page **pages, unsigned int gup_flags);
+long memfd_pin_user_pages(struct file *file, pgoff_t start,
+ unsigned long nr_pages, struct page **pages);

int get_user_pages_fast(unsigned long start, int nr_pages,
unsigned int gup_flags, struct page **pages);
diff --git a/mm/gup.c b/mm/gup.c
index 231711efa390..eb93d1ec9dc6 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -5,6 +5,7 @@
#include 

#include 
+#include 
#include 
#include 
#include 
@@ -17,6 +18,7 @@
#include 
#include 
#include 
+#include 
#include 
#include 

@@ -3410,3 +3412,103 @@ long pin_user_pages_unlocked(unsigned

long

start, unsigned long nr_pages,

 , gup_flags);
}
EXPORT_SYMBOL(pin_user_pages_unlocked);
+
+/**
+ * memfd_pin_user_pages() - pin user pages associated with a memfd
+ * @memfd:  the memfd whose pages are to be pinned
+ * @start:  starting memfd offset
+ * @nr_pages:   number of pages from start to pin
+ * @pages:  array that receives pointers

Re: [PATCH v6 3/5] mm/gup: Introduce memfd_pin_user_pages() for pinning memfd pages (v6)

2023-12-07 Thread David Hildenbrand


On 07.12.23 14:05, Jason Gunthorpe wrote:

On Thu, Dec 07, 2023 at 10:44:14AM +0100, David Hildenbrand wrote:


If you always want to return folios, then better name it
"memfd_pin_user_folios" (or just "memfd_pin_folios") and pass in a range
(instead of a nr_pages parameter), and somehow indicate to the caller
how many folio were in that range, and if that range was fully covered.

I think it makes sense to return folios from this interface; and considering my
use-case, I'd like have this API return an error if it cannot pin (or allocate)
the exact number of folios the caller requested.


Okay, then better use folios.

Assuming a caller puts in "start = X" and gets some large folio back. How is
the caller supposed to know at which offset to look into that folio (IOW<
which subpage)? For "pages" it was obvious (you get the actual subpages),
but as soon as we return a large folio, some information is missing for the
caller.

How can the caller figure that out?


This can only work if the memfd is required to only have full folios
at aligned locations. Under that restriction computing the first folio
offset is easy enough:

   folio offset = (start % folio size)

But is that true for the memfds here?


I assume folios are always naturally aligned, like:

[ 2m ][ 2m ][1m][1m][ 2m ]
^f0   ^f1   ^f2 ^f3 ^f4

If you query the range "3m -> 7m", you get back f1,f2,f3,f4 and have to 
start in the middle of the first folio with offset 1m. From there, it is
indeed simply continuing with the full folio size -- until the last 
folio, where you want to only process 1m.


folio offset = (1m % 2m)

would be correct in that case.

--
Cheers,

David / dhildenb

Re: [PATCH v6 3/5] mm/gup: Introduce memfd_pin_user_pages() for pinning memfd pages (v6)

2023-12-07 Thread David Hildenbrand


On 07.12.23 06:09, Kasireddy, Vivek wrote:

Hi David,


On 05.12.23 06:35, Vivek Kasireddy wrote:

For drivers that would like to longterm-pin the pages associated
with a memfd, the pin_user_pages_fd() API provides an option to
not only pin the pages via FOLL_PIN but also to check and migrate
them if they reside in movable zone or CMA block. This API
currently works with memfds but it should work with any files
that belong to either shmemfs or hugetlbfs. Files belonging to
other filesystems are rejected for now.

The pages need to be located first before pinning them via FOLL_PIN.
If they are found in the page cache, they can be immediately pinned.
Otherwise, they need to be allocated using the filesystem specific
APIs and then pinned.

v2:
- Drop gup_flags and improve comments and commit message (David)
- Allocate a page if we cannot find in page cache for the hugetlbfs
case as well (David)
- Don't unpin pages if there is a migration related failure (David)
- Drop the unnecessary nr_pages <= 0 check (Jason)
- Have the caller of the API pass in file * instead of fd (Jason)

v3: (David)
- Enclose the huge page allocation code with #ifdef

CONFIG_HUGETLB_PAGE

(Build error reported by kernel test robot )
- Don't forget memalloc_pin_restore() on non-migration related errors
- Improve the readability of the cleanup code associated with
non-migration related errors
- Augment the comments by describing FOLL_LONGTERM like behavior
- Include the R-b tag from Jason

v4:
- Remove the local variable "page" and instead use 3 return statements
in alloc_file_page() (David)
- Add the R-b tag from David

v5: (David)
- For hugetlb case, ensure that we only obtain head pages from the
mapping by using __filemap_get_folio() instead of find_get_page_flags()
- Handle -EEXIST when two or more potential users try to simultaneously
add a huge page to the mapping by forcing them to retry on failure

v6: (Christoph)
- Rename this API to memfd_pin_user_pages() to make it clear that it
is intended for memfds
- Move the memfd page allocation helper from gup.c to memfd.c
- Fix indentation errors in memfd_pin_user_pages()
- For contiguous ranges of folios, use a helper such as
filemap_get_folios_contig() to lookup the page cache in batches

Cc: David Hildenbrand 
Cc: Christoph Hellwig 
Cc: Daniel Vetter 
Cc: Mike Kravetz 
Cc: Hugh Dickins 
Cc: Peter Xu 
Cc: Gerd Hoffmann 
Cc: Dongwon Kim 
Cc: Junxiao Chang 
Suggested-by: Jason Gunthorpe 
Reviewed-by: Jason Gunthorpe  (v2)
Reviewed-by: David Hildenbrand  (v3)
Signed-off-by: Vivek Kasireddy 
---
   include/linux/memfd.h |   5 +++
   include/linux/mm.h|   2 +
   mm/gup.c  | 102 ++
   mm/memfd.c|  34 ++
   4 files changed, 143 insertions(+)

diff --git a/include/linux/memfd.h b/include/linux/memfd.h
index e7abf6fa4c52..6fc0d1282151 100644
--- a/include/linux/memfd.h
+++ b/include/linux/memfd.h
@@ -6,11 +6,16 @@

   #ifdef CONFIG_MEMFD_CREATE
   extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned int

arg);

+extern struct page *memfd_alloc_page(struct file *memfd, pgoff_t idx);
   #else
   static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned int 
a)
   {
return -EINVAL;
   }
+static inline struct page *memfd_alloc_page(struct file *memfd, pgoff_t

idx)

+{
+   return ERR_PTR(-EINVAL);
+}
   #endif

   #endif /* __LINUX_MEMFD_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 418d26608ece..ac69db45509f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2472,6 +2472,8 @@ long get_user_pages_unlocked(unsigned long

start, unsigned long nr_pages,

struct page **pages, unsigned int gup_flags);
   long pin_user_pages_unlocked(unsigned long start, unsigned long

nr_pages,

struct page **pages, unsigned int gup_flags);
+long memfd_pin_user_pages(struct file *file, pgoff_t start,
+ unsigned long nr_pages, struct page **pages);

   int get_user_pages_fast(unsigned long start, int nr_pages,
unsigned int gup_flags, struct page **pages);
diff --git a/mm/gup.c b/mm/gup.c
index 231711efa390..eb93d1ec9dc6 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -5,6 +5,7 @@
   #include 

   #include 
+#include 
   #include 
   #include 
   #include 
@@ -17,6 +18,7 @@
   #include 
   #include 
   #include 
+#include 
   #include 
   #include 

@@ -3410,3 +3412,103 @@ long pin_user_pages_unlocked(unsigned long

start, unsigned long nr_pages,

 , gup_flags);
   }
   EXPORT_SYMBOL(pin_user_pages_unlocked);
+
+/**
+ * memfd_pin_user_pages() - pin user pages associated with a memfd
+ * @memfd:  the memfd whose pages are to be pinned
+ * @start:  starting memfd offset
+ * @nr_pages:   number of pages from start to pin
+ * @pages:  array that receives pointers to the pages pinned.
+ *  Shoul

Re: [PATCH v6 3/5] mm/gup: Introduce memfd_pin_user_pages() for pinning memfd pages (v6)

2023-12-06 Thread David Hildenbrand


On 05.12.23 06:35, Vivek Kasireddy wrote:

For drivers that would like to longterm-pin the pages associated
with a memfd, the pin_user_pages_fd() API provides an option to
not only pin the pages via FOLL_PIN but also to check and migrate
them if they reside in movable zone or CMA block. This API
currently works with memfds but it should work with any files
that belong to either shmemfs or hugetlbfs. Files belonging to
other filesystems are rejected for now.

The pages need to be located first before pinning them via FOLL_PIN.
If they are found in the page cache, they can be immediately pinned.
Otherwise, they need to be allocated using the filesystem specific
APIs and then pinned.

v2:
- Drop gup_flags and improve comments and commit message (David)
- Allocate a page if we cannot find in page cache for the hugetlbfs
   case as well (David)
- Don't unpin pages if there is a migration related failure (David)
- Drop the unnecessary nr_pages <= 0 check (Jason)
- Have the caller of the API pass in file * instead of fd (Jason)

v3: (David)
- Enclose the huge page allocation code with #ifdef CONFIG_HUGETLB_PAGE
   (Build error reported by kernel test robot )
- Don't forget memalloc_pin_restore() on non-migration related errors
- Improve the readability of the cleanup code associated with
   non-migration related errors
- Augment the comments by describing FOLL_LONGTERM like behavior
- Include the R-b tag from Jason

v4:
- Remove the local variable "page" and instead use 3 return statements
   in alloc_file_page() (David)
- Add the R-b tag from David

v5: (David)
- For hugetlb case, ensure that we only obtain head pages from the
   mapping by using __filemap_get_folio() instead of find_get_page_flags()
- Handle -EEXIST when two or more potential users try to simultaneously
   add a huge page to the mapping by forcing them to retry on failure

v6: (Christoph)
- Rename this API to memfd_pin_user_pages() to make it clear that it
   is intended for memfds
- Move the memfd page allocation helper from gup.c to memfd.c
- Fix indentation errors in memfd_pin_user_pages()
- For contiguous ranges of folios, use a helper such as
   filemap_get_folios_contig() to lookup the page cache in batches

Cc: David Hildenbrand 
Cc: Christoph Hellwig 
Cc: Daniel Vetter 
Cc: Mike Kravetz 
Cc: Hugh Dickins 
Cc: Peter Xu 
Cc: Gerd Hoffmann 
Cc: Dongwon Kim 
Cc: Junxiao Chang 
Suggested-by: Jason Gunthorpe 
Reviewed-by: Jason Gunthorpe  (v2)
Reviewed-by: David Hildenbrand  (v3)
Signed-off-by: Vivek Kasireddy 
---
  include/linux/memfd.h |   5 +++
  include/linux/mm.h|   2 +
  mm/gup.c  | 102 ++
  mm/memfd.c|  34 ++
  4 files changed, 143 insertions(+)

diff --git a/include/linux/memfd.h b/include/linux/memfd.h
index e7abf6fa4c52..6fc0d1282151 100644
--- a/include/linux/memfd.h
+++ b/include/linux/memfd.h
@@ -6,11 +6,16 @@
  
  #ifdef CONFIG_MEMFD_CREATE

  extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned int 
arg);
+extern struct page *memfd_alloc_page(struct file *memfd, pgoff_t idx);
  #else
  static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned int a)
  {
return -EINVAL;
  }
+static inline struct page *memfd_alloc_page(struct file *memfd, pgoff_t idx)
+{
+   return ERR_PTR(-EINVAL);
+}
  #endif
  
  #endif /* __LINUX_MEMFD_H */

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 418d26608ece..ac69db45509f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2472,6 +2472,8 @@ long get_user_pages_unlocked(unsigned long start, 
unsigned long nr_pages,
struct page **pages, unsigned int gup_flags);
  long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
struct page **pages, unsigned int gup_flags);
+long memfd_pin_user_pages(struct file *file, pgoff_t start,
+ unsigned long nr_pages, struct page **pages);
  
  int get_user_pages_fast(unsigned long start, int nr_pages,

unsigned int gup_flags, struct page **pages);
diff --git a/mm/gup.c b/mm/gup.c
index 231711efa390..eb93d1ec9dc6 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -5,6 +5,7 @@
  #include 
  
  #include 

+#include 
  #include 
  #include 
  #include 
@@ -17,6 +18,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  
@@ -3410,3 +3412,103 @@ long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages,

 , gup_flags);
  }
  EXPORT_SYMBOL(pin_user_pages_unlocked);
+
+/**
+ * memfd_pin_user_pages() - pin user pages associated with a memfd
+ * @memfd:  the memfd whose pages are to be pinned
+ * @start:  starting memfd offset
+ * @nr_pages:   number of pages from start to pin
+ * @pages:  array that receives pointers to the pages pinned.
+ *  Should be at-least nr_pages long.
+ *
+ * Attempt to pin pages associated with a

Re: [RFC PATCH 2/6] mm/gmem: add arch-independent abstraction to track address mapping status

2023-12-04 Thread David Hildenbrand


On 02.12.23 15:50, Pedro Falcato wrote:

On Fri, Dec 1, 2023 at 9:23 AM David Hildenbrand  wrote:


On 28.11.23 13:50, Weixi Zhu wrote:

This patch adds an abstraction layer, struct vm_object, that maintains
per-process virtual-to-physical mapping status stored in struct gm_mapping.
For example, a virtual page may be mapped to a CPU physical page or to a
device physical page. Struct vm_object effectively maintains an
arch-independent page table, which is defined as a "logical page table".
While arch-dependent page table used by a real MMU is named a "physical
page table". The logical page table is useful if Linux core MM is extended
to handle a unified virtual address space with external accelerators using
customized MMUs.


Which raises the question why we are dealing with anonymous memory at
all? Why not go for shmem if you are already only special-casing VMAs
with a MMAP flag right now?

That would maybe avoid having to introduce controversial BSD design
concepts into Linux, that feel like going a step backwards in time to me
and adding *more* MM complexity.



In this patch, struct vm_object utilizes a radix
tree (xarray) to track where a virtual page is mapped to. This adds extra
memory consumption from xarray, but provides a nice abstraction to isolate
mapping status from the machine-dependent layer (PTEs). Besides supporting
accelerators with external MMUs, struct vm_object is planned to further
union with i_pages in struct address_mapping for file-backed memory.


A file already has a tree structure (pagecache) to manage the pages that
are theoretically mapped. It's easy to translate from a VMA to a page
inside that tree structure that is currently not present in page tables.

Why the need for that tree structure if you can just remove anon memory
from the picture?



The idea of struct vm_object is originated from FreeBSD VM design, which
provides a unified abstraction for anonymous memory, file-backed memory,
page cache and etc[1].


:/


Currently, Linux utilizes a set of hierarchical page walk functions to
abstract page table manipulations of different CPU architecture. The
problem happens when a device wants to reuse Linux MM code to manage its
page table -- the device page table may not be accessible to the CPU.
Existing solution like Linux HMM utilizes the MMU notifier mechanisms to
invoke device-specific MMU functions, but relies on encoding the mapping
status on the CPU page table entries. This entangles machine-independent
code with machine-dependent code, and also brings unnecessary restrictions.


Why? we have primitives to walk arch page tables in a non-arch specific
fashion and are using them all over the place.

We even have various mechanisms to map something into the page tables
and get the CPU to fault on it, as if it is inaccessible (PROT_NONE as
used for NUMA balancing, fake swap entries).


The PTE size and format vary arch by arch, which harms the extensibility.


Not really.

We might have some features limited to some architectures because of the
lack of PTE bits. And usually the problem is that people don't care
enough about enabling these features on older architectures.

If we ever *really* need more space for sw-defined data, it would be
possible to allocate auxiliary data for page tables only where required
(where the features apply), instead of crafting a completely new,
auxiliary datastructure with it's own locking.

So far it was not required to enable the feature we need on the
architectures we care about.



[1] https://docs.freebsd.org/en/articles/vm-design/


In the cover letter you have:

"The future plan of logical page table is to provide a generic
abstraction layer that support common anonymous memory (I am looking at
you, transparent huge pages) and file-backed memory."

Which I doubt will happen; there is little interest in making anonymous
memory management slower, more serialized, and wasting more memory on
metadata.


Also worth noting that:

1) Mach VM (which FreeBSD inherited, from the old BSD) vm_objects
aren't quite what's being stated here, rather they are somewhat
replacements for both anon_vma and address_space[1]. Very similarly to
Linux, they take pages from vm_objects and map them in page tables
using pmap (the big difference is anon memory, which has its
bookkeeping in page tables, on Linux)

2) These vm_objects were a horrendous mistake (see CoW chaining) and
FreeBSD has to go to horrendous lengths to make them tolerable. The
UVM paper/dissertation (by Charles Cranor) talks about these issues at
length, and 20 years later it's still true.

3) Despite Linux MM having its warts, it's probably correct to
consider it a solid improvement over FreeBSD MM or NetBSD UVM

And, finally, randomly tacking on core MM concepts from other systems
is at best a *really weird* idea. Particularly when they aren't even
what was stated!


Can you read my mind? :) thanks for noting all that, with which I 100% 
agree.


--
Cheers,

David / dhildenb

Re: [PATCH v5 3/5] mm/gup: Introduce pin_user_pages_fd() for pinning shmem/hugetlbfs file pages (v5)

2023-12-04 Thread David Hildenbrand


On 04.12.23 09:16, Christoph Hellwig wrote:

On Thu, Nov 30, 2023 at 06:41:11AM +, Kasireddy, Vivek wrote:

I see your concern. The word "file" does make it look like this API works
with all kinds of files although it is meant to specifically work with
files that
belong to shmemfs or hugetlbfs. Since it is intended to work with memfds
in particular, I'll rename this helper alloc_memfd_page(). I think it also
makes sense to do s/file/memfd in this whole patch. Does this sound ok?


That sounds much better, yes.  And please also rename the new api
to memfd_pin_user_pages。




asserts that this is true).  gup.c also seems like a very odd place
for such a helper.

I only created this helper to cleanly separate lookup and creation and to
reduce the level of indentation in pin_user_pages_fd(). Anyway, would
mm/memfd.c be a more appropriate location?


I think so, but I'll defer to the MM maintainers.


As mentioned above, this API is mainly intended for memfds and FWICS,
memfds are backed by files from either shmemfs or hugetlbfs.


Ok.  Witht better naming this should be more obvious.




All sounds reasonable to me!

--
Cheers,

David / dhildenb

Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory management) for external memory devices

2023-12-01 Thread David Hildenbrand


On 01.12.23 03:44, zhuweixi wrote:

Thanks!


I hope you understood that that was a joke :)


I am planning to present GMEM in Linux MM Alignment Sessions so I can collect 
more input from the mm developers.


Sounds good. But please try inviting key HMM/driver developer as well.

Most of the core-mm folks attending that meeting are not that familiar 
with these concepts and they are usually not happy about:


(1) More core-MM complexity for things that can be neatly handled in
separate subsystems with the existing infrastructure already.

(2) One new way of doing things why the other things remain in place.

(3) New MMAP flags. Usually you have a hard time getting this in.
Sometimes, there are other ways (e.g., special-purpose file-
systems).

(4) Changing controversial core-mm design decisions to handle corner
cases.

--
Cheers,

David / dhildenb

Re: [RFC PATCH 2/6] mm/gmem: add arch-independent abstraction to track address mapping status

2023-12-01 Thread David Hildenbrand


On 28.11.23 13:50, Weixi Zhu wrote:

This patch adds an abstraction layer, struct vm_object, that maintains
per-process virtual-to-physical mapping status stored in struct gm_mapping.
For example, a virtual page may be mapped to a CPU physical page or to a
device physical page. Struct vm_object effectively maintains an
arch-independent page table, which is defined as a "logical page table".
While arch-dependent page table used by a real MMU is named a "physical
page table". The logical page table is useful if Linux core MM is extended
to handle a unified virtual address space with external accelerators using
customized MMUs.


Which raises the question why we are dealing with anonymous memory at 
all? Why not go for shmem if you are already only special-casing VMAs 
with a MMAP flag right now?


That would maybe avoid having to introduce controversial BSD design 
concepts into Linux, that feel like going a step backwards in time to me 
and adding *more* MM complexity.




In this patch, struct vm_object utilizes a radix
tree (xarray) to track where a virtual page is mapped to. This adds extra
memory consumption from xarray, but provides a nice abstraction to isolate
mapping status from the machine-dependent layer (PTEs). Besides supporting
accelerators with external MMUs, struct vm_object is planned to further
union with i_pages in struct address_mapping for file-backed memory.


A file already has a tree structure (pagecache) to manage the pages that 
are theoretically mapped. It's easy to translate from a VMA to a page 
inside that tree structure that is currently not present in page tables.


Why the need for that tree structure if you can just remove anon memory 
from the picture?




The idea of struct vm_object is originated from FreeBSD VM design, which
provides a unified abstraction for anonymous memory, file-backed memory,
page cache and etc[1].


:/


Currently, Linux utilizes a set of hierarchical page walk functions to
abstract page table manipulations of different CPU architecture. The
problem happens when a device wants to reuse Linux MM code to manage its
page table -- the device page table may not be accessible to the CPU.
Existing solution like Linux HMM utilizes the MMU notifier mechanisms to
invoke device-specific MMU functions, but relies on encoding the mapping
status on the CPU page table entries. This entangles machine-independent
code with machine-dependent code, and also brings unnecessary restrictions.


Why? we have primitives to walk arch page tables in a non-arch specific 
fashion and are using them all over the place.


We even have various mechanisms to map something into the page tables 
and get the CPU to fault on it, as if it is inaccessible (PROT_NONE as 
used for NUMA balancing, fake swap entries).



The PTE size and format vary arch by arch, which harms the extensibility.


Not really.

We might have some features limited to some architectures because of the 
lack of PTE bits. And usually the problem is that people don't care 
enough about enabling these features on older architectures.


If we ever *really* need more space for sw-defined data, it would be 
possible to allocate auxiliary data for page tables only where required 
(where the features apply), instead of crafting a completely new, 
auxiliary datastructure with it's own locking.


So far it was not required to enable the feature we need on the 
architectures we care about.




[1] https://docs.freebsd.org/en/articles/vm-design/


In the cover letter you have:

"The future plan of logical page table is to provide a generic 
abstraction layer that support common anonymous memory (I am looking at 
you, transparent huge pages) and file-backed memory."


Which I doubt will happen; there is little interest in making anonymous 
memory management slower, more serialized, and wasting more memory on 
metadata.


Note that you won't make many friends around here with statements like 
"To be honest, not using a logical page table for anonymous memory is 
why Linux THP fails compared with FreeBSD's superpage".


I read one paper that makes such claims (I'm curious how you define 
"winning"), and am aware of some shortcomings. But I am not convinced 
that a second datastructure "is why Linux THP fails". It just requires 
some more work to get it sorted under Linux (e.g., allocate THP, PTE-map 
it and map inaccessible parts PROT_NONE, later collapse it in-place into 
a PMD), and so far, there was not a lot of interest that I am ware of to 
even start working on that.


So if there is not enough pain for companies to even work on that in 
Linux, maybe FreeBSD superpages are "winning" "on paper" only? Remember 
that the target audience here are Linux developers.



But yeah, this here is all designed around the idea "core MM is extended 
to handle a unified virtual address space with external accelerators 
using customized MMUs." and then trying to find other arguments why it's 
a good idea, without going too much into

Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory management) for external memory devices

2023-11-30 Thread David Hildenbrand


On 29.11.23 09:27, zhuweixi wrote:

Glad to hear that more sharable code is desirable.
IMHO, for a common MM subsystem, it is more beneficial for
GMEM to extend core MM instead of building a separate one.


More core-mm complexity, awesome, we all love that! ;)

--
Cheers,

David / dhildenb

Re: [PATCH v4 3/5] mm/gup: Introduce pin_user_pages_fd() for pinning shmem/hugetlbfs file pages (v4)

2023-11-20 Thread David Hildenbrand


On 18.11.23 07:32, Vivek Kasireddy wrote:

For drivers that would like to longterm-pin the pages associated
with a file, the pin_user_pages_fd() API provides an option to
not only pin the pages via FOLL_PIN but also to check and migrate
them if they reside in movable zone or CMA block. This API
currently works with files that belong to either shmem or hugetlbfs.
Files belonging to other filesystems are rejected for now.

The pages need to be located first before pinning them via FOLL_PIN.
If they are found in the page cache, they can be immediately pinned.
Otherwise, they need to be allocated using the filesystem specific
APIs and then pinned.

v2:
- Drop gup_flags and improve comments and commit message (David)
- Allocate a page if we cannot find in page cache for the hugetlbfs
   case as well (David)
- Don't unpin pages if there is a migration related failure (David)
- Drop the unnecessary nr_pages <= 0 check (Jason)
- Have the caller of the API pass in file * instead of fd (Jason)

v3: (David)
- Enclose the huge page allocation code with #ifdef CONFIG_HUGETLB_PAGE
   (Build error reported by kernel test robot )
- Don't forget memalloc_pin_restore() on non-migration related errors
- Improve the readability of the cleanup code associated with
   non-migration related errors
- Augment the comments by describing FOLL_LONGTERM like behavior
- Include the R-b tag from Jason

v4:
- Remove the local variable "page" and instead use 3 return statements
   in alloc_file_page() (David)
- Add the R-b tag from David

Cc: David Hildenbrand 
Cc: Daniel Vetter 
Cc: Mike Kravetz 
Cc: Hugh Dickins 
Cc: Peter Xu 
Cc: Gerd Hoffmann 
Cc: Dongwon Kim 
Cc: Junxiao Chang 
Suggested-by: Jason Gunthorpe 
Reviewed-by: Jason Gunthorpe  (v2)
Reviewed-by: David Hildenbrand  (v3)
Signed-off-by: Vivek Kasireddy 
---



[...]



+static struct page *alloc_file_page(struct file *file, pgoff_t idx)
+{
+#ifdef CONFIG_HUGETLB_PAGE
+   struct folio *folio;
+   int err;
+
+   if (is_file_hugepages(file)) {
+   folio = alloc_hugetlb_folio_nodemask(hstate_file(file),
+NUMA_NO_NODE,
+NULL,
+GFP_USER);
+   if (folio && folio_try_get(folio)) {
+   err = hugetlb_add_to_page_cache(folio,
+   file->f_mapping,
+   idx);
+   if (err) {
+   folio_put(folio);
+   free_huge_folio(folio);
+   return ERR_PTR(err);
+   }
+   return >page;


While looking at the user of pin_user_pages_fd(), I realized something:

Assume idx is not aligned to the hugetlb page size. 
find_get_page_flags() would always return a tail page in that case, but 
you'd be returning the head page here.


See pagecache_get_page()->folio_file_page(folio, index);


+   }
+   return ERR_PTR(-ENOMEM);
+   }
+#endif
+   return shmem_read_mapping_page(file->f_mapping, idx);
+}
+
+/**
+ * pin_user_pages_fd() - pin user pages associated with a file
+ * @file:   the file whose pages are to be pinned
+ * @start:  starting file offset
+ * @nr_pages:   number of pages from start to pin
+ * @pages:  array that receives pointers to the pages pinned.
+ *  Should be at-least nr_pages long.
+ *
+ * Attempt to pin pages associated with a file that belongs to either shmem
+ * or hugetlb. The pages are either found in the page cache or allocated if
+ * necessary. Once the pages are located, they are all pinned via FOLL_PIN.
+ * And, these pinned pages need to be released either using unpin_user_pages()
+ * or unpin_user_page().
+ *
+ * It must be noted that the pages may be pinned for an indefinite amount
+ * of time. And, in most cases, the duration of time they may stay pinned
+ * would be controlled by the userspace. This behavior is effectively the
+ * same as using FOLL_LONGTERM with other GUP APIs.
+ *
+ * Returns number of pages pinned. This would be equal to the number of
+ * pages requested. If no pages were pinned, it returns -errno.
+ */
+long pin_user_pages_fd(struct file *file, pgoff_t start,
+  unsigned long nr_pages, struct page **pages)
+{
+   struct page *page;
+   unsigned int flags, i;
+   long ret;
+
+   if (start < 0)
+   return -EINVAL;
+
+   if (!file)
+   return -EINVAL;
+
+   if (!shmem_file(file) && !is_file_hugepages(file))
+   return -EINVAL;
+
+   flags = memalloc_pin_save();
+   do {
+   for (i = 0; i < nr_pages; i++) {
+   /*
+* In most cases, we should be able to find the page
+

Re: [PATCH] mm/gup: Introduce pin_user_pages_fd() for pinning shmem/hugetlbfs file pages (v3)

2023-11-14 Thread David Hildenbrand


On 14.11.23 08:00, Vivek Kasireddy wrote:

For drivers that would like to longterm-pin the pages associated
with a file, the pin_user_pages_fd() API provides an option to
not only pin the pages via FOLL_PIN but also to check and migrate
them if they reside in movable zone or CMA block. This API
currently works with files that belong to either shmem or hugetlbfs.
Files belonging to other filesystems are rejected for now.

The pages need to be located first before pinning them via FOLL_PIN.
If they are found in the page cache, they can be immediately pinned.
Otherwise, they need to be allocated using the filesystem specific
APIs and then pinned.

v2:
- Drop gup_flags and improve comments and commit message (David)
- Allocate a page if we cannot find in page cache for the hugetlbfs
   case as well (David)
- Don't unpin pages if there is a migration related failure (David)
- Drop the unnecessary nr_pages <= 0 check (Jason)
- Have the caller of the API pass in file * instead of fd (Jason)

v3: (David)
- Enclose the huge page allocation code with #ifdef CONFIG_HUGETLB_PAGE
   (Build error reported by kernel test robot )
- Don't forget memalloc_pin_restore() on non-migration related errors
- Improve the readability of the cleanup code associated with
   non-migration related errors
- Augment the comments by describing FOLL_LONGTERM like behavior
- Include the R-b tag from Jason

Cc: David Hildenbrand 
Cc: Daniel Vetter 
Cc: Mike Kravetz 
Cc: Hugh Dickins 
Cc: Peter Xu 
Cc: Gerd Hoffmann 
Cc: Dongwon Kim 
Cc: Junxiao Chang 
Suggested-by: Jason Gunthorpe 
Reviewed-by: Jason Gunthorpe  (v2)
Signed-off-by: Vivek Kasireddy 
---
  include/linux/mm.h |   2 +
  mm/gup.c   | 109 +
  2 files changed, 111 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 418d26608ece..1b675fa35059 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2472,6 +2472,8 @@ long get_user_pages_unlocked(unsigned long start, 
unsigned long nr_pages,
struct page **pages, unsigned int gup_flags);
  long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
struct page **pages, unsigned int gup_flags);
+long pin_user_pages_fd(struct file *file, pgoff_t start,
+  unsigned long nr_pages, struct page **pages);
  
  int get_user_pages_fast(unsigned long start, int nr_pages,

unsigned int gup_flags, struct page **pages);
diff --git a/mm/gup.c b/mm/gup.c
index 231711efa390..b3af967cdff1 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -3410,3 +3410,112 @@ long pin_user_pages_unlocked(unsigned long start, 
unsigned long nr_pages,
 , gup_flags);
  }
  EXPORT_SYMBOL(pin_user_pages_unlocked);
+
+static struct page *alloc_file_page(struct file *file, pgoff_t idx)
+{
+#ifdef CONFIG_HUGETLB_PAGE
+   struct page *page = ERR_PTR(-ENOMEM);
+   struct folio *folio;
+   int err;
+
+   if (is_file_hugepages(file)) {
+   folio = alloc_hugetlb_folio_nodemask(hstate_file(file),
+NUMA_NO_NODE,
+NULL,
+GFP_USER);
+   if (folio && folio_try_get(folio)) {
+   page = >page;
+   err = hugetlb_add_to_page_cache(folio,
+   file->f_mapping,
+   idx);
+   if (err) {
+   folio_put(folio);
+   free_huge_folio(folio);
+   page = ERR_PTR(err);
+   }
+   }
+   return page;


You could avoid the "page" variable completely simply by using 3 return 
statements.


LGTM, thanks

Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [PATCH v2 1/3] mm/gup: Introduce pin_user_pages_fd() for pinning shmem/hugetlbfs file pages (v2)

2023-11-13 Thread David Hildenbrand


On 06.11.23 07:15, Vivek Kasireddy wrote:

For drivers that would like to longterm-pin the pages associated
with a file, the pin_user_pages_fd() API provides an option to
not only pin the pages via FOLL_PIN but also to check and migrate
them if they reside in movable zone or CMA block. This API
currently works with files that belong to either shmem or hugetlbfs.
Files belonging to other filesystems are rejected for now.

The pages need to be located first before pinning them via FOLL_PIN.
If they are found in the page cache, they can be immediately pinned.
Otherwise, they need to be allocated using the filesystem specific
APIs and then pinned.

v2:
- Drop gup_flags and improve comments and commit message (David)
- Allocate a page if we cannot find in page cache for the hugetlbfs
   case as well (David)
- Don't unpin pages if there is a migration related failure (David)
- Drop the unnecessary nr_pages <= 0 check (Jason)
- Have the caller of the API pass in file * instead of fd (Jason)

Cc: David Hildenbrand 
Cc: Daniel Vetter 
Cc: Mike Kravetz 
Cc: Hugh Dickins 
Cc: Peter Xu 
Cc: Gerd Hoffmann 
Cc: Dongwon Kim 
Cc: Junxiao Chang 
Suggested-by: Jason Gunthorpe 
Signed-off-by: Vivek Kasireddy 
---
  include/linux/mm.h |  2 +
  mm/gup.c   | 99 ++
  2 files changed, 101 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bf5d0b1b16f4..f6cc17b14653 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2457,6 +2457,8 @@ long get_user_pages_unlocked(unsigned long start, 
unsigned long nr_pages,
struct page **pages, unsigned int gup_flags);
  long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
struct page **pages, unsigned int gup_flags);
+long pin_user_pages_fd(struct file *file, pgoff_t start,
+  unsigned long nr_pages, struct page **pages);
  
  int get_user_pages_fast(unsigned long start, int nr_pages,

unsigned int gup_flags, struct page **pages);
diff --git a/mm/gup.c b/mm/gup.c
index 2f8a2d89fde1..d30b9dfebbb6 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -3400,3 +3400,102 @@ long pin_user_pages_unlocked(unsigned long start, 
unsigned long nr_pages,
 , gup_flags);
  }
  EXPORT_SYMBOL(pin_user_pages_unlocked);
+
+static struct page *alloc_file_page(struct file *file, pgoff_t idx)
+{
+   struct page *page = ERR_PTR(-ENOMEM);
+   struct folio *folio;
+   int err;
+
+   if (shmem_file(file))
+   return shmem_read_mapping_page(file->f_mapping, idx);
+


As the build reports indicate, this might have to be fenced with

#ifdef CONFIG_HUGETLB_PAGE


+   folio = alloc_hugetlb_folio_nodemask(hstate_file(file),
+NUMA_NO_NODE,
+NULL,
+GFP_USER);
+   if (folio && folio_try_get(folio)) {
+   page = >page;
+   err = hugetlb_add_to_page_cache(folio, file->f_mapping, idx);
+   if (err) {
+   folio_put(folio);
+   free_huge_folio(folio);
+   page = ERR_PTR(err);
+   }
+   }
+
+   return page;
+}
+
+/**
+ * pin_user_pages_fd() - pin user pages associated with a file
+ * @file:   the file whose pages are to be pinned
+ * @start:  starting file offset
+ * @nr_pages:   number of pages from start to pin
+ * @pages:  array that receives pointers to the pages pinned.
+ *  Should be at-least nr_pages long.
+ *
+ * Attempt to pin pages associated with a file that belongs to either shmem
+ * or hugetlbfs. The pages are either found in the page cache or allocated


nit: s/hugetlbfs/hugetlb/


+ * if necessary. Once the pages are located, they are all pinned via FOLL_PIN.
+ * And, these pinned pages need to be released using unpin_user_pages() or
+ * unpin_user_page().
+ *


It might be reasonable to add that the behavior mimics FOLL_LONGTERM 
semantics: the page may be held for an indefinite time period _often_ 
under userspace control.



+ * Returns number of pages pinned. This would be equal to the number of
+ * pages requested. If no pages were pinned, it returns -errno.
+ */
+long pin_user_pages_fd(struct file *file, pgoff_t start,
+  unsigned long nr_pages, struct page **pages)
+{
+   struct page *page;
+   unsigned int flags, i;
+   long ret;
+
+   if (start < 0)
+   return -EINVAL;
+
+   if (!file)
+   return -EINVAL;
+
+   if (!shmem_file(file) && !is_file_hugepages(file))
+   return -EINVAL;
+
+   flags = memalloc_pin_save();
+   do {
+   for (i = 0; i < nr_pages; i++) {
+   /*
+* In most cases, we should be able to find the page
+

Re: [PATCH v1 1/3] mm/gup: Introduce pin_user_pages_fd() for pinning shmem/hugetlbfs file pages

2023-10-06 Thread David Hildenbrand


On 03.10.23 09:44, Vivek Kasireddy wrote:

For drivers that would like to longterm-pin the pages associated
with a file, the pin_user_pages_fd() API provides an option to
not only FOLL_PIN the pages but also to check and migrate them
if they reside in movable zone or CMA block. For now, this API
can only work with files belonging to shmem or hugetlbfs given
that the udmabuf driver is the only user.


Maybe add "Other files are rejected.". Wasn't clear to me before I 
looked into the code.




It must be noted that the pages associated with hugetlbfs files
are expected to be found in the page cache. An error is returned
if they are not found. However, shmem pages can be swapped in or
allocated if they are not present in the page cache.

Cc: David Hildenbrand 
Cc: Daniel Vetter 
Cc: Mike Kravetz 
Cc: Hugh Dickins 
Cc: Peter Xu 
Cc: Gerd Hoffmann 
Cc: Dongwon Kim 
Cc: Junxiao Chang 
Suggested-by: Jason Gunthorpe 
Signed-off-by: Vivek Kasireddy 
---
  include/linux/mm.h |  2 ++
  mm/gup.c   | 87 ++
  2 files changed, 89 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bf5d0b1b16f4..af2121fb8101 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2457,6 +2457,8 @@ long get_user_pages_unlocked(unsigned long start, 
unsigned long nr_pages,
struct page **pages, unsigned int gup_flags);
  long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
struct page **pages, unsigned int gup_flags);
+long pin_user_pages_fd(int fd, pgoff_t start, unsigned long nr_pages,
+  unsigned int gup_flags, struct page **pages);
  
  int get_user_pages_fast(unsigned long start, int nr_pages,

unsigned int gup_flags, struct page **pages);
diff --git a/mm/gup.c b/mm/gup.c
index 2f8a2d89fde1..e34b77a15fa8 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -3400,3 +3400,90 @@ long pin_user_pages_unlocked(unsigned long start, 
unsigned long nr_pages,
 , gup_flags);
  }
  EXPORT_SYMBOL(pin_user_pages_unlocked);
+


This does look quite neat, nice! Let's take a closer look ...


+/**
+ * pin_user_pages_fd() - pin user pages associated with a file
+ * @fd: the fd whose pages are to be pinned
+ * @start:  starting file offset
+ * @nr_pages:   number of pages from start to pin
+ * @gup_flags:  flags modifying pin behaviour


^ I assume we should drop that. At least for now the flags are 
completely unused. And most likely we would want a different set of 
flags later (GUPFD_ ...).



+ * @pages:  array that receives pointers to the pages pinned.
+ *  Should be at least nr_pages long.
+ *
+ * Attempt to pin (and migrate) pages associated with a file belonging to


I'd drop the "and migrate" part, it's more of an implementation detail.


+ * either shmem or hugetlbfs. An error is returned if pages associated with
+ * hugetlbfs files are not present in the page cache. However, shmem pages
+ * are swapped in or allocated if they are not present in the page cache.


Why don't we do the same for hugetlbfs? Would make the interface more 
streamlined.


Certainly add that pinned pages have to be released using 
unpin_user_pages().



+ *
+ * Returns number of pages pinned. This would be equal to the number of
+ * pages requested.
+ * If nr_pages is 0 or negative, returns 0. If no pages were pinned, returns
+ * -errno.
+ */
+long pin_user_pages_fd(int fd, pgoff_t start, unsigned long nr_pages,
+  unsigned int gup_flags, struct page **pages)
+{
+   struct page *page;
+   struct file *filep;
+   unsigned int flags, i;
+   long ret;
+
+   if (nr_pages <= 0)
+   return 0;


I think we should just forbid that and use a WARN_ON_ONCE() here / 
return -EINVAL. So we'll never end up returning 0.



+   if (!is_valid_gup_args(pages, NULL, _flags, FOLL_PIN))
+   return 0;
+
+   if (start < 0)
+   return -EINVAL;
+
+   filep = fget(fd);
+   if (!filep)
+   return -EINVAL;
+
+   if (!shmem_file(filep) && !is_file_hugepages(filep))
+   return -EINVAL;
+
+   flags = memalloc_pin_save();
+   do {
+   for (i = 0; i < nr_pages; i++) {
+   if (shmem_mapping(filep->f_mapping)) {
+   page = shmem_read_mapping_page(filep->f_mapping,
+  start + i);
+   if (IS_ERR(page)) {
+   ret = PTR_ERR(page);
+   goto err;
+   }
+   } else {
+   page = find_get_page_flags(filep->f_mapping,
+  start + i,
+

Re: [PATCH v1 0/3] udmabuf: Add support for page migration out of movable zone or CMA

2023-09-14 Thread David Hildenbrand


I think it makes sense to have a generic (non-GUP) version of
check_and_migrate_movable_pages() available in migration.h that
drivers can use to ensure that they don't break memory hotunplug
accidentally.


Definately not.

Either use the VMA and pin_user_pages(), or implement
pin_user_pages_fd() in core code.

Do not open code something wonky in drivers.


Agreed. pin_user_pages_fd() might become relevant in the context of 
vfio/mdev + KVM gmem -- don't mmap guest memory but instead provide it 
via a special memfd to the kernel.


So there might be value in having such a core infrastructure.

--
Cheers,

David / dhildenb

Re: [PATCH v1 0/3] udmabuf: Add support for page migration out of movable zone or CMA

2023-08-24 Thread David Hildenbrand


On 24.08.23 20:30, Jason Gunthorpe wrote:

On Thu, Aug 24, 2023 at 08:30:17PM +0200, David Hildenbrand wrote:

On 24.08.23 08:31, Kasireddy, Vivek wrote:

Hi David,




- Add a new API to the backing store/allocator to longterm-pin the page.
 For example, something along the lines of

shmem_pin_mapping_page_longterm()

 for shmem as suggested by Daniel. A similar one needs to be added for
 hugetlbfs as well.


This may also be reasonable.


Sounds reasonable to keep the old API (that we unfortunately have) working.

I agree; I'd like to avoid adding new APIs unless absolutely necessary. Given 
this,
and considering the options I have mentioned earlier, what would be your
recommendation for how page migration needs to be done in udmabuf driver?


I guess using proper APIs for shmem and hugetlb. So, turning roughly what
you have in patch#1 for now into common code, and only calling into that
from udmabug.


This is a lot of work for an obscure uapi :\


Well, what can we otherwise to to *not* break existing users? I'm not 
happy about this either.


Of course, we can come up with a new uapi, but we have to handle the old 
uapi somehow.


Sure, we can simply always fail when we detect ZONE_MOVABLE or 
MIGRATE_CMA. Maybe that keeps at least some use cases working.


--
Cheers,

David / dhildenb

Re: [PATCH v1 0/3] udmabuf: Add support for page migration out of movable zone or CMA

2023-08-24 Thread David Hildenbrand


On 24.08.23 08:31, Kasireddy, Vivek wrote:

Hi David,




- Add a new API to the backing store/allocator to longterm-pin the page.
For example, something along the lines of

shmem_pin_mapping_page_longterm()

for shmem as suggested by Daniel. A similar one needs to be added for
hugetlbfs as well.


This may also be reasonable.


Sounds reasonable to keep the old API (that we unfortunately have) working.

I agree; I'd like to avoid adding new APIs unless absolutely necessary. Given 
this,
and considering the options I have mentioned earlier, what would be your
recommendation for how page migration needs to be done in udmabuf driver?


I guess using proper APIs for shmem and hugetlb. So, turning roughly 
what you have in patch#1 for now into common code, and only calling into 
that from udmabug.


--
Cheers,

David / dhildenb

Re: [PATCH v1 0/3] udmabuf: Add support for page migration out of movable zone or CMA

2023-08-23 Thread David Hildenbrand


- Add a new API to the backing store/allocator to longterm-pin the page.
   For example, something along the lines of shmem_pin_mapping_page_longterm()
   for shmem as suggested by Daniel. A similar one needs to be added for
   hugetlbfs as well.


This may also be reasonable.


Sounds reasonable to keep the old API (that we unfortunately have) working.

--
Cheers,

David / dhildenb

Re: [RFC v1 1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages)

2023-08-04 Thread David Hildenbrand


On 04.08.23 08:39, Kasireddy, Vivek wrote:

Hi Alistair, David, Jason,


Right, the "the zero pages are changed into writable pages" in your
above comment just might not apply, because there won't be any

page

replacement (hopefully :) ).



If the page replacement does not happen when there are new writes

to the

area where the hole previously existed, then would we still get an

invalidate

when this happens? Is there any other way to get notified when the

zeroed

page is written to if the invalidate does not get triggered?


What David is saying is that memfd does not use the zero page
optimization for hole punches. Any access to the memory, including
read-only access through hmm_range_fault() will allocate unique
pages. Since there is no zero page and no zero-page replacement there
is no issue with invalidations.



It looks like even with hmm_range_fault(), the invalidate does not get
triggered when the hole is refilled with new pages because of writes.
This is probably because hmm_range_fault() does not fault in any pages
that get invalidated later when writes occur.

hmm_range_fault() returns the current content of the VMAs, or it
faults. If it returns pages then it came from one of these two places.
If your VMA is incoherent with what you are doing then you have
bigger
problems, or maybe you found a bug.


Note it will only fault in pages if HMM_PFN_REQ_FAULT is specified. You
are setting that however you aren't setting HMM_PFN_REQ_WRITE which is
what would trigger a fault to bring in the new pages. Does setting that
fix the issue you are seeing?

No, adding HMM_PFN_REQ_WRITE still doesn't help in fixing the issue.
Although, I do not have THP enabled (or built-in), shmem does not evict
the pages after hole punch as noted in the comment in shmem_fallocate():
 if ((u64)unmap_end > (u64)unmap_start)
 unmap_mapping_range(mapping, unmap_start,
 1 + unmap_end - unmap_start, 0);
 shmem_truncate_range(inode, offset, offset + len - 1);
 /* No need to unmap again: hole-punching leaves COWed pages */

As a result, the pfn is still valid and the pte is pte_present() and 
pte_write().
This is the reason why adding in HMM_PFN_REQ_WRITE does not help;


Just to understand your setup: you are definitely using a MAP_SHARED 
shmem mapping, and not accidentally a MAP_PRIVATE mapping?


--
Cheers,

David / dhildenb

Re: [RFC v1 1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages)

2023-08-03 Thread David Hildenbrand


On 03.08.23 14:14, Jason Gunthorpe wrote:

On Thu, Aug 03, 2023 at 07:35:51AM +, Kasireddy, Vivek wrote:

Hi Jason,


Right, the "the zero pages are changed into writable pages" in your
above comment just might not apply, because there won't be any page
replacement (hopefully :) ).



If the page replacement does not happen when there are new writes to the
area where the hole previously existed, then would we still get an

invalidate

when this happens? Is there any other way to get notified when the zeroed
page is written to if the invalidate does not get triggered?


What David is saying is that memfd does not use the zero page
optimization for hole punches. Any access to the memory, including
read-only access through hmm_range_fault() will allocate unique
pages. Since there is no zero page and no zero-page replacement there
is no issue with invalidations.



It looks like even with hmm_range_fault(), the invalidate does not get
triggered when the hole is refilled with new pages because of writes.
This is probably because hmm_range_fault() does not fault in any pages
that get invalidated later when writes occur.


hmm_range_fault() returns the current content of the VMAs, or it
faults. If it returns pages then it came from one of these two places.

If your VMA is incoherent with what you are doing then you have bigger
problems, or maybe you found a bug.


The above log messages are seen immediately after the hole is punched. As
you can see, hmm_range_fault() returns the pfns of old pages and not zero
pages. And, I see the below messages (with patch #2 in this series applied)
as the hole is refilled after writes:


I don't know what you are doing, but it is something wrong or you've
found a bug in the memfds.



Maybe THP is involved? I recently had to dig that out for an internal 
discussion:


"Currently when truncating shmem file, if the range is partial of THP
(start or end is in the middle of THP), the pages actually will just get
cleared rather than being freed unless the range cover the whole THP.
Even though all the subpages are truncated (randomly or sequentially),
the THP may still be kept in page cache.  This might be fine for some
usecases which prefer preserving THP."

My recollection is that this behavior was never changed.

https://lore.kernel.org/all/1575420174-19171-1-git-send-email-yang@linux.alibaba.com/

--
Cheers,

David / dhildenb

Re: [RFC v1 1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages)

2023-08-01 Thread David Hildenbrand


On 01.08.23 14:26, Jason Gunthorpe wrote:

On Tue, Aug 01, 2023 at 02:26:03PM +0200, David Hildenbrand wrote:

On 01.08.23 14:23, Jason Gunthorpe wrote:

On Tue, Aug 01, 2023 at 02:22:12PM +0200, David Hildenbrand wrote:

On 01.08.23 14:19, Jason Gunthorpe wrote:

On Tue, Aug 01, 2023 at 05:32:38AM +, Kasireddy, Vivek wrote:


You get another invalidate because the memfd removes the zero pages
that hmm_range_fault installed in the PTEs before replacing them with
actual writable pages. Then you do the move, and another
hmm_range_fault, and basically the whole thing over again. Except this
time instead of returning zero pages it returns actual writable
page.



Ok, when I tested earlier (by registering an invalidate callback) but without
hmm_range_fault(), I did not find this additional invalidate getting triggered.
Let me try with hmm_range_fault() and see if everything works as expected.
Thank you for your help.


If you do not get an invalidate then there is a pretty serious bug in
the mm that needs fixing.

Anything hmm_range_fault() returns must be invalidated if the
underying CPU mapping changes for any reasons. Since hmm_range_fault()
will populate zero pages when reading from a hole in a memfd, it must
also get an invalidation when the zero pages are changed into writable
pages.


Can you point me at the code that returns that (shared) zero page?


It calls handle_mm_fault() - shouldn't that do it? Same as if the CPU
read faulted the page?


To the best of my knowledge, the shared zeropage is only used in
MAP_PRIVATE|MAP_AON mappings and in weird DAX mappings.

If that changed, we have to fix FOLL_PIN|FOLL_LONGTERM for MAP_SHARED VMAs.

If you read-fault on a memfd hole, you should get a proper "zeroed"
pagecache page that effectively "filled that hole" -- so there is no file
hole anymore.


Sounds fine then :)


Right, the "the zero pages are changed into writable pages" in your 
above comment just might not apply, because there won't be any page 
replacement (hopefully :) ).


--
Cheers,

David / dhildenb

Re: [RFC v1 1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages)

2023-08-01 Thread David Hildenbrand


On 01.08.23 14:23, Jason Gunthorpe wrote:

On Tue, Aug 01, 2023 at 02:22:12PM +0200, David Hildenbrand wrote:

On 01.08.23 14:19, Jason Gunthorpe wrote:

On Tue, Aug 01, 2023 at 05:32:38AM +, Kasireddy, Vivek wrote:


You get another invalidate because the memfd removes the zero pages
that hmm_range_fault installed in the PTEs before replacing them with
actual writable pages. Then you do the move, and another
hmm_range_fault, and basically the whole thing over again. Except this
time instead of returning zero pages it returns actual writable
page.



Ok, when I tested earlier (by registering an invalidate callback) but without
hmm_range_fault(), I did not find this additional invalidate getting triggered.
Let me try with hmm_range_fault() and see if everything works as expected.
Thank you for your help.


If you do not get an invalidate then there is a pretty serious bug in
the mm that needs fixing.

Anything hmm_range_fault() returns must be invalidated if the
underying CPU mapping changes for any reasons. Since hmm_range_fault()
will populate zero pages when reading from a hole in a memfd, it must
also get an invalidation when the zero pages are changed into writable
pages.


Can you point me at the code that returns that (shared) zero page?


It calls handle_mm_fault() - shouldn't that do it? Same as if the CPU
read faulted the page?


To the best of my knowledge, the shared zeropage is only used in 
MAP_PRIVATE|MAP_AON mappings and in weird DAX mappings.


If that changed, we have to fix FOLL_PIN|FOLL_LONGTERM for MAP_SHARED VMAs.

If you read-fault on a memfd hole, you should get a proper "zeroed" 
pagecache page that effectively "filled that hole" -- so there is no 
file hole anymore.


--
Cheers,

David / dhildenb

Re: [RFC v1 1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages)

2023-08-01 Thread David Hildenbrand


On 01.08.23 14:19, Jason Gunthorpe wrote:

On Tue, Aug 01, 2023 at 05:32:38AM +, Kasireddy, Vivek wrote:


You get another invalidate because the memfd removes the zero pages
that hmm_range_fault installed in the PTEs before replacing them with
actual writable pages. Then you do the move, and another
hmm_range_fault, and basically the whole thing over again. Except this
time instead of returning zero pages it returns actual writable
page.



Ok, when I tested earlier (by registering an invalidate callback) but without
hmm_range_fault(), I did not find this additional invalidate getting triggered.
Let me try with hmm_range_fault() and see if everything works as expected.
Thank you for your help.


If you do not get an invalidate then there is a pretty serious bug in
the mm that needs fixing.

Anything hmm_range_fault() returns must be invalidated if the
underying CPU mapping changes for any reasons. Since hmm_range_fault()
will populate zero pages when reading from a hole in a memfd, it must
also get an invalidation when the zero pages are changed into writable
pages.


Can you point me at the code that returns that (shared) zero page?

--
Cheers,

David / dhildenb

Re: [PATCH v2 4/4] perf/core: use vma_is_initial_stack() and vma_is_initial_heap()

2023-07-19 Thread David Hildenbrand


On 19.07.23 09:51, Kefeng Wang wrote:

Use the helpers to simplify code, also kill unneeded goto cpy_name.

Cc: Peter Zijlstra 
Cc: Arnaldo Carvalho de Melo 
Signed-off-by: Kefeng Wang 
---
  kernel/events/core.c | 22 +++---
  1 file changed, 7 insertions(+), 15 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 78ae7b6f90fd..d59f6327472f 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8685,22 +8685,14 @@ static void perf_event_mmap_event(struct 
perf_mmap_event *mmap_event)
}
  
  		name = (char *)arch_vma_name(vma);

-   if (name)
-   goto cpy_name;
-
-   if (vma->vm_start <= vma->vm_mm->start_brk &&
-   vma->vm_end >= vma->vm_mm->brk) {
-   name = "[heap]";
-   goto cpy_name;
+   if (!name) {
+   if (vma_is_initial_heap(vma))
+   name = "[heap]";
+   else if (vma_is_initial_stack(vma))
+   name = "[stack]";
+   else
+   name = "//anon";
}
-   if (vma->vm_start <= vma->vm_mm->start_stack &&
-   vma->vm_end >= vma->vm_mm->start_stack) {
-   name = "[stack]";
-   goto cpy_name;
-   }
-
-   name = "//anon";
-   goto cpy_name;


If you're removing that goto, maybe also worth removing the goto at the 
end of the previous if branch.


Reviewed-by: David Hildenbrand 


--
Cheers,

David / dhildenb

Re: [PATCH v2 3/4] selinux: use vma_is_initial_stack() and vma_is_initial_heap()

2023-07-19 Thread David Hildenbrand


On 19.07.23 09:51, Kefeng Wang wrote:

Use the helpers to simplify code.

Cc: Paul Moore 
Cc: Stephen Smalley 
Cc: Eric Paris 
Acked-by: Paul Moore 
Signed-off-by: Kefeng Wang 
---
  security/selinux/hooks.c | 7 ++-
  1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index d06e350fedee..ee8575540a8e 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -3762,13 +3762,10 @@ static int selinux_file_mprotect(struct vm_area_struct 
*vma,
if (default_noexec &&
(prot & PROT_EXEC) && !(vma->vm_flags & VM_EXEC)) {
int rc = 0;
-   if (vma->vm_start >= vma->vm_mm->start_brk &&
-   vma->vm_end <= vma->vm_mm->brk) {
+   if (vma_is_initial_heap(vma)) {
rc = avc_has_perm(sid, sid, SECCLASS_PROCESS,
  PROCESS__EXECHEAP, NULL);
-   } else if (!vma->vm_file &&
-  ((vma->vm_start <= vma->vm_mm->start_stack &&
-vma->vm_end >= vma->vm_mm->start_stack) ||
+   } else if (!vma->vm_file && (vma_is_initial_stack(vma) ||
vma_is_stack_for_current(vma))) {
rc = avc_has_perm(sid, sid, SECCLASS_PROCESS,
  PROCESS__EXECSTACK, NULL);


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [PATCH v2 2/4] drm/amdkfd: use vma_is_initial_stack() and vma_is_initial_heap()

2023-07-19 Thread David Hildenbrand


On 19.07.23 09:51, Kefeng Wang wrote:

Use the helpers to simplify code.

Cc: Felix Kuehling 
Cc: Alex Deucher 
Cc: "Christian König" 
Cc: "Pan, Xinhui" 
Cc: David Airlie 
Cc: Daniel Vetter 
Signed-off-by: Kefeng Wang 
---
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 5 +
  1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 5ff1a5a89d96..0b7bfbd0cb66 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -2621,10 +2621,7 @@ svm_range_get_range_boundaries(struct kfd_process *p, 
int64_t addr,
return -EFAULT;
}
  
-	*is_heap_stack = (vma->vm_start <= vma->vm_mm->brk &&

- vma->vm_end >= vma->vm_mm->start_brk) ||
-(vma->vm_start <= vma->vm_mm->start_stack &&
- vma->vm_end >= vma->vm_mm->start_stack);
+   *is_heap_stack = vma_is_initial_heap(vma) || vma_is_initial_stack(vma);
  
  	start_limit = max(vma->vm_start >> PAGE_SHIFT,

  (unsigned long)ALIGN_DOWN(addr, 2UL << 8));


Certainly a valid refactoring, although questionable if such code should 
care about that.


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [PATCH v2 1/4] mm: factor out VMA stack and heap checks

2023-07-19 Thread David Hildenbrand


On 19.07.23 09:51, Kefeng Wang wrote:

Factor out VMA stack and heap checks and name them
vma_is_initial_stack() and vma_is_initial_heap() for
general use.

Cc: Christian Göttsche 
Cc: David Hildenbrand 
Signed-off-by: Kefeng Wang 
---



[...]


diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2dd73e4f3d8e..51f8c573db74 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -822,6 +822,27 @@ static inline bool vma_is_anonymous(struct vm_area_struct 
*vma)
return !vma->vm_ops;
  }
  


Worth adding a similar comment like for vma_is_initial_stack() ?


+static inline bool vma_is_initial_heap(const struct vm_area_struct *vma)
+{
+   return vma->vm_start <= vma->vm_mm->brk &&
+   vma->vm_end >= vma->vm_mm->start_brk;
+}
+
+/*
+ * Indicate if the VMA is a stack for the given task; for
+ * /proc/PID/maps that is the stack of the main task.
+ */
+static inline bool vma_is_initial_stack(const struct vm_area_struct *vma)
+{
+   /*
+* We make no effort to guess what a given thread considers to be
+* its "stack".  It's not even well-defined for programs written
+* languages like Go.
+*/
+   return vma->vm_start <= vma->vm_mm->start_stack &&
+  vma->vm_end >= vma->vm_mm->start_stack;
+}
+
  static inline bool vma_is_temporary_stack(struct vm_area_struct *vma)
  {
int maybe_stack = vma->vm_flags & (VM_GROWSDOWN | VM_GROWSUP);


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [PATCH v2 2/2] udmabuf: Add back support for mapping hugetlb pages (v2)

2023-07-18 Thread David Hildenbrand


On 18.07.23 10:26, Vivek Kasireddy wrote:

A user or admin can configure a VMM (Qemu) Guest's memory to be
backed by hugetlb pages for various reasons. However, a Guest OS
would still allocate (and pin) buffers that are backed by regular
4k sized pages. In order to map these buffers and create dma-bufs
for them on the Host, we first need to find the hugetlb pages where
the buffer allocations are located and then determine the offsets
of individual chunks (within those pages) and use this information
to eventually populate a scatterlist.

Testcase: default_hugepagesz=2M hugepagesz=2M hugepages=2500 options
were passed to the Host kernel and Qemu was launched with these
relevant options: qemu-system-x86_64 -m 4096m
-device virtio-gpu-pci,max_outputs=1,blob=true,xres=1920,yres=1080
-display gtk,gl=on
-object memory-backend-memfd,hugetlb=on,id=mem1,size=4096M
-machine memory-backend=mem1

Replacing -display gtk,gl=on with -display gtk,gl=off above would
exercise the mmap handler.

v2: Updated get_sg_table() to manually populate the scatterlist for
 both huge page and non-huge-page cases.

Cc: David Hildenbrand 
Cc: Mike Kravetz 
Cc: Hugh Dickins 
Cc: Peter Xu 
Cc: Jason Gunthorpe 
Cc: Gerd Hoffmann 
Cc: Dongwon Kim 
Cc: Junxiao Chang 
Signed-off-by: Vivek Kasireddy 
---
  drivers/dma-buf/udmabuf.c | 84 +--
  1 file changed, 71 insertions(+), 13 deletions(-)


LGTM, in general. But I really hope Mike can comment.

--
Cheers,

David / dhildenb

Re: [PATCH v2 1/2] udmabuf: Use vmf_insert_pfn and VM_PFNMAP for handling mmap

2023-07-18 Thread David Hildenbrand


On 18.07.23 10:26, Vivek Kasireddy wrote:

Add VM_PFNMAP to vm_flags in the mmap handler to ensure that
the mappings would be managed without using struct page.

And, in the vm_fault handler, use vmf_insert_pfn to share the
page's pfn to userspace instead of directly sharing the page
(via struct page *).

Cc: David Hildenbrand 
Cc: Mike Kravetz 
Cc: Hugh Dickins 
Cc: Peter Xu 
Cc: Jason Gunthorpe 
Cc: Gerd Hoffmann 
Cc: Dongwon Kim 
Cc: Junxiao Chang 
Suggested-by: David Hildenbrand 
Signed-off-by: Vivek Kasireddy 
---
  drivers/dma-buf/udmabuf.c | 8 +---
  1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c
index c40645999648..820c993c8659 100644
--- a/drivers/dma-buf/udmabuf.c
+++ b/drivers/dma-buf/udmabuf.c
@@ -35,12 +35,13 @@ static vm_fault_t udmabuf_vm_fault(struct vm_fault *vmf)
struct vm_area_struct *vma = vmf->vma;
struct udmabuf *ubuf = vma->vm_private_data;
pgoff_t pgoff = vmf->pgoff;
+   unsigned long pfn;
  
  	if (pgoff >= ubuf->pagecount)

return VM_FAULT_SIGBUS;
-   vmf->page = ubuf->pages[pgoff];
-   get_page(vmf->page);
-   return 0;
+
+   pfn = page_to_pfn(ubuf->pages[pgoff]);
+   return vmf_insert_pfn(vma, vmf->address, pfn);
  }
  
  static const struct vm_operations_struct udmabuf_vm_ops = {

@@ -56,6 +57,7 @@ static int mmap_udmabuf(struct dma_buf *buf, struct 
vm_area_struct *vma)
  
  	vma->vm_ops = _vm_ops;

vma->vm_private_data = ubuf;
+   vm_flags_set(vma, VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP);
return 0;
  }
  


Looks reasonable to me.

Acked-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [PATCH 1/5] mm: introduce vma_is_stack() and vma_is_heap()

2023-07-17 Thread David Hildenbrand


On 12.07.23 16:38, Kefeng Wang wrote:

Introduce the two helpers for general use.

Signed-off-by: Kefeng Wang 
---
  include/linux/mm.h | 12 
  1 file changed, 12 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1462cf15badf..0bbeb31ac750 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -926,6 +926,18 @@ static inline bool vma_is_anonymous(struct vm_area_struct 
*vma)
return !vma->vm_ops;
  }
  
+static inline bool vma_is_heap(struct vm_area_struct *vma)

+{
+   return vma->vm_start <= vma->vm_mm->brk &&
+   vma->vm_end >= vma->vm_mm->start_brk;
+}
+
+static inline bool vma_is_stack(struct vm_area_struct *vma)
+{
+   return vma->vm_start <= vma->vm_mm->start_stack &&
+  vma->vm_end >= vma->vm_mm->start_stack;
+}
+
  static inline bool vma_is_temporary_stack(struct vm_area_struct *vma)
  {
int maybe_stack = vma->vm_flags & (VM_GROWSDOWN | VM_GROWSUP);


Looking at the comments in patch #3, should these functions be called

vma_is_initial_heap / vma_is_initial_stack ?

--
Cheers,

David / dhildenb

Re: [PATCH 2/5] mm: use vma_is_stack() and vma_is_heap()

2023-07-17 Thread David Hildenbrand


On 12.07.23 16:38, Kefeng Wang wrote:

Use the helpers to simplify code.

Signed-off-by: Kefeng Wang 
---
  fs/proc/task_mmu.c   | 24 
  fs/proc/task_nommu.c | 15 +--
  2 files changed, 5 insertions(+), 34 deletions(-)



Please squash patch #1 and this patch and call it something like

"mm: factor out VMA stack and heap checks"

And then, maybe also keep the comments in these functions, they sound 
reasonable to have.


--
Cheers,

David / dhildenb

Re: [PATCH v1 0/2] udmabuf: Add back support for mapping hugetlb pages

2023-06-27 Thread David Hildenbrand


On 27.06.23 08:37, Kasireddy, Vivek wrote:

Hi David,



Hi!

sorry for taking a bit longer to reply lately.

[...]


Sounds right, maybe it needs to go back to the old GUP solution, though, as
mmu notifiers are also mm-based not fd-based. Or to be explicit, I think
it'll be pin_user_pages(FOLL_LONGTERM) with the new API.  It'll also solve
the movable pages issue on pinning.


It better should be pin_user_pages(FOLL_LONGTERM). But I'm afraid we
cannot achieve that without breaking the existing kernel interface ...

Yeah, as you suggest, we unfortunately cannot go back to using GUP
without breaking udmabuf_create UAPI that expects memfds and file
offsets.



So we might have to implement the same page migration as gup does on
FOLL_LONGTERM here ... maybe there are more such cases/drivers that
actually require that handling when simply taking pages out of the
memfd, believing they can hold on to them forever.

IIUC, I don't think just handling the page migration in udmabuf is going to
cut it. It might require active cooperation of the Guest GPU driver as well
if this is even feasible.


The idea is, that once you extract the page from the memfd and it 
resides somewhere bad (MIGRATE_CMA, ZONE_MOVABLE), you trigger page 
migration. Essentially what migrate_longterm_unpinnable_pages() does:


Why would the guest driver have to be involved? It shouldn't care about
page migration in the hypervisor.

[...]


balloon, and then using that memory for communicating with the device]

Maybe it's all fine with udmabuf because of the way it is setup/torn
down by the guest driver. Unfortunately I can't tell.

Here are the functions used by virtio-gpu (Guest GPU driver) to allocate
pages for its resources:
__drm_gem_shmem_create: 
https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/drm_gem_shmem_helper.c#L97
Interestingly, the comment in the above function says that the pages
should not be allocated from the MOVABLE zone.


It doesn't add GFP_MOVABLE, so pages don't end up in 
ZONE_MOVABLE/MIGRATE_CMA *in the guest*. But we care about the 
ZONE_MOVABLE /MIGRATE_CMA *in the host*. (what the guest does is right, 
though)


IOW, what udmabuf does with guest memory on the hypervisor side, not the 
guest driver on the guest side.



The pages along with their dma addresses are then extracted and shared
with Qemu using these two functions:
drm_gem_get_pages: 
https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/drm_gem.c#L534
virtio_gpu_object_shmem_init: 
https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/virtio/virtgpu_object.c#L135


^ so these two target the guest driver as well, right? IOW, there is a 
memfd (shmem) in the guest that the guest driver uses to allocate pages 
from and there is the memfd in the hypervisor to back guest RAM.


The latter gets registered with udmabuf.


Qemu then translates the dma addresses into file offsets and creates
udmabufs -- as an optimization to avoid data copies only if blob is set
to true.


If the guest OS doesn't end up freeing/reallocating that memory while 
it's registered with udmabuf in the hypervisor, then we should be fine.


Because that way, the guest won't end up trigger MADV_DONTNEED by 
"accident".


--
Cheers,

David / dhildenb

Re: [PATCH v1 0/2] udmabuf: Add back support for mapping hugetlb pages

2023-06-26 Thread David Hildenbrand


On 26.06.23 19:52, Peter Xu wrote:

On Mon, Jun 26, 2023 at 07:45:37AM +, Kasireddy, Vivek wrote:

Hi Peter,



On Fri, Jun 23, 2023 at 06:13:02AM +, Kasireddy, Vivek wrote:

Hi David,


The first patch ensures that the mappings needed for handling mmap
operation would be managed by using the pfn instead of struct page.
The second patch restores support for mapping hugetlb pages where
subpages of a hugepage are not directly used anymore (main reason
for revert) and instead the hugetlb pages and the relevant offsets
are used to populate the scatterlist for dma-buf export and for
mmap operation.

Testcase: default_hugepagesz=2M hugepagesz=2M hugepages=2500

options

were passed to the Host kernel and Qemu was launched with these
relevant options: qemu-system-x86_64 -m 4096m
-device virtio-gpu-pci,max_outputs=1,blob=true,xres=1920,yres=1080
-display gtk,gl=on
-object memory-backend-memfd,hugetlb=on,id=mem1,size=4096M
-machine memory-backend=mem1

Replacing -display gtk,gl=on with -display gtk,gl=off above would
exercise the mmap handler.



While I think the VM_PFNMAP approach is much better and should fix

that

issue at hand, I thought more about missing memlock support and

realized

that we might have to fix something else. SO I'm going to raise the
issue here.

I think udmabuf chose the wrong interface to do what it's doing, that
makes it harder to fix it eventually.

Instead of accepting a range in a memfd, it should just have accepted a
user space address range and then used
pin_user_pages(FOLL_WRITE|FOLL_LONGTERM) to longterm-pin the

pages

"officially".

Udmabuf indeed started off by using user space address range and GUP

but

the dma-buf subsystem maintainer had concerns with that approach in v2.
It also had support for mlock in that version. Here is v2 and the relevant
conversation:
https://patchwork.freedesktop.org/patch/210992/?series=39879=2



So what's the issue? Udma effectively pins pages longterm ("possibly
forever") simply by grabbing a reference on them. These pages might
easily reside in ZONE_MOVABLE or in MIGRATE_CMA pageblocks.

So what udmabuf does is break memory hotunplug and CMA, because it
turns
pages that have to remain movable unmovable.

In the pin_user_pages(FOLL_LONGTERM) case we make sure to migrate
these
pages. See mm/gup.c:check_and_migrate_movable_pages() and

especially

folio_is_longterm_pinnable(). We'd probably have to implement

something

similar for udmabuf, where we detect such unpinnable pages and

migrate

them.

The pages udmabuf pins are only those associated with Guest (GPU

driver/virtio-gpu)

resources (or buffers allocated and pinned from shmem via drm GEM).

Some

resources are short-lived, and some are long-lived and whenever a

resource

gets destroyed, the pages are unpinned. And, not all resources have their

pages

pinned. The resource that is pinned for the longest duration is the FB and

that's

because it is updated every ~16ms (assuming 1920x1080@60) by the Guest
GPU driver. We can certainly pin/unpin the FB after it is accessed on the

Host

as a workaround, but I guess that may not be very efficient given the

amount

of churn it would create.

Also, as far as migration or S3/S4 is concerned, my understanding is that all
the Guest resources are destroyed and recreated again. So, wouldn't

something

similar happen during memory hotunplug?




For example, pairing udmabuf with vfio (which pins pages using
pin_user_pages(FOLL_LONGTERM)) in QEMU will most probably not work

in

all cases: if udmabuf longterm pinned the pages "the wrong way", vfio
will fail to migrate them during FOLL_LONGTERM and consequently fail
pin_user_pages(). As long as udmabuf holds a reference on these pages,
that will never succeed.

Dma-buf rules (for exporters) indicate that the pages only need to be

pinned

during the map_attachment phase (and until unmap attachment happens).
In other words, only when the sg_table is created by udmabuf. I guess one
option would be to not hold any references during UDMABUF_CREATE and
only grab references to the pages (as and when it gets used) during this

step.

Would this help?


IIUC the refcount is needed, otherwise I don't see what to protect the page
from being freed and even reused elsewhere before map_attachment().

It seems the previous concern on using gup was majorly fork(), if this is it:

https://patchwork.freedesktop.org/patch/210992/?series=39879=2#co
mment_414213

Could it also be guarded by just making sure the pages are MAP_SHARED
when
creating the udmabuf, if fork() is a requirement of the feature?

I had a feeling that the userspace still needs to always do the right thing
to make it work, even using pure PFN mappings.

For instance, what if the userapp just punchs a hole in the shmem/hugetlbfs
file after creating the udmabuf (I see that F_SEAL_SHRINK is required, but
at least not F_SEAL_WRITE with current impl), and fault a new page into the
page cache?

IIUC, Qemu creates and owns the memfd that is

Re: [PATCH v1 0/2] udmabuf: Add back support for mapping hugetlb pages

2023-06-22 Thread David Hildenbrand


On 22.06.23 09:27, Vivek Kasireddy wrote:

The first patch ensures that the mappings needed for handling mmap
operation would be managed by using the pfn instead of struct page.
The second patch restores support for mapping hugetlb pages where
subpages of a hugepage are not directly used anymore (main reason
for revert) and instead the hugetlb pages and the relevant offsets
are used to populate the scatterlist for dma-buf export and for
mmap operation.

Testcase: default_hugepagesz=2M hugepagesz=2M hugepages=2500 options
were passed to the Host kernel and Qemu was launched with these
relevant options: qemu-system-x86_64 -m 4096m
-device virtio-gpu-pci,max_outputs=1,blob=true,xres=1920,yres=1080
-display gtk,gl=on
-object memory-backend-memfd,hugetlb=on,id=mem1,size=4096M
-machine memory-backend=mem1

Replacing -display gtk,gl=on with -display gtk,gl=off above would
exercise the mmap handler.



While I think the VM_PFNMAP approach is much better and should fix that 
issue at hand, I thought more about missing memlock support and realized 
that we might have to fix something else. SO I'm going to raise the 
issue here.


I think udmabuf chose the wrong interface to do what it's doing, that 
makes it harder to fix it eventually.


Instead of accepting a range in a memfd, it should just have accepted a 
user space address range and then used 
pin_user_pages(FOLL_WRITE|FOLL_LONGTERM) to longterm-pin the pages 
"officially".


So what's the issue? Udma effectively pins pages longterm ("possibly 
forever") simply by grabbing a reference on them. These pages might 
easily reside in ZONE_MOVABLE or in MIGRATE_CMA pageblocks.


So what udmabuf does is break memory hotunplug and CMA, because it turns 
pages that have to remain movable unmovable.


In the pin_user_pages(FOLL_LONGTERM) case we make sure to migrate these 
pages. See mm/gup.c:check_and_migrate_movable_pages() and especially 
folio_is_longterm_pinnable(). We'd probably have to implement something 
similar for udmabuf, where we detect such unpinnable pages and migrate them.



For example, pairing udmabuf with vfio (which pins pages using 
pin_user_pages(FOLL_LONGTERM)) in QEMU will most probably not work in 
all cases: if udmabuf longterm pinned the pages "the wrong way", vfio 
will fail to migrate them during FOLL_LONGTERM and consequently fail 
pin_user_pages(). As long as udmabuf holds a reference on these pages, 
that will never succeed.



There are *probably* more issues on the QEMU side when udmabuf is paired 
with things like MADV_DONTNEED/FALLOC_FL_PUNCH_HOLE used for 
virtio-balloon, virtio-mem, postcopy live migration, ... for example, in 
the vfio/vdpa case we make sure that we disallow most of these, because 
otherwise there can be an accidental "disconnect" between the pages 
mapped into the VM (guest view) and the pages mapped into the IOMMU 
(device view), for example, after a reboot.


--
Cheers,

David / dhildenb

Re: [PATCH] udmabuf: revert 'Add support for mapping hugepages (v4)'

2023-06-15 Thread David Hildenbrand





Skimming over at shmem_read_mapping_page() users, I assume most of
them
use a VM_PFNMAP mapping (or don't mmap them at all), where we won't be
messing with the struct page at all.

(That might even allow you to mmap hugetlb sub-pages, because the struct
page -- and mapcount -- will be ignored completely and not touched.)

Oh, are you suggesting that if we do vma->vm_flags |= VM_PFNMAP
in the mmap handler (mmap_udmabuf) and also do
vmf_insert_pfn(vma, vmf->address, page_to_pfn(page))
instead of
vmf->page = ubuf->pages[pgoff];
get_page(vmf->page);

in the vma fault handler (udmabuf_vm_fault), we can avoid most of the
pitfalls you have identified -- including with the usage of hugetlb subpages?


Yes, that's my thinking, but I have to do my homework first to see if 
that would really work for hugetlb.


The thing is, I kind-of consider what udmabuf does a layer violation: we 
have a filesystem (shmem/hugetlb) that should handle mappings to user 
space. Yet, a driver decides to bypass that and simply map the pages 
ordinarily to user space. (revealed by the fact that hugetlb does never 
map sub-pages but udmabuf decides to do so)


In an ideal world everybody would simply mmap() the original memfd, but 
thinking about offset+size configuration within the memfd that might not 
always be desirable. As a workaround, we could mmap() only the PFNs, 
leaving the struct page unaffected.


I'll have to look closer into that.

--
Cheers,

David / dhildenb

Re: [PATCH] udmabuf: revert 'Add support for mapping hugepages (v4)'

2023-06-13 Thread David Hildenbrand


On 13.06.23 10:26, Kasireddy, Vivek wrote:

Hi David,



On 12.06.23 09:10, Kasireddy, Vivek wrote:

Hi Mike,


Hi Vivek,



Sorry for the late reply; I just got back from vacation.
If it is unsafe to directly use the subpages of a hugetlb page, then reverting
this patch seems like the only option for addressing this issue immediately.
So, this patch is
Acked-by: Vivek Kasireddy 

As far as the use-case is concerned, there are two main users of the

udmabuf

driver: Qemu and CrosVM VMMs. However, it appears Qemu is the only

one

that uses hugetlb pages (when hugetlb=on is set) as the backing store for
Guest (Linux, Android and Windows) system memory. The main goal is to
share the pages associated with the Guest allocated framebuffer (FB) with
the Host GPU driver and other components in a zero-copy way. To that

end,

the guest GPU driver (virtio-gpu) allocates 4k size pages (associated with
the FB) and pins them before sharing the (guest) physical (or dma)

addresses

(and lengths) with Qemu. Qemu then translates the addresses into file
offsets and shares these offsets with udmabuf.


Is my understanding correct, that we can effectively long-term pin
(worse than mlock) 64 MiB per UDMABUF_CREATE, allowing eventually !root

The 64 MiB limit is the theoretical upper bound that we have not seen hit in
practice. Typically, for a 1920x1080 resolution (commonly used in Guests),
the size of the FB is ~8 MB (1920x1080x4). And, most modern Graphics
compositors flip between two FBs.



Okay, but users with privileges to open that file can just create as 
many as they want? I think I'll have to play with it.



users

ll /dev/udmabuf
crw-rw 1 root kvm 10, 125 12. Jun 08:12 /dev/udmabuf

to bypass there effective MEMLOCK limit, fragmenting physical memory and
breaking swap?

Right, it does not look like the mlock limits are honored.



That should be added.



Regarding the udmabuf_vm_fault(), I assume we're mapping pages we
obtained from the memfd ourselves into a special VMA (mmap() of the

mmap operation is really needed only if any component on the Host needs
CPU access to the buffer. But in most scenarios, we try to ensure direct GPU
access (h/w acceleration via gl) to these pages.


udmabuf). I'm not sure how well shmem pages are prepared for getting
mapped by someone else into an arbitrary VMA (page->index?).

Most drm/gpu drivers use shmem pages as the backing store for FBs and
other buffers and also provide mmap capability. What concerns do you see
with this approach?


Are these mmaping the pages the way udmabuf maps these pages (IOW, 
on-demand fault where we core-mm will adjust the mapcount etc)?


Skimming over at shmem_read_mapping_page() users, I assume most of them 
use a VM_PFNMAP mapping (or don't mmap them at all), where we won't be 
messing with the struct page at all.


(That might even allow you to mmap hugetlb sub-pages, because the struct 
page -- and mapcount -- will be ignored completely and not touched.)






... also, just imagine someone doing FALLOC_FL_PUNCH_HOLE / ftruncate()
on the memfd. What's mapped into the memfd no longer corresponds to
what's pinned / mapped into the VMA.

IIUC, making use of the DMA_BUF_IOCTL_SYNC ioctl would help with any
coherency issues:
https://www.kernel.org/doc/html/v6.2/driver-api/dma-buf.html#c.dma_buf_sync



Would it as of now? udmabuf_create() pulls the shmem pages out of the 
memfd, not sure how DMA_BUF_IOCTL_SYNC would help to update that 
whenever the pages inside the memfd would change (e.g., 
FALLOC_FL_PUNCH_HOLE + realloc).


But that's most probably simply "not supported".




Was linux-mm (and especially shmem maintainers, ccing Hugh) involved in
the upstreaming of udmabuf?

It does not appear so from the link below although other key lists were cc'd:
https://patchwork.freedesktop.org/patch/246100/?series=39879=7


That's unfortunate :(

--
Cheers,

David / dhildenb

Re: [PATCH] udmabuf: revert 'Add support for mapping hugepages (v4)'

2023-06-12 Thread David Hildenbrand


On 12.06.23 09:10, Kasireddy, Vivek wrote:

Hi Mike,


Hi Vivek,



Sorry for the late reply; I just got back from vacation.
If it is unsafe to directly use the subpages of a hugetlb page, then reverting
this patch seems like the only option for addressing this issue immediately.
So, this patch is
Acked-by: Vivek Kasireddy 

As far as the use-case is concerned, there are two main users of the udmabuf
driver: Qemu and CrosVM VMMs. However, it appears Qemu is the only one
that uses hugetlb pages (when hugetlb=on is set) as the backing store for
Guest (Linux, Android and Windows) system memory. The main goal is to
share the pages associated with the Guest allocated framebuffer (FB) with
the Host GPU driver and other components in a zero-copy way. To that end,
the guest GPU driver (virtio-gpu) allocates 4k size pages (associated with
the FB) and pins them before sharing the (guest) physical (or dma) addresses
(and lengths) with Qemu. Qemu then translates the addresses into file
offsets and shares these offsets with udmabuf.


Is my understanding correct, that we can effectively long-term pin 
(worse than mlock) 64 MiB per UDMABUF_CREATE, allowing eventually !root 
users


ll /dev/udmabuf
crw-rw 1 root kvm 10, 125 12. Jun 08:12 /dev/udmabuf

to bypass there effective MEMLOCK limit, fragmenting physical memory and 
breaking swap?



Regarding the udmabuf_vm_fault(), I assume we're mapping pages we 
obtained from the memfd ourselves into a special VMA (mmap() of the 
udmabuf). I'm not sure how well shmem pages are prepared for getting 
mapped by someone else into an arbitrary VMA (page->index?).


... also, just imagine someone doing FALLOC_FL_PUNCH_HOLE / ftruncate() 
on the memfd. What's mapped into the memfd no longer corresponds to 
what's pinned / mapped into the VMA.



Was linux-mm (and especially shmem maintainers, ccing Hugh) involved in 
the upstreaming of udmabuf?


--
Cheers,

David / dhildenb

Re: [PATCH] mm: fix hugetlb page unmap count balance issue

2023-06-07 Thread David Hildenbrand


On 17.05.23 00:34, Mike Kravetz wrote:

On 05/15/23 10:04, Mike Kravetz wrote:

On 05/12/23 16:29, Mike Kravetz wrote:

On 05/12/23 14:26, James Houghton wrote:

On Fri, May 12, 2023 at 12:20 AM Junxiao Chang  wrote:

This alone doesn't fix mapcounting for PTE-mapped HugeTLB pages. You
need something like [1]. I can resend it if that's what we should be
doing, but this mapcounting scheme doesn't work when the page structs
have been freed.

It seems like it was a mistake to include support for hugetlb memfds in udmabuf.


IIUC, it was added with commit 16c243e99d33 udmabuf: Add support for mapping
hugepages (v4).  Looks like it was never sent to linux-mm?  That is unfortunate
as hugetlb vmemmap freeing went in at about the same time.  And, as you have
noted udmabuf will not work if hugetlb vmemmap freeing is enabled.

Sigh!

Trying to think of a way forward.
--
Mike Kravetz



[1]: 
https://lore.kernel.org/linux-mm/20230306230004.1387007-2-jthough...@google.com/

- James


Adding people and list on Cc: involved with commit 16c243e99d33.

There are several issues with trying to map tail pages of hugetllb pages
not taken into account with udmabuf.  James spent quite a bit of time trying
to understand and address all the issues with the HGM code.  While using
the scheme proposed by James, may be an approach to the mapcount issue there
are also other issues that need attention.  For example, I do not see how
the fault code checks the state of the hugetlb page (such as poison) as none
of that state is carried in tail pages.

The more I think about it, the more I think udmabuf should treat hugetlb
pages as hugetlb pages.  They should be mapped at the appropriate level
in the page table.  Of course, this would impose new restrictions on the
API (mmap and ioctl) that may break existing users.  I have no idea how
extensively udmabuf is being used with hugetlb mappings.


Verified that using udmabug on a hugetlb mapping with vmemmap optimization will
BUG as:

[14106.812312] BUG: unable to handle page fault for address: ea000a7c4030
[14106.813704] #PF: supervisor write access in kernel mode
[14106.814791] #PF: error_code(0x0003) - permissions violation
[14106.815921] PGD 27fff9067 P4D 27fff9067 PUD 27fff8067 PMD 17ec34067 PTE 
800285dab021
[14106.818489] Oops: 0003 [#1] PREEMPT SMP PTI
[14106.819345] CPU: 2 PID: 2313 Comm: udmabuf Not tainted 
6.4.0-rc1-next-20230508+ #44
[14106.820906] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
1.16.2-1.fc37 04/01/2014
[14106.822679] RIP: 0010:page_add_file_rmap+0x2e/0x270

I started looking more closely at the driver and I do not fully understand the
usage model.  I took clues from the selftest and driver.  It seems the first
step is to create a buffer via the UDMABUF_CREATE ioctl.  This will copy 4K
pages from the page cache to an array associated with a file.  I did note that
hugetlb and shm behavior is different here as the driver can not add missing
hugetlb pages to the cache as it does with shm.  However, what seems more
concerning is that there is nothing to prevent the pages from being replaced
in the cache before being added to a udmabuf mapping.  This means udmabuf
mapping and original memfd could be operating on a different set of pages.
Is this acceptable, or somehow prevented?

In my role, I am more interested in udmabuf handling of hugetlb pages.
Trying to use individual 4K pages of hugetlb pages is something that
should be avoided here.  Would it be acceptable to change code so that
only whole hugetlb pages are used by udmabuf?  If not, then perhaps the
existing hugetlb support can be removed?


I'm wondering if that VMA shouldn't be some kind of special mapping 
(VM_PFNMAP), such that the struct page is entirely ignored?


I'm quite confused and concerned when I read that code (what the hell is 
it doing with shmem/hugetlb pages? why does the mapcount even get adjusted?)


This all has a bad smell to it, I hope I'm missing something important ...

--
Cheers,

David / dhildenb

Re: [PATCH v5 1/6] mm/gup: remove unused vmas parameter from get_user_pages()

2023-05-16 Thread David Hildenbrand


On 16.05.23 16:30, Sean Christopherson wrote:

On Tue, May 16, 2023, David Hildenbrand wrote:

On 15.05.23 21:07, Sean Christopherson wrote:

On Sun, May 14, 2023, Lorenzo Stoakes wrote:

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index cb5c13eee193..eaa5bb8dbadc 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2477,7 +2477,7 @@ static inline int check_user_page_hwpoison(unsigned long 
addr)
   {
int rc, flags = FOLL_HWPOISON | FOLL_WRITE;
-   rc = get_user_pages(addr, 1, flags, NULL, NULL);
+   rc = get_user_pages(addr, 1, flags, NULL);
return rc == -EHWPOISON;


Unrelated to this patch, I think there's a pre-existing bug here.  If gup() 
returns
a valid page, KVM will leak the refcount and unintentionally pin the page.  
That's


When passing NULL as "pages" to get_user_pages(), __get_user_pages_locked()
won't set FOLL_GET. As FOLL_PIN is also not set, we won't be messing with
the mapcount of the page.


For completeness: s/mapcount/refcount/ :)



Ah, that's what I'm missing.




--
Thanks,

David / dhildenb

Re: [PATCH v5 1/6] mm/gup: remove unused vmas parameter from get_user_pages()

2023-05-16 Thread David Hildenbrand


On 15.05.23 21:07, Sean Christopherson wrote:

On Sun, May 14, 2023, Lorenzo Stoakes wrote:

No invocation of get_user_pages() use the vmas parameter, so remove it.

The GUP API is confusing and caveated. Recent changes have done much to
improve that, however there is more we can do. Exporting vmas is a prime
target as the caller has to be extremely careful to preclude their use
after the mmap_lock has expired or otherwise be left with dangling
pointers.

Removing the vmas parameter focuses the GUP functions upon their primary
purpose - pinning (and outputting) pages as well as performing the actions
implied by the input flags.

This is part of a patch series aiming to remove the vmas parameter
altogether.

Suggested-by: Matthew Wilcox (Oracle) 
Acked-by: Greg Kroah-Hartman 
Acked-by: David Hildenbrand 
Reviewed-by: Jason Gunthorpe 
Acked-by: Christian Kï¿½nig  (for radeon parts)
Acked-by: Jarkko Sakkinen 
Signed-off-by: Lorenzo Stoakes 
---
  arch/x86/kernel/cpu/sgx/ioctl.c | 2 +-
  drivers/gpu/drm/radeon/radeon_ttm.c | 2 +-
  drivers/misc/sgi-gru/grufault.c | 2 +-
  include/linux/mm.h  | 3 +--
  mm/gup.c| 9 +++--
  mm/gup_test.c   | 5 ++---
  virt/kvm/kvm_main.c | 2 +-
  7 files changed, 10 insertions(+), 15 deletions(-)


Acked-by: Sean Christopherson  (KVM)


diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index cb5c13eee193..eaa5bb8dbadc 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2477,7 +2477,7 @@ static inline int check_user_page_hwpoison(unsigned long 
addr)
  {
int rc, flags = FOLL_HWPOISON | FOLL_WRITE;
  
-	rc = get_user_pages(addr, 1, flags, NULL, NULL);

+   rc = get_user_pages(addr, 1, flags, NULL);
return rc == -EHWPOISON;


Unrelated to this patch, I think there's a pre-existing bug here.  If gup() 
returns
a valid page, KVM will leak the refcount and unintentionally pin the page.  
That's


When passing NULL as "pages" to get_user_pages(), 
__get_user_pages_locked() won't set FOLL_GET. As FOLL_PIN is also not 
set, we won't be messing with the mapcount of the page.


So even if get_user_pages() returns "1", we should be fine.


Or am I misunderstanding your concern? At least hva_to_pfn_slow() most 
certainly didn't return "1" if we end up calling 
check_user_page_hwpoison(), so nothing would have been pinned there as well.


--
Thanks,

David / dhildenb

Re: [PATCH v3 1/7] mm/gup: remove unused vmas parameter from get_user_pages()

2023-04-17 Thread David Hildenbrand


On 15.04.23 14:09, Lorenzo Stoakes wrote:

No invocation of get_user_pages() uses the vmas parameter, so remove
it.

The GUP API is confusing and caveated. Recent changes have done much to
improve that, however there is more we can do. Exporting vmas is a prime
target as the caller has to be extremely careful to preclude their use
after the mmap_lock has expired or otherwise be left with dangling
pointers.

Removing the vmas parameter focuses the GUP functions upon their primary
purpose - pinning (and outputting) pages as well as performing the actions
implied by the input flags.

This is part of a patch series aiming to remove the vmas parameter
altogether.

Signed-off-by: Lorenzo Stoakes 
Suggested-by: Matthew Wilcox (Oracle) 
Acked-by: Greg Kroah-Hartman 
---


Acked-by: David Hildenbrand 

--
Thanks,

David / dhildenb

Re: [PATCH mm-unstable v1 16/20] mm/frame-vector: remove FOLL_FORCE usage

2022-11-29 Thread David Hildenbrand


On 29.11.22 10:08, Hans Verkuil wrote:

On 29/11/2022 09:48, David Hildenbrand wrote:

On 28.11.22 23:59, Andrew Morton wrote:

On Mon, 28 Nov 2022 09:18:47 +0100 David Hildenbrand  wrote:


Less chances of things going wrong that way.

Just mention in the v2 cover letter that the first patch was added to
make it easy to backport that fix without being hampered by merge
conflicts if it was added after your frame_vector.c patch.


Yes, that's the way I would naturally do, it, however, Andrew prefers
delta updates for minor changes.

@Andrew, whatever you prefer!


I'm inclined to let things sit as they are.  Cross-tree conflicts
happen, and Linus handles them.  I'll flag this (very simple) conflict
in the pull request, if MM merges second.  If v4l merges second then
hopefully they will do the same.  But this one is so simple that Linus
hardly needs our help.


It's not about cross-tree conflicts, it's about the fact that my patch is
a fix that needs to be backported to older kernels. It should apply cleanly
to those older kernels if my patch goes in first, but if it is the other way
around I would have to make a new patch for the stable kernels.


IIUC, the conflict will be resolved at merge time and the merge 
resolution will be part of the merge commit. It doesn't matter in which 
order the patches go upstream, the merge commit resolves the problematic 
overlap.


So your patch will be upstream as intended, where it can be cleanly 
backported.


Hope I am not twisting reality ;)

--
Thanks,

David / dhildenb

Re: [PATCH mm-unstable v1 16/20] mm/frame-vector: remove FOLL_FORCE usage

2022-11-29 Thread David Hildenbrand


On 28.11.22 23:59, Andrew Morton wrote:

On Mon, 28 Nov 2022 09:18:47 +0100 David Hildenbrand  wrote:


Less chances of things going wrong that way.

Just mention in the v2 cover letter that the first patch was added to
make it easy to backport that fix without being hampered by merge
conflicts if it was added after your frame_vector.c patch.


Yes, that's the way I would naturally do, it, however, Andrew prefers
delta updates for minor changes.

@Andrew, whatever you prefer!


I'm inclined to let things sit as they are.  Cross-tree conflicts
happen, and Linus handles them.  I'll flag this (very simple) conflict
in the pull request, if MM merges second.  If v4l merges second then
hopefully they will do the same.  But this one is so simple that Linus
hardly needs our help.

But Linus won't be editing changelogs so that the changelog makes more
sense after both trees are joined.  I'm inclined to let the changelog
sit as it is as well.


Works for me. Thanks Andrew!

--
Thanks,

David / dhildenb

Re: [PATCH mm-unstable v1 16/20] mm/frame-vector: remove FOLL_FORCE usage

2022-11-28 Thread David Hildenbrand


On 28.11.22 09:17, Hans Verkuil wrote:

Hi David,

On 27/11/2022 11:35, David Hildenbrand wrote:

On 16.11.22 11:26, David Hildenbrand wrote:

FOLL_FORCE is really only for ptrace access. According to commit
707947247e95 ("media: videobuf2-vmalloc: get_userptr: buffers are always
writable"), get_vaddr_frames() currently pins all pages writable as a
workaround for issues with read-only buffers.

FOLL_FORCE, however, seems to be a legacy leftover as it predates
commit 707947247e95 ("media: videobuf2-vmalloc: get_userptr: buffers are
always writable"). Let's just remove it.

Once the read-only buffer issue has been resolved, FOLL_WRITE could
again be set depending on the DMA direction.

Cc: Hans Verkuil 
Cc: Marek Szyprowski 
Cc: Tomasz Figa 
Cc: Marek Szyprowski 
Cc: Mauro Carvalho Chehab 
Signed-off-by: David Hildenbrand 
---
   drivers/media/common/videobuf2/frame_vector.c | 2 +-
   1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/media/common/videobuf2/frame_vector.c 
b/drivers/media/common/videobuf2/frame_vector.c
index 542dde9d2609..062e98148c53 100644
--- a/drivers/media/common/videobuf2/frame_vector.c
+++ b/drivers/media/common/videobuf2/frame_vector.c
@@ -50,7 +50,7 @@ int get_vaddr_frames(unsigned long start, unsigned int 
nr_frames,
   start = untagged_addr(start);
     ret = pin_user_pages_fast(start, nr_frames,
-  FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM,
+  FOLL_WRITE | FOLL_LONGTERM,
     (struct page **)(vec->ptrs));
   if (ret > 0) {
   vec->got_ref = true;



Hi Andrew,

see the discussion at [1] regarding a conflict and how to proceed with
upstreaming. The conflict would be easy to resolve, however, also
the patch description doesn't make sense anymore with [1].


Might it be easier and less confusing if you post a v2 of this series
with my patch first? That way it is clear that 1) my patch has to come
first, and 2) that it is part of a single series and should be merged
by the mm subsystem.

Less chances of things going wrong that way.

Just mention in the v2 cover letter that the first patch was added to
make it easy to backport that fix without being hampered by merge
conflicts if it was added after your frame_vector.c patch.


Yes, that's the way I would naturally do, it, however, Andrew prefers 
delta updates for minor changes.


@Andrew, whatever you prefer!

Thanks!

--
Thanks,

David / dhildenb

Re: [PATCH mm-unstable v1 16/20] mm/frame-vector: remove FOLL_FORCE usage

2022-11-27 Thread David Hildenbrand


On 16.11.22 11:26, David Hildenbrand wrote:

FOLL_FORCE is really only for ptrace access. According to commit
707947247e95 ("media: videobuf2-vmalloc: get_userptr: buffers are always
writable"), get_vaddr_frames() currently pins all pages writable as a
workaround for issues with read-only buffers.

FOLL_FORCE, however, seems to be a legacy leftover as it predates
commit 707947247e95 ("media: videobuf2-vmalloc: get_userptr: buffers are
always writable"). Let's just remove it.

Once the read-only buffer issue has been resolved, FOLL_WRITE could
again be set depending on the DMA direction.

Cc: Hans Verkuil 
Cc: Marek Szyprowski 
Cc: Tomasz Figa 
Cc: Marek Szyprowski 
Cc: Mauro Carvalho Chehab 
Signed-off-by: David Hildenbrand 
---
  drivers/media/common/videobuf2/frame_vector.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/media/common/videobuf2/frame_vector.c 
b/drivers/media/common/videobuf2/frame_vector.c
index 542dde9d2609..062e98148c53 100644
--- a/drivers/media/common/videobuf2/frame_vector.c
+++ b/drivers/media/common/videobuf2/frame_vector.c
@@ -50,7 +50,7 @@ int get_vaddr_frames(unsigned long start, unsigned int 
nr_frames,
start = untagged_addr(start);
  
  	ret = pin_user_pages_fast(start, nr_frames,

- FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM,
+ FOLL_WRITE | FOLL_LONGTERM,
  (struct page **)(vec->ptrs));
if (ret > 0) {
vec->got_ref = true;



Hi Andrew,

see the discussion at [1] regarding a conflict and how to proceed with
upstreaming. The conflict would be easy to resolve, however, also
the patch description doesn't make sense anymore with [1].


On top of mm-unstable, reverting this patch and applying [1] gives me
an updated patch:


From 1e66c25f1467c1f1e5f275312f2c6df29308d4df Mon Sep 17 00:00:00 2001
From: David Hildenbrand 
Date: Wed, 16 Nov 2022 11:26:55 +0100
Subject: [PATCH] mm/frame-vector: remove FOLL_FORCE usage

GUP now supports reliable R/O long-term pinning in COW mappings, such
that we break COW early. MAP_SHARED VMAs only use the shared zeropage so
far in one corner case (DAXFS file with holes), which can be ignored
because GUP does not support long-term pinning in fsdax (see
check_vma_flags()).

Consequently, FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM is no longer required
for reliable R/O long-term pinning: FOLL_LONGTERM is sufficient. So stop
using FOLL_FORCE, which is really only for ptrace access.

Reviewed-by: Daniel Vetter 
Acked-by: Hans Verkuil 
Cc: Hans Verkuil 
Cc: Marek Szyprowski 
Cc: Tomasz Figa 
Cc: Marek Szyprowski 
Cc: Mauro Carvalho Chehab 
Signed-off-by: David Hildenbrand 
---
 drivers/media/common/videobuf2/frame_vector.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/media/common/videobuf2/frame_vector.c 
b/drivers/media/common/videobuf2/frame_vector.c
index aad72640f055..8606fdacf5b8 100644
--- a/drivers/media/common/videobuf2/frame_vector.c
+++ b/drivers/media/common/videobuf2/frame_vector.c
@@ -41,7 +41,7 @@ int get_vaddr_frames(unsigned long start, unsigned int 
nr_frames, bool write,
int ret_pin_user_pages_fast = 0;
int ret = 0;
int err;
-   unsigned int gup_flags = FOLL_FORCE | FOLL_LONGTERM;
+   unsigned int gup_flags = FOLL_LONGTERM;
 
 	if (nr_frames == 0)

return 0;
--
2.38.1



Please let me know how you want to proceed. Ideally, you'd pick up
[1] and apply this updated patch. Also, please tell me if I should
send this updated patch in a separate mail (e.g., as reply to this mail).


[1] https://lkml.kernel.org/r/71bdd3cf-b044-3f12-df58-7c16d5749...@xs4all.nl

--
Thanks,

David / dhildenb

Re: [PATCH RFC 16/19] mm/frame-vector: remove FOLL_FORCE usage

2022-11-22 Thread David Hildenbrand


On 22.11.22 15:07, Hans Verkuil wrote:

On 11/22/22 13:38, David Hildenbrand wrote:

On 22.11.22 13:25, Hans Verkuil wrote:

Hi Tomasz, David,

On 11/8/22 05:45, Tomasz Figa wrote:

Hi David,

On Tue, Nov 8, 2022 at 1:19 AM David Hildenbrand  wrote:


FOLL_FORCE is really only for debugger access. According to commit
707947247e95 ("media: videobuf2-vmalloc: get_userptr: buffers are always
writable"), the pinned pages are always writable.


Actually that patch is only a workaround to temporarily disable
support for read-only pages as they seemed to suffer from some
corruption issues in the retrieved user pages. We expect to support
read-only pages as hardware input after. That said, FOLL_FORCE doesn't
sound like the right thing even in that case, but I don't know the
background behind it being added here in the first place. +Hans
Verkuil +Marek Szyprowski do you happen to remember anything about it?


I tracked the use of 'force' all the way back to the first git commit
(2.6.12-rc1) in the very old video-buf.c. So it is very, very old and the
reason is lost in the mists of time.

I'm not sure if the 'force' argument of get_user_pages() at that time
even meant the same as FOLL_FORCE today. From what I can tell it has just
been faithfully used ever since, but I have my doubt that anyone understands
the reason behind it since it was never explained.

Looking at this old LWN article https://lwn.net/Articles/28548/ suggests
that it might be related to calling get_user_pages for write buffers
(non-zero write argument) where you also want to be able to read from the
buffer. That is certainly something that some drivers need to do post-capture
fixups.

But 'force' was also always set for read buffers, and I don't know if that
was something that was actually needed, or just laziness.

I assume that removing FOLL_FORCE from 'FOLL_FORCE|FOLL_WRITE' will still
allow drivers to read from the buffer?


Yes. The only problematic corner case I can imagine is if someone has a
VMA without write permissions (no PROT_WRITE/VM_WRITE) and wants to pin
user space pages as a read buffer. We'd specify now FOLL_WRITE without
FOLL_FORCE and GUP would reject that: write access without write
permissions is invalid.


I do not believe this will be an issue.



There would be no way around "fixing" this implementation to not specify
FOLL_WRITE when only reading from user-space pages. Not sure what the
implications are regarding that corruption that was mentioned in
707947247e95.


Before 707947247e95 the FOLL_WRITE flag was only set for write buffers
(i.e. video capture, DMA_FROM_DEVICE), not for read buffers (video output,
DMA_TO_DEVICE). In the video output case there should never be any need
for drivers to write to the buffer to the best of my knowledge.

But I have had some complaints about that commit that it causes problems
in some scenarios, and it has been on my todo list for quite some time now
to dig deeper into this. I probably should prioritize this for this or
next week.



Having said that, I assume such a scenario is unlikely -- but you might
know better how user space usually uses this interface. There would be
three options:

1) Leave the FOLL_FORCE hack in for now, which I *really* want to avoid.
2) Remove FOLL_FORCE and see if anybody even notices (this patch) and
 leave the implementation as is for now.
3) Remove FOLL_FORCE and fixup the implementation to only specify
 FOLL_WRITE if the pages will actually get written to.

3) would most probably ideal, however, I am no expert on that code and
can't do it (707947247e95 confuses me). So naive me would go with 2) first.



Option 3 would be best. And 707947247e95 confuses me as well, and I actually
wrote it :-) I am wondering whether it was addressed at the right level, but
as I said, I need to dig a bit deeper into this.


Cool, let me know if I can help!

--
Thanks,

David / dhildenb

Re: [PATCH RFC 16/19] mm/frame-vector: remove FOLL_FORCE usage

2022-11-22 Thread David Hildenbrand


On 22.11.22 13:25, Hans Verkuil wrote:

Hi Tomasz, David,

On 11/8/22 05:45, Tomasz Figa wrote:

Hi David,

On Tue, Nov 8, 2022 at 1:19 AM David Hildenbrand  wrote:


FOLL_FORCE is really only for debugger access. According to commit
707947247e95 ("media: videobuf2-vmalloc: get_userptr: buffers are always
writable"), the pinned pages are always writable.


Actually that patch is only a workaround to temporarily disable
support for read-only pages as they seemed to suffer from some
corruption issues in the retrieved user pages. We expect to support
read-only pages as hardware input after. That said, FOLL_FORCE doesn't
sound like the right thing even in that case, but I don't know the
background behind it being added here in the first place. +Hans
Verkuil +Marek Szyprowski do you happen to remember anything about it?


I tracked the use of 'force' all the way back to the first git commit
(2.6.12-rc1) in the very old video-buf.c. So it is very, very old and the
reason is lost in the mists of time.

I'm not sure if the 'force' argument of get_user_pages() at that time
even meant the same as FOLL_FORCE today. From what I can tell it has just
been faithfully used ever since, but I have my doubt that anyone understands
the reason behind it since it was never explained.

Looking at this old LWN article https://lwn.net/Articles/28548/ suggests
that it might be related to calling get_user_pages for write buffers
(non-zero write argument) where you also want to be able to read from the
buffer. That is certainly something that some drivers need to do post-capture
fixups.

But 'force' was also always set for read buffers, and I don't know if that
was something that was actually needed, or just laziness.

I assume that removing FOLL_FORCE from 'FOLL_FORCE|FOLL_WRITE' will still
allow drivers to read from the buffer?


Yes. The only problematic corner case I can imagine is if someone has a 
VMA without write permissions (no PROT_WRITE/VM_WRITE) and wants to pin 
user space pages as a read buffer. We'd specify now FOLL_WRITE without 
FOLL_FORCE and GUP would reject that: write access without write 
permissions is invalid.


There would be no way around "fixing" this implementation to not specify 
FOLL_WRITE when only reading from user-space pages. Not sure what the 
implications are regarding that corruption that was mentioned in 
707947247e95.


Having said that, I assume such a scenario is unlikely -- but you might 
know better how user space usually uses this interface. There would be 
three options:


1) Leave the FOLL_FORCE hack in for now, which I *really* want to avoid.
2) Remove FOLL_FORCE and see if anybody even notices (this patch) and
   leave the implementation as is for now.
3) Remove FOLL_FORCE and fixup the implementation to only specify
   FOLL_WRITE if the pages will actually get written to.

3) would most probably ideal, however, I am no expert on that code and 
can't do it (707947247e95 confuses me). So naive me would go with 2) first.


--
Thanks,

David / dhildenb

Re: [PATCH mm-unstable v1 20/20] mm: rename FOLL_FORCE to FOLL_PTRACE

2022-11-16 Thread David Hildenbrand


On 16.11.22 19:16, Linus Torvalds wrote:

On Wed, Nov 16, 2022 at 2:30 AM David Hildenbrand  wrote:


Let's make it clearer that functionality provided by FOLL_FORCE is
really only for ptrace access.


I'm not super-happy about this one.

I do understand the "let's rename the bit so that no new user shows up".

And it's true that the main traditional use is ptrace.

But from the patch itself it becomes obvious that no, it's not *just*
ptrace. At least not yet.

It's used for get_arg_page(), which uses it to basically look up (and
install) pages in the newly created VM.

Now, I'm not entirely sure why it even uses FOLL_FORCE, - I think it
might be historical, because the target should always be the new stack
vma.

Following the history of it is a big of a mess, because there's a
number of renamings and re-organizations, but it seems to go back to
2007 and commit b6a2fea39318 ("mm: variable length argument support").



Right.


Before that commit, we kept our own array of "this is the set of pages
that I will install in the new VM". That commit basically just inserts
the pages directly into the VM instead, getting rid of the array size
limitation.

So at a minimum, I think that FOLL_FORCE would need to be removed
before any renaming to FOLL_PTRACE, because that's not some kind of
small random case.

It *might* be as simple as just removing it, but maybe there's some
reason for having it that I don't immediately see.


Right, I have the same feeling. It might just be a copy-and-paste legacy 
leftover.




There _are_ also small random cases too, like get_cmdline(). Maybe
that counts as ptrace, but the execve() case most definitely does not.


I agree. I'd suggest moving forward without this (last) patch for now 
and figuring out how to further cleanup FOLL_FORCE usage on top.


@Andrew, if you intend to put this into mm-unstable, please drop the 
last patch for now.


--
Thanks,

David / dhildenb

[PATCH mm-unstable v1 20/20] mm: rename FOLL_FORCE to FOLL_PTRACE

2022-11-16 Thread David Hildenbrand

Let's make it clearer that functionality provided by FOLL_FORCE is
really only for ptrace access. Prevent accidental re-use in drivers
by renaming FOLL_FORCE to FOLL_PTRACE:

  git grep -l 'FOLL_FORCE' | xargs sed -i 's/FOLL_FORCE/FOLL_PTRACE/g'

In the future, we might want to use a separate set of flags for the
access_vm interface: most FOLL_* flags don't apply and we mostly only
want to pass FOLL_PTRACE and FOLL_WRITE.

Suggested-by: Christoph Hellwig 
Cc: Oleg Nesterov 
Cc: Richard Henderson 
Cc: Ivan Kokshaysky 
Cc: Matt Turner 
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Thomas Bogendoerfer 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Christophe Leroy 
Cc: "David S. Miller" 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: "H. Peter Anvin" 
Cc: Richard Weinberger 
Cc: Anton Ivanov 
Cc: Johannes Berg 
Cc: Eric Biederman 
Cc: Kees Cook 
Cc: Alexander Viro 
Cc: Peter Zijlstra 
Cc: Arnaldo Carvalho de Melo 
Cc: Mark Rutland 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Mike Kravetz 
Cc: Muchun Song 
Cc: Kentaro Takeda 
Cc: Tetsuo Handa 
Cc: Paul Moore 
Cc: James Morris 
Cc: "Serge E. Hallyn" 
Signed-off-by: David Hildenbrand 
---
 arch/alpha/kernel/ptrace.c|  6 +++---
 arch/arm64/kernel/mte.c   |  2 +-
 arch/ia64/kernel/ptrace.c | 10 +-
 arch/mips/kernel/ptrace32.c   |  4 ++--
 arch/mips/math-emu/dsemul.c   |  2 +-
 arch/powerpc/kernel/ptrace/ptrace32.c |  4 ++--
 arch/sparc/kernel/ptrace_32.c |  4 ++--
 arch/sparc/kernel/ptrace_64.c |  8 
 arch/x86/kernel/step.c|  2 +-
 arch/x86/um/ptrace_32.c   |  2 +-
 arch/x86/um/ptrace_64.c   |  2 +-
 fs/exec.c |  2 +-
 fs/proc/base.c|  2 +-
 include/linux/mm.h|  8 
 kernel/events/uprobes.c   |  4 ++--
 kernel/ptrace.c   | 12 ++--
 mm/gup.c  | 28 +--
 mm/huge_memory.c  |  8 
 mm/hugetlb.c  |  2 +-
 mm/memory.c   |  4 ++--
 mm/util.c |  4 ++--
 security/tomoyo/domain.c  |  2 +-
 22 files changed, 61 insertions(+), 61 deletions(-)

diff --git a/arch/alpha/kernel/ptrace.c b/arch/alpha/kernel/ptrace.c
index a1a239ea002d..55def6479ff2 100644
--- a/arch/alpha/kernel/ptrace.c
+++ b/arch/alpha/kernel/ptrace.c
@@ -158,7 +158,7 @@ static inline int
 read_int(struct task_struct *task, unsigned long addr, int * data)
 {
int copied = access_process_vm(task, addr, data, sizeof(int),
-   FOLL_FORCE);
+   FOLL_PTRACE);
return (copied == sizeof(int)) ? 0 : -EIO;
 }
 
@@ -166,7 +166,7 @@ static inline int
 write_int(struct task_struct *task, unsigned long addr, int data)
 {
int copied = access_process_vm(task, addr, , sizeof(int),
-   FOLL_FORCE | FOLL_WRITE);
+   FOLL_PTRACE | FOLL_WRITE);
return (copied == sizeof(int)) ? 0 : -EIO;
 }
 
@@ -284,7 +284,7 @@ long arch_ptrace(struct task_struct *child, long request,
case PTRACE_PEEKTEXT: /* read word at location addr. */
case PTRACE_PEEKDATA:
copied = ptrace_access_vm(child, addr, , sizeof(tmp),
-   FOLL_FORCE);
+   FOLL_PTRACE);
ret = -EIO;
if (copied != sizeof(tmp))
break;
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index 7467217c1eaf..fa29fecaedbc 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -525,7 +525,7 @@ int mte_ptrace_copy_tags(struct task_struct *child, long 
request,
int ret;
struct iovec kiov;
struct iovec __user *uiov = (void __user *)data;
-   unsigned int gup_flags = FOLL_FORCE;
+   unsigned int gup_flags = FOLL_PTRACE;
 
if (!system_supports_mte())
return -EIO;
diff --git a/arch/ia64/kernel/ptrace.c b/arch/ia64/kernel/ptrace.c
index ab8aeb34d1d9..3781db1f506c 100644
--- a/arch/ia64/kernel/ptrace.c
+++ b/arch/ia64/kernel/ptrace.c
@@ -452,7 +452,7 @@ ia64_peek (struct task_struct *child, struct switch_stack 
*child_stack,
return 0;
}
}
-   copied = access_process_vm(child, addr, , sizeof(ret), FOLL_FORCE);
+   copied = access_process_vm(child, addr, , sizeof(ret), FOLL_PTRACE);
if (copied != sizeof(ret))
return -EIO;
*val = ret;
@@ -489,7 +489,7 @@ ia64_poke (struct task_struct *child, struct switch_stack 
*child_stack,
}
}
} else if (access_process_vm(child, addr, , sizeof(val),
-   FOLL_FORCE | FOLL_WRITE)
+

[PATCH mm-unstable v1 19/20] habanalabs: remove FOLL_FORCE usage

2022-11-16 Thread David Hildenbrand

FOLL_FORCE is really only for ptrace access. As we unpin the pinned pages
using unpin_user_pages_dirty_lock(true), the assumption is that all these
pages are writable.

FOLL_FORCE in this case seems to be due to copy-and-past from other
drivers. Let's just remove it.

Acked-by: Oded Gabbay 
Cc: Oded Gabbay 
Cc: Arnd Bergmann 
Cc: Greg Kroah-Hartman 
Signed-off-by: David Hildenbrand 
---
 drivers/misc/habanalabs/common/memory.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/misc/habanalabs/common/memory.c 
b/drivers/misc/habanalabs/common/memory.c
index ef28f3b37b93..e35cca96bbef 100644
--- a/drivers/misc/habanalabs/common/memory.c
+++ b/drivers/misc/habanalabs/common/memory.c
@@ -2312,8 +2312,7 @@ static int get_user_memory(struct hl_device *hdev, u64 
addr, u64 size,
if (!userptr->pages)
return -ENOMEM;
 
-   rc = pin_user_pages_fast(start, npages,
-FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM,
+   rc = pin_user_pages_fast(start, npages, FOLL_WRITE | FOLL_LONGTERM,
 userptr->pages);
 
if (rc != npages) {
-- 
2.38.1

[PATCH mm-unstable v1 17/20] drm/exynos: remove FOLL_FORCE usage

2022-11-16 Thread David Hildenbrand

FOLL_FORCE is really only for ptrace access. As we unpin the pinned pages
using unpin_user_pages_dirty_lock(true), the assumption is that all these
pages are writable.

FOLL_FORCE in this case seems to be a legacy leftover. Let's just remove
it.

Cc: Inki Dae 
Cc: Seung-Woo Kim 
Cc: Kyungmin Park 
Cc: David Airlie 
Cc: Daniel Vetter 
Cc: Krzysztof Kozlowski 
Signed-off-by: David Hildenbrand 
---
 drivers/gpu/drm/exynos/exynos_drm_g2d.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/exynos/exynos_drm_g2d.c 
b/drivers/gpu/drm/exynos/exynos_drm_g2d.c
index 471fd6c8135f..e19c2ceb3759 100644
--- a/drivers/gpu/drm/exynos/exynos_drm_g2d.c
+++ b/drivers/gpu/drm/exynos/exynos_drm_g2d.c
@@ -477,7 +477,7 @@ static dma_addr_t *g2d_userptr_get_dma_addr(struct g2d_data 
*g2d,
}
 
ret = pin_user_pages_fast(start, npages,
- FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM,
+ FOLL_WRITE | FOLL_LONGTERM,
  g2d_userptr->pages);
if (ret != npages) {
DRM_DEV_ERROR(g2d->dev,
-- 
2.38.1

[PATCH mm-unstable v1 18/20] RDMA/hw/qib/qib_user_pages: remove FOLL_FORCE usage

2022-11-16 Thread David Hildenbrand

FOLL_FORCE is really only for ptrace access. As we unpin the pinned pages
using unpin_user_pages_dirty_lock(true), the assumption is that all these
pages are writable.

FOLL_FORCE in this case seems to be a legacy leftover. Let's just remove
it.

Cc: Dennis Dalessandro 
Cc: Jason Gunthorpe 
Cc: Leon Romanovsky 
Signed-off-by: David Hildenbrand 
---
 drivers/infiniband/hw/qib/qib_user_pages.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/hw/qib/qib_user_pages.c 
b/drivers/infiniband/hw/qib/qib_user_pages.c
index f4b5f05058e4..f693bc753b6b 100644
--- a/drivers/infiniband/hw/qib/qib_user_pages.c
+++ b/drivers/infiniband/hw/qib/qib_user_pages.c
@@ -110,7 +110,7 @@ int qib_get_user_pages(unsigned long start_page, size_t 
num_pages,
for (got = 0; got < num_pages; got += ret) {
ret = pin_user_pages(start_page + got * PAGE_SIZE,
 num_pages - got,
-FOLL_LONGTERM | FOLL_WRITE | FOLL_FORCE,
+FOLL_LONGTERM | FOLL_WRITE,
 p + got, NULL);
if (ret < 0) {
mmap_read_unlock(current->mm);
-- 
2.38.1

[PATCH mm-unstable v1 09/20] mm/gup: reliable R/O long-term pinning in COW mappings

2022-11-16 Thread David Hildenbrand

We already support reliable R/O pinning of anonymous memory. However,
assume we end up pinning (R/O long-term) a pagecache page or the shared
zeropage inside a writable private ("COW") mapping. The next write access
will trigger a write-fault and replace the pinned page by an exclusive
anonymous page in the process page tables to break COW: the pinned page no
longer corresponds to the page mapped into the process' page table.

Now that FAULT_FLAG_UNSHARE can break COW on anything mapped into a
COW mapping, let's properly break COW first before R/O long-term
pinning something that's not an exclusive anon page inside a COW
mapping. FAULT_FLAG_UNSHARE will break COW and map an exclusive anon page
instead that can get pinned safely.

With this change, we can stop using FOLL_FORCE|FOLL_WRITE for reliable
R/O long-term pinning in COW mappings.

With this change, the new R/O long-term pinning tests for non-anonymous
memory succeed:
  # [RUN] R/O longterm GUP pin ... with shared zeropage
  ok 151 Longterm R/O pin is reliable
  # [RUN] R/O longterm GUP pin ... with memfd
  ok 152 Longterm R/O pin is reliable
  # [RUN] R/O longterm GUP pin ... with tmpfile
  ok 153 Longterm R/O pin is reliable
  # [RUN] R/O longterm GUP pin ... with huge zeropage
  ok 154 Longterm R/O pin is reliable
  # [RUN] R/O longterm GUP pin ... with memfd hugetlb (2048 kB)
  ok 155 Longterm R/O pin is reliable
  # [RUN] R/O longterm GUP pin ... with memfd hugetlb (1048576 kB)
  ok 156 Longterm R/O pin is reliable
  # [RUN] R/O longterm GUP-fast pin ... with shared zeropage
  ok 157 Longterm R/O pin is reliable
  # [RUN] R/O longterm GUP-fast pin ... with memfd
  ok 158 Longterm R/O pin is reliable
  # [RUN] R/O longterm GUP-fast pin ... with tmpfile
  ok 159 Longterm R/O pin is reliable
  # [RUN] R/O longterm GUP-fast pin ... with huge zeropage
  ok 160 Longterm R/O pin is reliable
  # [RUN] R/O longterm GUP-fast pin ... with memfd hugetlb (2048 kB)
  ok 161 Longterm R/O pin is reliable
  # [RUN] R/O longterm GUP-fast pin ... with memfd hugetlb (1048576 kB)
  ok 162 Longterm R/O pin is reliable

Note 1: We don't care about short-term R/O-pinning, because they have
snapshot semantics: they are not supposed to observe modifications that
happen after pinning.

As one example, assume we start direct I/O to read from a page and store
page content into a file: modifications to page content after starting
direct I/O are not guaranteed to end up in the file. So even if we'd pin
the shared zeropage, the end result would be as expected -- getting zeroes
stored to the file.

Note 2: For shared mappings we'll now always fallback to the slow path to
lookup the VMA when R/O long-term pining. While that's the necessary price
we have to pay right now, it's actually not that bad in practice: most
FOLL_LONGTERM users already specify FOLL_WRITE, for example, along with
FOLL_FORCE because they tried dealing with COW mappings correctly ...

Note 3: For users that use FOLL_LONGTERM right now without FOLL_WRITE,
such as VFIO, we'd now no longer pin the shared zeropage. Instead, we'd
populate exclusive anon pages that we can pin. There was a concern that
this could affect the memlock limit of existing setups.

For example, a VM running with VFIO could run into the memlock limit and
fail to run. However, we essentially had the same behavior already in
commit 17839856fd58 ("gup: document and work around "COW can break either
way" issue") which got merged into some enterprise distros, and there were
not any such complaints. So most probably, we're fine.

Signed-off-by: David Hildenbrand 
---
 include/linux/mm.h | 27 ---
 mm/gup.c   | 10 +-
 mm/huge_memory.c   |  2 +-
 mm/hugetlb.c   |  7 ---
 4 files changed, 34 insertions(+), 12 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6bd2ee5872dd..e8cc838f42f9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3095,8 +3095,12 @@ static inline int vm_fault_to_errno(vm_fault_t vm_fault, 
int foll_flags)
  * Must be called with the (sub)page that's actually referenced via the
  * page table entry, which might not necessarily be the head page for a
  * PTE-mapped THP.
+ *
+ * If the vma is NULL, we're coming from the GUP-fast path and might have
+ * to fallback to the slow path just to lookup the vma.
  */
-static inline bool gup_must_unshare(unsigned int flags, struct page *page)
+static inline bool gup_must_unshare(struct vm_area_struct *vma,
+   unsigned int flags, struct page *page)
 {
/*
 * FOLL_WRITE is implicitly handled correctly as the page table entry
@@ -3109,8 +3113,25 @@ static inline bool gup_must_unshare(unsigned int flags, 
struct page *page)
 * Note: PageAnon(page) is stable until the page is actually getting
 * freed.
 */
-   if (!PageAnon(page))
-   return false;
+   if (!PageAnon(page)) {
+

[PATCH mm-unstable v1 13/20] media: videobuf-dma-sg: remove FOLL_FORCE usage

2022-11-16 Thread David Hildenbrand

GUP now supports reliable R/O long-term pinning in COW mappings, such
that we break COW early. MAP_SHARED VMAs only use the shared zeropage so
far in one corner case (DAXFS file with holes), which can be ignored
because GUP does not support long-term pinning in fsdax (see
check_vma_flags()).

Consequently, FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM is no longer required
for reliable R/O long-term pinning: FOLL_LONGTERM is sufficient. So stop
using FOLL_FORCE, which is really only for ptrace access.

Cc: Mauro Carvalho Chehab 
Signed-off-by: David Hildenbrand 
---
 drivers/media/v4l2-core/videobuf-dma-sg.c | 14 +-
 1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c 
b/drivers/media/v4l2-core/videobuf-dma-sg.c
index f75e5eedeee0..234e9f647c96 100644
--- a/drivers/media/v4l2-core/videobuf-dma-sg.c
+++ b/drivers/media/v4l2-core/videobuf-dma-sg.c
@@ -151,17 +151,16 @@ static void videobuf_dma_init(struct videobuf_dmabuf *dma)
 static int videobuf_dma_init_user_locked(struct videobuf_dmabuf *dma,
int direction, unsigned long data, unsigned long size)
 {
+   unsigned int gup_flags = FOLL_LONGTERM;
unsigned long first, last;
-   int err, rw = 0;
-   unsigned int flags = FOLL_FORCE;
+   int err;
 
dma->direction = direction;
switch (dma->direction) {
case DMA_FROM_DEVICE:
-   rw = READ;
+   gup_flags |= FOLL_WRITE;
break;
case DMA_TO_DEVICE:
-   rw = WRITE;
break;
default:
BUG();
@@ -177,14 +176,11 @@ static int videobuf_dma_init_user_locked(struct 
videobuf_dmabuf *dma,
if (NULL == dma->pages)
return -ENOMEM;
 
-   if (rw == READ)
-   flags |= FOLL_WRITE;
-
dprintk(1, "init user [0x%lx+0x%lx => %lu pages]\n",
data, size, dma->nr_pages);
 
-   err = pin_user_pages(data & PAGE_MASK, dma->nr_pages,
-flags | FOLL_LONGTERM, dma->pages, NULL);
+   err = pin_user_pages(data & PAGE_MASK, dma->nr_pages, gup_flags,
+dma->pages, NULL);
 
if (err != dma->nr_pages) {
dma->nr_pages = (err >= 0) ? err : 0;
-- 
2.38.1

[PATCH mm-unstable v1 15/20] media: pci/ivtv: remove FOLL_FORCE usage

2022-11-16 Thread David Hildenbrand

FOLL_FORCE is really only for ptrace access. R/O pinning a page is
supposed to fail if the VMA misses proper access permissions (no VM_READ).

Let's just remove FOLL_FORCE usage here; there would have to be a pretty
good reason to allow arbitrary drivers to R/O pin pages in a PROT_NONE
VMA. Most probably, FOLL_FORCE usage is just some legacy leftover.

Cc: Andy Walls 
Cc: Mauro Carvalho Chehab 
Signed-off-by: David Hildenbrand 
---
 drivers/media/pci/ivtv/ivtv-udma.c | 2 +-
 drivers/media/pci/ivtv/ivtv-yuv.c  | 5 ++---
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/media/pci/ivtv/ivtv-udma.c 
b/drivers/media/pci/ivtv/ivtv-udma.c
index 210be8290f24..99b9f55ca829 100644
--- a/drivers/media/pci/ivtv/ivtv-udma.c
+++ b/drivers/media/pci/ivtv/ivtv-udma.c
@@ -115,7 +115,7 @@ int ivtv_udma_setup(struct ivtv *itv, unsigned long 
ivtv_dest_addr,
 
/* Pin user pages for DMA Xfer */
err = pin_user_pages_unlocked(user_dma.uaddr, user_dma.page_count,
-   dma->map, FOLL_FORCE);
+   dma->map, 0);
 
if (user_dma.page_count != err) {
IVTV_DEBUG_WARN("failed to map user pages, returned %d instead 
of %d\n",
diff --git a/drivers/media/pci/ivtv/ivtv-yuv.c 
b/drivers/media/pci/ivtv/ivtv-yuv.c
index 4ba10c34a16a..582146f8d70d 100644
--- a/drivers/media/pci/ivtv/ivtv-yuv.c
+++ b/drivers/media/pci/ivtv/ivtv-yuv.c
@@ -63,12 +63,11 @@ static int ivtv_yuv_prep_user_dma(struct ivtv *itv, struct 
ivtv_user_dma *dma,
 
/* Pin user pages for DMA Xfer */
y_pages = pin_user_pages_unlocked(y_dma.uaddr,
-   y_dma.page_count, >map[0], FOLL_FORCE);
+   y_dma.page_count, >map[0], 0);
uv_pages = 0; /* silence gcc. value is set and consumed only if: */
if (y_pages == y_dma.page_count) {
uv_pages = pin_user_pages_unlocked(uv_dma.uaddr,
-   uv_dma.page_count, >map[y_pages],
-   FOLL_FORCE);
+   uv_dma.page_count, >map[y_pages], 0);
}
 
if (y_pages != y_dma.page_count || uv_pages != uv_dma.page_count) {
-- 
2.38.1

[PATCH mm-unstable v1 16/20] mm/frame-vector: remove FOLL_FORCE usage

2022-11-16 Thread David Hildenbrand

FOLL_FORCE is really only for ptrace access. According to commit
707947247e95 ("media: videobuf2-vmalloc: get_userptr: buffers are always
writable"), get_vaddr_frames() currently pins all pages writable as a
workaround for issues with read-only buffers.

FOLL_FORCE, however, seems to be a legacy leftover as it predates
commit 707947247e95 ("media: videobuf2-vmalloc: get_userptr: buffers are
always writable"). Let's just remove it.

Once the read-only buffer issue has been resolved, FOLL_WRITE could
again be set depending on the DMA direction.

Cc: Hans Verkuil 
Cc: Marek Szyprowski 
Cc: Tomasz Figa 
Cc: Marek Szyprowski 
Cc: Mauro Carvalho Chehab 
Signed-off-by: David Hildenbrand 
---
 drivers/media/common/videobuf2/frame_vector.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/media/common/videobuf2/frame_vector.c 
b/drivers/media/common/videobuf2/frame_vector.c
index 542dde9d2609..062e98148c53 100644
--- a/drivers/media/common/videobuf2/frame_vector.c
+++ b/drivers/media/common/videobuf2/frame_vector.c
@@ -50,7 +50,7 @@ int get_vaddr_frames(unsigned long start, unsigned int 
nr_frames,
start = untagged_addr(start);
 
ret = pin_user_pages_fast(start, nr_frames,
- FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM,
+ FOLL_WRITE | FOLL_LONGTERM,
  (struct page **)(vec->ptrs));
if (ret > 0) {
vec->got_ref = true;
-- 
2.38.1

[PATCH mm-unstable v1 14/20] drm/etnaviv: remove FOLL_FORCE usage

2022-11-16 Thread David Hildenbrand

GUP now supports reliable R/O long-term pinning in COW mappings, such
that we break COW early. MAP_SHARED VMAs only use the shared zeropage so
far in one corner case (DAXFS file with holes), which can be ignored
because GUP does not support long-term pinning in fsdax (see
check_vma_flags()).

commit cd5297b0855f ("drm/etnaviv: Use FOLL_FORCE for userptr")
documents that FOLL_FORCE | FOLL_WRITE was really only used for reliable
R/O pinning.

Consequently, FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM is no longer required
for reliable R/O long-term pinning: FOLL_LONGTERM is sufficient. So stop
using FOLL_FORCE, which is really only for ptrace access.

Cc: Daniel Vetter 
Cc: Lucas Stach 
Cc: Russell King 
Cc: Christian Gmeiner 
Cc: David Airlie 
Signed-off-by: David Hildenbrand 
---
 drivers/gpu/drm/etnaviv/etnaviv_gem.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gem.c 
b/drivers/gpu/drm/etnaviv/etnaviv_gem.c
index cc386f8a7116..efe2240945d0 100644
--- a/drivers/gpu/drm/etnaviv/etnaviv_gem.c
+++ b/drivers/gpu/drm/etnaviv/etnaviv_gem.c
@@ -638,6 +638,7 @@ static int etnaviv_gem_userptr_get_pages(struct 
etnaviv_gem_object *etnaviv_obj)
struct page **pvec = NULL;
struct etnaviv_gem_userptr *userptr = _obj->userptr;
int ret, pinned = 0, npages = etnaviv_obj->base.size >> PAGE_SHIFT;
+   unsigned int gup_flags = FOLL_LONGTERM;
 
might_lock_read(>mm->mmap_lock);
 
@@ -648,14 +649,15 @@ static int etnaviv_gem_userptr_get_pages(struct 
etnaviv_gem_object *etnaviv_obj)
if (!pvec)
return -ENOMEM;
 
+   if (!userptr->ro)
+   gup_flags |= FOLL_WRITE;
+
do {
unsigned num_pages = npages - pinned;
uint64_t ptr = userptr->ptr + pinned * PAGE_SIZE;
struct page **pages = pvec + pinned;
 
-   ret = pin_user_pages_fast(ptr, num_pages,
- FOLL_WRITE | FOLL_FORCE | 
FOLL_LONGTERM,
- pages);
+   ret = pin_user_pages_fast(ptr, num_pages, gup_flags, pages);
if (ret < 0) {
unpin_user_pages(pvec, pinned);
kvfree(pvec);
-- 
2.38.1

[PATCH mm-unstable v1 12/20] RDMA/siw: remove FOLL_FORCE usage

2022-11-16 Thread David Hildenbrand

GUP now supports reliable R/O long-term pinning in COW mappings, such
that we break COW early. MAP_SHARED VMAs only use the shared zeropage so
far in one corner case (DAXFS file with holes), which can be ignored
because GUP does not support long-term pinning in fsdax (see
check_vma_flags()).

Consequently, FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM is no longer required
for reliable R/O long-term pinning: FOLL_LONGTERM is sufficient. So stop
using FOLL_FORCE, which is really only for ptrace access.

Cc: Bernard Metzler 
Cc: Jason Gunthorpe 
Cc: Leon Romanovsky 
Signed-off-by: David Hildenbrand 
---
 drivers/infiniband/sw/siw/siw_mem.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/sw/siw/siw_mem.c 
b/drivers/infiniband/sw/siw/siw_mem.c
index 61c17db70d65..b2b33dd3b4fa 100644
--- a/drivers/infiniband/sw/siw/siw_mem.c
+++ b/drivers/infiniband/sw/siw/siw_mem.c
@@ -368,7 +368,7 @@ struct siw_umem *siw_umem_get(u64 start, u64 len, bool 
writable)
struct mm_struct *mm_s;
u64 first_page_va;
unsigned long mlock_limit;
-   unsigned int foll_flags = FOLL_WRITE;
+   unsigned int foll_flags = FOLL_LONGTERM;
int num_pages, num_chunks, i, rv = 0;
 
if (!can_do_mlock())
@@ -391,8 +391,8 @@ struct siw_umem *siw_umem_get(u64 start, u64 len, bool 
writable)
 
mmgrab(mm_s);
 
-   if (!writable)
-   foll_flags |= FOLL_FORCE;
+   if (writable)
+   foll_flags |= FOLL_WRITE;
 
mmap_read_lock(mm_s);
 
@@ -423,8 +423,7 @@ struct siw_umem *siw_umem_get(u64 start, u64 len, bool 
writable)
while (nents) {
struct page **plist = >page_chunk[i].plist[got];
 
-   rv = pin_user_pages(first_page_va, nents,
-   foll_flags | FOLL_LONGTERM,
+   rv = pin_user_pages(first_page_va, nents, foll_flags,
plist, NULL);
if (rv < 0)
goto out_sem_up;
-- 
2.38.1

[PATCH mm-unstable v1 11/20] RDMA/usnic: remove FOLL_FORCE usage

2022-11-16 Thread David Hildenbrand

GUP now supports reliable R/O long-term pinning in COW mappings, such
that we break COW early. MAP_SHARED VMAs only use the shared zeropage so
far in one corner case (DAXFS file with holes), which can be ignored
because GUP does not support long-term pinning in fsdax (see
check_vma_flags()).

Consequently, FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM is no longer required
for reliable R/O long-term pinning: FOLL_LONGTERM is sufficient. So stop
using FOLL_FORCE, which is really only for ptrace access.

Cc: Christian Benvenuti 
Cc: Nelson Escobar 
Cc: Jason Gunthorpe 
Cc: Leon Romanovsky 
Signed-off-by: David Hildenbrand 
---
 drivers/infiniband/hw/usnic/usnic_uiom.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/hw/usnic/usnic_uiom.c 
b/drivers/infiniband/hw/usnic/usnic_uiom.c
index 67923ced6e2d..c301b3be9f30 100644
--- a/drivers/infiniband/hw/usnic/usnic_uiom.c
+++ b/drivers/infiniband/hw/usnic/usnic_uiom.c
@@ -85,6 +85,7 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t 
size, int writable,
int dmasync, struct usnic_uiom_reg *uiomr)
 {
struct list_head *chunk_list = >chunk_list;
+   unsigned int gup_flags = FOLL_LONGTERM;
struct page **page_list;
struct scatterlist *sg;
struct usnic_uiom_chunk *chunk;
@@ -96,7 +97,6 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t 
size, int writable,
int off;
int i;
dma_addr_t pa;
-   unsigned int gup_flags;
struct mm_struct *mm;
 
/*
@@ -131,8 +131,8 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t 
size, int writable,
goto out;
}
 
-   gup_flags = FOLL_WRITE;
-   gup_flags |= (writable) ? 0 : FOLL_FORCE;
+   if (writable)
+   gup_flags |= FOLL_WRITE;
cur_base = addr & PAGE_MASK;
ret = 0;
 
@@ -140,8 +140,7 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t 
size, int writable,
ret = pin_user_pages(cur_base,
 min_t(unsigned long, npages,
 PAGE_SIZE / sizeof(struct page *)),
-gup_flags | FOLL_LONGTERM,
-page_list, NULL);
+gup_flags, page_list, NULL);
 
if (ret < 0)
goto out;
-- 
2.38.1

[PATCH mm-unstable v1 10/20] RDMA/umem: remove FOLL_FORCE usage

2022-11-16 Thread David Hildenbrand

GUP now supports reliable R/O long-term pinning in COW mappings, such
that we break COW early. MAP_SHARED VMAs only use the shared zeropage so
far in one corner case (DAXFS file with holes), which can be ignored
because GUP does not support long-term pinning in fsdax (see
check_vma_flags()).

Consequently, FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM is no longer required
for reliable R/O long-term pinning: FOLL_LONGTERM is sufficient. So stop
using FOLL_FORCE, which is really only for ptrace access.

Tested-by: Leon Romanovsky  # Over mlx4 and mlx5.
Cc: Jason Gunthorpe 
Cc: Leon Romanovsky 
Signed-off-by: David Hildenbrand 
---
 drivers/infiniband/core/umem.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 86d479772fbc..755a9c57db6f 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -156,7 +156,7 @@ struct ib_umem *ib_umem_get(struct ib_device *device, 
unsigned long addr,
struct mm_struct *mm;
unsigned long npages;
int pinned, ret;
-   unsigned int gup_flags = FOLL_WRITE;
+   unsigned int gup_flags = FOLL_LONGTERM;
 
/*
 * If the combination of the addr and size requested for this memory
@@ -210,8 +210,8 @@ struct ib_umem *ib_umem_get(struct ib_device *device, 
unsigned long addr,
 
cur_base = addr & PAGE_MASK;
 
-   if (!umem->writable)
-   gup_flags |= FOLL_FORCE;
+   if (umem->writable)
+   gup_flags |= FOLL_WRITE;
 
while (npages) {
cond_resched();
@@ -219,7 +219,7 @@ struct ib_umem *ib_umem_get(struct ib_device *device, 
unsigned long addr,
  min_t(unsigned long, npages,
PAGE_SIZE /
sizeof(struct page *)),
- gup_flags | FOLL_LONGTERM, page_list);
+ gup_flags, page_list);
if (pinned < 0) {
ret = pinned;
goto umem_release;
-- 
2.38.1

[PATCH mm-unstable v1 07/20] mm: don't call vm_ops->huge_fault() in wp_huge_pmd()/wp_huge_pud() for private mappings

2022-11-16 Thread David Hildenbrand

If we already have a PMD/PUD mapped write-protected in a private mapping
and we want to break COW either due to FAULT_FLAG_WRITE or
FAULT_FLAG_UNSHARE, there is no need to inform the file system just like on
the PTE path.

Let's just split (->zap) + fallback in that case.

This is a preparation for more generic FAULT_FLAG_UNSHARE support in
COW mappings.

Signed-off-by: David Hildenbrand 
---
 mm/memory.c | 24 +++-
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index c35e6cd32b6a..d47ad33c6487 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4802,6 +4802,7 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault 
*vmf)
 static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
 {
const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE;
+   vm_fault_t ret;
 
if (vma_is_anonymous(vmf->vma)) {
if (likely(!unshare) &&
@@ -4809,11 +4810,13 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault 
*vmf)
return handle_userfault(vmf, VM_UFFD_WP);
return do_huge_pmd_wp_page(vmf);
}
-   if (vmf->vma->vm_ops->huge_fault) {
-   vm_fault_t ret = vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PMD);
 
-   if (!(ret & VM_FAULT_FALLBACK))
-   return ret;
+   if (vmf->vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) {
+   if (vmf->vma->vm_ops->huge_fault) {
+   ret = vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PMD);
+   if (!(ret & VM_FAULT_FALLBACK))
+   return ret;
+   }
}
 
/* COW or write-notify handled on pte level: split pmd. */
@@ -4839,14 +4842,17 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, 
pud_t orig_pud)
 {
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) &&\
defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
+   vm_fault_t ret;
+
/* No support for anonymous transparent PUD pages yet */
if (vma_is_anonymous(vmf->vma))
goto split;
-   if (vmf->vma->vm_ops->huge_fault) {
-   vm_fault_t ret = vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PUD);
-
-   if (!(ret & VM_FAULT_FALLBACK))
-   return ret;
+   if (vmf->vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) {
+   if (vmf->vma->vm_ops->huge_fault) {
+   ret = vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PUD);
+   if (!(ret & VM_FAULT_FALLBACK))
+   return ret;
+   }
}
 split:
/* COW or write-notify not handled on PUD level: split pud.*/
-- 
2.38.1

[PATCH mm-unstable v1 08/20] mm: extend FAULT_FLAG_UNSHARE support to anything in a COW mapping

2022-11-16 Thread David Hildenbrand

Extend FAULT_FLAG_UNSHARE to break COW on anything mapped into a
COW (i.e., private writable) mapping and adjust the documentation
accordingly.

FAULT_FLAG_UNSHARE will now also break COW when encountering the shared
zeropage, a pagecache page, a PFNMAP, ... inside a COW mapping, by
properly replacing the mapped page/pfn by a private copy (an exclusive
anonymous page).

Note that only do_wp_page() needs care: hugetlb_wp() already handles
FAULT_FLAG_UNSHARE correctly. wp_huge_pmd()/wp_huge_pud() also handles it
correctly, for example, splitting the huge zeropage on FAULT_FLAG_UNSHARE
such that we can handle FAULT_FLAG_UNSHARE on the PTE level.

This change is a requirement for reliable long-term R/O pinning in
COW mappings.

Signed-off-by: David Hildenbrand 
---
 include/linux/mm_types.h | 8 
 mm/memory.c  | 4 
 2 files changed, 4 insertions(+), 8 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5e7f4fac1e78..5e9aaad8c7b2 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1037,9 +1037,9 @@ typedef struct {
  * @FAULT_FLAG_REMOTE: The fault is not for current task/mm.
  * @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch.
  * @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal 
signals.
- * @FAULT_FLAG_UNSHARE: The fault is an unsharing request to unshare (and mark
- *  exclusive) a possibly shared anonymous page that is
- *  mapped R/O.
+ * @FAULT_FLAG_UNSHARE: The fault is an unsharing request to break COW in a
+ *  COW mapping, making sure that an exclusive anon page is
+ *  mapped after the fault.
  * @FAULT_FLAG_ORIG_PTE_VALID: whether the fault has vmf->orig_pte cached.
  *We should only access orig_pte if this flag set.
  *
@@ -1064,7 +1064,7 @@ typedef struct {
  *
  * The combination FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE is illegal.
  * FAULT_FLAG_UNSHARE is ignored and treated like an ordinary read fault when
- * no existing R/O-mapped anonymous page is encountered.
+ * applied to mappings that are not COW mappings.
  */
 enum fault_flag {
FAULT_FLAG_WRITE =  1 << 0,
diff --git a/mm/memory.c b/mm/memory.c
index d47ad33c6487..56b21ab1e4d2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3432,10 +3432,6 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
}
wp_page_reuse(vmf);
return 0;
-   } else if (unshare) {
-   /* No anonymous page -> nothing to do. */
-   pte_unmap_unlock(vmf->pte, vmf->ptl);
-   return 0;
}
 copy:
/*
-- 
2.38.1

[PATCH mm-unstable v1 06/20] mm: rework handling in do_wp_page() based on private vs. shared mappings

2022-11-16 Thread David Hildenbrand

We want to extent FAULT_FLAG_UNSHARE support to anything mapped into a
COW mapping (pagecache page, zeropage, PFN, ...), not just anonymous pages.
Let's prepare for that by handling shared mappings first such that we can
handle private mappings last.

While at it, use folio-based functions instead of page-based functions
where we touch the code either way.

Signed-off-by: David Hildenbrand 
---
 mm/memory.c | 38 +-
 1 file changed, 17 insertions(+), 21 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index c4fa378ec2a0..c35e6cd32b6a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3342,7 +3342,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
 {
const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE;
struct vm_area_struct *vma = vmf->vma;
-   struct folio *folio;
+   struct folio *folio = NULL;
 
if (likely(!unshare)) {
if (userfaultfd_pte_wp(vma, *vmf->pte)) {
@@ -3360,13 +3360,12 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
}
 
vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte);
-   if (!vmf->page) {
-   if (unlikely(unshare)) {
-   /* No anonymous page -> nothing to do. */
-   pte_unmap_unlock(vmf->pte, vmf->ptl);
-   return 0;
-   }
 
+   /*
+* Shared mapping: we are guaranteed to have VM_WRITE and
+* FAULT_FLAG_WRITE set at this point.
+*/
+   if (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) {
/*
 * VM_MIXEDMAP !pfn_valid() case, or VM_SOFTDIRTY clear on a
 * VM_PFNMAP VMA.
@@ -3374,20 +3373,19 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
 * We should not cow pages in a shared writeable mapping.
 * Just mark the pages writable and/or call ops->pfn_mkwrite.
 */
-   if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
-(VM_WRITE|VM_SHARED))
+   if (!vmf->page)
return wp_pfn_shared(vmf);
-
-   pte_unmap_unlock(vmf->pte, vmf->ptl);
-   return wp_page_copy(vmf);
+   return wp_page_shared(vmf);
}
 
+   if (vmf->page)
+   folio = page_folio(vmf->page);
+
/*
-* Take out anonymous pages first, anonymous shared vmas are
-* not dirty accountable.
+* Private mapping: create an exclusive anonymous page copy if reuse
+* is impossible. We might miss VM_WRITE for FOLL_FORCE handling.
 */
-   folio = page_folio(vmf->page);
-   if (folio_test_anon(folio)) {
+   if (folio && folio_test_anon(folio)) {
/*
 * If the page is exclusive to this process we must reuse the
 * page without further checks.
@@ -3438,19 +3436,17 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
/* No anonymous page -> nothing to do. */
pte_unmap_unlock(vmf->pte, vmf->ptl);
return 0;
-   } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
-   (VM_WRITE|VM_SHARED))) {
-   return wp_page_shared(vmf);
}
 copy:
/*
 * Ok, we need to copy. Oh, well..
 */
-   get_page(vmf->page);
+   if (folio)
+   folio_get(folio);
 
pte_unmap_unlock(vmf->pte, vmf->ptl);
 #ifdef CONFIG_KSM
-   if (PageKsm(vmf->page))
+   if (folio && folio_test_ksm(folio))
count_vm_event(COW_KSM);
 #endif
return wp_page_copy(vmf);
-- 
2.38.1

[PATCH mm-unstable v1 05/20] mm: add early FAULT_FLAG_WRITE consistency checks

2022-11-16 Thread David Hildenbrand

Let's catch abuse of FAULT_FLAG_WRITE early, such that we don't have to
care in all other handlers and might get "surprises" if we forget to do
so.

Write faults without VM_MAYWRITE don't make any sense, and our
maybe_mkwrite() logic could have hidden such abuse for now.

Write faults without VM_WRITE on something that is not a COW mapping is
similarly broken, and e.g., do_wp_page() could end up placing an
anonymous page into a shared mapping, which would be bad.

This is a preparation for reliable R/O long-term pinning of pages in
private mappings, whereby we want to make sure that we will never break
COW in a read-only private mapping.

Signed-off-by: David Hildenbrand 
---
 mm/memory.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index e014435a87db..c4fa378ec2a0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5170,6 +5170,14 @@ static vm_fault_t sanitize_fault_flags(struct 
vm_area_struct *vma,
 */
if (!is_cow_mapping(vma->vm_flags))
*flags &= ~FAULT_FLAG_UNSHARE;
+   } else if (*flags & FAULT_FLAG_WRITE) {
+   /* Write faults on read-only mappings are impossible ... */
+   if (WARN_ON_ONCE(!(vma->vm_flags & VM_MAYWRITE)))
+   return VM_FAULT_SIGSEGV;
+   /* ... and FOLL_FORCE only applies to COW mappings. */
+   if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE) &&
+!is_cow_mapping(vma->vm_flags)))
+   return VM_FAULT_SIGSEGV;
}
return 0;
 }
-- 
2.38.1

[PATCH mm-unstable v1 02/20] selftests/vm: cow: basic COW tests for non-anonymous pages

2022-11-16 Thread David Hildenbrand

Let's add basic tests for COW with non-anonymous pages in private
mappings: write access should properly trigger COW and result in the
private changes not being visible through other page mappings.

Especially, add tests for:
* Zeropage
* Huge zeropage
* Ordinary pagecache pages via memfd and tmpfile()
* Hugetlb pages via memfd

Fortunately, all tests pass.

Signed-off-by: David Hildenbrand 
---
 tools/testing/selftests/vm/cow.c | 338 ++-
 1 file changed, 337 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/vm/cow.c b/tools/testing/selftests/vm/cow.c
index d202bfd63585..fb07bd44529c 100644
--- a/tools/testing/selftests/vm/cow.c
+++ b/tools/testing/selftests/vm/cow.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "local_config.h"
 #ifdef LOCAL_CONFIG_HAVE_LIBURING
@@ -35,6 +36,7 @@ static size_t thpsize;
 static int nr_hugetlbsizes;
 static size_t hugetlbsizes[10];
 static int gup_fd;
+static bool has_huge_zeropage;
 
 static void detect_thpsize(void)
 {
@@ -64,6 +66,31 @@ static void detect_thpsize(void)
close(fd);
 }
 
+static void detect_huge_zeropage(void)
+{
+   int fd = open("/sys/kernel/mm/transparent_hugepage/use_zero_page",
+ O_RDONLY);
+   size_t enabled = 0;
+   char buf[15];
+   int ret;
+
+   if (fd < 0)
+   return;
+
+   ret = pread(fd, buf, sizeof(buf), 0);
+   if (ret > 0 && ret < sizeof(buf)) {
+   buf[ret] = 0;
+
+   enabled = strtoul(buf, NULL, 10);
+   if (enabled == 1) {
+   has_huge_zeropage = true;
+   ksft_print_msg("[INFO] huge zeropage is enabled\n");
+   }
+   }
+
+   close(fd);
+}
+
 static void detect_hugetlbsizes(void)
 {
DIR *dir = opendir("/sys/kernel/mm/hugepages/");
@@ -1148,6 +1175,312 @@ static int tests_per_anon_test_case(void)
return tests;
 }
 
+typedef void (*non_anon_test_fn)(char *mem, const char *smem, size_t size);
+
+static void test_cow(char *mem, const char *smem, size_t size)
+{
+   char *old = malloc(size);
+
+   /* Backup the original content. */
+   memcpy(old, smem, size);
+
+   /* Modify the page. */
+   memset(mem, 0xff, size);
+
+   /* See if we still read the old values via the other mapping. */
+   ksft_test_result(!memcmp(smem, old, size),
+"Other mapping not modified\n");
+   free(old);
+}
+
+static void run_with_zeropage(non_anon_test_fn fn, const char *desc)
+{
+   char *mem, *smem, tmp;
+
+   ksft_print_msg("[RUN] %s ... with shared zeropage\n", desc);
+
+   mem = mmap(NULL, pagesize, PROT_READ | PROT_WRITE,
+  MAP_PRIVATE | MAP_ANON, -1, 0);
+   if (mem == MAP_FAILED) {
+   ksft_test_result_fail("mmap() failed\n");
+   return;
+   }
+
+   smem = mmap(NULL, pagesize, PROT_READ, MAP_PRIVATE | MAP_ANON, -1, 0);
+   if (mem == MAP_FAILED) {
+   ksft_test_result_fail("mmap() failed\n");
+   goto munmap;
+   }
+
+   /* Read from the page to populate the shared zeropage. */
+   tmp = *mem + *smem;
+   asm volatile("" : "+r" (tmp));
+
+   fn(mem, smem, pagesize);
+munmap:
+   munmap(mem, pagesize);
+   if (smem != MAP_FAILED)
+   munmap(smem, pagesize);
+}
+
+static void run_with_huge_zeropage(non_anon_test_fn fn, const char *desc)
+{
+   char *mem, *smem, *mmap_mem, *mmap_smem, tmp;
+   size_t mmap_size;
+   int ret;
+
+   ksft_print_msg("[RUN] %s ... with huge zeropage\n", desc);
+
+   if (!has_huge_zeropage) {
+   ksft_test_result_skip("Huge zeropage not enabled\n");
+   return;
+   }
+
+   /* For alignment purposes, we need twice the thp size. */
+   mmap_size = 2 * thpsize;
+   mmap_mem = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE,
+   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+   if (mmap_mem == MAP_FAILED) {
+   ksft_test_result_fail("mmap() failed\n");
+   return;
+   }
+   mmap_smem = mmap(NULL, mmap_size, PROT_READ,
+MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+   if (mmap_smem == MAP_FAILED) {
+   ksft_test_result_fail("mmap() failed\n");
+   goto munmap;
+   }
+
+   /* We need a THP-aligned memory area. */
+   mem = (char *)(((uintptr_t)mmap_mem + thpsize) & ~(thpsize - 1));
+   smem = (char *)(((uintptr_t)mmap_smem + thpsize) & ~(thpsize - 1));
+
+   ret = madvise(mem, thpsize, MADV_HUGEPAGE);
+   ret |= madvise(smem, thpsize, MADV_HUGEPAGE);
+   if (ret) {
+   ksft_test_result_fail("MADV_HUGEPAGE failed\n");
+   goto munmap;
+

[PATCH mm-unstable v1 04/20] mm: add early FAULT_FLAG_UNSHARE consistency checks

2022-11-16 Thread David Hildenbrand

For now, FAULT_FLAG_UNSHARE only applies to anonymous pages, which
implies a COW mapping. Let's hide FAULT_FLAG_UNSHARE early if we're not
dealing with a COW mapping, such that we treat it like a read fault as
documented and don't have to worry about the flag throughout all fault
handlers.

While at it, centralize the check for mutual exclusion of
FAULT_FLAG_UNSHARE and FAULT_FLAG_WRITE and just drop the check that
either flag is set in the WP handler.

Signed-off-by: David Hildenbrand 
---
 mm/huge_memory.c |  3 ---
 mm/hugetlb.c |  5 -
 mm/memory.c  | 23 ---
 3 files changed, 20 insertions(+), 11 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ed12cd3acbfd..68d00196b519 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1267,9 +1267,6 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
vmf->ptl = pmd_lockptr(vma->vm_mm, vmf->pmd);
VM_BUG_ON_VMA(!vma->anon_vma, vma);
 
-   VM_BUG_ON(unshare && (vmf->flags & FAULT_FLAG_WRITE));
-   VM_BUG_ON(!unshare && !(vmf->flags & FAULT_FLAG_WRITE));
-
if (is_huge_zero_pmd(orig_pmd))
goto fallback;
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1de986c62976..383b26069b33 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5314,9 +5314,6 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct 
vm_area_struct *vma,
unsigned long haddr = address & huge_page_mask(h);
struct mmu_notifier_range range;
 
-   VM_BUG_ON(unshare && (flags & FOLL_WRITE));
-   VM_BUG_ON(!unshare && !(flags & FOLL_WRITE));
-
/*
 * hugetlb does not support FOLL_FORCE-style write faults that keep the
 * PTE mapped R/O such as maybe_mkwrite() would do.
@@ -5326,8 +5323,6 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct 
vm_area_struct *vma,
 
/* Let's take out MAP_SHARED mappings first. */
if (vma->vm_flags & VM_MAYSHARE) {
-   if (unlikely(unshare))
-   return 0;
set_huge_ptep_writable(vma, haddr, ptep);
return 0;
}
diff --git a/mm/memory.c b/mm/memory.c
index 2d453736f87c..e014435a87db 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3344,9 +3344,6 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
struct vm_area_struct *vma = vmf->vma;
struct folio *folio;
 
-   VM_BUG_ON(unshare && (vmf->flags & FAULT_FLAG_WRITE));
-   VM_BUG_ON(!unshare && !(vmf->flags & FAULT_FLAG_WRITE));
-
if (likely(!unshare)) {
if (userfaultfd_pte_wp(vma, *vmf->pte)) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -5161,6 +5158,22 @@ static void lru_gen_exit_fault(void)
 }
 #endif /* CONFIG_LRU_GEN */
 
+static vm_fault_t sanitize_fault_flags(struct vm_area_struct *vma,
+  unsigned int *flags)
+{
+   if (unlikely(*flags & FAULT_FLAG_UNSHARE)) {
+   if (WARN_ON_ONCE(*flags & FAULT_FLAG_WRITE))
+   return VM_FAULT_SIGSEGV;
+   /*
+* FAULT_FLAG_UNSHARE only applies to COW mappings. Let's
+* just treat it like an ordinary read-fault otherwise.
+*/
+   if (!is_cow_mapping(vma->vm_flags))
+   *flags &= ~FAULT_FLAG_UNSHARE;
+   }
+   return 0;
+}
+
 /*
  * By the time we get here, we already hold the mm semaphore
  *
@@ -5177,6 +5190,10 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, 
unsigned long address,
count_vm_event(PGFAULT);
count_memcg_event_mm(vma->vm_mm, PGFAULT);
 
+   ret = sanitize_fault_flags(vma, );
+   if (ret)
+   return ret;
+
if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
flags & FAULT_FLAG_INSTRUCTION,
flags & FAULT_FLAG_REMOTE))
-- 
2.38.1

[PATCH mm-unstable v1 03/20] selftests/vm: cow: R/O long-term pinning reliability tests for non-anon pages

2022-11-16 Thread David Hildenbrand

Let's test whether R/O long-term pinning is reliable for non-anonymous
memory: when R/O long-term pinning a page, the expectation is that we
break COW early before pinning, such that actual write access via the
page tables won't break COW later and end up replacing the R/O-pinned
page in the page table.

Consequently, R/O long-term pinning in private mappings would only target
exclusive anonymous pages.

For now, all tests fail:
# [RUN] R/O longterm GUP pin ... with shared zeropage
not ok 151 Longterm R/O pin is reliable
# [RUN] R/O longterm GUP pin ... with memfd
not ok 152 Longterm R/O pin is reliable
# [RUN] R/O longterm GUP pin ... with tmpfile
not ok 153 Longterm R/O pin is reliable
# [RUN] R/O longterm GUP pin ... with huge zeropage
not ok 154 Longterm R/O pin is reliable
# [RUN] R/O longterm GUP pin ... with memfd hugetlb (2048 kB)
not ok 155 Longterm R/O pin is reliable
# [RUN] R/O longterm GUP pin ... with memfd hugetlb (1048576 kB)
not ok 156 Longterm R/O pin is reliable
# [RUN] R/O longterm GUP-fast pin ... with shared zeropage
not ok 157 Longterm R/O pin is reliable
# [RUN] R/O longterm GUP-fast pin ... with memfd
not ok 158 Longterm R/O pin is reliable
# [RUN] R/O longterm GUP-fast pin ... with tmpfile
not ok 159 Longterm R/O pin is reliable
# [RUN] R/O longterm GUP-fast pin ... with huge zeropage
not ok 160 Longterm R/O pin is reliable
# [RUN] R/O longterm GUP-fast pin ... with memfd hugetlb (2048 kB)
not ok 161 Longterm R/O pin is reliable
# [RUN] R/O longterm GUP-fast pin ... with memfd hugetlb (1048576 kB)
not ok 162 Longterm R/O pin is reliable

Signed-off-by: David Hildenbrand 
---
 tools/testing/selftests/vm/cow.c | 28 +++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/vm/cow.c b/tools/testing/selftests/vm/cow.c
index fb07bd44529c..73e05b52c49e 100644
--- a/tools/testing/selftests/vm/cow.c
+++ b/tools/testing/selftests/vm/cow.c
@@ -561,6 +561,7 @@ static void test_iouring_fork(char *mem, size_t size)
 #endif /* LOCAL_CONFIG_HAVE_LIBURING */
 
 enum ro_pin_test {
+   RO_PIN_TEST,
RO_PIN_TEST_SHARED,
RO_PIN_TEST_PREVIOUSLY_SHARED,
RO_PIN_TEST_RO_EXCLUSIVE,
@@ -593,6 +594,8 @@ static void do_test_ro_pin(char *mem, size_t size, enum 
ro_pin_test test,
}
 
switch (test) {
+   case RO_PIN_TEST:
+   break;
case RO_PIN_TEST_SHARED:
case RO_PIN_TEST_PREVIOUSLY_SHARED:
/*
@@ -1193,6 +1196,16 @@ static void test_cow(char *mem, const char *smem, size_t 
size)
free(old);
 }
 
+static void test_ro_pin(char *mem, const char *smem, size_t size)
+{
+   do_test_ro_pin(mem, size, RO_PIN_TEST, false);
+}
+
+static void test_ro_fast_pin(char *mem, const char *smem, size_t size)
+{
+   do_test_ro_pin(mem, size, RO_PIN_TEST, true);
+}
+
 static void run_with_zeropage(non_anon_test_fn fn, const char *desc)
 {
char *mem, *smem, tmp;
@@ -1433,7 +1446,7 @@ struct non_anon_test_case {
 };
 
 /*
- * Test cases that target any pages in private mappings that are non anonymous:
+ * Test cases that target any pages in private mappings that are not anonymous:
  * pages that may get shared via COW ndependent of fork(). This includes
  * the shared zeropage(s), pagecache pages, ...
  */
@@ -1446,6 +1459,19 @@ static const struct non_anon_test_case 
non_anon_test_cases[] = {
"Basic COW",
test_cow,
},
+   /*
+* Take a R/O longterm pin. When modifying the page via the page table,
+* the page content change must be visible via the pin.
+*/
+   {
+   "R/O longterm GUP pin",
+   test_ro_pin,
+   },
+   /* Same as above, but using GUP-fast. */
+   {
+   "R/O longterm GUP-fast pin",
+   test_ro_fast_pin,
+   },
 };
 
 static void run_non_anon_test_case(struct non_anon_test_case const *test_case)
-- 
2.38.1

[PATCH mm-unstable v1 01/20] selftests/vm: anon_cow: prepare for non-anonymous COW tests

2022-11-16 Thread David Hildenbrand

Originally, the plan was to have a separate tests for testing COW of
non-anonymous (e.g., shared zeropage) pages.

Turns out, that we'd need a lot of similar functionality and that there
isn't a really good reason to separate it. So let's prepare for non-anon
tests by renaming to "cow".

Signed-off-by: David Hildenbrand 
---
 tools/testing/selftests/vm/.gitignore |  2 +-
 tools/testing/selftests/vm/Makefile   | 10 
 tools/testing/selftests/vm/check_config.sh|  4 +--
 .../selftests/vm/{anon_cow.c => cow.c}| 25 +++
 tools/testing/selftests/vm/run_vmtests.sh |  8 +++---
 5 files changed, 27 insertions(+), 22 deletions(-)
 rename tools/testing/selftests/vm/{anon_cow.c => cow.c} (97%)

diff --git a/tools/testing/selftests/vm/.gitignore 
b/tools/testing/selftests/vm/.gitignore
index 8a536c731e3c..ee8c41c998e6 100644
--- a/tools/testing/selftests/vm/.gitignore
+++ b/tools/testing/selftests/vm/.gitignore
@@ -1,5 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0-only
-anon_cow
+cow
 hugepage-mmap
 hugepage-mremap
 hugepage-shm
diff --git a/tools/testing/selftests/vm/Makefile 
b/tools/testing/selftests/vm/Makefile
index 0986bd60c19f..89c14e41bd43 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -27,7 +27,7 @@ MAKEFLAGS += --no-builtin-rules
 
 CFLAGS = -Wall -I $(top_srcdir) -I $(top_srcdir)/usr/include $(EXTRA_CFLAGS) 
$(KHDR_INCLUDES)
 LDLIBS = -lrt -lpthread
-TEST_GEN_FILES = anon_cow
+TEST_GEN_FILES = cow
 TEST_GEN_FILES += compaction_test
 TEST_GEN_FILES += gup_test
 TEST_GEN_FILES += hmm-tests
@@ -99,7 +99,7 @@ TEST_FILES += va_128TBswitch.sh
 
 include ../lib.mk
 
-$(OUTPUT)/anon_cow: vm_util.c
+$(OUTPUT)/cow: vm_util.c
 $(OUTPUT)/khugepaged: vm_util.c
 $(OUTPUT)/ksm_functional_tests: vm_util.c
 $(OUTPUT)/madv_populate: vm_util.c
@@ -156,8 +156,8 @@ warn_32bit_failure:
 endif
 endif
 
-# ANON_COW_EXTRA_LIBS may get set in local_config.mk, or it may be left empty.
-$(OUTPUT)/anon_cow: LDLIBS += $(ANON_COW_EXTRA_LIBS)
+# cow_EXTRA_LIBS may get set in local_config.mk, or it may be left empty.
+$(OUTPUT)/cow: LDLIBS += $(COW_EXTRA_LIBS)
 
 $(OUTPUT)/mlock-random-test $(OUTPUT)/memfd_secret: LDLIBS += -lcap
 
@@ -170,7 +170,7 @@ local_config.mk local_config.h: check_config.sh
 
 EXTRA_CLEAN += local_config.mk local_config.h
 
-ifeq ($(ANON_COW_EXTRA_LIBS),)
+ifeq ($(COW_EXTRA_LIBS),)
 all: warn_missing_liburing
 
 warn_missing_liburing:
diff --git a/tools/testing/selftests/vm/check_config.sh 
b/tools/testing/selftests/vm/check_config.sh
index 9a44c6520925..bcba3af0acea 100644
--- a/tools/testing/selftests/vm/check_config.sh
+++ b/tools/testing/selftests/vm/check_config.sh
@@ -21,11 +21,11 @@ $CC -c $tmpfile_c -o $tmpfile_o >/dev/null 2>&1
 
 if [ -f $tmpfile_o ]; then
 echo "#define LOCAL_CONFIG_HAVE_LIBURING 1"  > $OUTPUT_H_FILE
-echo "ANON_COW_EXTRA_LIBS = -luring" > $OUTPUT_MKFILE
+echo "COW_EXTRA_LIBS = -luring"  > $OUTPUT_MKFILE
 else
 echo "// No liburing support found"  > $OUTPUT_H_FILE
 echo "# No liburing support found, so:"  > $OUTPUT_MKFILE
-echo "ANON_COW_EXTRA_LIBS = "   >> $OUTPUT_MKFILE
+echo "COW_EXTRA_LIBS = ">> $OUTPUT_MKFILE
 fi
 
 rm ${tmpname}.*
diff --git a/tools/testing/selftests/vm/anon_cow.c 
b/tools/testing/selftests/vm/cow.c
similarity index 97%
rename from tools/testing/selftests/vm/anon_cow.c
rename to tools/testing/selftests/vm/cow.c
index bbb251eb5025..d202bfd63585 100644
--- a/tools/testing/selftests/vm/anon_cow.c
+++ b/tools/testing/selftests/vm/cow.c
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-only
 /*
- * COW (Copy On Write) tests for anonymous memory.
+ * COW (Copy On Write) tests.
  *
  * Copyright 2022, Red Hat, Inc.
  *
@@ -986,7 +986,11 @@ struct test_case {
test_fn fn;
 };
 
-static const struct test_case test_cases[] = {
+/*
+ * Test cases that are specific to anonymous pages: pages in private mappings
+ * that may get shared via COW during fork().
+ */
+static const struct test_case anon_test_cases[] = {
/*
 * Basic COW tests for fork() without any GUP. If we miss to break COW,
 * either the child can observe modifications by the parent or the
@@ -1104,7 +1108,7 @@ static const struct test_case test_cases[] = {
},
 };
 
-static void run_test_case(struct test_case const *test_case)
+static void run_anon_test_case(struct test_case const *test_case)
 {
int i;
 
@@ -1125,15 +1129,17 @@ static void run_test_case(struct test_case const 
*test_case)
 hugetlbsizes[i]);
 }
 
-static void run_test_cases(void)
+static void run_anon_test_cases(void)
 {
int i;
 
-   for (i = 0; i < ARRAY_SIZE(test_cases); i++)
-   run_test_case(_cases[i]);
+   k

[PATCH mm-unstable v1 00/20] mm/gup: remove FOLL_FORCE usage from drivers (reliable R/O long-term pinning)

2022-11-16 Thread David Hildenbrand

For now, we did not support reliable R/O long-term pinning in COW mappings.
That means, if we would trigger R/O long-term pinning in MAP_PRIVATE
mapping, we could end up pinning the (R/O-mapped) shared zeropage or a
pagecache page.

The next write access would trigger a write fault and replace the pinned
page by an exclusive anonymous page in the process page table; whatever the
process would write to that private page copy would not be visible by the
owner of the previous page pin: for example, RDMA could read stale data.
The end result is essentially an unexpected and hard-to-debug memory
corruption.

Some drivers tried working around that limitation by using
"FOLL_FORCE|FOLL_WRITE|FOLL_LONGTERM" for R/O long-term pinning for now.
FOLL_WRITE would trigger a write fault, if required, and break COW before
pinning the page. FOLL_FORCE is required because the VMA might lack write
permissions, and drivers wanted to make that working as well, just like
one would expect (no write access, but still triggering a write access to
break COW).

However, that is not a practical solution, because
(1) Drivers that don't stick to that undocumented and debatable pattern
would still run into that issue. For example, VFIO only uses
FOLL_LONGTERM for R/O long-term pinning.
(2) Using FOLL_WRITE just to work around a COW mapping + page pinning
limitation is unintuitive. FOLL_WRITE would, for example, mark the
page softdirty or trigger uffd-wp, even though, there actually isn't
going to be any write access.
(3) The purpose of FOLL_FORCE is debug access, not access without lack of
VMA permissions by arbitrarty drivers.

So instead, make R/O long-term pinning work as expected, by breaking COW
in a COW mapping early, such that we can remove any FOLL_FORCE usage from
drivers and make FOLL_FORCE ptrace-specific (renaming it to FOLL_PTRACE).
More details in patch #8.

Patches #1--#3 add COW tests for non-anonymous pages.
Patches #4--#7 prepare core MM for extended FAULT_FLAG_UNSHARE support in
COW mappings.
Patch #8 implements reliable R/O long-term pinning in COW mappings
Patches #9--#19 remove any FOLL_FORCE usage from drivers.
Patch #20 renames FOLL_FORCE to FOLL_PTRACE.

I'm refraining from CCing all driver/arch maintainers on the whole patch
set, but only CC them on the cover letter and the applicable patch
(I know, I know, someone is always unhappy ... sorry).

RFC -> v1:
* Use term "ptrace" instead of "debuggers" in patch descriptions
* Added ACK/Tested-by
* "mm/frame-vector: remove FOLL_FORCE usage"
 -> Adjust description
* "mm: rename FOLL_FORCE to FOLL_PTRACE"
 -> Added

David Hildenbrand (20):
  selftests/vm: anon_cow: prepare for non-anonymous COW tests
  selftests/vm: cow: basic COW tests for non-anonymous pages
  selftests/vm: cow: R/O long-term pinning reliability tests for
non-anon pages
  mm: add early FAULT_FLAG_UNSHARE consistency checks
  mm: add early FAULT_FLAG_WRITE consistency checks
  mm: rework handling in do_wp_page() based on private vs. shared
mappings
  mm: don't call vm_ops->huge_fault() in wp_huge_pmd()/wp_huge_pud() for
private mappings
  mm: extend FAULT_FLAG_UNSHARE support to anything in a COW mapping
  mm/gup: reliable R/O long-term pinning in COW mappings
  RDMA/umem: remove FOLL_FORCE usage
  RDMA/usnic: remove FOLL_FORCE usage
  RDMA/siw: remove FOLL_FORCE usage
  media: videobuf-dma-sg: remove FOLL_FORCE usage
  drm/etnaviv: remove FOLL_FORCE usage
  media: pci/ivtv: remove FOLL_FORCE usage
  mm/frame-vector: remove FOLL_FORCE usage
  drm/exynos: remove FOLL_FORCE usage
  RDMA/hw/qib/qib_user_pages: remove FOLL_FORCE usage
  habanalabs: remove FOLL_FORCE usage
  mm: rename FOLL_FORCE to FOLL_PTRACE

 arch/alpha/kernel/ptrace.c|   6 +-
 arch/arm64/kernel/mte.c   |   2 +-
 arch/ia64/kernel/ptrace.c |  10 +-
 arch/mips/kernel/ptrace32.c   |   4 +-
 arch/mips/math-emu/dsemul.c   |   2 +-
 arch/powerpc/kernel/ptrace/ptrace32.c |   4 +-
 arch/sparc/kernel/ptrace_32.c |   4 +-
 arch/sparc/kernel/ptrace_64.c |   8 +-
 arch/x86/kernel/step.c|   2 +-
 arch/x86/um/ptrace_32.c   |   2 +-
 arch/x86/um/ptrace_64.c   |   2 +-
 drivers/gpu/drm/etnaviv/etnaviv_gem.c |   8 +-
 drivers/gpu/drm/exynos/exynos_drm_g2d.c   |   2 +-
 drivers/infiniband/core/umem.c|   8 +-
 drivers/infiniband/hw/qib/qib_user_pages.c|   2 +-
 drivers/infiniband/hw/usnic/usnic_uiom.c  |   9 +-
 drivers/infiniband/sw/siw/siw_mem.c   |   9 +-
 drivers/media/common/videobuf2/frame_vector.c |   2 +-
 drivers/media/pci/ivtv/ivtv-udma.c|   2 +-
 drivers/media/pci/ivtv/ivtv-yuv.c |   5 +-
 drivers/media/v4l2-core/videobuf-dma-sg.c |  14 +-
 dri

Re: [PATCH] mm/memory: Return vm_fault_t result from migrate_to_ram() callback

2022-11-14 Thread David Hildenbrand


On 14.11.22 12:55, Alistair Popple wrote:

The migrate_to_ram() callback should always succeed, but in rare cases
can fail usually returning VM_FAULT_SIGBUS. Commit 16ce101db85d
("mm/memory.c: fix race when faulting a device private page")
incorrectly stopped passing the return code up the stack. Fix this by
setting the ret variable, restoring the previous behaviour on
migrate_to_ram() failure.

Fixes: 16ce101db85d ("mm/memory.c: fix race when faulting a device private 
page")
Signed-off-by: Alistair Popple 
Cc: Ralph Campbell 
Cc: John Hubbard 
Cc: Alex Sierra 
Cc: Ben Skeggs 
Cc: Felix Kuehling 
Cc: Lyude Paul 
Cc: Jason Gunthorpe 
Cc: Michael Ellerman 

---

Hi Andrew, I noticed my series needs another small fix which I'm
hoping you can apply for v6.1-rcX. Thanks (and sorry for not catching
this sooner).
---
  mm/memory.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index f88c351aecd4..8a6d5c823f91 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3763,7 +3763,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 */
get_page(vmf->page);
pte_unmap_unlock(vmf->pte, vmf->ptl);
-   vmf->page->pgmap->ops->migrate_to_ram(vmf);
+   ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
put_page(vmf->page);
} else if (is_hwpoison_entry(entry)) {
        ret = VM_FAULT_HWPOISON;



Acked-by: David Hildenbrand 

--
Thanks,

David / dhildenb

Re: [PATCH RFC 00/19] mm/gup: remove FOLL_FORCE usage from drivers (reliable R/O long-term pinning)

2022-11-14 Thread David Hildenbrand


On 14.11.22 07:03, Christoph Hellwig wrote:

On Mon, Nov 07, 2022 at 09:27:23AM -0800, Linus Torvalds wrote:

And I'd really love to just have this long saga ended, and FOLL_FORCE
finally relegated to purely ptrace accesses.


At that point we should also rename it to FOLL_PTRACE to make that
very clear, and also break anything in-flight accidentally readding it,
which I'd otherwise expect to happen.


Good idea; I'll include a patch in v1.

--
Thanks,

David / dhildenb

Re: [PATCH RFC 16/19] mm/frame-vector: remove FOLL_FORCE usage

2022-11-08 Thread David Hildenbrand


On 08.11.22 05:45, Tomasz Figa wrote:

Hi David,


Hi Tomasz,

thanks for looking into this!



On Tue, Nov 8, 2022 at 1:19 AM David Hildenbrand  wrote:


FOLL_FORCE is really only for debugger access. According to commit
707947247e95 ("media: videobuf2-vmalloc: get_userptr: buffers are always
writable"), the pinned pages are always writable.




As reference, the cover letter of the series:

https://lkml.kernel.org/r/20221107161740.144456-1-da...@redhat.com


Actually that patch is only a workaround to temporarily disable
support for read-only pages as they seemed to suffer from some
corruption issues in the retrieved user pages. We expect to support
read-only pages as hardware input after. That said, FOLL_FORCE doesn't
sound like the right thing even in that case, but I don't know the
background behind it being added here in the first place. +Hans
Verkuil +Marek Szyprowski do you happen to remember anything about it?


Maybe I mis-interpreted 707947247e95; re-reading it again, I am not 
quite sure what the actual problem is and how it relates to GUP 
FOLL_WRITE handling. FOLL_FORCE already was in place before 707947247e95 
and should be removed.


What I understood is "Just always call vb2_create_framevec() with 
FOLL_WRITE to always allow writing to the buffers.".



If the pinned page is never written to via the obtained GUP reference:
* FOLL_WRITE should not be set
* FOLL_FORCE should not be set
* We should not dirty the page when unpinning

If the pinned page may be written to via the obtained GUP reference:
* FOLL_WRITE should be set
* FOLL_FORCE should not be set
* We should dirty the page when unpinning


If the function is called for both, we should think about doing it 
conditional based on a "write" variable, like pre-707947247e95 did.


@Hans, any insight?

--
Thanks,

David / dhildenb

Re: [PATCH RFC 00/19] mm/gup: remove FOLL_FORCE usage from drivers (reliable R/O long-term pinning)

2022-11-08 Thread David Hildenbrand


On 07.11.22 18:27, Linus Torvalds wrote:

On Mon, Nov 7, 2022 at 8:18 AM David Hildenbrand  wrote:


So instead, make R/O long-term pinning work as expected, by breaking COW
in a COW mapping early, such that we can remove any FOLL_FORCE usage from
drivers.


Nothing makes me unhappy from a quick scan through these patches.

And I'd really love to just have this long saga ended, and FOLL_FORCE
finally relegated to purely ptrace accesses.

So an enthusiastic Ack from me.


Thanks Linus! My hope is that we can remove it from all drivers and not 
have to leave it in for some corner cases; so far it looks promising.


--
Thanks,

David / dhildenb

Re: [PATCH RFC 05/19] mm: add early FAULT_FLAG_WRITE consistency checks

2022-11-07 Thread David Hildenbrand


On 07.11.22 20:03, Nadav Amit wrote:

On Nov 7, 2022, at 8:17 AM, David Hildenbrand  wrote:


!! External Email

Let's catch abuse of FAULT_FLAG_WRITE early, such that we don't have to
care in all other handlers and might get "surprises" if we forget to do
so.

Write faults without VM_MAYWRITE don't make any sense, and our
maybe_mkwrite() logic could have hidden such abuse for now.

Write faults without VM_WRITE on something that is not a COW mapping is
similarly broken, and e.g., do_wp_page() could end up placing an
anonymous page into a shared mapping, which would be bad.

This is a preparation for reliable R/O long-term pinning of pages in
private mappings, whereby we want to make sure that we will never break
COW in a read-only private mapping.

Signed-off-by: David Hildenbrand 
---
mm/memory.c | 8 
1 file changed, 8 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index fe131273217a..826353da7b23 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5159,6 +5159,14 @@ static vm_fault_t sanitize_fault_flags(struct 
vm_area_struct *vma,
 */
if (!is_cow_mapping(vma->vm_flags))
*flags &= ~FAULT_FLAG_UNSHARE;
+   } else if (*flags & FAULT_FLAG_WRITE) {
+   /* Write faults on read-only mappings are impossible ... */
+   if (WARN_ON_ONCE(!(vma->vm_flags & VM_MAYWRITE)))
+   return VM_FAULT_SIGSEGV;
+   /* ... and FOLL_FORCE only applies to COW mappings. */
+   if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE) &&
+!is_cow_mapping(vma->vm_flags)))
+   return VM_FAULT_SIGSEGV;


Not sure about the WARN_*(). Seems as if it might trigger in benign even if
rare scenarios, e.g., mprotect() racing with page-fault.



We most certainly would want to catch any such broken/racy cases. There 
are no benign cases I could possibly think of.


Page faults need the mmap lock in read. mprotect() / VMA changes need 
the mmap lock in write. Whoever calls handle_mm_fault() is supposed to 
properly check VMA permissions.



--
Thanks,

David / dhildenb

[PATCH RFC 18/19] RDMA/hw/qib/qib_user_pages: remove FOLL_FORCE usage

2022-11-07 Thread David Hildenbrand

FOLL_FORCE is really only for debugger access. As we unpin the pinned pages
using unpin_user_pages_dirty_lock(true), the assumption is that all these
pages are writable.

FOLL_FORCE in this case seems to be a legacy leftover. Let's just remove
it.

Cc: Dennis Dalessandro 
Cc: Jason Gunthorpe 
Cc: Leon Romanovsky 
Signed-off-by: David Hildenbrand 
---
 drivers/infiniband/hw/qib/qib_user_pages.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/hw/qib/qib_user_pages.c 
b/drivers/infiniband/hw/qib/qib_user_pages.c
index f4b5f05058e4..f693bc753b6b 100644
--- a/drivers/infiniband/hw/qib/qib_user_pages.c
+++ b/drivers/infiniband/hw/qib/qib_user_pages.c
@@ -110,7 +110,7 @@ int qib_get_user_pages(unsigned long start_page, size_t 
num_pages,
for (got = 0; got < num_pages; got += ret) {
ret = pin_user_pages(start_page + got * PAGE_SIZE,
 num_pages - got,
-FOLL_LONGTERM | FOLL_WRITE | FOLL_FORCE,
+FOLL_LONGTERM | FOLL_WRITE,
 p + got, NULL);
if (ret < 0) {
mmap_read_unlock(current->mm);
-- 
2.38.1

[PATCH RFC 19/19] habanalabs: remove FOLL_FORCE usage

2022-11-07 Thread David Hildenbrand

FOLL_FORCE is really only for debugger access. As we unpin the pinned pages
using unpin_user_pages_dirty_lock(true), the assumption is that all these
pages are writable.

FOLL_FORCE in this case seems to be due to copy-and-past from other
drivers. Let's just remove it.

Cc: Oded Gabbay 
Cc: Arnd Bergmann 
Cc: Greg Kroah-Hartman 
Signed-off-by: David Hildenbrand 
---
 drivers/misc/habanalabs/common/memory.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/misc/habanalabs/common/memory.c 
b/drivers/misc/habanalabs/common/memory.c
index ef28f3b37b93..e35cca96bbef 100644
--- a/drivers/misc/habanalabs/common/memory.c
+++ b/drivers/misc/habanalabs/common/memory.c
@@ -2312,8 +2312,7 @@ static int get_user_memory(struct hl_device *hdev, u64 
addr, u64 size,
if (!userptr->pages)
return -ENOMEM;
 
-   rc = pin_user_pages_fast(start, npages,
-FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM,
+   rc = pin_user_pages_fast(start, npages, FOLL_WRITE | FOLL_LONGTERM,
 userptr->pages);
 
if (rc != npages) {
-- 
2.38.1

[PATCH RFC 16/19] mm/frame-vector: remove FOLL_FORCE usage

2022-11-07 Thread David Hildenbrand

FOLL_FORCE is really only for debugger access. According to commit
707947247e95 ("media: videobuf2-vmalloc: get_userptr: buffers are always
writable"), the pinned pages are always writable.

FOLL_FORCE in this case seems to be a legacy leftover. Let's just remove
it.

Cc: Tomasz Figa 
Cc: Marek Szyprowski 
Cc: Mauro Carvalho Chehab 
Signed-off-by: David Hildenbrand 
---
 drivers/media/common/videobuf2/frame_vector.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/media/common/videobuf2/frame_vector.c 
b/drivers/media/common/videobuf2/frame_vector.c
index 542dde9d2609..062e98148c53 100644
--- a/drivers/media/common/videobuf2/frame_vector.c
+++ b/drivers/media/common/videobuf2/frame_vector.c
@@ -50,7 +50,7 @@ int get_vaddr_frames(unsigned long start, unsigned int 
nr_frames,
start = untagged_addr(start);
 
ret = pin_user_pages_fast(start, nr_frames,
- FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM,
+ FOLL_WRITE | FOLL_LONGTERM,
  (struct page **)(vec->ptrs));
if (ret > 0) {
vec->got_ref = true;
-- 
2.38.1

[PATCH RFC 17/19] drm/exynos: remove FOLL_FORCE usage

2022-11-07 Thread David Hildenbrand

FOLL_FORCE is really only for debugger access. As we unpin the pinned pages
using unpin_user_pages_dirty_lock(true), the assumption is that all these
pages are writable.

FOLL_FORCE in this case seems to be a legacy leftover. Let's just remove
it.

Cc: Inki Dae 
Cc: Seung-Woo Kim 
Cc: Kyungmin Park 
Cc: David Airlie 
Cc: Daniel Vetter 
Cc: Krzysztof Kozlowski 
Signed-off-by: David Hildenbrand 
---
 drivers/gpu/drm/exynos/exynos_drm_g2d.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/exynos/exynos_drm_g2d.c 
b/drivers/gpu/drm/exynos/exynos_drm_g2d.c
index 471fd6c8135f..e19c2ceb3759 100644
--- a/drivers/gpu/drm/exynos/exynos_drm_g2d.c
+++ b/drivers/gpu/drm/exynos/exynos_drm_g2d.c
@@ -477,7 +477,7 @@ static dma_addr_t *g2d_userptr_get_dma_addr(struct g2d_data 
*g2d,
}
 
ret = pin_user_pages_fast(start, npages,
- FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM,
+ FOLL_WRITE | FOLL_LONGTERM,
  g2d_userptr->pages);
if (ret != npages) {
DRM_DEV_ERROR(g2d->dev,
-- 
2.38.1

[PATCH RFC 15/19] media: pci/ivtv: remove FOLL_FORCE usage

2022-11-07 Thread David Hildenbrand

FOLL_FORCE is really only for debugger access. R/O pinning a page is
supposed to fail if the VMA misses proper access permissions (no VM_READ).

Let's just remove FOLL_FORCE usage here; there would have to be a pretty
good reason to allow arbitrary drivers to R/O pin pages in a PROT_NONE
VMA. Most probably, FOLL_FORCE usage is just some legacy leftover.

Cc: Andy Walls 
Cc: Mauro Carvalho Chehab 
Signed-off-by: David Hildenbrand 
---
 drivers/media/pci/ivtv/ivtv-udma.c | 2 +-
 drivers/media/pci/ivtv/ivtv-yuv.c  | 5 ++---
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/media/pci/ivtv/ivtv-udma.c 
b/drivers/media/pci/ivtv/ivtv-udma.c
index 210be8290f24..99b9f55ca829 100644
--- a/drivers/media/pci/ivtv/ivtv-udma.c
+++ b/drivers/media/pci/ivtv/ivtv-udma.c
@@ -115,7 +115,7 @@ int ivtv_udma_setup(struct ivtv *itv, unsigned long 
ivtv_dest_addr,
 
/* Pin user pages for DMA Xfer */
err = pin_user_pages_unlocked(user_dma.uaddr, user_dma.page_count,
-   dma->map, FOLL_FORCE);
+   dma->map, 0);
 
if (user_dma.page_count != err) {
IVTV_DEBUG_WARN("failed to map user pages, returned %d instead 
of %d\n",
diff --git a/drivers/media/pci/ivtv/ivtv-yuv.c 
b/drivers/media/pci/ivtv/ivtv-yuv.c
index 4ba10c34a16a..582146f8d70d 100644
--- a/drivers/media/pci/ivtv/ivtv-yuv.c
+++ b/drivers/media/pci/ivtv/ivtv-yuv.c
@@ -63,12 +63,11 @@ static int ivtv_yuv_prep_user_dma(struct ivtv *itv, struct 
ivtv_user_dma *dma,
 
/* Pin user pages for DMA Xfer */
y_pages = pin_user_pages_unlocked(y_dma.uaddr,
-   y_dma.page_count, >map[0], FOLL_FORCE);
+   y_dma.page_count, >map[0], 0);
uv_pages = 0; /* silence gcc. value is set and consumed only if: */
if (y_pages == y_dma.page_count) {
uv_pages = pin_user_pages_unlocked(uv_dma.uaddr,
-   uv_dma.page_count, >map[y_pages],
-   FOLL_FORCE);
+   uv_dma.page_count, >map[y_pages], 0);
}
 
if (y_pages != y_dma.page_count || uv_pages != uv_dma.page_count) {
-- 
2.38.1

[PATCH RFC 14/19] drm/etnaviv: remove FOLL_FORCE usage

2022-11-07 Thread David Hildenbrand

GUP now supports reliable R/O long-term pinning in COW mappings, such
that we break COW early. MAP_SHARED VMAs only use the shared zeropage so
far in one corner case (DAXFS file with holes), which can be ignored
because GUP does not support long-term pinning in fsdax (see
check_vma_flags()).

commit cd5297b0855f ("drm/etnaviv: Use FOLL_FORCE for userptr")
documents that FOLL_FORCE | FOLL_WRITE was really only used for reliable
R/O pinning.

Consequently, FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM is no longer required
for reliable R/O long-term pinning: FOLL_LONGTERM is sufficient. So stop
using FOLL_FORCE, which is really only for debugger access.

Cc: Daniel Vetter 
Cc: Lucas Stach 
Cc: Russell King 
Cc: Christian Gmeiner 
Cc: David Airlie 
Signed-off-by: David Hildenbrand 
---
 drivers/gpu/drm/etnaviv/etnaviv_gem.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gem.c 
b/drivers/gpu/drm/etnaviv/etnaviv_gem.c
index cc386f8a7116..efe2240945d0 100644
--- a/drivers/gpu/drm/etnaviv/etnaviv_gem.c
+++ b/drivers/gpu/drm/etnaviv/etnaviv_gem.c
@@ -638,6 +638,7 @@ static int etnaviv_gem_userptr_get_pages(struct 
etnaviv_gem_object *etnaviv_obj)
struct page **pvec = NULL;
struct etnaviv_gem_userptr *userptr = _obj->userptr;
int ret, pinned = 0, npages = etnaviv_obj->base.size >> PAGE_SHIFT;
+   unsigned int gup_flags = FOLL_LONGTERM;
 
might_lock_read(>mm->mmap_lock);
 
@@ -648,14 +649,15 @@ static int etnaviv_gem_userptr_get_pages(struct 
etnaviv_gem_object *etnaviv_obj)
if (!pvec)
return -ENOMEM;
 
+   if (!userptr->ro)
+   gup_flags |= FOLL_WRITE;
+
do {
unsigned num_pages = npages - pinned;
uint64_t ptr = userptr->ptr + pinned * PAGE_SIZE;
struct page **pages = pvec + pinned;
 
-   ret = pin_user_pages_fast(ptr, num_pages,
- FOLL_WRITE | FOLL_FORCE | 
FOLL_LONGTERM,
- pages);
+   ret = pin_user_pages_fast(ptr, num_pages, gup_flags, pages);
if (ret < 0) {
unpin_user_pages(pvec, pinned);
kvfree(pvec);
-- 
2.38.1

[PATCH RFC 13/19] media: videobuf-dma-sg: remove FOLL_FORCE usage

2022-11-07 Thread David Hildenbrand

GUP now supports reliable R/O long-term pinning in COW mappings, such
that we break COW early. MAP_SHARED VMAs only use the shared zeropage so
far in one corner case (DAXFS file with holes), which can be ignored
because GUP does not support long-term pinning in fsdax (see
check_vma_flags()).

Consequently, FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM is no longer required
for reliable R/O long-term pinning: FOLL_LONGTERM is sufficient. So stop
using FOLL_FORCE, which is really only for debugger access.

Cc: Mauro Carvalho Chehab 
Signed-off-by: David Hildenbrand 
---
 drivers/media/v4l2-core/videobuf-dma-sg.c | 14 +-
 1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c 
b/drivers/media/v4l2-core/videobuf-dma-sg.c
index f75e5eedeee0..234e9f647c96 100644
--- a/drivers/media/v4l2-core/videobuf-dma-sg.c
+++ b/drivers/media/v4l2-core/videobuf-dma-sg.c
@@ -151,17 +151,16 @@ static void videobuf_dma_init(struct videobuf_dmabuf *dma)
 static int videobuf_dma_init_user_locked(struct videobuf_dmabuf *dma,
int direction, unsigned long data, unsigned long size)
 {
+   unsigned int gup_flags = FOLL_LONGTERM;
unsigned long first, last;
-   int err, rw = 0;
-   unsigned int flags = FOLL_FORCE;
+   int err;
 
dma->direction = direction;
switch (dma->direction) {
case DMA_FROM_DEVICE:
-   rw = READ;
+   gup_flags |= FOLL_WRITE;
break;
case DMA_TO_DEVICE:
-   rw = WRITE;
break;
default:
BUG();
@@ -177,14 +176,11 @@ static int videobuf_dma_init_user_locked(struct 
videobuf_dmabuf *dma,
if (NULL == dma->pages)
return -ENOMEM;
 
-   if (rw == READ)
-   flags |= FOLL_WRITE;
-
dprintk(1, "init user [0x%lx+0x%lx => %lu pages]\n",
data, size, dma->nr_pages);
 
-   err = pin_user_pages(data & PAGE_MASK, dma->nr_pages,
-flags | FOLL_LONGTERM, dma->pages, NULL);
+   err = pin_user_pages(data & PAGE_MASK, dma->nr_pages, gup_flags,
+dma->pages, NULL);
 
if (err != dma->nr_pages) {
dma->nr_pages = (err >= 0) ? err : 0;
-- 
2.38.1

[PATCH RFC 12/19] RDMA/siw: remove FOLL_FORCE usage

2022-11-07 Thread David Hildenbrand

GUP now supports reliable R/O long-term pinning in COW mappings, such
that we break COW early. MAP_SHARED VMAs only use the shared zeropage so
far in one corner case (DAXFS file with holes), which can be ignored
because GUP does not support long-term pinning in fsdax (see
check_vma_flags()).

Consequently, FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM is no longer required
for reliable R/O long-term pinning: FOLL_LONGTERM is sufficient. So stop
using FOLL_FORCE, which is really only for debugger access.

Cc: Bernard Metzler 
Cc: Jason Gunthorpe 
Cc: Leon Romanovsky 
Signed-off-by: David Hildenbrand 
---
 drivers/infiniband/sw/siw/siw_mem.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/sw/siw/siw_mem.c 
b/drivers/infiniband/sw/siw/siw_mem.c
index 61c17db70d65..b2b33dd3b4fa 100644
--- a/drivers/infiniband/sw/siw/siw_mem.c
+++ b/drivers/infiniband/sw/siw/siw_mem.c
@@ -368,7 +368,7 @@ struct siw_umem *siw_umem_get(u64 start, u64 len, bool 
writable)
struct mm_struct *mm_s;
u64 first_page_va;
unsigned long mlock_limit;
-   unsigned int foll_flags = FOLL_WRITE;
+   unsigned int foll_flags = FOLL_LONGTERM;
int num_pages, num_chunks, i, rv = 0;
 
if (!can_do_mlock())
@@ -391,8 +391,8 @@ struct siw_umem *siw_umem_get(u64 start, u64 len, bool 
writable)
 
mmgrab(mm_s);
 
-   if (!writable)
-   foll_flags |= FOLL_FORCE;
+   if (writable)
+   foll_flags |= FOLL_WRITE;
 
mmap_read_lock(mm_s);
 
@@ -423,8 +423,7 @@ struct siw_umem *siw_umem_get(u64 start, u64 len, bool 
writable)
while (nents) {
struct page **plist = >page_chunk[i].plist[got];
 
-   rv = pin_user_pages(first_page_va, nents,
-   foll_flags | FOLL_LONGTERM,
+   rv = pin_user_pages(first_page_va, nents, foll_flags,
plist, NULL);
if (rv < 0)
goto out_sem_up;
-- 
2.38.1

[PATCH RFC 11/19] RDMA/usnic: remove FOLL_FORCE usage

2022-11-07 Thread David Hildenbrand

GUP now supports reliable R/O long-term pinning in COW mappings, such
that we break COW early. MAP_SHARED VMAs only use the shared zeropage so
far in one corner case (DAXFS file with holes), which can be ignored
because GUP does not support long-term pinning in fsdax (see
check_vma_flags()).

Consequently, FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM is no longer required
for reliable R/O long-term pinning: FOLL_LONGTERM is sufficient. So stop
using FOLL_FORCE, which is really only for debugger access.

Cc: Christian Benvenuti 
Cc: Nelson Escobar 
Cc: Jason Gunthorpe 
Cc: Leon Romanovsky 
Signed-off-by: David Hildenbrand 
---
 drivers/infiniband/hw/usnic/usnic_uiom.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/hw/usnic/usnic_uiom.c 
b/drivers/infiniband/hw/usnic/usnic_uiom.c
index 67923ced6e2d..c301b3be9f30 100644
--- a/drivers/infiniband/hw/usnic/usnic_uiom.c
+++ b/drivers/infiniband/hw/usnic/usnic_uiom.c
@@ -85,6 +85,7 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t 
size, int writable,
int dmasync, struct usnic_uiom_reg *uiomr)
 {
struct list_head *chunk_list = >chunk_list;
+   unsigned int gup_flags = FOLL_LONGTERM;
struct page **page_list;
struct scatterlist *sg;
struct usnic_uiom_chunk *chunk;
@@ -96,7 +97,6 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t 
size, int writable,
int off;
int i;
dma_addr_t pa;
-   unsigned int gup_flags;
struct mm_struct *mm;
 
/*
@@ -131,8 +131,8 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t 
size, int writable,
goto out;
}
 
-   gup_flags = FOLL_WRITE;
-   gup_flags |= (writable) ? 0 : FOLL_FORCE;
+   if (writable)
+   gup_flags |= FOLL_WRITE;
cur_base = addr & PAGE_MASK;
ret = 0;
 
@@ -140,8 +140,7 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t 
size, int writable,
ret = pin_user_pages(cur_base,
 min_t(unsigned long, npages,
 PAGE_SIZE / sizeof(struct page *)),
-gup_flags | FOLL_LONGTERM,
-page_list, NULL);
+gup_flags, page_list, NULL);
 
if (ret < 0)
goto out;
-- 
2.38.1

[PATCH RFC 10/19] RDMA/umem: remove FOLL_FORCE usage

2022-11-07 Thread David Hildenbrand

GUP now supports reliable R/O long-term pinning in COW mappings, such
that we break COW early. MAP_SHARED VMAs only use the shared zeropage so
far in one corner case (DAXFS file with holes), which can be ignored
because GUP does not support long-term pinning in fsdax (see
check_vma_flags()).

Consequently, FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM is no longer required
for reliable R/O long-term pinning: FOLL_LONGTERM is sufficient. So stop
using FOLL_FORCE, which is really only for debugger access.

Cc: Jason Gunthorpe 
Cc: Leon Romanovsky 
Signed-off-by: David Hildenbrand 
---
 drivers/infiniband/core/umem.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 86d479772fbc..755a9c57db6f 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -156,7 +156,7 @@ struct ib_umem *ib_umem_get(struct ib_device *device, 
unsigned long addr,
struct mm_struct *mm;
unsigned long npages;
int pinned, ret;
-   unsigned int gup_flags = FOLL_WRITE;
+   unsigned int gup_flags = FOLL_LONGTERM;
 
/*
 * If the combination of the addr and size requested for this memory
@@ -210,8 +210,8 @@ struct ib_umem *ib_umem_get(struct ib_device *device, 
unsigned long addr,
 
cur_base = addr & PAGE_MASK;
 
-   if (!umem->writable)
-   gup_flags |= FOLL_FORCE;
+   if (umem->writable)
+   gup_flags |= FOLL_WRITE;
 
while (npages) {
cond_resched();
@@ -219,7 +219,7 @@ struct ib_umem *ib_umem_get(struct ib_device *device, 
unsigned long addr,
  min_t(unsigned long, npages,
PAGE_SIZE /
sizeof(struct page *)),
- gup_flags | FOLL_LONGTERM, page_list);
+ gup_flags, page_list);
if (pinned < 0) {
ret = pinned;
goto umem_release;
-- 
2.38.1

[PATCH RFC 08/19] mm: extend FAULT_FLAG_UNSHARE support to anything in a COW mapping

2022-11-07 Thread David Hildenbrand

Extend FAULT_FLAG_UNSHARE to break COW on anything mapped into a
COW (i.e., private writable) mapping and adjust the documentation
accordingly.

FAULT_FLAG_UNSHARE will now also break COW when encountering the shared
zeropage, a pagecache page, a PFNMAP, ... inside a COW mapping, by
properly replacing the mapped page/pfn by a private copy (an exclusive
anonymous page).

Note that only do_wp_page() needs care: hugetlb_wp() already handles
FAULT_FLAG_UNSHARE correctly. wp_huge_pmd()/wp_huge_pud() also handles it
correctly, for example, splitting the huge zeropage on FAULT_FLAG_UNSHARE
such that we can handle FAULT_FLAG_UNSHARE on the PTE level.

This change is a requirement for reliable long-term R/O pinning in
COW mappings.

Signed-off-by: David Hildenbrand 
---
 include/linux/mm_types.h | 8 
 mm/memory.c  | 4 
 2 files changed, 4 insertions(+), 8 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 834022721bc6..3f9fa01a3e24 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -965,9 +965,9 @@ typedef struct {
  * @FAULT_FLAG_REMOTE: The fault is not for current task/mm.
  * @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch.
  * @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal 
signals.
- * @FAULT_FLAG_UNSHARE: The fault is an unsharing request to unshare (and mark
- *  exclusive) a possibly shared anonymous page that is
- *  mapped R/O.
+ * @FAULT_FLAG_UNSHARE: The fault is an unsharing request to break COW in a
+ *  COW mapping, making sure that an exclusive anon page is
+ *  mapped after the fault.
  * @FAULT_FLAG_ORIG_PTE_VALID: whether the fault has vmf->orig_pte cached.
  *We should only access orig_pte if this flag set.
  *
@@ -992,7 +992,7 @@ typedef struct {
  *
  * The combination FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE is illegal.
  * FAULT_FLAG_UNSHARE is ignored and treated like an ordinary read fault when
- * no existing R/O-mapped anonymous page is encountered.
+ * applied to mappings that are not COW mappings.
  */
 enum fault_flag {
FAULT_FLAG_WRITE =  1 << 0,
diff --git a/mm/memory.c b/mm/memory.c
index d2f9673755be..73ed83def548 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3431,10 +3431,6 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
}
wp_page_reuse(vmf);
return 0;
-   } else if (unshare) {
-   /* No anonymous page -> nothing to do. */
-   pte_unmap_unlock(vmf->pte, vmf->ptl);
-   return 0;
}
 copy:
/*
-- 
2.38.1

[PATCH RFC 09/19] mm/gup: reliable R/O long-term pinning in COW mappings

2022-11-07 Thread David Hildenbrand

We already support reliable R/O pinning of anonymous memory. However,
assume we end up pinning (R/O long-term) a pagecache page or the shared
zeropage inside a writable private ("COW") mapping. The next write access
will trigger a write-fault and replace the pinned page by an exclusive
anonymous page in the process page tables to break COW: the pinned page no
longer corresponds to the page mapped into the process' page table.

Now that FAULT_FLAG_UNSHARE can break COW on anything mapped into a
COW mapping, let's properly break COW first before R/O long-term
pinning something that's not an exclusive anon page inside a COW
mapping. FAULT_FLAG_UNSHARE will break COW and map an exclusive anon page
instead that can get pinned safely.

With this change, we can stop using FOLL_FORCE|FOLL_WRITE for reliable
R/O long-term pinning in COW mappings.

With this change, the new R/O long-term pinning tests for non-anonymous
memory succeed:
  # [RUN] R/O longterm GUP pin ... with shared zeropage
  ok 151 Longterm R/O pin is reliable
  # [RUN] R/O longterm GUP pin ... with memfd
  ok 152 Longterm R/O pin is reliable
  # [RUN] R/O longterm GUP pin ... with tmpfile
  ok 153 Longterm R/O pin is reliable
  # [RUN] R/O longterm GUP pin ... with huge zeropage
  ok 154 Longterm R/O pin is reliable
  # [RUN] R/O longterm GUP pin ... with memfd hugetlb (2048 kB)
  ok 155 Longterm R/O pin is reliable
  # [RUN] R/O longterm GUP pin ... with memfd hugetlb (1048576 kB)
  ok 156 Longterm R/O pin is reliable
  # [RUN] R/O longterm GUP-fast pin ... with shared zeropage
  ok 157 Longterm R/O pin is reliable
  # [RUN] R/O longterm GUP-fast pin ... with memfd
  ok 158 Longterm R/O pin is reliable
  # [RUN] R/O longterm GUP-fast pin ... with tmpfile
  ok 159 Longterm R/O pin is reliable
  # [RUN] R/O longterm GUP-fast pin ... with huge zeropage
  ok 160 Longterm R/O pin is reliable
  # [RUN] R/O longterm GUP-fast pin ... with memfd hugetlb (2048 kB)
  ok 161 Longterm R/O pin is reliable
  # [RUN] R/O longterm GUP-fast pin ... with memfd hugetlb (1048576 kB)
  ok 162 Longterm R/O pin is reliable

Note 1: We don't care about short-term R/O-pinning, because they have
snapshot semantics: they are not supposed to observe modifications that
happen after pinning.

As one example, assume we start direct I/O to read from a page and store
page content into a file: modifications to page content after starting
direct I/O are not guaranteed to end up in the file. So even if we'd pin
the shared zeropage, the end result would be as expected -- getting zeroes
stored to the file.

Note 2: For shared mappings we'll now always fallback to the slow path to
lookup the VMA when R/O long-term pining. While that's the necessary price
we have to pay right now, it's actually not that bad in practice: most
FOLL_LONGTERM users already specify FOLL_WRITE, for example, along with
FOLL_FORCE because they tried dealing with COW mappings correctly ...

Note 3: For users that use FOLL_LONGTERM right now without FOLL_WRITE,
such as VFIO, we'd now no longer pin the shared zeropage. Instead, we'd
populate exclusive anon pages that we can pin. There was a concern that
this could affect the memlock limit of existing setups.

For example, a VM running with VFIO could run into the memlock limit and
fail to run. However, we essentially had the same behavior already in
commit 17839856fd58 ("gup: document and work around "COW can break either
way" issue") which got merged into some enterprise distros, and there were
not any such complaints. So most probably, we're fine.

Signed-off-by: David Hildenbrand 
---
 include/linux/mm.h | 27 ---
 mm/gup.c   | 10 +-
 mm/huge_memory.c   |  2 +-
 mm/hugetlb.c   |  7 ---
 4 files changed, 34 insertions(+), 12 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 517c8cc8ccb9..3252ed88b472 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3002,8 +3002,12 @@ static inline int vm_fault_to_errno(vm_fault_t vm_fault, 
int foll_flags)
  * Must be called with the (sub)page that's actually referenced via the
  * page table entry, which might not necessarily be the head page for a
  * PTE-mapped THP.
+ *
+ * If the vma is NULL, we're coming from the GUP-fast path and might have
+ * to fallback to the slow path just to lookup the vma.
  */
-static inline bool gup_must_unshare(unsigned int flags, struct page *page)
+static inline bool gup_must_unshare(struct vm_area_struct *vma,
+   unsigned int flags, struct page *page)
 {
/*
 * FOLL_WRITE is implicitly handled correctly as the page table entry
@@ -3016,8 +3020,25 @@ static inline bool gup_must_unshare(unsigned int flags, 
struct page *page)
 * Note: PageAnon(page) is stable until the page is actually getting
 * freed.
 */
-   if (!PageAnon(page))
-   return false;
+   if (!PageAnon(page)) {
+

[PATCH RFC 07/19] mm: don't call vm_ops->huge_fault() in wp_huge_pmd()/wp_huge_pud() for private mappings

2022-11-07 Thread David Hildenbrand

If we already have a PMD/PUD mapped write-protected in a private mapping
and we want to break COW either due to FAULT_FLAG_WRITE or
FAULT_FLAG_UNSHARE, there is no need to inform the file system just like on
the PTE path.

Let's just split (->zap) + fallback in that case.

This is a preparation for more generic FAULT_FLAG_UNSHARE support in
COW mappings.

Signed-off-by: David Hildenbrand 
---
 mm/memory.c | 24 +++-
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 41e4c697033a..d2f9673755be 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4791,6 +4791,7 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault 
*vmf)
 static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
 {
const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE;
+   vm_fault_t ret;
 
if (vma_is_anonymous(vmf->vma)) {
if (likely(!unshare) &&
@@ -4798,11 +4799,13 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault 
*vmf)
return handle_userfault(vmf, VM_UFFD_WP);
return do_huge_pmd_wp_page(vmf);
}
-   if (vmf->vma->vm_ops->huge_fault) {
-   vm_fault_t ret = vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PMD);
 
-   if (!(ret & VM_FAULT_FALLBACK))
-   return ret;
+   if (vmf->vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) {
+   if (vmf->vma->vm_ops->huge_fault) {
+   ret = vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PMD);
+   if (!(ret & VM_FAULT_FALLBACK))
+   return ret;
+   }
}
 
/* COW or write-notify handled on pte level: split pmd. */
@@ -4828,14 +4831,17 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, 
pud_t orig_pud)
 {
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) &&\
defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
+   vm_fault_t ret;
+
/* No support for anonymous transparent PUD pages yet */
if (vma_is_anonymous(vmf->vma))
goto split;
-   if (vmf->vma->vm_ops->huge_fault) {
-   vm_fault_t ret = vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PUD);
-
-   if (!(ret & VM_FAULT_FALLBACK))
-   return ret;
+   if (vmf->vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) {
+   if (vmf->vma->vm_ops->huge_fault) {
+   ret = vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PUD);
+   if (!(ret & VM_FAULT_FALLBACK))
+   return ret;
+   }
}
 split:
/* COW or write-notify not handled on PUD level: split pud.*/
-- 
2.38.1

[PATCH RFC 06/19] mm: rework handling in do_wp_page() based on private vs. shared mappings

2022-11-07 Thread David Hildenbrand

We want to extent FAULT_FLAG_UNSHARE support to anything mapped into a
COW mapping (pagecache page, zeropage, PFN, ...), not just anonymous pages.
Let's prepare for that by handling shared mappings first such that we can
handle private mappings last.

While at it, use folio-based functions instead of page-based functions
where we touch the code either way.

Signed-off-by: David Hildenbrand 
---
 mm/memory.c | 38 +-
 1 file changed, 17 insertions(+), 21 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 826353da7b23..41e4c697033a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3341,7 +3341,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
 {
const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE;
struct vm_area_struct *vma = vmf->vma;
-   struct folio *folio;
+   struct folio *folio = NULL;
 
if (likely(!unshare)) {
if (userfaultfd_pte_wp(vma, *vmf->pte)) {
@@ -3359,13 +3359,12 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
}
 
vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte);
-   if (!vmf->page) {
-   if (unlikely(unshare)) {
-   /* No anonymous page -> nothing to do. */
-   pte_unmap_unlock(vmf->pte, vmf->ptl);
-   return 0;
-   }
 
+   /*
+* Shared mapping: we are guaranteed to have VM_WRITE and
+* FAULT_FLAG_WRITE set at this point.
+*/
+   if (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) {
/*
 * VM_MIXEDMAP !pfn_valid() case, or VM_SOFTDIRTY clear on a
 * VM_PFNMAP VMA.
@@ -3373,20 +3372,19 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
 * We should not cow pages in a shared writeable mapping.
 * Just mark the pages writable and/or call ops->pfn_mkwrite.
 */
-   if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
-(VM_WRITE|VM_SHARED))
+   if (!vmf->page)
return wp_pfn_shared(vmf);
-
-   pte_unmap_unlock(vmf->pte, vmf->ptl);
-   return wp_page_copy(vmf);
+   return wp_page_shared(vmf);
}
 
+   if (vmf->page)
+   folio = page_folio(vmf->page);
+
/*
-* Take out anonymous pages first, anonymous shared vmas are
-* not dirty accountable.
+* Private mapping: create an exclusive anonymous page copy if reuse
+* is impossible. We might miss VM_WRITE for FOLL_FORCE handling.
 */
-   folio = page_folio(vmf->page);
-   if (folio_test_anon(folio)) {
+   if (folio && folio_test_anon(folio)) {
/*
 * If the page is exclusive to this process we must reuse the
 * page without further checks.
@@ -3437,19 +3435,17 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
/* No anonymous page -> nothing to do. */
pte_unmap_unlock(vmf->pte, vmf->ptl);
return 0;
-   } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
-   (VM_WRITE|VM_SHARED))) {
-   return wp_page_shared(vmf);
}
 copy:
/*
 * Ok, we need to copy. Oh, well..
 */
-   get_page(vmf->page);
+   if (folio)
+   folio_get(folio);
 
pte_unmap_unlock(vmf->pte, vmf->ptl);
 #ifdef CONFIG_KSM
-   if (PageKsm(vmf->page))
+   if (folio && folio_test_ksm(folio))
count_vm_event(COW_KSM);
 #endif
return wp_page_copy(vmf);
-- 
2.38.1

[PATCH RFC 04/19] mm: add early FAULT_FLAG_UNSHARE consistency checks

2022-11-07 Thread David Hildenbrand

For now, FAULT_FLAG_UNSHARE only applies to anonymous pages, which
implies a COW mapping. Let's hide FAULT_FLAG_UNSHARE early if we're not
dealing with a COW mapping, such that we treat it like a read fault as
documented and don't have to worry about the flag throughout all fault
handlers.

While at it, centralize the check for mutual exclusion of
FAULT_FLAG_UNSHARE and FAULT_FLAG_WRITE and just drop the check that
either flag is set in the WP handler.

Signed-off-by: David Hildenbrand 
---
 mm/huge_memory.c |  3 ---
 mm/hugetlb.c |  5 -
 mm/memory.c  | 23 ---
 3 files changed, 20 insertions(+), 11 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1d47b3f7b877..7173756d6868 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1267,9 +1267,6 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
vmf->ptl = pmd_lockptr(vma->vm_mm, vmf->pmd);
VM_BUG_ON_VMA(!vma->anon_vma, vma);
 
-   VM_BUG_ON(unshare && (vmf->flags & FAULT_FLAG_WRITE));
-   VM_BUG_ON(!unshare && !(vmf->flags & FAULT_FLAG_WRITE));
-
if (is_huge_zero_pmd(orig_pmd))
goto fallback;
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index be572af75d9c..3672c7e06748 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5316,9 +5316,6 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct 
vm_area_struct *vma,
unsigned long haddr = address & huge_page_mask(h);
struct mmu_notifier_range range;
 
-   VM_BUG_ON(unshare && (flags & FOLL_WRITE));
-   VM_BUG_ON(!unshare && !(flags & FOLL_WRITE));
-
/*
 * hugetlb does not support FOLL_FORCE-style write faults that keep the
 * PTE mapped R/O such as maybe_mkwrite() would do.
@@ -5328,8 +5325,6 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct 
vm_area_struct *vma,
 
/* Let's take out MAP_SHARED mappings first. */
if (vma->vm_flags & VM_MAYSHARE) {
-   if (unlikely(unshare))
-   return 0;
set_huge_ptep_writable(vma, haddr, ptep);
return 0;
}
diff --git a/mm/memory.c b/mm/memory.c
index 78e2c58f6f31..fe131273217a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3343,9 +3343,6 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
struct vm_area_struct *vma = vmf->vma;
struct folio *folio;
 
-   VM_BUG_ON(unshare && (vmf->flags & FAULT_FLAG_WRITE));
-   VM_BUG_ON(!unshare && !(vmf->flags & FAULT_FLAG_WRITE));
-
if (likely(!unshare)) {
if (userfaultfd_pte_wp(vma, *vmf->pte)) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -5150,6 +5147,22 @@ static void lru_gen_exit_fault(void)
 }
 #endif /* CONFIG_LRU_GEN */
 
+static vm_fault_t sanitize_fault_flags(struct vm_area_struct *vma,
+  unsigned int *flags)
+{
+   if (unlikely(*flags & FAULT_FLAG_UNSHARE)) {
+   if (WARN_ON_ONCE(*flags & FAULT_FLAG_WRITE))
+   return VM_FAULT_SIGSEGV;
+   /*
+* FAULT_FLAG_UNSHARE only applies to COW mappings. Let's
+* just treat it like an ordinary read-fault otherwise.
+*/
+   if (!is_cow_mapping(vma->vm_flags))
+   *flags &= ~FAULT_FLAG_UNSHARE;
+   }
+   return 0;
+}
+
 /*
  * By the time we get here, we already hold the mm semaphore
  *
@@ -5166,6 +5179,10 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, 
unsigned long address,
count_vm_event(PGFAULT);
count_memcg_event_mm(vma->vm_mm, PGFAULT);
 
+   ret = sanitize_fault_flags(vma, );
+   if (ret)
+   return ret;
+
if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
flags & FAULT_FLAG_INSTRUCTION,
flags & FAULT_FLAG_REMOTE))
-- 
2.38.1

[PATCH RFC 05/19] mm: add early FAULT_FLAG_WRITE consistency checks

2022-11-07 Thread David Hildenbrand

Let's catch abuse of FAULT_FLAG_WRITE early, such that we don't have to
care in all other handlers and might get "surprises" if we forget to do
so.

Write faults without VM_MAYWRITE don't make any sense, and our
maybe_mkwrite() logic could have hidden such abuse for now.

Write faults without VM_WRITE on something that is not a COW mapping is
similarly broken, and e.g., do_wp_page() could end up placing an
anonymous page into a shared mapping, which would be bad.

This is a preparation for reliable R/O long-term pinning of pages in
private mappings, whereby we want to make sure that we will never break
COW in a read-only private mapping.

Signed-off-by: David Hildenbrand 
---
 mm/memory.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index fe131273217a..826353da7b23 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5159,6 +5159,14 @@ static vm_fault_t sanitize_fault_flags(struct 
vm_area_struct *vma,
 */
if (!is_cow_mapping(vma->vm_flags))
*flags &= ~FAULT_FLAG_UNSHARE;
+   } else if (*flags & FAULT_FLAG_WRITE) {
+   /* Write faults on read-only mappings are impossible ... */
+   if (WARN_ON_ONCE(!(vma->vm_flags & VM_MAYWRITE)))
+   return VM_FAULT_SIGSEGV;
+   /* ... and FOLL_FORCE only applies to COW mappings. */
+   if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE) &&
+!is_cow_mapping(vma->vm_flags)))
+   return VM_FAULT_SIGSEGV;
}
return 0;
 }
-- 
2.38.1

[PATCH RFC 03/19] selftests/vm: cow: R/O long-term pinning reliability tests for non-anon pages

2022-11-07 Thread David Hildenbrand

Let's test whether R/O long-term pinning is reliable for non-anonymous
memory: when R/O long-term pinning a page, the expectation is that we
break COW early before pinning, such that actual write access via the
page tables won't break COW later and end up replacing the R/O-pinned
page in the page table.

Consequently, R/O long-term pinning in private mappings would only target
exclusive anonymous pages.

For now, all tests fail:
# [RUN] R/O longterm GUP pin ... with shared zeropage
not ok 151 Longterm R/O pin is reliable
# [RUN] R/O longterm GUP pin ... with memfd
not ok 152 Longterm R/O pin is reliable
# [RUN] R/O longterm GUP pin ... with tmpfile
not ok 153 Longterm R/O pin is reliable
# [RUN] R/O longterm GUP pin ... with huge zeropage
not ok 154 Longterm R/O pin is reliable
# [RUN] R/O longterm GUP pin ... with memfd hugetlb (2048 kB)
not ok 155 Longterm R/O pin is reliable
# [RUN] R/O longterm GUP pin ... with memfd hugetlb (1048576 kB)
not ok 156 Longterm R/O pin is reliable
# [RUN] R/O longterm GUP-fast pin ... with shared zeropage
not ok 157 Longterm R/O pin is reliable
# [RUN] R/O longterm GUP-fast pin ... with memfd
not ok 158 Longterm R/O pin is reliable
# [RUN] R/O longterm GUP-fast pin ... with tmpfile
not ok 159 Longterm R/O pin is reliable
# [RUN] R/O longterm GUP-fast pin ... with huge zeropage
not ok 160 Longterm R/O pin is reliable
# [RUN] R/O longterm GUP-fast pin ... with memfd hugetlb (2048 kB)
not ok 161 Longterm R/O pin is reliable
# [RUN] R/O longterm GUP-fast pin ... with memfd hugetlb (1048576 kB)
not ok 162 Longterm R/O pin is reliable

Signed-off-by: David Hildenbrand 
---
 tools/testing/selftests/vm/cow.c | 28 +++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/vm/cow.c b/tools/testing/selftests/vm/cow.c
index 93c643bcdcf5..40ba45d0c6b4 100644
--- a/tools/testing/selftests/vm/cow.c
+++ b/tools/testing/selftests/vm/cow.c
@@ -534,6 +534,7 @@ static void test_iouring_fork(char *mem, size_t size)
 #endif /* LOCAL_CONFIG_HAVE_LIBURING */
 
 enum ro_pin_test {
+   RO_PIN_TEST,
RO_PIN_TEST_SHARED,
RO_PIN_TEST_PREVIOUSLY_SHARED,
RO_PIN_TEST_RO_EXCLUSIVE,
@@ -566,6 +567,8 @@ static void do_test_ro_pin(char *mem, size_t size, enum 
ro_pin_test test,
}
 
switch (test) {
+   case RO_PIN_TEST:
+   break;
case RO_PIN_TEST_SHARED:
case RO_PIN_TEST_PREVIOUSLY_SHARED:
/*
@@ -1150,6 +1153,16 @@ static void test_cow(char *mem, const char *smem, size_t 
size)
free(old);
 }
 
+static void test_ro_pin(char *mem, const char *smem, size_t size)
+{
+   do_test_ro_pin(mem, size, RO_PIN_TEST, false);
+}
+
+static void test_ro_fast_pin(char *mem, const char *smem, size_t size)
+{
+   do_test_ro_pin(mem, size, RO_PIN_TEST, true);
+}
+
 static void run_with_zeropage(non_anon_test_fn fn, const char *desc)
 {
char *mem, *smem, tmp;
@@ -1390,7 +1403,7 @@ struct non_anon_test_case {
 };
 
 /*
- * Test cases that target any pages in private mappings that are non anonymous:
+ * Test cases that target any pages in private mappings that are not anonymous:
  * pages that may get shared via COW ndependent of fork(). This includes
  * the shared zeropage(s), pagecache pages, ...
  */
@@ -1403,6 +1416,19 @@ static const struct non_anon_test_case 
non_anon_test_cases[] = {
"Basic COW",
test_cow,
},
+   /*
+* Take a R/O longterm pin. When modifying the page via the page table,
+* the page content change must be visible via the pin.
+*/
+   {
+   "R/O longterm GUP pin",
+   test_ro_pin,
+   },
+   /* Same as above, but using GUP-fast. */
+   {
+   "R/O longterm GUP-fast pin",
+   test_ro_fast_pin,
+   },
 };
 
 static void run_non_anon_test_case(struct non_anon_test_case const *test_case)
-- 
2.38.1

1 2 3 >

1 - 100 of 204 matches

Mail list logo