Re: [PATCH v13 2/8] mm/gup: Introduce check_and_migrate_movable_folios()
On 04.04.24 09:26, Vivek Kasireddy wrote: This helper is the folio equivalent of check_and_migrate_movable_pages(). Therefore, all the rules that apply to check_and_migrate_movable_pages() also apply to this one as well. Currently, this helper is only used by memfd_pin_folios(). This patch also includes changes to rename and convert the internal functions collect_longterm_unpinnable_pages() and migrate_longterm_unpinnable_pages() to work on folios. As a result, check_and_migrate_movable_pages() is now a wrapper around check_and_migrate_movable_folios(). Cc: David Hildenbrand Cc: Matthew Wilcox Cc: Christoph Hellwig Cc: Jason Gunthorpe Cc: Peter Xu Suggested-by: David Hildenbrand Signed-off-by: Vivek Kasireddy --- [...] +/* + * Check whether all folios are *allowed* to be pinned indefinitely (longterm). + * Rather confusingly, all folios in the range are required to be pinned via + * FOLL_PIN, before calling this routine. + * + * If any folios in the range are not allowed to be pinned, then this routine + * will migrate those folios away, unpin all the folios in the range and return + * -EAGAIN. The caller should re-pin the entire range with FOLL_PIN and then + * call this routine again. + * + * If an error other than -EAGAIN occurs, this indicates a migration failure. + * The caller should give up, and propagate the error back up the call stack. + * + * If everything is OK and all folios in the range are allowed to be pinned, + * then this routine leaves all folios pinned and returns zero for success. + */ +static long check_and_migrate_movable_folios(unsigned long nr_folios, +struct folio **folios) +{ + unsigned long collected; + LIST_HEAD(movable_folio_list); + + collected = collect_longterm_unpinnable_folios(_folio_list, + nr_folios, folios); + if (!collected) + return 0; + + return migrate_longterm_unpinnable_folios(_folio_list, + nr_folios, folios); +} + /* * Check whether all pages are *allowed* to be pinned. Rather confusingly, all * pages in the range are required to be pinned via FOLL_PIN, before calling Likely we should just drop that comment and refer to check_and_migrate_movable_folios() instead. No need to duplicate all that. @@ -2555,16 +2585,20 @@ static int migrate_longterm_unpinnable_pages( static long check_and_migrate_movable_pages(unsigned long nr_pages, struct page **pages) { - unsigned long collected; - LIST_HEAD(movable_page_list); + struct folio **folios; + long i, ret; - collected = collect_longterm_unpinnable_pages(_page_list, - nr_pages, pages); - if (!collected) - return 0; + folios = kmalloc_array(nr_pages, sizeof(*folios), GFP_KERNEL); + if (!folios) + return -ENOMEM; - return migrate_longterm_unpinnable_pages(_page_list, nr_pages, - pages); + for (i = 0; i < nr_pages; i++) + folios[i] = page_folio(pages[i]); I wonder if we have to handle pages[i] being NULL. Hopefully not :) Looks straight forward now: Acked-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v13 1/8] mm/gup: Introduce unpin_folio/unpin_folios helpers
On 04.04.24 09:26, Vivek Kasireddy wrote: These helpers are the folio versions of unpin_user_page/unpin_user_pages. They are currently only useful for unpinning folios pinned by memfd_pin_folios() or other associated routines. However, they could find new uses in the future, when more and more folio-only helpers are added to GUP. We should probably sanity check the folio as part of unpin similar to how it is done in unpin_user_page/unpin_user_pages but we cannot cleanly do that at the moment without also checking the subpage. Therefore, sanity checking needs to be added to these routines once we have a way to determine if any given folio is anon-exclusive (via a per folio AnonExclusive flag). Cc: David Hildenbrand Cc: Matthew Wilcox Cc: Christoph Hellwig Cc: Jason Gunthorpe Cc: Peter Xu Suggested-by: David Hildenbrand Signed-off-by: Vivek Kasireddy --- Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v12 2/8] mm/gup: Introduce check_and_migrate_movable_folios()
On 25.02.24 08:56, Vivek Kasireddy wrote: This helper is the folio equivalent of check_and_migrate_movable_pages(). Therefore, all the rules that apply to check_and_migrate_movable_pages() also apply to this one as well. Currently, this helper is only used by memfd_pin_folios(). This patch also includes changes to rename and convert the internal functions collect_longterm_unpinnable_pages() and migrate_longterm_unpinnable_pages() to work on folios. Since they are also used by check_and_migrate_movable_pages(), a temporary array is used to collect and share the folios with these functions. Cc: David Hildenbrand Cc: Matthew Wilcox Cc: Christoph Hellwig Cc: Jason Gunthorpe Cc: Peter Xu Suggested-by: David Hildenbrand Signed-off-by: Vivek Kasireddy --- mm/gup.c | 129 +++ 1 file changed, 92 insertions(+), 37 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index 0a45eda6aaeb..1410af954a4e 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2099,20 +2099,24 @@ struct page *get_dump_page(unsigned long addr) #ifdef CONFIG_MIGRATION /* - * Returns the number of collected pages. Return value is always >= 0. + * Returns the number of collected folios. Return value is always >= 0. */ -static unsigned long collect_longterm_unpinnable_pages( - struct list_head *movable_page_list, - unsigned long nr_pages, +static unsigned long collect_longterm_unpinnable_folios( + struct list_head *movable_folio_list, + unsigned long nr_folios, + struct folio **folios, struct page **pages) This function really shouldn't consume both folios and pages. Either use "folios" and handle the conversion from pages->folios in the caller, or handle it similar to release_pages() where we can pass either and simply always do page_folio() on the given pointer, using essentially an abstracted pointer type and always calling page_folio() on that thing. The easiest is likely to just do the page->folio conversion in the caller by looping over the arrays once more. See below. Temporary memory allocation can be avoided by using an abstracted pointer type. [...] + folio = folios[i]; if (folio == prev_folio) continue; prev_folio = folio; @@ -2126,7 +2130,7 @@ static unsigned long collect_longterm_unpinnable_pages( continue; if (folio_test_hugetlb(folio)) { - isolate_hugetlb(folio, movable_page_list); + isolate_hugetlb(folio, movable_folio_list); continue; } @@ -2138,7 +2142,7 @@ static unsigned long collect_longterm_unpinnable_pages( if (!folio_isolate_lru(folio)) continue; - list_add_tail(>lru, movable_page_list); + list_add_tail(>lru, movable_folio_list); node_stat_mod_folio(folio, NR_ISOLATED_ANON + folio_is_file_lru(folio), folio_nr_pages(folio)); @@ -2148,27 +2152,28 @@ static unsigned long collect_longterm_unpinnable_pages( } /* - * Unpins all pages and migrates device coherent pages and movable_page_list. - * Returns -EAGAIN if all pages were successfully migrated or -errno for failure - * (or partial success). + * Unpins all folios and migrates device coherent folios and movable_folio_list. + * Returns -EAGAIN if all folios were successfully migrated or -errno for + * failure (or partial success). */ -static int migrate_longterm_unpinnable_pages( - struct list_head *movable_page_list, - unsigned long nr_pages, - struct page **pages) +static int migrate_longterm_unpinnable_folios( + struct list_head *movable_folio_list, + unsigned long nr_folios, + struct folio **folios) { int ret; unsigned long i; - for (i = 0; i < nr_pages; i++) { - struct folio *folio = page_folio(pages[i]); + for (i = 0; i < nr_folios; i++) { + struct folio *folio = folios[i]; if (folio_is_device_coherent(folio)) { /* -* Migration will fail if the page is pinned, so convert -* the pin on the source page to a normal reference. +* Migration will fail if the folio is pinned, so +* convert the pin on the source folio to a normal +* reference. */ -
Re: [PATCH v12 1/8] mm/gup: Introduce unpin_folio/unpin_folios helpers
On 02.04.24 15:52, David Hildenbrand wrote: On 25.02.24 08:56, Vivek Kasireddy wrote: These helpers are the folio versions of unpin_user_page/unpin_user_pages. They are currently only useful for unpinning folios pinned by memfd_pin_folios() or other associated routines. However, they could find new uses in the future, when more and more folio-only helpers are added to GUP. Cc: David Hildenbrand Cc: Matthew Wilcox Cc: Christoph Hellwig Cc: Jason Gunthorpe Cc: Peter Xu Suggested-by: David Hildenbrand Signed-off-by: Vivek Kasireddy --- include/linux/mm.h | 2 ++ mm/gup.c | 81 -- 2 files changed, 74 insertions(+), 9 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 6f4825d82965..36e4c2b22600 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1601,11 +1601,13 @@ static inline void put_page(struct page *page) #define GUP_PIN_COUNTING_BIAS (1U << 10) void unpin_user_page(struct page *page); +void unpin_folio(struct folio *folio); void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages, bool make_dirty); void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages, bool make_dirty); void unpin_user_pages(struct page **pages, unsigned long npages); +void unpin_folios(struct folio **folios, unsigned long nfolios); static inline bool is_cow_mapping(vm_flags_t flags) { diff --git a/mm/gup.c b/mm/gup.c index df83182ec72d..0a45eda6aaeb 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -30,6 +30,23 @@ struct follow_page_context { unsigned int page_mask; }; +static inline void sanity_check_pinned_folios(struct folio **folios, + unsigned long nfolios) +{ + if (!IS_ENABLED(CONFIG_DEBUG_VM)) + return; + + for (; nfolios; nfolios--, folios++) { + struct folio *folio = *folios; + + if (is_zero_folio(folio) || + !folio_test_anon(folio)) + continue; + + VM_BUG_ON_FOLIO(!PageAnonExclusive(>page), folio); That change is wrong (and the split makes the check confusing). It could be that the first subpage is no longer exclusive, but the given (sanity_check_pinned_pages() ) subpage is exclusive for large folios. I suggest dropping that change, and instead, in unpin_folio()/unpin_folios(), reject any anon folios for now. So, replace the sanity_check_pinned_folios() in unpin_folio() / unpin_folios() by a VM_WARN_ON(folio_test_anon(folio)); After reading patch #2: drop both the sanity check and VM_WARN_ON() from unpin_folio()/unpin_folios(), and add a comment to the patch description that we cannot do the sanity checking without the subpage, and that we can reintroduce it once we have a single per-folio AnonExclusive bit. -- Cheers, David / dhildenb
Re: [PATCH v12 1/8] mm/gup: Introduce unpin_folio/unpin_folios helpers
On 25.02.24 08:56, Vivek Kasireddy wrote: These helpers are the folio versions of unpin_user_page/unpin_user_pages. They are currently only useful for unpinning folios pinned by memfd_pin_folios() or other associated routines. However, they could find new uses in the future, when more and more folio-only helpers are added to GUP. Cc: David Hildenbrand Cc: Matthew Wilcox Cc: Christoph Hellwig Cc: Jason Gunthorpe Cc: Peter Xu Suggested-by: David Hildenbrand Signed-off-by: Vivek Kasireddy --- include/linux/mm.h | 2 ++ mm/gup.c | 81 -- 2 files changed, 74 insertions(+), 9 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 6f4825d82965..36e4c2b22600 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1601,11 +1601,13 @@ static inline void put_page(struct page *page) #define GUP_PIN_COUNTING_BIAS (1U << 10) void unpin_user_page(struct page *page); +void unpin_folio(struct folio *folio); void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages, bool make_dirty); void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages, bool make_dirty); void unpin_user_pages(struct page **pages, unsigned long npages); +void unpin_folios(struct folio **folios, unsigned long nfolios); static inline bool is_cow_mapping(vm_flags_t flags) { diff --git a/mm/gup.c b/mm/gup.c index df83182ec72d..0a45eda6aaeb 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -30,6 +30,23 @@ struct follow_page_context { unsigned int page_mask; }; +static inline void sanity_check_pinned_folios(struct folio **folios, + unsigned long nfolios) +{ + if (!IS_ENABLED(CONFIG_DEBUG_VM)) + return; + + for (; nfolios; nfolios--, folios++) { + struct folio *folio = *folios; + + if (is_zero_folio(folio) || + !folio_test_anon(folio)) + continue; + + VM_BUG_ON_FOLIO(!PageAnonExclusive(>page), folio); That change is wrong (and the split makes the check confusing). It could be that the first subpage is no longer exclusive, but the given (sanity_check_pinned_pages() ) subpage is exclusive for large folios. I suggest dropping that change, and instead, in unpin_folio()/unpin_folios(), reject any anon folios for now. So, replace the sanity_check_pinned_folios() in unpin_folio() / unpin_folios() by a VM_WARN_ON(folio_test_anon(folio)); It will all be better once we have a single anon-exclusive flag per folio (which I am working on), but in the meantime, we really don't expect code that called pin_user_pages() to call unpin_folios(). [...] +/** + * unpin_folio() - release a dma-pinned folio + * @folio: pointer to folio to be released + * + * Folios that were pinned via memfd_pin_folios() or other similar routines + * must be released either using unpin_folio() or unpin_folios(). This is so + * that such folios can be separately tracked and uniquely handled. I'd drop the last sentence; no need for apologies/explanations, this is simply how ;pinning works :) + */ +void unpin_folio(struct folio *folio) +{ + sanity_check_pinned_folios(, 1); + gup_put_folio(folio, 1, FOLL_PIN); +} +EXPORT_SYMBOL(unpin_folio); Can we restrict that to EXPORT_SYMBOL_GPL for now? memfd_pin_folios() uses EXPORT_SYMBOL_GPL... + /** * folio_add_pin - Try to get an additional pin on a pinned folio * @folio: The folio to be pinned @@ -488,6 +516,41 @@ void unpin_user_pages(struct page **pages, unsigned long npages) } EXPORT_SYMBOL(unpin_user_pages); +/** + * unpin_folios() - release an array of gup-pinned folios. + * @folios: array of folios to be marked dirty and released. + * @nfolios: number of folios in the @folios array. + * + * For each folio in the @folios array, release the folio using unpin_folio(). + * + * Please see the unpin_folio() documentation for details. + */ +void unpin_folios(struct folio **folios, unsigned long nfolios) +{ + unsigned long i = 0, j; + + /* +* If this WARN_ON() fires, then the system *might* be leaking folios +* (by leaving them pinned), but probably not. More likely, gup/pup +* returned a hard -ERRNO error to the caller, who erroneously passed +* it here. +*/ + if (WARN_ON(IS_ERR_VALUE(nfolios))) + return; + + sanity_check_pinned_folios(folios, nfolios); + while (i < nfolios) { + for (j = i + 1; j < nfolios; j++) + if (folios[i] != folios[j]) + break; + + if (folios[i]) + gup_put_folio(folios[i], j - i, FOLL_PIN); + i = j; + } +} +EXPORT_SYMBOL(unpin_folios); Same thought here. -- Cheers, David / dhildenb
Re: [PATCH v12 0/8] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios
On 29.03.24 06:38, Kasireddy, Vivek wrote: Hi David, On 25.02.24 08:56, Vivek Kasireddy wrote: Currently, some drivers (e.g, Udmabuf) that want to longterm-pin the pages/folios associated with a memfd, do so by simply taking a reference on them. This is not desirable because the pages/folios may reside in Movable zone or CMA block. Therefore, having drivers use memfd_pin_folios() API ensures that the folios are appropriately pinned via FOLL_PIN for longterm DMA. This patchset also introduces a few helpers and converts the Udmabuf driver to use folios and memfd_pin_folios() API to longterm-pin the folios for DMA. Two new Udmabuf selftests are also included to test the driver and the new API. --- Sorry Vivek, I got distracted. What's the state of this? I assume it's not in an mm tree yet. No problem. Right, they are not in any tree yet. The first two mm patches that add the unpin_folios() and check_and_migrate_movable_folios() helpers still need to be reviewed. I try to get this reviewed this week. If I fail to do that, please ping me. Ok, sounds good! .. as it's already Friday (and even a public Holiday today+Monday here), let me prioritize this for next week! -- Cheers, David / dhildenb
Re: [PATCH v12 0/8] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios
On 25.02.24 08:56, Vivek Kasireddy wrote: Currently, some drivers (e.g, Udmabuf) that want to longterm-pin the pages/folios associated with a memfd, do so by simply taking a reference on them. This is not desirable because the pages/folios may reside in Movable zone or CMA block. Therefore, having drivers use memfd_pin_folios() API ensures that the folios are appropriately pinned via FOLL_PIN for longterm DMA. This patchset also introduces a few helpers and converts the Udmabuf driver to use folios and memfd_pin_folios() API to longterm-pin the folios for DMA. Two new Udmabuf selftests are also included to test the driver and the new API. --- Sorry Vivek, I got distracted. What's the state of this? I assume it's not in an mm tree yet. I try to get this reviewed this week. If I fail to do that, please ping me. -- Cheers, David / dhildenb
Re: [PATCH v7 3/6] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios (v7)
Hi, Sorry, I'm still not happy about the current state, because (1) the folio vs. pages handling is still mixed (2) we're returning+pinning a large folio multiple times. I can address (1) in a follow-up series and as far as (2) is concerned, my understanding is that we need to increase the folio's refcount as and when the folio's tail pages are used. Is this not the case? It appears this is what unpin_user_pages() expects as well. Do you see any concern with this? If you'd just pin the folio once, you'd also only have to unpin it once. Bu that raises a question: Is it a requirement for the user of this interface, being able to unmap+unpin each individual page? If you really want to handle each subpage possibly individually, then subpage or folio+offset makes more sense, agreed. See below if there is an easy way to clean this up. @@ -5,6 +5,7 @@ #include #include +#include #include #include #include @@ -17,6 +18,7 @@ #include #include #include +#include #include #include @@ -3410,3 +3412,156 @@ long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages, , gup_flags); } EXPORT_SYMBOL(pin_user_pages_unlocked); + +/** + * memfd_pin_folios() - pin folios associated with a memfd + * @memfd: the memfd whose folios are to be pinned + * @start: starting memfd offset + * @nr_pages: number of pages from start to pin We're not pinning pages. An inclusive range [start, end] would be clearer. Ok, I'll make this change in the next version. + * @folios: array that receives pointers to the folios pinned. + * Should be at-least nr_pages long. + * @offsets:array that receives offsets of pages in their folios. + * Should be at-least nr_pages long. See below, I'm wondering if this is really required once we return each folio only once. The offsets can be calculated by the caller (udmabuf) as well but doing so in this interface would prevent special handling in the caller for the hugetlb case. Please look at patch 5 in this series (udmabuf: Pin the pages using memfd_pin_folios() API (v5)) for more details as to what I mean. I'll have a look later to be reminded about the target use case :) + * + * It must be noted that the folios may be pinned for an indefinite amount + * of time. And, in most cases, the duration of time they may stay pinned + * would be controlled by the userspace. This behavior is effectively the + * same as using FOLL_LONGTERM with other GUP APIs. + * + * Returns number of folios pinned. This would be equal to the number of + * pages requested. If no folios were pinned, it returns -errno. + */ +long memfd_pin_folios(struct file *memfd, unsigned long start, + unsigned long nr_pages, struct folio **folios, + pgoff_t *offsets) +{ + unsigned long end = start + (nr_pages << PAGE_SHIFT) - 1; + unsigned int max_pgs, pgoff, pgshift = PAGE_SHIFT; + pgoff_t start_idx, end_idx, next_idx; + unsigned int flags, nr_folios, i, j; + struct folio *folio = NULL; + struct folio_batch fbatch; + struct page **pages; + struct hstate *h; + long ret; + + if (!nr_pages) + return -EINVAL; + + if (!memfd) + return -EINVAL; + + if (!shmem_file(memfd) && !is_file_hugepages(memfd)) + return -EINVAL; + + pages = kmalloc_array(nr_pages, sizeof(*pages), GFP_KERNEL); + if (!pages) + return -ENOMEM; + + if (is_file_hugepages(memfd)) { + h = hstate_file(memfd); + pgshift = huge_page_shift(h); + } + + flags = memalloc_pin_save(); + do { + i = 0; + start_idx = start >> pgshift; + end_idx = end >> pgshift; + if (is_file_hugepages(memfd)) { + start_idx <<= huge_page_order(h); + end_idx <<= huge_page_order(h); + } + + folio_batch_init(); + while (start_idx <= end_idx) { + /* +* In most cases, we should be able to find the folios +* in the page cache. If we cannot find them for some +* reason, we try to allocate them and add them to the +* page cache. +*/ + nr_folios = filemap_get_folios_contig(memfd- f_mapping, + _idx, + end_idx, + ); + if (folio) { + folio_put(folio); + folio = NULL; + } + + next_idx = 0; + for
Re: [PATCH v7 0/6] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios (v7)
On 12.12.23 08:37, Vivek Kasireddy wrote: The first two patches were previously reviewed but not yet merged. These ones need to be merged first as the fourth patch depends on the changes introduced in them and they also fix bugs seen in very specific scenarios (running Qemu with hugetlb=on, blob=true and rebooting guest VM). The third patch introduces memfd_pin_folios() API and the fourth patch converts udmabuf driver to use folios. The fifth patch shows how the udmabuf driver can make use of the new API to longterm-pin the folios. The last patch adds two new udmabuf selftests to verify data coherency after potential page migration. We should probably get started with the first two patches, they are independent of the remaining discussion regarding memfd_pin_folios(). -- Cheers, David / dhildenb
Re: [PATCH v7 3/6] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios (v7)
On 12.12.23 08:38, Vivek Kasireddy wrote: For drivers that would like to longterm-pin the folios associated with a memfd, the memfd_pin_folios() API provides an option to not only pin the folios via FOLL_PIN but also to check and migrate them if they reside in movable zone or CMA block. This API currently works with memfds but it should work with any files that belong to either shmemfs or hugetlbfs. Files belonging to other filesystems are rejected for now. The folios need to be located first before pinning them via FOLL_PIN. If they are found in the page cache, they can be immediately pinned. Otherwise, they need to be allocated using the filesystem specific APIs and then pinned. v2: - Drop gup_flags and improve comments and commit message (David) - Allocate a page if we cannot find in page cache for the hugetlbfs case as well (David) - Don't unpin pages if there is a migration related failure (David) - Drop the unnecessary nr_pages <= 0 check (Jason) - Have the caller of the API pass in file * instead of fd (Jason) v3: (David) - Enclose the huge page allocation code with #ifdef CONFIG_HUGETLB_PAGE (Build error reported by kernel test robot ) - Don't forget memalloc_pin_restore() on non-migration related errors - Improve the readability of the cleanup code associated with non-migration related errors - Augment the comments by describing FOLL_LONGTERM like behavior - Include the R-b tag from Jason v4: - Remove the local variable "page" and instead use 3 return statements in alloc_file_page() (David) - Add the R-b tag from David v5: (David) - For hugetlb case, ensure that we only obtain head pages from the mapping by using __filemap_get_folio() instead of find_get_page_flags() - Handle -EEXIST when two or more potential users try to simultaneously add a huge page to the mapping by forcing them to retry on failure v6: (Christoph) - Rename this API to memfd_pin_user_pages() to make it clear that it is intended for memfds - Move the memfd page allocation helper from gup.c to memfd.c - Fix indentation errors in memfd_pin_user_pages() - For contiguous ranges of folios, use a helper such as filemap_get_folios_contig() to lookup the page cache in batches v7: - Rename this API to memfd_pin_folios() and make it return folios and offsets instead of pages (David) - Don't continue processing the folios in the batch returned by filemap_get_folios_contig() if they do not have correct next_idx - Add the R-b tag from Christoph Sorry, I'm still not happy about the current state, because (1) the folio vs. pages handling is still mixed (2) we're returning+pinning a large folio multiple times. See below if there is an easy way to clean this up. @@ -5,6 +5,7 @@ #include #include +#include #include #include #include @@ -17,6 +18,7 @@ #include #include #include +#include #include #include @@ -3410,3 +3412,156 @@ long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages, , gup_flags); } EXPORT_SYMBOL(pin_user_pages_unlocked); + +/** + * memfd_pin_folios() - pin folios associated with a memfd + * @memfd: the memfd whose folios are to be pinned + * @start: starting memfd offset + * @nr_pages: number of pages from start to pin We're not pinning pages. An inclusive range [start, end] would be clearer. + * @folios: array that receives pointers to the folios pinned. + * Should be at-least nr_pages long. + * @offsets:array that receives offsets of pages in their folios. + * Should be at-least nr_pages long. See below, I'm wondering if this is really required once we return each folio only once. + * + * Attempt to pin folios associated with a memfd; given that a memfd is + * either backed by shmem or hugetlb, the folios can either be found in + * the page cache or need to be allocated if necessary. Once the folios + * are located, they are all pinned via FOLL_PIN and the @offsets array + * is populated with offsets of the pages in their respective folios. + * Therefore, for each page the caller requested, there will be a + * corresponding entry in both @folios and @offsets. And, eventually, + * these pinned folios need to be released either using unpin_user_pages() + * or unpin_user_page(). Oh, we don't have a folio function yet? Should be easy to add, and we'd really add them. + * + * It must be noted that the folios may be pinned for an indefinite amount + * of time. And, in most cases, the duration of time they may stay pinned + * would be controlled by the userspace. This behavior is effectively the + * same as using FOLL_LONGTERM with other GUP APIs. + * + * Returns number of folios pinned. This would be equal to the number of + * pages requested. If no folios were pinned, it returns -errno. + */ +long memfd_pin_folios(struct file *memfd, unsigned long start, + unsigned long nr_pages, struct folio **folios, +
Re: [PATCH v6 3/5] mm/gup: Introduce memfd_pin_user_pages() for pinning memfd pages (v6)
On 08.12.23 08:57, Kasireddy, Vivek wrote: Hi David, On 05.12.23 06:35, Vivek Kasireddy wrote: For drivers that would like to longterm-pin the pages associated with a memfd, the pin_user_pages_fd() API provides an option to not only pin the pages via FOLL_PIN but also to check and migrate them if they reside in movable zone or CMA block. This API currently works with memfds but it should work with any files that belong to either shmemfs or hugetlbfs. Files belonging to other filesystems are rejected for now. The pages need to be located first before pinning them via FOLL_PIN. If they are found in the page cache, they can be immediately pinned. Otherwise, they need to be allocated using the filesystem specific APIs and then pinned. v2: - Drop gup_flags and improve comments and commit message (David) - Allocate a page if we cannot find in page cache for the hugetlbfs case as well (David) - Don't unpin pages if there is a migration related failure (David) - Drop the unnecessary nr_pages <= 0 check (Jason) - Have the caller of the API pass in file * instead of fd (Jason) v3: (David) - Enclose the huge page allocation code with #ifdef CONFIG_HUGETLB_PAGE (Build error reported by kernel test robot ) - Don't forget memalloc_pin_restore() on non-migration related errors - Improve the readability of the cleanup code associated with non-migration related errors - Augment the comments by describing FOLL_LONGTERM like behavior - Include the R-b tag from Jason v4: - Remove the local variable "page" and instead use 3 return statements in alloc_file_page() (David) - Add the R-b tag from David v5: (David) - For hugetlb case, ensure that we only obtain head pages from the mapping by using __filemap_get_folio() instead of find_get_page_flags() - Handle -EEXIST when two or more potential users try to simultaneously add a huge page to the mapping by forcing them to retry on failure v6: (Christoph) - Rename this API to memfd_pin_user_pages() to make it clear that it is intended for memfds - Move the memfd page allocation helper from gup.c to memfd.c - Fix indentation errors in memfd_pin_user_pages() - For contiguous ranges of folios, use a helper such as filemap_get_folios_contig() to lookup the page cache in batches Cc: David Hildenbrand Cc: Christoph Hellwig Cc: Daniel Vetter Cc: Mike Kravetz Cc: Hugh Dickins Cc: Peter Xu Cc: Gerd Hoffmann Cc: Dongwon Kim Cc: Junxiao Chang Suggested-by: Jason Gunthorpe Reviewed-by: Jason Gunthorpe (v2) Reviewed-by: David Hildenbrand (v3) Signed-off-by: Vivek Kasireddy --- include/linux/memfd.h | 5 +++ include/linux/mm.h| 2 + mm/gup.c | 102 ++ mm/memfd.c| 34 ++ 4 files changed, 143 insertions(+) diff --git a/include/linux/memfd.h b/include/linux/memfd.h index e7abf6fa4c52..6fc0d1282151 100644 --- a/include/linux/memfd.h +++ b/include/linux/memfd.h @@ -6,11 +6,16 @@ #ifdef CONFIG_MEMFD_CREATE extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned int arg); +extern struct page *memfd_alloc_page(struct file *memfd, pgoff_t idx); #else static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned int a) { return -EINVAL; } +static inline struct page *memfd_alloc_page(struct file *memfd, pgoff_t idx) +{ + return ERR_PTR(-EINVAL); +} #endif #endif /* __LINUX_MEMFD_H */ diff --git a/include/linux/mm.h b/include/linux/mm.h index 418d26608ece..ac69db45509f 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2472,6 +2472,8 @@ long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages, struct page **pages, unsigned int gup_flags); long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages, struct page **pages, unsigned int gup_flags); +long memfd_pin_user_pages(struct file *file, pgoff_t start, + unsigned long nr_pages, struct page **pages); int get_user_pages_fast(unsigned long start, int nr_pages, unsigned int gup_flags, struct page **pages); diff --git a/mm/gup.c b/mm/gup.c index 231711efa390..eb93d1ec9dc6 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -5,6 +5,7 @@ #include #include +#include #include #include #include @@ -17,6 +18,7 @@ #include #include #include +#include #include #include @@ -3410,3 +3412,103 @@ long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages, , gup_flags); } EXPORT_SYMBOL(pin_user_pages_unlocked); + +/** + * memfd_pin_user_pages() - pin user pages associated with a memfd + * @memfd: the memfd whose pages are to be pinned + * @start: starting memfd offset + * @nr_pages: number of pages from start to pin + * @pages: array that receives pointers
Re: [PATCH v6 3/5] mm/gup: Introduce memfd_pin_user_pages() for pinning memfd pages (v6)
On 07.12.23 14:05, Jason Gunthorpe wrote: On Thu, Dec 07, 2023 at 10:44:14AM +0100, David Hildenbrand wrote: If you always want to return folios, then better name it "memfd_pin_user_folios" (or just "memfd_pin_folios") and pass in a range (instead of a nr_pages parameter), and somehow indicate to the caller how many folio were in that range, and if that range was fully covered. I think it makes sense to return folios from this interface; and considering my use-case, I'd like have this API return an error if it cannot pin (or allocate) the exact number of folios the caller requested. Okay, then better use folios. Assuming a caller puts in "start = X" and gets some large folio back. How is the caller supposed to know at which offset to look into that folio (IOW< which subpage)? For "pages" it was obvious (you get the actual subpages), but as soon as we return a large folio, some information is missing for the caller. How can the caller figure that out? This can only work if the memfd is required to only have full folios at aligned locations. Under that restriction computing the first folio offset is easy enough: folio offset = (start % folio size) But is that true for the memfds here? I assume folios are always naturally aligned, like: [ 2m ][ 2m ][1m][1m][ 2m ] ^f0 ^f1 ^f2 ^f3 ^f4 If you query the range "3m -> 7m", you get back f1,f2,f3,f4 and have to start in the middle of the first folio with offset 1m. From there, it is indeed simply continuing with the full folio size -- until the last folio, where you want to only process 1m. folio offset = (1m % 2m) would be correct in that case. -- Cheers, David / dhildenb
Re: [PATCH v6 3/5] mm/gup: Introduce memfd_pin_user_pages() for pinning memfd pages (v6)
On 07.12.23 06:09, Kasireddy, Vivek wrote: Hi David, On 05.12.23 06:35, Vivek Kasireddy wrote: For drivers that would like to longterm-pin the pages associated with a memfd, the pin_user_pages_fd() API provides an option to not only pin the pages via FOLL_PIN but also to check and migrate them if they reside in movable zone or CMA block. This API currently works with memfds but it should work with any files that belong to either shmemfs or hugetlbfs. Files belonging to other filesystems are rejected for now. The pages need to be located first before pinning them via FOLL_PIN. If they are found in the page cache, they can be immediately pinned. Otherwise, they need to be allocated using the filesystem specific APIs and then pinned. v2: - Drop gup_flags and improve comments and commit message (David) - Allocate a page if we cannot find in page cache for the hugetlbfs case as well (David) - Don't unpin pages if there is a migration related failure (David) - Drop the unnecessary nr_pages <= 0 check (Jason) - Have the caller of the API pass in file * instead of fd (Jason) v3: (David) - Enclose the huge page allocation code with #ifdef CONFIG_HUGETLB_PAGE (Build error reported by kernel test robot ) - Don't forget memalloc_pin_restore() on non-migration related errors - Improve the readability of the cleanup code associated with non-migration related errors - Augment the comments by describing FOLL_LONGTERM like behavior - Include the R-b tag from Jason v4: - Remove the local variable "page" and instead use 3 return statements in alloc_file_page() (David) - Add the R-b tag from David v5: (David) - For hugetlb case, ensure that we only obtain head pages from the mapping by using __filemap_get_folio() instead of find_get_page_flags() - Handle -EEXIST when two or more potential users try to simultaneously add a huge page to the mapping by forcing them to retry on failure v6: (Christoph) - Rename this API to memfd_pin_user_pages() to make it clear that it is intended for memfds - Move the memfd page allocation helper from gup.c to memfd.c - Fix indentation errors in memfd_pin_user_pages() - For contiguous ranges of folios, use a helper such as filemap_get_folios_contig() to lookup the page cache in batches Cc: David Hildenbrand Cc: Christoph Hellwig Cc: Daniel Vetter Cc: Mike Kravetz Cc: Hugh Dickins Cc: Peter Xu Cc: Gerd Hoffmann Cc: Dongwon Kim Cc: Junxiao Chang Suggested-by: Jason Gunthorpe Reviewed-by: Jason Gunthorpe (v2) Reviewed-by: David Hildenbrand (v3) Signed-off-by: Vivek Kasireddy --- include/linux/memfd.h | 5 +++ include/linux/mm.h| 2 + mm/gup.c | 102 ++ mm/memfd.c| 34 ++ 4 files changed, 143 insertions(+) diff --git a/include/linux/memfd.h b/include/linux/memfd.h index e7abf6fa4c52..6fc0d1282151 100644 --- a/include/linux/memfd.h +++ b/include/linux/memfd.h @@ -6,11 +6,16 @@ #ifdef CONFIG_MEMFD_CREATE extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned int arg); +extern struct page *memfd_alloc_page(struct file *memfd, pgoff_t idx); #else static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned int a) { return -EINVAL; } +static inline struct page *memfd_alloc_page(struct file *memfd, pgoff_t idx) +{ + return ERR_PTR(-EINVAL); +} #endif #endif /* __LINUX_MEMFD_H */ diff --git a/include/linux/mm.h b/include/linux/mm.h index 418d26608ece..ac69db45509f 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2472,6 +2472,8 @@ long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages, struct page **pages, unsigned int gup_flags); long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages, struct page **pages, unsigned int gup_flags); +long memfd_pin_user_pages(struct file *file, pgoff_t start, + unsigned long nr_pages, struct page **pages); int get_user_pages_fast(unsigned long start, int nr_pages, unsigned int gup_flags, struct page **pages); diff --git a/mm/gup.c b/mm/gup.c index 231711efa390..eb93d1ec9dc6 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -5,6 +5,7 @@ #include #include +#include #include #include #include @@ -17,6 +18,7 @@ #include #include #include +#include #include #include @@ -3410,3 +3412,103 @@ long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages, , gup_flags); } EXPORT_SYMBOL(pin_user_pages_unlocked); + +/** + * memfd_pin_user_pages() - pin user pages associated with a memfd + * @memfd: the memfd whose pages are to be pinned + * @start: starting memfd offset + * @nr_pages: number of pages from start to pin + * @pages: array that receives pointers to the pages pinned. + * Shoul
Re: [PATCH v6 3/5] mm/gup: Introduce memfd_pin_user_pages() for pinning memfd pages (v6)
On 05.12.23 06:35, Vivek Kasireddy wrote: For drivers that would like to longterm-pin the pages associated with a memfd, the pin_user_pages_fd() API provides an option to not only pin the pages via FOLL_PIN but also to check and migrate them if they reside in movable zone or CMA block. This API currently works with memfds but it should work with any files that belong to either shmemfs or hugetlbfs. Files belonging to other filesystems are rejected for now. The pages need to be located first before pinning them via FOLL_PIN. If they are found in the page cache, they can be immediately pinned. Otherwise, they need to be allocated using the filesystem specific APIs and then pinned. v2: - Drop gup_flags and improve comments and commit message (David) - Allocate a page if we cannot find in page cache for the hugetlbfs case as well (David) - Don't unpin pages if there is a migration related failure (David) - Drop the unnecessary nr_pages <= 0 check (Jason) - Have the caller of the API pass in file * instead of fd (Jason) v3: (David) - Enclose the huge page allocation code with #ifdef CONFIG_HUGETLB_PAGE (Build error reported by kernel test robot ) - Don't forget memalloc_pin_restore() on non-migration related errors - Improve the readability of the cleanup code associated with non-migration related errors - Augment the comments by describing FOLL_LONGTERM like behavior - Include the R-b tag from Jason v4: - Remove the local variable "page" and instead use 3 return statements in alloc_file_page() (David) - Add the R-b tag from David v5: (David) - For hugetlb case, ensure that we only obtain head pages from the mapping by using __filemap_get_folio() instead of find_get_page_flags() - Handle -EEXIST when two or more potential users try to simultaneously add a huge page to the mapping by forcing them to retry on failure v6: (Christoph) - Rename this API to memfd_pin_user_pages() to make it clear that it is intended for memfds - Move the memfd page allocation helper from gup.c to memfd.c - Fix indentation errors in memfd_pin_user_pages() - For contiguous ranges of folios, use a helper such as filemap_get_folios_contig() to lookup the page cache in batches Cc: David Hildenbrand Cc: Christoph Hellwig Cc: Daniel Vetter Cc: Mike Kravetz Cc: Hugh Dickins Cc: Peter Xu Cc: Gerd Hoffmann Cc: Dongwon Kim Cc: Junxiao Chang Suggested-by: Jason Gunthorpe Reviewed-by: Jason Gunthorpe (v2) Reviewed-by: David Hildenbrand (v3) Signed-off-by: Vivek Kasireddy --- include/linux/memfd.h | 5 +++ include/linux/mm.h| 2 + mm/gup.c | 102 ++ mm/memfd.c| 34 ++ 4 files changed, 143 insertions(+) diff --git a/include/linux/memfd.h b/include/linux/memfd.h index e7abf6fa4c52..6fc0d1282151 100644 --- a/include/linux/memfd.h +++ b/include/linux/memfd.h @@ -6,11 +6,16 @@ #ifdef CONFIG_MEMFD_CREATE extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned int arg); +extern struct page *memfd_alloc_page(struct file *memfd, pgoff_t idx); #else static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned int a) { return -EINVAL; } +static inline struct page *memfd_alloc_page(struct file *memfd, pgoff_t idx) +{ + return ERR_PTR(-EINVAL); +} #endif #endif /* __LINUX_MEMFD_H */ diff --git a/include/linux/mm.h b/include/linux/mm.h index 418d26608ece..ac69db45509f 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2472,6 +2472,8 @@ long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages, struct page **pages, unsigned int gup_flags); long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages, struct page **pages, unsigned int gup_flags); +long memfd_pin_user_pages(struct file *file, pgoff_t start, + unsigned long nr_pages, struct page **pages); int get_user_pages_fast(unsigned long start, int nr_pages, unsigned int gup_flags, struct page **pages); diff --git a/mm/gup.c b/mm/gup.c index 231711efa390..eb93d1ec9dc6 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -5,6 +5,7 @@ #include #include +#include #include #include #include @@ -17,6 +18,7 @@ #include #include #include +#include #include #include @@ -3410,3 +3412,103 @@ long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages, , gup_flags); } EXPORT_SYMBOL(pin_user_pages_unlocked); + +/** + * memfd_pin_user_pages() - pin user pages associated with a memfd + * @memfd: the memfd whose pages are to be pinned + * @start: starting memfd offset + * @nr_pages: number of pages from start to pin + * @pages: array that receives pointers to the pages pinned. + * Should be at-least nr_pages long. + * + * Attempt to pin pages associated with a
Re: [RFC PATCH 2/6] mm/gmem: add arch-independent abstraction to track address mapping status
On 02.12.23 15:50, Pedro Falcato wrote: On Fri, Dec 1, 2023 at 9:23 AM David Hildenbrand wrote: On 28.11.23 13:50, Weixi Zhu wrote: This patch adds an abstraction layer, struct vm_object, that maintains per-process virtual-to-physical mapping status stored in struct gm_mapping. For example, a virtual page may be mapped to a CPU physical page or to a device physical page. Struct vm_object effectively maintains an arch-independent page table, which is defined as a "logical page table". While arch-dependent page table used by a real MMU is named a "physical page table". The logical page table is useful if Linux core MM is extended to handle a unified virtual address space with external accelerators using customized MMUs. Which raises the question why we are dealing with anonymous memory at all? Why not go for shmem if you are already only special-casing VMAs with a MMAP flag right now? That would maybe avoid having to introduce controversial BSD design concepts into Linux, that feel like going a step backwards in time to me and adding *more* MM complexity. In this patch, struct vm_object utilizes a radix tree (xarray) to track where a virtual page is mapped to. This adds extra memory consumption from xarray, but provides a nice abstraction to isolate mapping status from the machine-dependent layer (PTEs). Besides supporting accelerators with external MMUs, struct vm_object is planned to further union with i_pages in struct address_mapping for file-backed memory. A file already has a tree structure (pagecache) to manage the pages that are theoretically mapped. It's easy to translate from a VMA to a page inside that tree structure that is currently not present in page tables. Why the need for that tree structure if you can just remove anon memory from the picture? The idea of struct vm_object is originated from FreeBSD VM design, which provides a unified abstraction for anonymous memory, file-backed memory, page cache and etc[1]. :/ Currently, Linux utilizes a set of hierarchical page walk functions to abstract page table manipulations of different CPU architecture. The problem happens when a device wants to reuse Linux MM code to manage its page table -- the device page table may not be accessible to the CPU. Existing solution like Linux HMM utilizes the MMU notifier mechanisms to invoke device-specific MMU functions, but relies on encoding the mapping status on the CPU page table entries. This entangles machine-independent code with machine-dependent code, and also brings unnecessary restrictions. Why? we have primitives to walk arch page tables in a non-arch specific fashion and are using them all over the place. We even have various mechanisms to map something into the page tables and get the CPU to fault on it, as if it is inaccessible (PROT_NONE as used for NUMA balancing, fake swap entries). The PTE size and format vary arch by arch, which harms the extensibility. Not really. We might have some features limited to some architectures because of the lack of PTE bits. And usually the problem is that people don't care enough about enabling these features on older architectures. If we ever *really* need more space for sw-defined data, it would be possible to allocate auxiliary data for page tables only where required (where the features apply), instead of crafting a completely new, auxiliary datastructure with it's own locking. So far it was not required to enable the feature we need on the architectures we care about. [1] https://docs.freebsd.org/en/articles/vm-design/ In the cover letter you have: "The future plan of logical page table is to provide a generic abstraction layer that support common anonymous memory (I am looking at you, transparent huge pages) and file-backed memory." Which I doubt will happen; there is little interest in making anonymous memory management slower, more serialized, and wasting more memory on metadata. Also worth noting that: 1) Mach VM (which FreeBSD inherited, from the old BSD) vm_objects aren't quite what's being stated here, rather they are somewhat replacements for both anon_vma and address_space[1]. Very similarly to Linux, they take pages from vm_objects and map them in page tables using pmap (the big difference is anon memory, which has its bookkeeping in page tables, on Linux) 2) These vm_objects were a horrendous mistake (see CoW chaining) and FreeBSD has to go to horrendous lengths to make them tolerable. The UVM paper/dissertation (by Charles Cranor) talks about these issues at length, and 20 years later it's still true. 3) Despite Linux MM having its warts, it's probably correct to consider it a solid improvement over FreeBSD MM or NetBSD UVM And, finally, randomly tacking on core MM concepts from other systems is at best a *really weird* idea. Particularly when they aren't even what was stated! Can you read my mind? :) thanks for noting all that, with which I 100% agree. -- Cheers, David / dhildenb
Re: [PATCH v5 3/5] mm/gup: Introduce pin_user_pages_fd() for pinning shmem/hugetlbfs file pages (v5)
On 04.12.23 09:16, Christoph Hellwig wrote: On Thu, Nov 30, 2023 at 06:41:11AM +, Kasireddy, Vivek wrote: I see your concern. The word "file" does make it look like this API works with all kinds of files although it is meant to specifically work with files that belong to shmemfs or hugetlbfs. Since it is intended to work with memfds in particular, I'll rename this helper alloc_memfd_page(). I think it also makes sense to do s/file/memfd in this whole patch. Does this sound ok? That sounds much better, yes. And please also rename the new api to memfd_pin_user_pages。 asserts that this is true). gup.c also seems like a very odd place for such a helper. I only created this helper to cleanly separate lookup and creation and to reduce the level of indentation in pin_user_pages_fd(). Anyway, would mm/memfd.c be a more appropriate location? I think so, but I'll defer to the MM maintainers. As mentioned above, this API is mainly intended for memfds and FWICS, memfds are backed by files from either shmemfs or hugetlbfs. Ok. Witht better naming this should be more obvious. All sounds reasonable to me! -- Cheers, David / dhildenb
Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory management) for external memory devices
On 01.12.23 03:44, zhuweixi wrote: Thanks! I hope you understood that that was a joke :) I am planning to present GMEM in Linux MM Alignment Sessions so I can collect more input from the mm developers. Sounds good. But please try inviting key HMM/driver developer as well. Most of the core-mm folks attending that meeting are not that familiar with these concepts and they are usually not happy about: (1) More core-MM complexity for things that can be neatly handled in separate subsystems with the existing infrastructure already. (2) One new way of doing things why the other things remain in place. (3) New MMAP flags. Usually you have a hard time getting this in. Sometimes, there are other ways (e.g., special-purpose file- systems). (4) Changing controversial core-mm design decisions to handle corner cases. -- Cheers, David / dhildenb
Re: [RFC PATCH 2/6] mm/gmem: add arch-independent abstraction to track address mapping status
On 28.11.23 13:50, Weixi Zhu wrote: This patch adds an abstraction layer, struct vm_object, that maintains per-process virtual-to-physical mapping status stored in struct gm_mapping. For example, a virtual page may be mapped to a CPU physical page or to a device physical page. Struct vm_object effectively maintains an arch-independent page table, which is defined as a "logical page table". While arch-dependent page table used by a real MMU is named a "physical page table". The logical page table is useful if Linux core MM is extended to handle a unified virtual address space with external accelerators using customized MMUs. Which raises the question why we are dealing with anonymous memory at all? Why not go for shmem if you are already only special-casing VMAs with a MMAP flag right now? That would maybe avoid having to introduce controversial BSD design concepts into Linux, that feel like going a step backwards in time to me and adding *more* MM complexity. In this patch, struct vm_object utilizes a radix tree (xarray) to track where a virtual page is mapped to. This adds extra memory consumption from xarray, but provides a nice abstraction to isolate mapping status from the machine-dependent layer (PTEs). Besides supporting accelerators with external MMUs, struct vm_object is planned to further union with i_pages in struct address_mapping for file-backed memory. A file already has a tree structure (pagecache) to manage the pages that are theoretically mapped. It's easy to translate from a VMA to a page inside that tree structure that is currently not present in page tables. Why the need for that tree structure if you can just remove anon memory from the picture? The idea of struct vm_object is originated from FreeBSD VM design, which provides a unified abstraction for anonymous memory, file-backed memory, page cache and etc[1]. :/ Currently, Linux utilizes a set of hierarchical page walk functions to abstract page table manipulations of different CPU architecture. The problem happens when a device wants to reuse Linux MM code to manage its page table -- the device page table may not be accessible to the CPU. Existing solution like Linux HMM utilizes the MMU notifier mechanisms to invoke device-specific MMU functions, but relies on encoding the mapping status on the CPU page table entries. This entangles machine-independent code with machine-dependent code, and also brings unnecessary restrictions. Why? we have primitives to walk arch page tables in a non-arch specific fashion and are using them all over the place. We even have various mechanisms to map something into the page tables and get the CPU to fault on it, as if it is inaccessible (PROT_NONE as used for NUMA balancing, fake swap entries). The PTE size and format vary arch by arch, which harms the extensibility. Not really. We might have some features limited to some architectures because of the lack of PTE bits. And usually the problem is that people don't care enough about enabling these features on older architectures. If we ever *really* need more space for sw-defined data, it would be possible to allocate auxiliary data for page tables only where required (where the features apply), instead of crafting a completely new, auxiliary datastructure with it's own locking. So far it was not required to enable the feature we need on the architectures we care about. [1] https://docs.freebsd.org/en/articles/vm-design/ In the cover letter you have: "The future plan of logical page table is to provide a generic abstraction layer that support common anonymous memory (I am looking at you, transparent huge pages) and file-backed memory." Which I doubt will happen; there is little interest in making anonymous memory management slower, more serialized, and wasting more memory on metadata. Note that you won't make many friends around here with statements like "To be honest, not using a logical page table for anonymous memory is why Linux THP fails compared with FreeBSD's superpage". I read one paper that makes such claims (I'm curious how you define "winning"), and am aware of some shortcomings. But I am not convinced that a second datastructure "is why Linux THP fails". It just requires some more work to get it sorted under Linux (e.g., allocate THP, PTE-map it and map inaccessible parts PROT_NONE, later collapse it in-place into a PMD), and so far, there was not a lot of interest that I am ware of to even start working on that. So if there is not enough pain for companies to even work on that in Linux, maybe FreeBSD superpages are "winning" "on paper" only? Remember that the target audience here are Linux developers. But yeah, this here is all designed around the idea "core MM is extended to handle a unified virtual address space with external accelerators using customized MMUs." and then trying to find other arguments why it's a good idea, without going too much into
Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory management) for external memory devices
On 29.11.23 09:27, zhuweixi wrote: Glad to hear that more sharable code is desirable. IMHO, for a common MM subsystem, it is more beneficial for GMEM to extend core MM instead of building a separate one. More core-mm complexity, awesome, we all love that! ;) -- Cheers, David / dhildenb
Re: [PATCH v4 3/5] mm/gup: Introduce pin_user_pages_fd() for pinning shmem/hugetlbfs file pages (v4)
On 18.11.23 07:32, Vivek Kasireddy wrote: For drivers that would like to longterm-pin the pages associated with a file, the pin_user_pages_fd() API provides an option to not only pin the pages via FOLL_PIN but also to check and migrate them if they reside in movable zone or CMA block. This API currently works with files that belong to either shmem or hugetlbfs. Files belonging to other filesystems are rejected for now. The pages need to be located first before pinning them via FOLL_PIN. If they are found in the page cache, they can be immediately pinned. Otherwise, they need to be allocated using the filesystem specific APIs and then pinned. v2: - Drop gup_flags and improve comments and commit message (David) - Allocate a page if we cannot find in page cache for the hugetlbfs case as well (David) - Don't unpin pages if there is a migration related failure (David) - Drop the unnecessary nr_pages <= 0 check (Jason) - Have the caller of the API pass in file * instead of fd (Jason) v3: (David) - Enclose the huge page allocation code with #ifdef CONFIG_HUGETLB_PAGE (Build error reported by kernel test robot ) - Don't forget memalloc_pin_restore() on non-migration related errors - Improve the readability of the cleanup code associated with non-migration related errors - Augment the comments by describing FOLL_LONGTERM like behavior - Include the R-b tag from Jason v4: - Remove the local variable "page" and instead use 3 return statements in alloc_file_page() (David) - Add the R-b tag from David Cc: David Hildenbrand Cc: Daniel Vetter Cc: Mike Kravetz Cc: Hugh Dickins Cc: Peter Xu Cc: Gerd Hoffmann Cc: Dongwon Kim Cc: Junxiao Chang Suggested-by: Jason Gunthorpe Reviewed-by: Jason Gunthorpe (v2) Reviewed-by: David Hildenbrand (v3) Signed-off-by: Vivek Kasireddy --- [...] +static struct page *alloc_file_page(struct file *file, pgoff_t idx) +{ +#ifdef CONFIG_HUGETLB_PAGE + struct folio *folio; + int err; + + if (is_file_hugepages(file)) { + folio = alloc_hugetlb_folio_nodemask(hstate_file(file), +NUMA_NO_NODE, +NULL, +GFP_USER); + if (folio && folio_try_get(folio)) { + err = hugetlb_add_to_page_cache(folio, + file->f_mapping, + idx); + if (err) { + folio_put(folio); + free_huge_folio(folio); + return ERR_PTR(err); + } + return >page; While looking at the user of pin_user_pages_fd(), I realized something: Assume idx is not aligned to the hugetlb page size. find_get_page_flags() would always return a tail page in that case, but you'd be returning the head page here. See pagecache_get_page()->folio_file_page(folio, index); + } + return ERR_PTR(-ENOMEM); + } +#endif + return shmem_read_mapping_page(file->f_mapping, idx); +} + +/** + * pin_user_pages_fd() - pin user pages associated with a file + * @file: the file whose pages are to be pinned + * @start: starting file offset + * @nr_pages: number of pages from start to pin + * @pages: array that receives pointers to the pages pinned. + * Should be at-least nr_pages long. + * + * Attempt to pin pages associated with a file that belongs to either shmem + * or hugetlb. The pages are either found in the page cache or allocated if + * necessary. Once the pages are located, they are all pinned via FOLL_PIN. + * And, these pinned pages need to be released either using unpin_user_pages() + * or unpin_user_page(). + * + * It must be noted that the pages may be pinned for an indefinite amount + * of time. And, in most cases, the duration of time they may stay pinned + * would be controlled by the userspace. This behavior is effectively the + * same as using FOLL_LONGTERM with other GUP APIs. + * + * Returns number of pages pinned. This would be equal to the number of + * pages requested. If no pages were pinned, it returns -errno. + */ +long pin_user_pages_fd(struct file *file, pgoff_t start, + unsigned long nr_pages, struct page **pages) +{ + struct page *page; + unsigned int flags, i; + long ret; + + if (start < 0) + return -EINVAL; + + if (!file) + return -EINVAL; + + if (!shmem_file(file) && !is_file_hugepages(file)) + return -EINVAL; + + flags = memalloc_pin_save(); + do { + for (i = 0; i < nr_pages; i++) { + /* +* In most cases, we should be able to find the page +
Re: [PATCH] mm/gup: Introduce pin_user_pages_fd() for pinning shmem/hugetlbfs file pages (v3)
On 14.11.23 08:00, Vivek Kasireddy wrote: For drivers that would like to longterm-pin the pages associated with a file, the pin_user_pages_fd() API provides an option to not only pin the pages via FOLL_PIN but also to check and migrate them if they reside in movable zone or CMA block. This API currently works with files that belong to either shmem or hugetlbfs. Files belonging to other filesystems are rejected for now. The pages need to be located first before pinning them via FOLL_PIN. If they are found in the page cache, they can be immediately pinned. Otherwise, they need to be allocated using the filesystem specific APIs and then pinned. v2: - Drop gup_flags and improve comments and commit message (David) - Allocate a page if we cannot find in page cache for the hugetlbfs case as well (David) - Don't unpin pages if there is a migration related failure (David) - Drop the unnecessary nr_pages <= 0 check (Jason) - Have the caller of the API pass in file * instead of fd (Jason) v3: (David) - Enclose the huge page allocation code with #ifdef CONFIG_HUGETLB_PAGE (Build error reported by kernel test robot ) - Don't forget memalloc_pin_restore() on non-migration related errors - Improve the readability of the cleanup code associated with non-migration related errors - Augment the comments by describing FOLL_LONGTERM like behavior - Include the R-b tag from Jason Cc: David Hildenbrand Cc: Daniel Vetter Cc: Mike Kravetz Cc: Hugh Dickins Cc: Peter Xu Cc: Gerd Hoffmann Cc: Dongwon Kim Cc: Junxiao Chang Suggested-by: Jason Gunthorpe Reviewed-by: Jason Gunthorpe (v2) Signed-off-by: Vivek Kasireddy --- include/linux/mm.h | 2 + mm/gup.c | 109 + 2 files changed, 111 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index 418d26608ece..1b675fa35059 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2472,6 +2472,8 @@ long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages, struct page **pages, unsigned int gup_flags); long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages, struct page **pages, unsigned int gup_flags); +long pin_user_pages_fd(struct file *file, pgoff_t start, + unsigned long nr_pages, struct page **pages); int get_user_pages_fast(unsigned long start, int nr_pages, unsigned int gup_flags, struct page **pages); diff --git a/mm/gup.c b/mm/gup.c index 231711efa390..b3af967cdff1 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -3410,3 +3410,112 @@ long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages, , gup_flags); } EXPORT_SYMBOL(pin_user_pages_unlocked); + +static struct page *alloc_file_page(struct file *file, pgoff_t idx) +{ +#ifdef CONFIG_HUGETLB_PAGE + struct page *page = ERR_PTR(-ENOMEM); + struct folio *folio; + int err; + + if (is_file_hugepages(file)) { + folio = alloc_hugetlb_folio_nodemask(hstate_file(file), +NUMA_NO_NODE, +NULL, +GFP_USER); + if (folio && folio_try_get(folio)) { + page = >page; + err = hugetlb_add_to_page_cache(folio, + file->f_mapping, + idx); + if (err) { + folio_put(folio); + free_huge_folio(folio); + page = ERR_PTR(err); + } + } + return page; You could avoid the "page" variable completely simply by using 3 return statements. LGTM, thanks Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v2 1/3] mm/gup: Introduce pin_user_pages_fd() for pinning shmem/hugetlbfs file pages (v2)
On 06.11.23 07:15, Vivek Kasireddy wrote: For drivers that would like to longterm-pin the pages associated with a file, the pin_user_pages_fd() API provides an option to not only pin the pages via FOLL_PIN but also to check and migrate them if they reside in movable zone or CMA block. This API currently works with files that belong to either shmem or hugetlbfs. Files belonging to other filesystems are rejected for now. The pages need to be located first before pinning them via FOLL_PIN. If they are found in the page cache, they can be immediately pinned. Otherwise, they need to be allocated using the filesystem specific APIs and then pinned. v2: - Drop gup_flags and improve comments and commit message (David) - Allocate a page if we cannot find in page cache for the hugetlbfs case as well (David) - Don't unpin pages if there is a migration related failure (David) - Drop the unnecessary nr_pages <= 0 check (Jason) - Have the caller of the API pass in file * instead of fd (Jason) Cc: David Hildenbrand Cc: Daniel Vetter Cc: Mike Kravetz Cc: Hugh Dickins Cc: Peter Xu Cc: Gerd Hoffmann Cc: Dongwon Kim Cc: Junxiao Chang Suggested-by: Jason Gunthorpe Signed-off-by: Vivek Kasireddy --- include/linux/mm.h | 2 + mm/gup.c | 99 ++ 2 files changed, 101 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index bf5d0b1b16f4..f6cc17b14653 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2457,6 +2457,8 @@ long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages, struct page **pages, unsigned int gup_flags); long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages, struct page **pages, unsigned int gup_flags); +long pin_user_pages_fd(struct file *file, pgoff_t start, + unsigned long nr_pages, struct page **pages); int get_user_pages_fast(unsigned long start, int nr_pages, unsigned int gup_flags, struct page **pages); diff --git a/mm/gup.c b/mm/gup.c index 2f8a2d89fde1..d30b9dfebbb6 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -3400,3 +3400,102 @@ long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages, , gup_flags); } EXPORT_SYMBOL(pin_user_pages_unlocked); + +static struct page *alloc_file_page(struct file *file, pgoff_t idx) +{ + struct page *page = ERR_PTR(-ENOMEM); + struct folio *folio; + int err; + + if (shmem_file(file)) + return shmem_read_mapping_page(file->f_mapping, idx); + As the build reports indicate, this might have to be fenced with #ifdef CONFIG_HUGETLB_PAGE + folio = alloc_hugetlb_folio_nodemask(hstate_file(file), +NUMA_NO_NODE, +NULL, +GFP_USER); + if (folio && folio_try_get(folio)) { + page = >page; + err = hugetlb_add_to_page_cache(folio, file->f_mapping, idx); + if (err) { + folio_put(folio); + free_huge_folio(folio); + page = ERR_PTR(err); + } + } + + return page; +} + +/** + * pin_user_pages_fd() - pin user pages associated with a file + * @file: the file whose pages are to be pinned + * @start: starting file offset + * @nr_pages: number of pages from start to pin + * @pages: array that receives pointers to the pages pinned. + * Should be at-least nr_pages long. + * + * Attempt to pin pages associated with a file that belongs to either shmem + * or hugetlbfs. The pages are either found in the page cache or allocated nit: s/hugetlbfs/hugetlb/ + * if necessary. Once the pages are located, they are all pinned via FOLL_PIN. + * And, these pinned pages need to be released using unpin_user_pages() or + * unpin_user_page(). + * It might be reasonable to add that the behavior mimics FOLL_LONGTERM semantics: the page may be held for an indefinite time period _often_ under userspace control. + * Returns number of pages pinned. This would be equal to the number of + * pages requested. If no pages were pinned, it returns -errno. + */ +long pin_user_pages_fd(struct file *file, pgoff_t start, + unsigned long nr_pages, struct page **pages) +{ + struct page *page; + unsigned int flags, i; + long ret; + + if (start < 0) + return -EINVAL; + + if (!file) + return -EINVAL; + + if (!shmem_file(file) && !is_file_hugepages(file)) + return -EINVAL; + + flags = memalloc_pin_save(); + do { + for (i = 0; i < nr_pages; i++) { + /* +* In most cases, we should be able to find the page +
Re: [PATCH v1 1/3] mm/gup: Introduce pin_user_pages_fd() for pinning shmem/hugetlbfs file pages
On 03.10.23 09:44, Vivek Kasireddy wrote: For drivers that would like to longterm-pin the pages associated with a file, the pin_user_pages_fd() API provides an option to not only FOLL_PIN the pages but also to check and migrate them if they reside in movable zone or CMA block. For now, this API can only work with files belonging to shmem or hugetlbfs given that the udmabuf driver is the only user. Maybe add "Other files are rejected.". Wasn't clear to me before I looked into the code. It must be noted that the pages associated with hugetlbfs files are expected to be found in the page cache. An error is returned if they are not found. However, shmem pages can be swapped in or allocated if they are not present in the page cache. Cc: David Hildenbrand Cc: Daniel Vetter Cc: Mike Kravetz Cc: Hugh Dickins Cc: Peter Xu Cc: Gerd Hoffmann Cc: Dongwon Kim Cc: Junxiao Chang Suggested-by: Jason Gunthorpe Signed-off-by: Vivek Kasireddy --- include/linux/mm.h | 2 ++ mm/gup.c | 87 ++ 2 files changed, 89 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index bf5d0b1b16f4..af2121fb8101 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2457,6 +2457,8 @@ long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages, struct page **pages, unsigned int gup_flags); long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages, struct page **pages, unsigned int gup_flags); +long pin_user_pages_fd(int fd, pgoff_t start, unsigned long nr_pages, + unsigned int gup_flags, struct page **pages); int get_user_pages_fast(unsigned long start, int nr_pages, unsigned int gup_flags, struct page **pages); diff --git a/mm/gup.c b/mm/gup.c index 2f8a2d89fde1..e34b77a15fa8 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -3400,3 +3400,90 @@ long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages, , gup_flags); } EXPORT_SYMBOL(pin_user_pages_unlocked); + This does look quite neat, nice! Let's take a closer look ... +/** + * pin_user_pages_fd() - pin user pages associated with a file + * @fd: the fd whose pages are to be pinned + * @start: starting file offset + * @nr_pages: number of pages from start to pin + * @gup_flags: flags modifying pin behaviour ^ I assume we should drop that. At least for now the flags are completely unused. And most likely we would want a different set of flags later (GUPFD_ ...). + * @pages: array that receives pointers to the pages pinned. + * Should be at least nr_pages long. + * + * Attempt to pin (and migrate) pages associated with a file belonging to I'd drop the "and migrate" part, it's more of an implementation detail. + * either shmem or hugetlbfs. An error is returned if pages associated with + * hugetlbfs files are not present in the page cache. However, shmem pages + * are swapped in or allocated if they are not present in the page cache. Why don't we do the same for hugetlbfs? Would make the interface more streamlined. Certainly add that pinned pages have to be released using unpin_user_pages(). + * + * Returns number of pages pinned. This would be equal to the number of + * pages requested. + * If nr_pages is 0 or negative, returns 0. If no pages were pinned, returns + * -errno. + */ +long pin_user_pages_fd(int fd, pgoff_t start, unsigned long nr_pages, + unsigned int gup_flags, struct page **pages) +{ + struct page *page; + struct file *filep; + unsigned int flags, i; + long ret; + + if (nr_pages <= 0) + return 0; I think we should just forbid that and use a WARN_ON_ONCE() here / return -EINVAL. So we'll never end up returning 0. + if (!is_valid_gup_args(pages, NULL, _flags, FOLL_PIN)) + return 0; + + if (start < 0) + return -EINVAL; + + filep = fget(fd); + if (!filep) + return -EINVAL; + + if (!shmem_file(filep) && !is_file_hugepages(filep)) + return -EINVAL; + + flags = memalloc_pin_save(); + do { + for (i = 0; i < nr_pages; i++) { + if (shmem_mapping(filep->f_mapping)) { + page = shmem_read_mapping_page(filep->f_mapping, + start + i); + if (IS_ERR(page)) { + ret = PTR_ERR(page); + goto err; + } + } else { + page = find_get_page_flags(filep->f_mapping, + start + i, +
Re: [PATCH v1 0/3] udmabuf: Add support for page migration out of movable zone or CMA
I think it makes sense to have a generic (non-GUP) version of check_and_migrate_movable_pages() available in migration.h that drivers can use to ensure that they don't break memory hotunplug accidentally. Definately not. Either use the VMA and pin_user_pages(), or implement pin_user_pages_fd() in core code. Do not open code something wonky in drivers. Agreed. pin_user_pages_fd() might become relevant in the context of vfio/mdev + KVM gmem -- don't mmap guest memory but instead provide it via a special memfd to the kernel. So there might be value in having such a core infrastructure. -- Cheers, David / dhildenb
Re: [PATCH v1 0/3] udmabuf: Add support for page migration out of movable zone or CMA
On 24.08.23 20:30, Jason Gunthorpe wrote: On Thu, Aug 24, 2023 at 08:30:17PM +0200, David Hildenbrand wrote: On 24.08.23 08:31, Kasireddy, Vivek wrote: Hi David, - Add a new API to the backing store/allocator to longterm-pin the page. For example, something along the lines of shmem_pin_mapping_page_longterm() for shmem as suggested by Daniel. A similar one needs to be added for hugetlbfs as well. This may also be reasonable. Sounds reasonable to keep the old API (that we unfortunately have) working. I agree; I'd like to avoid adding new APIs unless absolutely necessary. Given this, and considering the options I have mentioned earlier, what would be your recommendation for how page migration needs to be done in udmabuf driver? I guess using proper APIs for shmem and hugetlb. So, turning roughly what you have in patch#1 for now into common code, and only calling into that from udmabug. This is a lot of work for an obscure uapi :\ Well, what can we otherwise to to *not* break existing users? I'm not happy about this either. Of course, we can come up with a new uapi, but we have to handle the old uapi somehow. Sure, we can simply always fail when we detect ZONE_MOVABLE or MIGRATE_CMA. Maybe that keeps at least some use cases working. -- Cheers, David / dhildenb
Re: [PATCH v1 0/3] udmabuf: Add support for page migration out of movable zone or CMA
On 24.08.23 08:31, Kasireddy, Vivek wrote: Hi David, - Add a new API to the backing store/allocator to longterm-pin the page. For example, something along the lines of shmem_pin_mapping_page_longterm() for shmem as suggested by Daniel. A similar one needs to be added for hugetlbfs as well. This may also be reasonable. Sounds reasonable to keep the old API (that we unfortunately have) working. I agree; I'd like to avoid adding new APIs unless absolutely necessary. Given this, and considering the options I have mentioned earlier, what would be your recommendation for how page migration needs to be done in udmabuf driver? I guess using proper APIs for shmem and hugetlb. So, turning roughly what you have in patch#1 for now into common code, and only calling into that from udmabug. -- Cheers, David / dhildenb
Re: [PATCH v1 0/3] udmabuf: Add support for page migration out of movable zone or CMA
- Add a new API to the backing store/allocator to longterm-pin the page. For example, something along the lines of shmem_pin_mapping_page_longterm() for shmem as suggested by Daniel. A similar one needs to be added for hugetlbfs as well. This may also be reasonable. Sounds reasonable to keep the old API (that we unfortunately have) working. -- Cheers, David / dhildenb
Re: [RFC v1 1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages)
On 04.08.23 08:39, Kasireddy, Vivek wrote: Hi Alistair, David, Jason, Right, the "the zero pages are changed into writable pages" in your above comment just might not apply, because there won't be any page replacement (hopefully :) ). If the page replacement does not happen when there are new writes to the area where the hole previously existed, then would we still get an invalidate when this happens? Is there any other way to get notified when the zeroed page is written to if the invalidate does not get triggered? What David is saying is that memfd does not use the zero page optimization for hole punches. Any access to the memory, including read-only access through hmm_range_fault() will allocate unique pages. Since there is no zero page and no zero-page replacement there is no issue with invalidations. It looks like even with hmm_range_fault(), the invalidate does not get triggered when the hole is refilled with new pages because of writes. This is probably because hmm_range_fault() does not fault in any pages that get invalidated later when writes occur. hmm_range_fault() returns the current content of the VMAs, or it faults. If it returns pages then it came from one of these two places. If your VMA is incoherent with what you are doing then you have bigger problems, or maybe you found a bug. Note it will only fault in pages if HMM_PFN_REQ_FAULT is specified. You are setting that however you aren't setting HMM_PFN_REQ_WRITE which is what would trigger a fault to bring in the new pages. Does setting that fix the issue you are seeing? No, adding HMM_PFN_REQ_WRITE still doesn't help in fixing the issue. Although, I do not have THP enabled (or built-in), shmem does not evict the pages after hole punch as noted in the comment in shmem_fallocate(): if ((u64)unmap_end > (u64)unmap_start) unmap_mapping_range(mapping, unmap_start, 1 + unmap_end - unmap_start, 0); shmem_truncate_range(inode, offset, offset + len - 1); /* No need to unmap again: hole-punching leaves COWed pages */ As a result, the pfn is still valid and the pte is pte_present() and pte_write(). This is the reason why adding in HMM_PFN_REQ_WRITE does not help; Just to understand your setup: you are definitely using a MAP_SHARED shmem mapping, and not accidentally a MAP_PRIVATE mapping? -- Cheers, David / dhildenb
Re: [RFC v1 1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages)
On 03.08.23 14:14, Jason Gunthorpe wrote: On Thu, Aug 03, 2023 at 07:35:51AM +, Kasireddy, Vivek wrote: Hi Jason, Right, the "the zero pages are changed into writable pages" in your above comment just might not apply, because there won't be any page replacement (hopefully :) ). If the page replacement does not happen when there are new writes to the area where the hole previously existed, then would we still get an invalidate when this happens? Is there any other way to get notified when the zeroed page is written to if the invalidate does not get triggered? What David is saying is that memfd does not use the zero page optimization for hole punches. Any access to the memory, including read-only access through hmm_range_fault() will allocate unique pages. Since there is no zero page and no zero-page replacement there is no issue with invalidations. It looks like even with hmm_range_fault(), the invalidate does not get triggered when the hole is refilled with new pages because of writes. This is probably because hmm_range_fault() does not fault in any pages that get invalidated later when writes occur. hmm_range_fault() returns the current content of the VMAs, or it faults. If it returns pages then it came from one of these two places. If your VMA is incoherent with what you are doing then you have bigger problems, or maybe you found a bug. The above log messages are seen immediately after the hole is punched. As you can see, hmm_range_fault() returns the pfns of old pages and not zero pages. And, I see the below messages (with patch #2 in this series applied) as the hole is refilled after writes: I don't know what you are doing, but it is something wrong or you've found a bug in the memfds. Maybe THP is involved? I recently had to dig that out for an internal discussion: "Currently when truncating shmem file, if the range is partial of THP (start or end is in the middle of THP), the pages actually will just get cleared rather than being freed unless the range cover the whole THP. Even though all the subpages are truncated (randomly or sequentially), the THP may still be kept in page cache. This might be fine for some usecases which prefer preserving THP." My recollection is that this behavior was never changed. https://lore.kernel.org/all/1575420174-19171-1-git-send-email-yang@linux.alibaba.com/ -- Cheers, David / dhildenb
Re: [RFC v1 1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages)
On 01.08.23 14:26, Jason Gunthorpe wrote: On Tue, Aug 01, 2023 at 02:26:03PM +0200, David Hildenbrand wrote: On 01.08.23 14:23, Jason Gunthorpe wrote: On Tue, Aug 01, 2023 at 02:22:12PM +0200, David Hildenbrand wrote: On 01.08.23 14:19, Jason Gunthorpe wrote: On Tue, Aug 01, 2023 at 05:32:38AM +, Kasireddy, Vivek wrote: You get another invalidate because the memfd removes the zero pages that hmm_range_fault installed in the PTEs before replacing them with actual writable pages. Then you do the move, and another hmm_range_fault, and basically the whole thing over again. Except this time instead of returning zero pages it returns actual writable page. Ok, when I tested earlier (by registering an invalidate callback) but without hmm_range_fault(), I did not find this additional invalidate getting triggered. Let me try with hmm_range_fault() and see if everything works as expected. Thank you for your help. If you do not get an invalidate then there is a pretty serious bug in the mm that needs fixing. Anything hmm_range_fault() returns must be invalidated if the underying CPU mapping changes for any reasons. Since hmm_range_fault() will populate zero pages when reading from a hole in a memfd, it must also get an invalidation when the zero pages are changed into writable pages. Can you point me at the code that returns that (shared) zero page? It calls handle_mm_fault() - shouldn't that do it? Same as if the CPU read faulted the page? To the best of my knowledge, the shared zeropage is only used in MAP_PRIVATE|MAP_AON mappings and in weird DAX mappings. If that changed, we have to fix FOLL_PIN|FOLL_LONGTERM for MAP_SHARED VMAs. If you read-fault on a memfd hole, you should get a proper "zeroed" pagecache page that effectively "filled that hole" -- so there is no file hole anymore. Sounds fine then :) Right, the "the zero pages are changed into writable pages" in your above comment just might not apply, because there won't be any page replacement (hopefully :) ). -- Cheers, David / dhildenb
Re: [RFC v1 1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages)
On 01.08.23 14:23, Jason Gunthorpe wrote: On Tue, Aug 01, 2023 at 02:22:12PM +0200, David Hildenbrand wrote: On 01.08.23 14:19, Jason Gunthorpe wrote: On Tue, Aug 01, 2023 at 05:32:38AM +, Kasireddy, Vivek wrote: You get another invalidate because the memfd removes the zero pages that hmm_range_fault installed in the PTEs before replacing them with actual writable pages. Then you do the move, and another hmm_range_fault, and basically the whole thing over again. Except this time instead of returning zero pages it returns actual writable page. Ok, when I tested earlier (by registering an invalidate callback) but without hmm_range_fault(), I did not find this additional invalidate getting triggered. Let me try with hmm_range_fault() and see if everything works as expected. Thank you for your help. If you do not get an invalidate then there is a pretty serious bug in the mm that needs fixing. Anything hmm_range_fault() returns must be invalidated if the underying CPU mapping changes for any reasons. Since hmm_range_fault() will populate zero pages when reading from a hole in a memfd, it must also get an invalidation when the zero pages are changed into writable pages. Can you point me at the code that returns that (shared) zero page? It calls handle_mm_fault() - shouldn't that do it? Same as if the CPU read faulted the page? To the best of my knowledge, the shared zeropage is only used in MAP_PRIVATE|MAP_AON mappings and in weird DAX mappings. If that changed, we have to fix FOLL_PIN|FOLL_LONGTERM for MAP_SHARED VMAs. If you read-fault on a memfd hole, you should get a proper "zeroed" pagecache page that effectively "filled that hole" -- so there is no file hole anymore. -- Cheers, David / dhildenb
Re: [RFC v1 1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages)
On 01.08.23 14:19, Jason Gunthorpe wrote: On Tue, Aug 01, 2023 at 05:32:38AM +, Kasireddy, Vivek wrote: You get another invalidate because the memfd removes the zero pages that hmm_range_fault installed in the PTEs before replacing them with actual writable pages. Then you do the move, and another hmm_range_fault, and basically the whole thing over again. Except this time instead of returning zero pages it returns actual writable page. Ok, when I tested earlier (by registering an invalidate callback) but without hmm_range_fault(), I did not find this additional invalidate getting triggered. Let me try with hmm_range_fault() and see if everything works as expected. Thank you for your help. If you do not get an invalidate then there is a pretty serious bug in the mm that needs fixing. Anything hmm_range_fault() returns must be invalidated if the underying CPU mapping changes for any reasons. Since hmm_range_fault() will populate zero pages when reading from a hole in a memfd, it must also get an invalidation when the zero pages are changed into writable pages. Can you point me at the code that returns that (shared) zero page? -- Cheers, David / dhildenb
Re: [PATCH v2 4/4] perf/core: use vma_is_initial_stack() and vma_is_initial_heap()
On 19.07.23 09:51, Kefeng Wang wrote: Use the helpers to simplify code, also kill unneeded goto cpy_name. Cc: Peter Zijlstra Cc: Arnaldo Carvalho de Melo Signed-off-by: Kefeng Wang --- kernel/events/core.c | 22 +++--- 1 file changed, 7 insertions(+), 15 deletions(-) diff --git a/kernel/events/core.c b/kernel/events/core.c index 78ae7b6f90fd..d59f6327472f 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -8685,22 +8685,14 @@ static void perf_event_mmap_event(struct perf_mmap_event *mmap_event) } name = (char *)arch_vma_name(vma); - if (name) - goto cpy_name; - - if (vma->vm_start <= vma->vm_mm->start_brk && - vma->vm_end >= vma->vm_mm->brk) { - name = "[heap]"; - goto cpy_name; + if (!name) { + if (vma_is_initial_heap(vma)) + name = "[heap]"; + else if (vma_is_initial_stack(vma)) + name = "[stack]"; + else + name = "//anon"; } - if (vma->vm_start <= vma->vm_mm->start_stack && - vma->vm_end >= vma->vm_mm->start_stack) { - name = "[stack]"; - goto cpy_name; - } - - name = "//anon"; - goto cpy_name; If you're removing that goto, maybe also worth removing the goto at the end of the previous if branch. Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v2 3/4] selinux: use vma_is_initial_stack() and vma_is_initial_heap()
On 19.07.23 09:51, Kefeng Wang wrote: Use the helpers to simplify code. Cc: Paul Moore Cc: Stephen Smalley Cc: Eric Paris Acked-by: Paul Moore Signed-off-by: Kefeng Wang --- security/selinux/hooks.c | 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c index d06e350fedee..ee8575540a8e 100644 --- a/security/selinux/hooks.c +++ b/security/selinux/hooks.c @@ -3762,13 +3762,10 @@ static int selinux_file_mprotect(struct vm_area_struct *vma, if (default_noexec && (prot & PROT_EXEC) && !(vma->vm_flags & VM_EXEC)) { int rc = 0; - if (vma->vm_start >= vma->vm_mm->start_brk && - vma->vm_end <= vma->vm_mm->brk) { + if (vma_is_initial_heap(vma)) { rc = avc_has_perm(sid, sid, SECCLASS_PROCESS, PROCESS__EXECHEAP, NULL); - } else if (!vma->vm_file && - ((vma->vm_start <= vma->vm_mm->start_stack && -vma->vm_end >= vma->vm_mm->start_stack) || + } else if (!vma->vm_file && (vma_is_initial_stack(vma) || vma_is_stack_for_current(vma))) { rc = avc_has_perm(sid, sid, SECCLASS_PROCESS, PROCESS__EXECSTACK, NULL); Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v2 2/4] drm/amdkfd: use vma_is_initial_stack() and vma_is_initial_heap()
On 19.07.23 09:51, Kefeng Wang wrote: Use the helpers to simplify code. Cc: Felix Kuehling Cc: Alex Deucher Cc: "Christian König" Cc: "Pan, Xinhui" Cc: David Airlie Cc: Daniel Vetter Signed-off-by: Kefeng Wang --- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c index 5ff1a5a89d96..0b7bfbd0cb66 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c @@ -2621,10 +2621,7 @@ svm_range_get_range_boundaries(struct kfd_process *p, int64_t addr, return -EFAULT; } - *is_heap_stack = (vma->vm_start <= vma->vm_mm->brk && - vma->vm_end >= vma->vm_mm->start_brk) || -(vma->vm_start <= vma->vm_mm->start_stack && - vma->vm_end >= vma->vm_mm->start_stack); + *is_heap_stack = vma_is_initial_heap(vma) || vma_is_initial_stack(vma); start_limit = max(vma->vm_start >> PAGE_SHIFT, (unsigned long)ALIGN_DOWN(addr, 2UL << 8)); Certainly a valid refactoring, although questionable if such code should care about that. Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v2 1/4] mm: factor out VMA stack and heap checks
On 19.07.23 09:51, Kefeng Wang wrote: Factor out VMA stack and heap checks and name them vma_is_initial_stack() and vma_is_initial_heap() for general use. Cc: Christian Göttsche Cc: David Hildenbrand Signed-off-by: Kefeng Wang --- [...] diff --git a/include/linux/mm.h b/include/linux/mm.h index 2dd73e4f3d8e..51f8c573db74 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -822,6 +822,27 @@ static inline bool vma_is_anonymous(struct vm_area_struct *vma) return !vma->vm_ops; } Worth adding a similar comment like for vma_is_initial_stack() ? +static inline bool vma_is_initial_heap(const struct vm_area_struct *vma) +{ + return vma->vm_start <= vma->vm_mm->brk && + vma->vm_end >= vma->vm_mm->start_brk; +} + +/* + * Indicate if the VMA is a stack for the given task; for + * /proc/PID/maps that is the stack of the main task. + */ +static inline bool vma_is_initial_stack(const struct vm_area_struct *vma) +{ + /* +* We make no effort to guess what a given thread considers to be +* its "stack". It's not even well-defined for programs written +* languages like Go. +*/ + return vma->vm_start <= vma->vm_mm->start_stack && + vma->vm_end >= vma->vm_mm->start_stack; +} + static inline bool vma_is_temporary_stack(struct vm_area_struct *vma) { int maybe_stack = vma->vm_flags & (VM_GROWSDOWN | VM_GROWSUP); Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v2 2/2] udmabuf: Add back support for mapping hugetlb pages (v2)
On 18.07.23 10:26, Vivek Kasireddy wrote: A user or admin can configure a VMM (Qemu) Guest's memory to be backed by hugetlb pages for various reasons. However, a Guest OS would still allocate (and pin) buffers that are backed by regular 4k sized pages. In order to map these buffers and create dma-bufs for them on the Host, we first need to find the hugetlb pages where the buffer allocations are located and then determine the offsets of individual chunks (within those pages) and use this information to eventually populate a scatterlist. Testcase: default_hugepagesz=2M hugepagesz=2M hugepages=2500 options were passed to the Host kernel and Qemu was launched with these relevant options: qemu-system-x86_64 -m 4096m -device virtio-gpu-pci,max_outputs=1,blob=true,xres=1920,yres=1080 -display gtk,gl=on -object memory-backend-memfd,hugetlb=on,id=mem1,size=4096M -machine memory-backend=mem1 Replacing -display gtk,gl=on with -display gtk,gl=off above would exercise the mmap handler. v2: Updated get_sg_table() to manually populate the scatterlist for both huge page and non-huge-page cases. Cc: David Hildenbrand Cc: Mike Kravetz Cc: Hugh Dickins Cc: Peter Xu Cc: Jason Gunthorpe Cc: Gerd Hoffmann Cc: Dongwon Kim Cc: Junxiao Chang Signed-off-by: Vivek Kasireddy --- drivers/dma-buf/udmabuf.c | 84 +-- 1 file changed, 71 insertions(+), 13 deletions(-) LGTM, in general. But I really hope Mike can comment. -- Cheers, David / dhildenb
Re: [PATCH v2 1/2] udmabuf: Use vmf_insert_pfn and VM_PFNMAP for handling mmap
On 18.07.23 10:26, Vivek Kasireddy wrote: Add VM_PFNMAP to vm_flags in the mmap handler to ensure that the mappings would be managed without using struct page. And, in the vm_fault handler, use vmf_insert_pfn to share the page's pfn to userspace instead of directly sharing the page (via struct page *). Cc: David Hildenbrand Cc: Mike Kravetz Cc: Hugh Dickins Cc: Peter Xu Cc: Jason Gunthorpe Cc: Gerd Hoffmann Cc: Dongwon Kim Cc: Junxiao Chang Suggested-by: David Hildenbrand Signed-off-by: Vivek Kasireddy --- drivers/dma-buf/udmabuf.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c index c40645999648..820c993c8659 100644 --- a/drivers/dma-buf/udmabuf.c +++ b/drivers/dma-buf/udmabuf.c @@ -35,12 +35,13 @@ static vm_fault_t udmabuf_vm_fault(struct vm_fault *vmf) struct vm_area_struct *vma = vmf->vma; struct udmabuf *ubuf = vma->vm_private_data; pgoff_t pgoff = vmf->pgoff; + unsigned long pfn; if (pgoff >= ubuf->pagecount) return VM_FAULT_SIGBUS; - vmf->page = ubuf->pages[pgoff]; - get_page(vmf->page); - return 0; + + pfn = page_to_pfn(ubuf->pages[pgoff]); + return vmf_insert_pfn(vma, vmf->address, pfn); } static const struct vm_operations_struct udmabuf_vm_ops = { @@ -56,6 +57,7 @@ static int mmap_udmabuf(struct dma_buf *buf, struct vm_area_struct *vma) vma->vm_ops = _vm_ops; vma->vm_private_data = ubuf; + vm_flags_set(vma, VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP); return 0; } Looks reasonable to me. Acked-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH 1/5] mm: introduce vma_is_stack() and vma_is_heap()
On 12.07.23 16:38, Kefeng Wang wrote: Introduce the two helpers for general use. Signed-off-by: Kefeng Wang --- include/linux/mm.h | 12 1 file changed, 12 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index 1462cf15badf..0bbeb31ac750 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -926,6 +926,18 @@ static inline bool vma_is_anonymous(struct vm_area_struct *vma) return !vma->vm_ops; } +static inline bool vma_is_heap(struct vm_area_struct *vma) +{ + return vma->vm_start <= vma->vm_mm->brk && + vma->vm_end >= vma->vm_mm->start_brk; +} + +static inline bool vma_is_stack(struct vm_area_struct *vma) +{ + return vma->vm_start <= vma->vm_mm->start_stack && + vma->vm_end >= vma->vm_mm->start_stack; +} + static inline bool vma_is_temporary_stack(struct vm_area_struct *vma) { int maybe_stack = vma->vm_flags & (VM_GROWSDOWN | VM_GROWSUP); Looking at the comments in patch #3, should these functions be called vma_is_initial_heap / vma_is_initial_stack ? -- Cheers, David / dhildenb
Re: [PATCH 2/5] mm: use vma_is_stack() and vma_is_heap()
On 12.07.23 16:38, Kefeng Wang wrote: Use the helpers to simplify code. Signed-off-by: Kefeng Wang --- fs/proc/task_mmu.c | 24 fs/proc/task_nommu.c | 15 +-- 2 files changed, 5 insertions(+), 34 deletions(-) Please squash patch #1 and this patch and call it something like "mm: factor out VMA stack and heap checks" And then, maybe also keep the comments in these functions, they sound reasonable to have. -- Cheers, David / dhildenb
Re: [PATCH v1 0/2] udmabuf: Add back support for mapping hugetlb pages
On 27.06.23 08:37, Kasireddy, Vivek wrote: Hi David, Hi! sorry for taking a bit longer to reply lately. [...] Sounds right, maybe it needs to go back to the old GUP solution, though, as mmu notifiers are also mm-based not fd-based. Or to be explicit, I think it'll be pin_user_pages(FOLL_LONGTERM) with the new API. It'll also solve the movable pages issue on pinning. It better should be pin_user_pages(FOLL_LONGTERM). But I'm afraid we cannot achieve that without breaking the existing kernel interface ... Yeah, as you suggest, we unfortunately cannot go back to using GUP without breaking udmabuf_create UAPI that expects memfds and file offsets. So we might have to implement the same page migration as gup does on FOLL_LONGTERM here ... maybe there are more such cases/drivers that actually require that handling when simply taking pages out of the memfd, believing they can hold on to them forever. IIUC, I don't think just handling the page migration in udmabuf is going to cut it. It might require active cooperation of the Guest GPU driver as well if this is even feasible. The idea is, that once you extract the page from the memfd and it resides somewhere bad (MIGRATE_CMA, ZONE_MOVABLE), you trigger page migration. Essentially what migrate_longterm_unpinnable_pages() does: Why would the guest driver have to be involved? It shouldn't care about page migration in the hypervisor. [...] balloon, and then using that memory for communicating with the device] Maybe it's all fine with udmabuf because of the way it is setup/torn down by the guest driver. Unfortunately I can't tell. Here are the functions used by virtio-gpu (Guest GPU driver) to allocate pages for its resources: __drm_gem_shmem_create: https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/drm_gem_shmem_helper.c#L97 Interestingly, the comment in the above function says that the pages should not be allocated from the MOVABLE zone. It doesn't add GFP_MOVABLE, so pages don't end up in ZONE_MOVABLE/MIGRATE_CMA *in the guest*. But we care about the ZONE_MOVABLE /MIGRATE_CMA *in the host*. (what the guest does is right, though) IOW, what udmabuf does with guest memory on the hypervisor side, not the guest driver on the guest side. The pages along with their dma addresses are then extracted and shared with Qemu using these two functions: drm_gem_get_pages: https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/drm_gem.c#L534 virtio_gpu_object_shmem_init: https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/virtio/virtgpu_object.c#L135 ^ so these two target the guest driver as well, right? IOW, there is a memfd (shmem) in the guest that the guest driver uses to allocate pages from and there is the memfd in the hypervisor to back guest RAM. The latter gets registered with udmabuf. Qemu then translates the dma addresses into file offsets and creates udmabufs -- as an optimization to avoid data copies only if blob is set to true. If the guest OS doesn't end up freeing/reallocating that memory while it's registered with udmabuf in the hypervisor, then we should be fine. Because that way, the guest won't end up trigger MADV_DONTNEED by "accident". -- Cheers, David / dhildenb
Re: [PATCH v1 0/2] udmabuf: Add back support for mapping hugetlb pages
On 26.06.23 19:52, Peter Xu wrote: On Mon, Jun 26, 2023 at 07:45:37AM +, Kasireddy, Vivek wrote: Hi Peter, On Fri, Jun 23, 2023 at 06:13:02AM +, Kasireddy, Vivek wrote: Hi David, The first patch ensures that the mappings needed for handling mmap operation would be managed by using the pfn instead of struct page. The second patch restores support for mapping hugetlb pages where subpages of a hugepage are not directly used anymore (main reason for revert) and instead the hugetlb pages and the relevant offsets are used to populate the scatterlist for dma-buf export and for mmap operation. Testcase: default_hugepagesz=2M hugepagesz=2M hugepages=2500 options were passed to the Host kernel and Qemu was launched with these relevant options: qemu-system-x86_64 -m 4096m -device virtio-gpu-pci,max_outputs=1,blob=true,xres=1920,yres=1080 -display gtk,gl=on -object memory-backend-memfd,hugetlb=on,id=mem1,size=4096M -machine memory-backend=mem1 Replacing -display gtk,gl=on with -display gtk,gl=off above would exercise the mmap handler. While I think the VM_PFNMAP approach is much better and should fix that issue at hand, I thought more about missing memlock support and realized that we might have to fix something else. SO I'm going to raise the issue here. I think udmabuf chose the wrong interface to do what it's doing, that makes it harder to fix it eventually. Instead of accepting a range in a memfd, it should just have accepted a user space address range and then used pin_user_pages(FOLL_WRITE|FOLL_LONGTERM) to longterm-pin the pages "officially". Udmabuf indeed started off by using user space address range and GUP but the dma-buf subsystem maintainer had concerns with that approach in v2. It also had support for mlock in that version. Here is v2 and the relevant conversation: https://patchwork.freedesktop.org/patch/210992/?series=39879=2 So what's the issue? Udma effectively pins pages longterm ("possibly forever") simply by grabbing a reference on them. These pages might easily reside in ZONE_MOVABLE or in MIGRATE_CMA pageblocks. So what udmabuf does is break memory hotunplug and CMA, because it turns pages that have to remain movable unmovable. In the pin_user_pages(FOLL_LONGTERM) case we make sure to migrate these pages. See mm/gup.c:check_and_migrate_movable_pages() and especially folio_is_longterm_pinnable(). We'd probably have to implement something similar for udmabuf, where we detect such unpinnable pages and migrate them. The pages udmabuf pins are only those associated with Guest (GPU driver/virtio-gpu) resources (or buffers allocated and pinned from shmem via drm GEM). Some resources are short-lived, and some are long-lived and whenever a resource gets destroyed, the pages are unpinned. And, not all resources have their pages pinned. The resource that is pinned for the longest duration is the FB and that's because it is updated every ~16ms (assuming 1920x1080@60) by the Guest GPU driver. We can certainly pin/unpin the FB after it is accessed on the Host as a workaround, but I guess that may not be very efficient given the amount of churn it would create. Also, as far as migration or S3/S4 is concerned, my understanding is that all the Guest resources are destroyed and recreated again. So, wouldn't something similar happen during memory hotunplug? For example, pairing udmabuf with vfio (which pins pages using pin_user_pages(FOLL_LONGTERM)) in QEMU will most probably not work in all cases: if udmabuf longterm pinned the pages "the wrong way", vfio will fail to migrate them during FOLL_LONGTERM and consequently fail pin_user_pages(). As long as udmabuf holds a reference on these pages, that will never succeed. Dma-buf rules (for exporters) indicate that the pages only need to be pinned during the map_attachment phase (and until unmap attachment happens). In other words, only when the sg_table is created by udmabuf. I guess one option would be to not hold any references during UDMABUF_CREATE and only grab references to the pages (as and when it gets used) during this step. Would this help? IIUC the refcount is needed, otherwise I don't see what to protect the page from being freed and even reused elsewhere before map_attachment(). It seems the previous concern on using gup was majorly fork(), if this is it: https://patchwork.freedesktop.org/patch/210992/?series=39879=2#co mment_414213 Could it also be guarded by just making sure the pages are MAP_SHARED when creating the udmabuf, if fork() is a requirement of the feature? I had a feeling that the userspace still needs to always do the right thing to make it work, even using pure PFN mappings. For instance, what if the userapp just punchs a hole in the shmem/hugetlbfs file after creating the udmabuf (I see that F_SEAL_SHRINK is required, but at least not F_SEAL_WRITE with current impl), and fault a new page into the page cache? IIUC, Qemu creates and owns the memfd that is
Re: [PATCH v1 0/2] udmabuf: Add back support for mapping hugetlb pages
On 22.06.23 09:27, Vivek Kasireddy wrote: The first patch ensures that the mappings needed for handling mmap operation would be managed by using the pfn instead of struct page. The second patch restores support for mapping hugetlb pages where subpages of a hugepage are not directly used anymore (main reason for revert) and instead the hugetlb pages and the relevant offsets are used to populate the scatterlist for dma-buf export and for mmap operation. Testcase: default_hugepagesz=2M hugepagesz=2M hugepages=2500 options were passed to the Host kernel and Qemu was launched with these relevant options: qemu-system-x86_64 -m 4096m -device virtio-gpu-pci,max_outputs=1,blob=true,xres=1920,yres=1080 -display gtk,gl=on -object memory-backend-memfd,hugetlb=on,id=mem1,size=4096M -machine memory-backend=mem1 Replacing -display gtk,gl=on with -display gtk,gl=off above would exercise the mmap handler. While I think the VM_PFNMAP approach is much better and should fix that issue at hand, I thought more about missing memlock support and realized that we might have to fix something else. SO I'm going to raise the issue here. I think udmabuf chose the wrong interface to do what it's doing, that makes it harder to fix it eventually. Instead of accepting a range in a memfd, it should just have accepted a user space address range and then used pin_user_pages(FOLL_WRITE|FOLL_LONGTERM) to longterm-pin the pages "officially". So what's the issue? Udma effectively pins pages longterm ("possibly forever") simply by grabbing a reference on them. These pages might easily reside in ZONE_MOVABLE or in MIGRATE_CMA pageblocks. So what udmabuf does is break memory hotunplug and CMA, because it turns pages that have to remain movable unmovable. In the pin_user_pages(FOLL_LONGTERM) case we make sure to migrate these pages. See mm/gup.c:check_and_migrate_movable_pages() and especially folio_is_longterm_pinnable(). We'd probably have to implement something similar for udmabuf, where we detect such unpinnable pages and migrate them. For example, pairing udmabuf with vfio (which pins pages using pin_user_pages(FOLL_LONGTERM)) in QEMU will most probably not work in all cases: if udmabuf longterm pinned the pages "the wrong way", vfio will fail to migrate them during FOLL_LONGTERM and consequently fail pin_user_pages(). As long as udmabuf holds a reference on these pages, that will never succeed. There are *probably* more issues on the QEMU side when udmabuf is paired with things like MADV_DONTNEED/FALLOC_FL_PUNCH_HOLE used for virtio-balloon, virtio-mem, postcopy live migration, ... for example, in the vfio/vdpa case we make sure that we disallow most of these, because otherwise there can be an accidental "disconnect" between the pages mapped into the VM (guest view) and the pages mapped into the IOMMU (device view), for example, after a reboot. -- Cheers, David / dhildenb
Re: [PATCH] udmabuf: revert 'Add support for mapping hugepages (v4)'
Skimming over at shmem_read_mapping_page() users, I assume most of them use a VM_PFNMAP mapping (or don't mmap them at all), where we won't be messing with the struct page at all. (That might even allow you to mmap hugetlb sub-pages, because the struct page -- and mapcount -- will be ignored completely and not touched.) Oh, are you suggesting that if we do vma->vm_flags |= VM_PFNMAP in the mmap handler (mmap_udmabuf) and also do vmf_insert_pfn(vma, vmf->address, page_to_pfn(page)) instead of vmf->page = ubuf->pages[pgoff]; get_page(vmf->page); in the vma fault handler (udmabuf_vm_fault), we can avoid most of the pitfalls you have identified -- including with the usage of hugetlb subpages? Yes, that's my thinking, but I have to do my homework first to see if that would really work for hugetlb. The thing is, I kind-of consider what udmabuf does a layer violation: we have a filesystem (shmem/hugetlb) that should handle mappings to user space. Yet, a driver decides to bypass that and simply map the pages ordinarily to user space. (revealed by the fact that hugetlb does never map sub-pages but udmabuf decides to do so) In an ideal world everybody would simply mmap() the original memfd, but thinking about offset+size configuration within the memfd that might not always be desirable. As a workaround, we could mmap() only the PFNs, leaving the struct page unaffected. I'll have to look closer into that. -- Cheers, David / dhildenb
Re: [PATCH] udmabuf: revert 'Add support for mapping hugepages (v4)'
On 13.06.23 10:26, Kasireddy, Vivek wrote: Hi David, On 12.06.23 09:10, Kasireddy, Vivek wrote: Hi Mike, Hi Vivek, Sorry for the late reply; I just got back from vacation. If it is unsafe to directly use the subpages of a hugetlb page, then reverting this patch seems like the only option for addressing this issue immediately. So, this patch is Acked-by: Vivek Kasireddy As far as the use-case is concerned, there are two main users of the udmabuf driver: Qemu and CrosVM VMMs. However, it appears Qemu is the only one that uses hugetlb pages (when hugetlb=on is set) as the backing store for Guest (Linux, Android and Windows) system memory. The main goal is to share the pages associated with the Guest allocated framebuffer (FB) with the Host GPU driver and other components in a zero-copy way. To that end, the guest GPU driver (virtio-gpu) allocates 4k size pages (associated with the FB) and pins them before sharing the (guest) physical (or dma) addresses (and lengths) with Qemu. Qemu then translates the addresses into file offsets and shares these offsets with udmabuf. Is my understanding correct, that we can effectively long-term pin (worse than mlock) 64 MiB per UDMABUF_CREATE, allowing eventually !root The 64 MiB limit is the theoretical upper bound that we have not seen hit in practice. Typically, for a 1920x1080 resolution (commonly used in Guests), the size of the FB is ~8 MB (1920x1080x4). And, most modern Graphics compositors flip between two FBs. Okay, but users with privileges to open that file can just create as many as they want? I think I'll have to play with it. users ll /dev/udmabuf crw-rw 1 root kvm 10, 125 12. Jun 08:12 /dev/udmabuf to bypass there effective MEMLOCK limit, fragmenting physical memory and breaking swap? Right, it does not look like the mlock limits are honored. That should be added. Regarding the udmabuf_vm_fault(), I assume we're mapping pages we obtained from the memfd ourselves into a special VMA (mmap() of the mmap operation is really needed only if any component on the Host needs CPU access to the buffer. But in most scenarios, we try to ensure direct GPU access (h/w acceleration via gl) to these pages. udmabuf). I'm not sure how well shmem pages are prepared for getting mapped by someone else into an arbitrary VMA (page->index?). Most drm/gpu drivers use shmem pages as the backing store for FBs and other buffers and also provide mmap capability. What concerns do you see with this approach? Are these mmaping the pages the way udmabuf maps these pages (IOW, on-demand fault where we core-mm will adjust the mapcount etc)? Skimming over at shmem_read_mapping_page() users, I assume most of them use a VM_PFNMAP mapping (or don't mmap them at all), where we won't be messing with the struct page at all. (That might even allow you to mmap hugetlb sub-pages, because the struct page -- and mapcount -- will be ignored completely and not touched.) ... also, just imagine someone doing FALLOC_FL_PUNCH_HOLE / ftruncate() on the memfd. What's mapped into the memfd no longer corresponds to what's pinned / mapped into the VMA. IIUC, making use of the DMA_BUF_IOCTL_SYNC ioctl would help with any coherency issues: https://www.kernel.org/doc/html/v6.2/driver-api/dma-buf.html#c.dma_buf_sync Would it as of now? udmabuf_create() pulls the shmem pages out of the memfd, not sure how DMA_BUF_IOCTL_SYNC would help to update that whenever the pages inside the memfd would change (e.g., FALLOC_FL_PUNCH_HOLE + realloc). But that's most probably simply "not supported". Was linux-mm (and especially shmem maintainers, ccing Hugh) involved in the upstreaming of udmabuf? It does not appear so from the link below although other key lists were cc'd: https://patchwork.freedesktop.org/patch/246100/?series=39879=7 That's unfortunate :( -- Cheers, David / dhildenb
Re: [PATCH] udmabuf: revert 'Add support for mapping hugepages (v4)'
On 12.06.23 09:10, Kasireddy, Vivek wrote: Hi Mike, Hi Vivek, Sorry for the late reply; I just got back from vacation. If it is unsafe to directly use the subpages of a hugetlb page, then reverting this patch seems like the only option for addressing this issue immediately. So, this patch is Acked-by: Vivek Kasireddy As far as the use-case is concerned, there are two main users of the udmabuf driver: Qemu and CrosVM VMMs. However, it appears Qemu is the only one that uses hugetlb pages (when hugetlb=on is set) as the backing store for Guest (Linux, Android and Windows) system memory. The main goal is to share the pages associated with the Guest allocated framebuffer (FB) with the Host GPU driver and other components in a zero-copy way. To that end, the guest GPU driver (virtio-gpu) allocates 4k size pages (associated with the FB) and pins them before sharing the (guest) physical (or dma) addresses (and lengths) with Qemu. Qemu then translates the addresses into file offsets and shares these offsets with udmabuf. Is my understanding correct, that we can effectively long-term pin (worse than mlock) 64 MiB per UDMABUF_CREATE, allowing eventually !root users ll /dev/udmabuf crw-rw 1 root kvm 10, 125 12. Jun 08:12 /dev/udmabuf to bypass there effective MEMLOCK limit, fragmenting physical memory and breaking swap? Regarding the udmabuf_vm_fault(), I assume we're mapping pages we obtained from the memfd ourselves into a special VMA (mmap() of the udmabuf). I'm not sure how well shmem pages are prepared for getting mapped by someone else into an arbitrary VMA (page->index?). ... also, just imagine someone doing FALLOC_FL_PUNCH_HOLE / ftruncate() on the memfd. What's mapped into the memfd no longer corresponds to what's pinned / mapped into the VMA. Was linux-mm (and especially shmem maintainers, ccing Hugh) involved in the upstreaming of udmabuf? -- Cheers, David / dhildenb
Re: [PATCH] mm: fix hugetlb page unmap count balance issue
On 17.05.23 00:34, Mike Kravetz wrote: On 05/15/23 10:04, Mike Kravetz wrote: On 05/12/23 16:29, Mike Kravetz wrote: On 05/12/23 14:26, James Houghton wrote: On Fri, May 12, 2023 at 12:20 AM Junxiao Chang wrote: This alone doesn't fix mapcounting for PTE-mapped HugeTLB pages. You need something like [1]. I can resend it if that's what we should be doing, but this mapcounting scheme doesn't work when the page structs have been freed. It seems like it was a mistake to include support for hugetlb memfds in udmabuf. IIUC, it was added with commit 16c243e99d33 udmabuf: Add support for mapping hugepages (v4). Looks like it was never sent to linux-mm? That is unfortunate as hugetlb vmemmap freeing went in at about the same time. And, as you have noted udmabuf will not work if hugetlb vmemmap freeing is enabled. Sigh! Trying to think of a way forward. -- Mike Kravetz [1]: https://lore.kernel.org/linux-mm/20230306230004.1387007-2-jthough...@google.com/ - James Adding people and list on Cc: involved with commit 16c243e99d33. There are several issues with trying to map tail pages of hugetllb pages not taken into account with udmabuf. James spent quite a bit of time trying to understand and address all the issues with the HGM code. While using the scheme proposed by James, may be an approach to the mapcount issue there are also other issues that need attention. For example, I do not see how the fault code checks the state of the hugetlb page (such as poison) as none of that state is carried in tail pages. The more I think about it, the more I think udmabuf should treat hugetlb pages as hugetlb pages. They should be mapped at the appropriate level in the page table. Of course, this would impose new restrictions on the API (mmap and ioctl) that may break existing users. I have no idea how extensively udmabuf is being used with hugetlb mappings. Verified that using udmabug on a hugetlb mapping with vmemmap optimization will BUG as: [14106.812312] BUG: unable to handle page fault for address: ea000a7c4030 [14106.813704] #PF: supervisor write access in kernel mode [14106.814791] #PF: error_code(0x0003) - permissions violation [14106.815921] PGD 27fff9067 P4D 27fff9067 PUD 27fff8067 PMD 17ec34067 PTE 800285dab021 [14106.818489] Oops: 0003 [#1] PREEMPT SMP PTI [14106.819345] CPU: 2 PID: 2313 Comm: udmabuf Not tainted 6.4.0-rc1-next-20230508+ #44 [14106.820906] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014 [14106.822679] RIP: 0010:page_add_file_rmap+0x2e/0x270 I started looking more closely at the driver and I do not fully understand the usage model. I took clues from the selftest and driver. It seems the first step is to create a buffer via the UDMABUF_CREATE ioctl. This will copy 4K pages from the page cache to an array associated with a file. I did note that hugetlb and shm behavior is different here as the driver can not add missing hugetlb pages to the cache as it does with shm. However, what seems more concerning is that there is nothing to prevent the pages from being replaced in the cache before being added to a udmabuf mapping. This means udmabuf mapping and original memfd could be operating on a different set of pages. Is this acceptable, or somehow prevented? In my role, I am more interested in udmabuf handling of hugetlb pages. Trying to use individual 4K pages of hugetlb pages is something that should be avoided here. Would it be acceptable to change code so that only whole hugetlb pages are used by udmabuf? If not, then perhaps the existing hugetlb support can be removed? I'm wondering if that VMA shouldn't be some kind of special mapping (VM_PFNMAP), such that the struct page is entirely ignored? I'm quite confused and concerned when I read that code (what the hell is it doing with shmem/hugetlb pages? why does the mapcount even get adjusted?) This all has a bad smell to it, I hope I'm missing something important ... -- Cheers, David / dhildenb
Re: [PATCH v5 1/6] mm/gup: remove unused vmas parameter from get_user_pages()
On 16.05.23 16:30, Sean Christopherson wrote: On Tue, May 16, 2023, David Hildenbrand wrote: On 15.05.23 21:07, Sean Christopherson wrote: On Sun, May 14, 2023, Lorenzo Stoakes wrote: diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index cb5c13eee193..eaa5bb8dbadc 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2477,7 +2477,7 @@ static inline int check_user_page_hwpoison(unsigned long addr) { int rc, flags = FOLL_HWPOISON | FOLL_WRITE; - rc = get_user_pages(addr, 1, flags, NULL, NULL); + rc = get_user_pages(addr, 1, flags, NULL); return rc == -EHWPOISON; Unrelated to this patch, I think there's a pre-existing bug here. If gup() returns a valid page, KVM will leak the refcount and unintentionally pin the page. That's When passing NULL as "pages" to get_user_pages(), __get_user_pages_locked() won't set FOLL_GET. As FOLL_PIN is also not set, we won't be messing with the mapcount of the page. For completeness: s/mapcount/refcount/ :) Ah, that's what I'm missing. -- Thanks, David / dhildenb
Re: [PATCH v5 1/6] mm/gup: remove unused vmas parameter from get_user_pages()
On 15.05.23 21:07, Sean Christopherson wrote: On Sun, May 14, 2023, Lorenzo Stoakes wrote: No invocation of get_user_pages() use the vmas parameter, so remove it. The GUP API is confusing and caveated. Recent changes have done much to improve that, however there is more we can do. Exporting vmas is a prime target as the caller has to be extremely careful to preclude their use after the mmap_lock has expired or otherwise be left with dangling pointers. Removing the vmas parameter focuses the GUP functions upon their primary purpose - pinning (and outputting) pages as well as performing the actions implied by the input flags. This is part of a patch series aiming to remove the vmas parameter altogether. Suggested-by: Matthew Wilcox (Oracle) Acked-by: Greg Kroah-Hartman Acked-by: David Hildenbrand Reviewed-by: Jason Gunthorpe Acked-by: Christian K�nig (for radeon parts) Acked-by: Jarkko Sakkinen Signed-off-by: Lorenzo Stoakes --- arch/x86/kernel/cpu/sgx/ioctl.c | 2 +- drivers/gpu/drm/radeon/radeon_ttm.c | 2 +- drivers/misc/sgi-gru/grufault.c | 2 +- include/linux/mm.h | 3 +-- mm/gup.c| 9 +++-- mm/gup_test.c | 5 ++--- virt/kvm/kvm_main.c | 2 +- 7 files changed, 10 insertions(+), 15 deletions(-) Acked-by: Sean Christopherson (KVM) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index cb5c13eee193..eaa5bb8dbadc 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2477,7 +2477,7 @@ static inline int check_user_page_hwpoison(unsigned long addr) { int rc, flags = FOLL_HWPOISON | FOLL_WRITE; - rc = get_user_pages(addr, 1, flags, NULL, NULL); + rc = get_user_pages(addr, 1, flags, NULL); return rc == -EHWPOISON; Unrelated to this patch, I think there's a pre-existing bug here. If gup() returns a valid page, KVM will leak the refcount and unintentionally pin the page. That's When passing NULL as "pages" to get_user_pages(), __get_user_pages_locked() won't set FOLL_GET. As FOLL_PIN is also not set, we won't be messing with the mapcount of the page. So even if get_user_pages() returns "1", we should be fine. Or am I misunderstanding your concern? At least hva_to_pfn_slow() most certainly didn't return "1" if we end up calling check_user_page_hwpoison(), so nothing would have been pinned there as well. -- Thanks, David / dhildenb
Re: [PATCH v3 1/7] mm/gup: remove unused vmas parameter from get_user_pages()
On 15.04.23 14:09, Lorenzo Stoakes wrote: No invocation of get_user_pages() uses the vmas parameter, so remove it. The GUP API is confusing and caveated. Recent changes have done much to improve that, however there is more we can do. Exporting vmas is a prime target as the caller has to be extremely careful to preclude their use after the mmap_lock has expired or otherwise be left with dangling pointers. Removing the vmas parameter focuses the GUP functions upon their primary purpose - pinning (and outputting) pages as well as performing the actions implied by the input flags. This is part of a patch series aiming to remove the vmas parameter altogether. Signed-off-by: Lorenzo Stoakes Suggested-by: Matthew Wilcox (Oracle) Acked-by: Greg Kroah-Hartman --- Acked-by: David Hildenbrand -- Thanks, David / dhildenb
Re: [PATCH mm-unstable v1 16/20] mm/frame-vector: remove FOLL_FORCE usage
On 29.11.22 10:08, Hans Verkuil wrote: On 29/11/2022 09:48, David Hildenbrand wrote: On 28.11.22 23:59, Andrew Morton wrote: On Mon, 28 Nov 2022 09:18:47 +0100 David Hildenbrand wrote: Less chances of things going wrong that way. Just mention in the v2 cover letter that the first patch was added to make it easy to backport that fix without being hampered by merge conflicts if it was added after your frame_vector.c patch. Yes, that's the way I would naturally do, it, however, Andrew prefers delta updates for minor changes. @Andrew, whatever you prefer! I'm inclined to let things sit as they are. Cross-tree conflicts happen, and Linus handles them. I'll flag this (very simple) conflict in the pull request, if MM merges second. If v4l merges second then hopefully they will do the same. But this one is so simple that Linus hardly needs our help. It's not about cross-tree conflicts, it's about the fact that my patch is a fix that needs to be backported to older kernels. It should apply cleanly to those older kernels if my patch goes in first, but if it is the other way around I would have to make a new patch for the stable kernels. IIUC, the conflict will be resolved at merge time and the merge resolution will be part of the merge commit. It doesn't matter in which order the patches go upstream, the merge commit resolves the problematic overlap. So your patch will be upstream as intended, where it can be cleanly backported. Hope I am not twisting reality ;) -- Thanks, David / dhildenb
Re: [PATCH mm-unstable v1 16/20] mm/frame-vector: remove FOLL_FORCE usage
On 28.11.22 23:59, Andrew Morton wrote: On Mon, 28 Nov 2022 09:18:47 +0100 David Hildenbrand wrote: Less chances of things going wrong that way. Just mention in the v2 cover letter that the first patch was added to make it easy to backport that fix without being hampered by merge conflicts if it was added after your frame_vector.c patch. Yes, that's the way I would naturally do, it, however, Andrew prefers delta updates for minor changes. @Andrew, whatever you prefer! I'm inclined to let things sit as they are. Cross-tree conflicts happen, and Linus handles them. I'll flag this (very simple) conflict in the pull request, if MM merges second. If v4l merges second then hopefully they will do the same. But this one is so simple that Linus hardly needs our help. But Linus won't be editing changelogs so that the changelog makes more sense after both trees are joined. I'm inclined to let the changelog sit as it is as well. Works for me. Thanks Andrew! -- Thanks, David / dhildenb
Re: [PATCH mm-unstable v1 16/20] mm/frame-vector: remove FOLL_FORCE usage
On 28.11.22 09:17, Hans Verkuil wrote: Hi David, On 27/11/2022 11:35, David Hildenbrand wrote: On 16.11.22 11:26, David Hildenbrand wrote: FOLL_FORCE is really only for ptrace access. According to commit 707947247e95 ("media: videobuf2-vmalloc: get_userptr: buffers are always writable"), get_vaddr_frames() currently pins all pages writable as a workaround for issues with read-only buffers. FOLL_FORCE, however, seems to be a legacy leftover as it predates commit 707947247e95 ("media: videobuf2-vmalloc: get_userptr: buffers are always writable"). Let's just remove it. Once the read-only buffer issue has been resolved, FOLL_WRITE could again be set depending on the DMA direction. Cc: Hans Verkuil Cc: Marek Szyprowski Cc: Tomasz Figa Cc: Marek Szyprowski Cc: Mauro Carvalho Chehab Signed-off-by: David Hildenbrand --- drivers/media/common/videobuf2/frame_vector.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/media/common/videobuf2/frame_vector.c b/drivers/media/common/videobuf2/frame_vector.c index 542dde9d2609..062e98148c53 100644 --- a/drivers/media/common/videobuf2/frame_vector.c +++ b/drivers/media/common/videobuf2/frame_vector.c @@ -50,7 +50,7 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames, start = untagged_addr(start); ret = pin_user_pages_fast(start, nr_frames, - FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM, + FOLL_WRITE | FOLL_LONGTERM, (struct page **)(vec->ptrs)); if (ret > 0) { vec->got_ref = true; Hi Andrew, see the discussion at [1] regarding a conflict and how to proceed with upstreaming. The conflict would be easy to resolve, however, also the patch description doesn't make sense anymore with [1]. Might it be easier and less confusing if you post a v2 of this series with my patch first? That way it is clear that 1) my patch has to come first, and 2) that it is part of a single series and should be merged by the mm subsystem. Less chances of things going wrong that way. Just mention in the v2 cover letter that the first patch was added to make it easy to backport that fix without being hampered by merge conflicts if it was added after your frame_vector.c patch. Yes, that's the way I would naturally do, it, however, Andrew prefers delta updates for minor changes. @Andrew, whatever you prefer! Thanks! -- Thanks, David / dhildenb
Re: [PATCH mm-unstable v1 16/20] mm/frame-vector: remove FOLL_FORCE usage
On 16.11.22 11:26, David Hildenbrand wrote: FOLL_FORCE is really only for ptrace access. According to commit 707947247e95 ("media: videobuf2-vmalloc: get_userptr: buffers are always writable"), get_vaddr_frames() currently pins all pages writable as a workaround for issues with read-only buffers. FOLL_FORCE, however, seems to be a legacy leftover as it predates commit 707947247e95 ("media: videobuf2-vmalloc: get_userptr: buffers are always writable"). Let's just remove it. Once the read-only buffer issue has been resolved, FOLL_WRITE could again be set depending on the DMA direction. Cc: Hans Verkuil Cc: Marek Szyprowski Cc: Tomasz Figa Cc: Marek Szyprowski Cc: Mauro Carvalho Chehab Signed-off-by: David Hildenbrand --- drivers/media/common/videobuf2/frame_vector.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/media/common/videobuf2/frame_vector.c b/drivers/media/common/videobuf2/frame_vector.c index 542dde9d2609..062e98148c53 100644 --- a/drivers/media/common/videobuf2/frame_vector.c +++ b/drivers/media/common/videobuf2/frame_vector.c @@ -50,7 +50,7 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames, start = untagged_addr(start); ret = pin_user_pages_fast(start, nr_frames, - FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM, + FOLL_WRITE | FOLL_LONGTERM, (struct page **)(vec->ptrs)); if (ret > 0) { vec->got_ref = true; Hi Andrew, see the discussion at [1] regarding a conflict and how to proceed with upstreaming. The conflict would be easy to resolve, however, also the patch description doesn't make sense anymore with [1]. On top of mm-unstable, reverting this patch and applying [1] gives me an updated patch: From 1e66c25f1467c1f1e5f275312f2c6df29308d4df Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Wed, 16 Nov 2022 11:26:55 +0100 Subject: [PATCH] mm/frame-vector: remove FOLL_FORCE usage GUP now supports reliable R/O long-term pinning in COW mappings, such that we break COW early. MAP_SHARED VMAs only use the shared zeropage so far in one corner case (DAXFS file with holes), which can be ignored because GUP does not support long-term pinning in fsdax (see check_vma_flags()). Consequently, FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM is no longer required for reliable R/O long-term pinning: FOLL_LONGTERM is sufficient. So stop using FOLL_FORCE, which is really only for ptrace access. Reviewed-by: Daniel Vetter Acked-by: Hans Verkuil Cc: Hans Verkuil Cc: Marek Szyprowski Cc: Tomasz Figa Cc: Marek Szyprowski Cc: Mauro Carvalho Chehab Signed-off-by: David Hildenbrand --- drivers/media/common/videobuf2/frame_vector.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/media/common/videobuf2/frame_vector.c b/drivers/media/common/videobuf2/frame_vector.c index aad72640f055..8606fdacf5b8 100644 --- a/drivers/media/common/videobuf2/frame_vector.c +++ b/drivers/media/common/videobuf2/frame_vector.c @@ -41,7 +41,7 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames, bool write, int ret_pin_user_pages_fast = 0; int ret = 0; int err; - unsigned int gup_flags = FOLL_FORCE | FOLL_LONGTERM; + unsigned int gup_flags = FOLL_LONGTERM; if (nr_frames == 0) return 0; -- 2.38.1 Please let me know how you want to proceed. Ideally, you'd pick up [1] and apply this updated patch. Also, please tell me if I should send this updated patch in a separate mail (e.g., as reply to this mail). [1] https://lkml.kernel.org/r/71bdd3cf-b044-3f12-df58-7c16d5749...@xs4all.nl -- Thanks, David / dhildenb
Re: [PATCH RFC 16/19] mm/frame-vector: remove FOLL_FORCE usage
On 22.11.22 15:07, Hans Verkuil wrote: On 11/22/22 13:38, David Hildenbrand wrote: On 22.11.22 13:25, Hans Verkuil wrote: Hi Tomasz, David, On 11/8/22 05:45, Tomasz Figa wrote: Hi David, On Tue, Nov 8, 2022 at 1:19 AM David Hildenbrand wrote: FOLL_FORCE is really only for debugger access. According to commit 707947247e95 ("media: videobuf2-vmalloc: get_userptr: buffers are always writable"), the pinned pages are always writable. Actually that patch is only a workaround to temporarily disable support for read-only pages as they seemed to suffer from some corruption issues in the retrieved user pages. We expect to support read-only pages as hardware input after. That said, FOLL_FORCE doesn't sound like the right thing even in that case, but I don't know the background behind it being added here in the first place. +Hans Verkuil +Marek Szyprowski do you happen to remember anything about it? I tracked the use of 'force' all the way back to the first git commit (2.6.12-rc1) in the very old video-buf.c. So it is very, very old and the reason is lost in the mists of time. I'm not sure if the 'force' argument of get_user_pages() at that time even meant the same as FOLL_FORCE today. From what I can tell it has just been faithfully used ever since, but I have my doubt that anyone understands the reason behind it since it was never explained. Looking at this old LWN article https://lwn.net/Articles/28548/ suggests that it might be related to calling get_user_pages for write buffers (non-zero write argument) where you also want to be able to read from the buffer. That is certainly something that some drivers need to do post-capture fixups. But 'force' was also always set for read buffers, and I don't know if that was something that was actually needed, or just laziness. I assume that removing FOLL_FORCE from 'FOLL_FORCE|FOLL_WRITE' will still allow drivers to read from the buffer? Yes. The only problematic corner case I can imagine is if someone has a VMA without write permissions (no PROT_WRITE/VM_WRITE) and wants to pin user space pages as a read buffer. We'd specify now FOLL_WRITE without FOLL_FORCE and GUP would reject that: write access without write permissions is invalid. I do not believe this will be an issue. There would be no way around "fixing" this implementation to not specify FOLL_WRITE when only reading from user-space pages. Not sure what the implications are regarding that corruption that was mentioned in 707947247e95. Before 707947247e95 the FOLL_WRITE flag was only set for write buffers (i.e. video capture, DMA_FROM_DEVICE), not for read buffers (video output, DMA_TO_DEVICE). In the video output case there should never be any need for drivers to write to the buffer to the best of my knowledge. But I have had some complaints about that commit that it causes problems in some scenarios, and it has been on my todo list for quite some time now to dig deeper into this. I probably should prioritize this for this or next week. Having said that, I assume such a scenario is unlikely -- but you might know better how user space usually uses this interface. There would be three options: 1) Leave the FOLL_FORCE hack in for now, which I *really* want to avoid. 2) Remove FOLL_FORCE and see if anybody even notices (this patch) and leave the implementation as is for now. 3) Remove FOLL_FORCE and fixup the implementation to only specify FOLL_WRITE if the pages will actually get written to. 3) would most probably ideal, however, I am no expert on that code and can't do it (707947247e95 confuses me). So naive me would go with 2) first. Option 3 would be best. And 707947247e95 confuses me as well, and I actually wrote it :-) I am wondering whether it was addressed at the right level, but as I said, I need to dig a bit deeper into this. Cool, let me know if I can help! -- Thanks, David / dhildenb
Re: [PATCH RFC 16/19] mm/frame-vector: remove FOLL_FORCE usage
On 22.11.22 13:25, Hans Verkuil wrote: Hi Tomasz, David, On 11/8/22 05:45, Tomasz Figa wrote: Hi David, On Tue, Nov 8, 2022 at 1:19 AM David Hildenbrand wrote: FOLL_FORCE is really only for debugger access. According to commit 707947247e95 ("media: videobuf2-vmalloc: get_userptr: buffers are always writable"), the pinned pages are always writable. Actually that patch is only a workaround to temporarily disable support for read-only pages as they seemed to suffer from some corruption issues in the retrieved user pages. We expect to support read-only pages as hardware input after. That said, FOLL_FORCE doesn't sound like the right thing even in that case, but I don't know the background behind it being added here in the first place. +Hans Verkuil +Marek Szyprowski do you happen to remember anything about it? I tracked the use of 'force' all the way back to the first git commit (2.6.12-rc1) in the very old video-buf.c. So it is very, very old and the reason is lost in the mists of time. I'm not sure if the 'force' argument of get_user_pages() at that time even meant the same as FOLL_FORCE today. From what I can tell it has just been faithfully used ever since, but I have my doubt that anyone understands the reason behind it since it was never explained. Looking at this old LWN article https://lwn.net/Articles/28548/ suggests that it might be related to calling get_user_pages for write buffers (non-zero write argument) where you also want to be able to read from the buffer. That is certainly something that some drivers need to do post-capture fixups. But 'force' was also always set for read buffers, and I don't know if that was something that was actually needed, or just laziness. I assume that removing FOLL_FORCE from 'FOLL_FORCE|FOLL_WRITE' will still allow drivers to read from the buffer? Yes. The only problematic corner case I can imagine is if someone has a VMA without write permissions (no PROT_WRITE/VM_WRITE) and wants to pin user space pages as a read buffer. We'd specify now FOLL_WRITE without FOLL_FORCE and GUP would reject that: write access without write permissions is invalid. There would be no way around "fixing" this implementation to not specify FOLL_WRITE when only reading from user-space pages. Not sure what the implications are regarding that corruption that was mentioned in 707947247e95. Having said that, I assume such a scenario is unlikely -- but you might know better how user space usually uses this interface. There would be three options: 1) Leave the FOLL_FORCE hack in for now, which I *really* want to avoid. 2) Remove FOLL_FORCE and see if anybody even notices (this patch) and leave the implementation as is for now. 3) Remove FOLL_FORCE and fixup the implementation to only specify FOLL_WRITE if the pages will actually get written to. 3) would most probably ideal, however, I am no expert on that code and can't do it (707947247e95 confuses me). So naive me would go with 2) first. -- Thanks, David / dhildenb
Re: [PATCH mm-unstable v1 20/20] mm: rename FOLL_FORCE to FOLL_PTRACE
On 16.11.22 19:16, Linus Torvalds wrote: On Wed, Nov 16, 2022 at 2:30 AM David Hildenbrand wrote: Let's make it clearer that functionality provided by FOLL_FORCE is really only for ptrace access. I'm not super-happy about this one. I do understand the "let's rename the bit so that no new user shows up". And it's true that the main traditional use is ptrace. But from the patch itself it becomes obvious that no, it's not *just* ptrace. At least not yet. It's used for get_arg_page(), which uses it to basically look up (and install) pages in the newly created VM. Now, I'm not entirely sure why it even uses FOLL_FORCE, - I think it might be historical, because the target should always be the new stack vma. Following the history of it is a big of a mess, because there's a number of renamings and re-organizations, but it seems to go back to 2007 and commit b6a2fea39318 ("mm: variable length argument support"). Right. Before that commit, we kept our own array of "this is the set of pages that I will install in the new VM". That commit basically just inserts the pages directly into the VM instead, getting rid of the array size limitation. So at a minimum, I think that FOLL_FORCE would need to be removed before any renaming to FOLL_PTRACE, because that's not some kind of small random case. It *might* be as simple as just removing it, but maybe there's some reason for having it that I don't immediately see. Right, I have the same feeling. It might just be a copy-and-paste legacy leftover. There _are_ also small random cases too, like get_cmdline(). Maybe that counts as ptrace, but the execve() case most definitely does not. I agree. I'd suggest moving forward without this (last) patch for now and figuring out how to further cleanup FOLL_FORCE usage on top. @Andrew, if you intend to put this into mm-unstable, please drop the last patch for now. -- Thanks, David / dhildenb
[PATCH mm-unstable v1 20/20] mm: rename FOLL_FORCE to FOLL_PTRACE
Let's make it clearer that functionality provided by FOLL_FORCE is really only for ptrace access. Prevent accidental re-use in drivers by renaming FOLL_FORCE to FOLL_PTRACE: git grep -l 'FOLL_FORCE' | xargs sed -i 's/FOLL_FORCE/FOLL_PTRACE/g' In the future, we might want to use a separate set of flags for the access_vm interface: most FOLL_* flags don't apply and we mostly only want to pass FOLL_PTRACE and FOLL_WRITE. Suggested-by: Christoph Hellwig Cc: Oleg Nesterov Cc: Richard Henderson Cc: Ivan Kokshaysky Cc: Matt Turner Cc: Catalin Marinas Cc: Will Deacon Cc: Thomas Bogendoerfer Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Christophe Leroy Cc: "David S. Miller" Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: Dave Hansen Cc: "H. Peter Anvin" Cc: Richard Weinberger Cc: Anton Ivanov Cc: Johannes Berg Cc: Eric Biederman Cc: Kees Cook Cc: Alexander Viro Cc: Peter Zijlstra Cc: Arnaldo Carvalho de Melo Cc: Mark Rutland Cc: Alexander Shishkin Cc: Jiri Olsa Cc: Namhyung Kim Cc: Mike Kravetz Cc: Muchun Song Cc: Kentaro Takeda Cc: Tetsuo Handa Cc: Paul Moore Cc: James Morris Cc: "Serge E. Hallyn" Signed-off-by: David Hildenbrand --- arch/alpha/kernel/ptrace.c| 6 +++--- arch/arm64/kernel/mte.c | 2 +- arch/ia64/kernel/ptrace.c | 10 +- arch/mips/kernel/ptrace32.c | 4 ++-- arch/mips/math-emu/dsemul.c | 2 +- arch/powerpc/kernel/ptrace/ptrace32.c | 4 ++-- arch/sparc/kernel/ptrace_32.c | 4 ++-- arch/sparc/kernel/ptrace_64.c | 8 arch/x86/kernel/step.c| 2 +- arch/x86/um/ptrace_32.c | 2 +- arch/x86/um/ptrace_64.c | 2 +- fs/exec.c | 2 +- fs/proc/base.c| 2 +- include/linux/mm.h| 8 kernel/events/uprobes.c | 4 ++-- kernel/ptrace.c | 12 ++-- mm/gup.c | 28 +-- mm/huge_memory.c | 8 mm/hugetlb.c | 2 +- mm/memory.c | 4 ++-- mm/util.c | 4 ++-- security/tomoyo/domain.c | 2 +- 22 files changed, 61 insertions(+), 61 deletions(-) diff --git a/arch/alpha/kernel/ptrace.c b/arch/alpha/kernel/ptrace.c index a1a239ea002d..55def6479ff2 100644 --- a/arch/alpha/kernel/ptrace.c +++ b/arch/alpha/kernel/ptrace.c @@ -158,7 +158,7 @@ static inline int read_int(struct task_struct *task, unsigned long addr, int * data) { int copied = access_process_vm(task, addr, data, sizeof(int), - FOLL_FORCE); + FOLL_PTRACE); return (copied == sizeof(int)) ? 0 : -EIO; } @@ -166,7 +166,7 @@ static inline int write_int(struct task_struct *task, unsigned long addr, int data) { int copied = access_process_vm(task, addr, , sizeof(int), - FOLL_FORCE | FOLL_WRITE); + FOLL_PTRACE | FOLL_WRITE); return (copied == sizeof(int)) ? 0 : -EIO; } @@ -284,7 +284,7 @@ long arch_ptrace(struct task_struct *child, long request, case PTRACE_PEEKTEXT: /* read word at location addr. */ case PTRACE_PEEKDATA: copied = ptrace_access_vm(child, addr, , sizeof(tmp), - FOLL_FORCE); + FOLL_PTRACE); ret = -EIO; if (copied != sizeof(tmp)) break; diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c index 7467217c1eaf..fa29fecaedbc 100644 --- a/arch/arm64/kernel/mte.c +++ b/arch/arm64/kernel/mte.c @@ -525,7 +525,7 @@ int mte_ptrace_copy_tags(struct task_struct *child, long request, int ret; struct iovec kiov; struct iovec __user *uiov = (void __user *)data; - unsigned int gup_flags = FOLL_FORCE; + unsigned int gup_flags = FOLL_PTRACE; if (!system_supports_mte()) return -EIO; diff --git a/arch/ia64/kernel/ptrace.c b/arch/ia64/kernel/ptrace.c index ab8aeb34d1d9..3781db1f506c 100644 --- a/arch/ia64/kernel/ptrace.c +++ b/arch/ia64/kernel/ptrace.c @@ -452,7 +452,7 @@ ia64_peek (struct task_struct *child, struct switch_stack *child_stack, return 0; } } - copied = access_process_vm(child, addr, , sizeof(ret), FOLL_FORCE); + copied = access_process_vm(child, addr, , sizeof(ret), FOLL_PTRACE); if (copied != sizeof(ret)) return -EIO; *val = ret; @@ -489,7 +489,7 @@ ia64_poke (struct task_struct *child, struct switch_stack *child_stack, } } } else if (access_process_vm(child, addr, , sizeof(val), - FOLL_FORCE | FOLL_WRITE) +
[PATCH mm-unstable v1 19/20] habanalabs: remove FOLL_FORCE usage
FOLL_FORCE is really only for ptrace access. As we unpin the pinned pages using unpin_user_pages_dirty_lock(true), the assumption is that all these pages are writable. FOLL_FORCE in this case seems to be due to copy-and-past from other drivers. Let's just remove it. Acked-by: Oded Gabbay Cc: Oded Gabbay Cc: Arnd Bergmann Cc: Greg Kroah-Hartman Signed-off-by: David Hildenbrand --- drivers/misc/habanalabs/common/memory.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/misc/habanalabs/common/memory.c b/drivers/misc/habanalabs/common/memory.c index ef28f3b37b93..e35cca96bbef 100644 --- a/drivers/misc/habanalabs/common/memory.c +++ b/drivers/misc/habanalabs/common/memory.c @@ -2312,8 +2312,7 @@ static int get_user_memory(struct hl_device *hdev, u64 addr, u64 size, if (!userptr->pages) return -ENOMEM; - rc = pin_user_pages_fast(start, npages, -FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM, + rc = pin_user_pages_fast(start, npages, FOLL_WRITE | FOLL_LONGTERM, userptr->pages); if (rc != npages) { -- 2.38.1
[PATCH mm-unstable v1 17/20] drm/exynos: remove FOLL_FORCE usage
FOLL_FORCE is really only for ptrace access. As we unpin the pinned pages using unpin_user_pages_dirty_lock(true), the assumption is that all these pages are writable. FOLL_FORCE in this case seems to be a legacy leftover. Let's just remove it. Cc: Inki Dae Cc: Seung-Woo Kim Cc: Kyungmin Park Cc: David Airlie Cc: Daniel Vetter Cc: Krzysztof Kozlowski Signed-off-by: David Hildenbrand --- drivers/gpu/drm/exynos/exynos_drm_g2d.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/exynos/exynos_drm_g2d.c b/drivers/gpu/drm/exynos/exynos_drm_g2d.c index 471fd6c8135f..e19c2ceb3759 100644 --- a/drivers/gpu/drm/exynos/exynos_drm_g2d.c +++ b/drivers/gpu/drm/exynos/exynos_drm_g2d.c @@ -477,7 +477,7 @@ static dma_addr_t *g2d_userptr_get_dma_addr(struct g2d_data *g2d, } ret = pin_user_pages_fast(start, npages, - FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM, + FOLL_WRITE | FOLL_LONGTERM, g2d_userptr->pages); if (ret != npages) { DRM_DEV_ERROR(g2d->dev, -- 2.38.1
[PATCH mm-unstable v1 18/20] RDMA/hw/qib/qib_user_pages: remove FOLL_FORCE usage
FOLL_FORCE is really only for ptrace access. As we unpin the pinned pages using unpin_user_pages_dirty_lock(true), the assumption is that all these pages are writable. FOLL_FORCE in this case seems to be a legacy leftover. Let's just remove it. Cc: Dennis Dalessandro Cc: Jason Gunthorpe Cc: Leon Romanovsky Signed-off-by: David Hildenbrand --- drivers/infiniband/hw/qib/qib_user_pages.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/infiniband/hw/qib/qib_user_pages.c b/drivers/infiniband/hw/qib/qib_user_pages.c index f4b5f05058e4..f693bc753b6b 100644 --- a/drivers/infiniband/hw/qib/qib_user_pages.c +++ b/drivers/infiniband/hw/qib/qib_user_pages.c @@ -110,7 +110,7 @@ int qib_get_user_pages(unsigned long start_page, size_t num_pages, for (got = 0; got < num_pages; got += ret) { ret = pin_user_pages(start_page + got * PAGE_SIZE, num_pages - got, -FOLL_LONGTERM | FOLL_WRITE | FOLL_FORCE, +FOLL_LONGTERM | FOLL_WRITE, p + got, NULL); if (ret < 0) { mmap_read_unlock(current->mm); -- 2.38.1
[PATCH mm-unstable v1 09/20] mm/gup: reliable R/O long-term pinning in COW mappings
We already support reliable R/O pinning of anonymous memory. However, assume we end up pinning (R/O long-term) a pagecache page or the shared zeropage inside a writable private ("COW") mapping. The next write access will trigger a write-fault and replace the pinned page by an exclusive anonymous page in the process page tables to break COW: the pinned page no longer corresponds to the page mapped into the process' page table. Now that FAULT_FLAG_UNSHARE can break COW on anything mapped into a COW mapping, let's properly break COW first before R/O long-term pinning something that's not an exclusive anon page inside a COW mapping. FAULT_FLAG_UNSHARE will break COW and map an exclusive anon page instead that can get pinned safely. With this change, we can stop using FOLL_FORCE|FOLL_WRITE for reliable R/O long-term pinning in COW mappings. With this change, the new R/O long-term pinning tests for non-anonymous memory succeed: # [RUN] R/O longterm GUP pin ... with shared zeropage ok 151 Longterm R/O pin is reliable # [RUN] R/O longterm GUP pin ... with memfd ok 152 Longterm R/O pin is reliable # [RUN] R/O longterm GUP pin ... with tmpfile ok 153 Longterm R/O pin is reliable # [RUN] R/O longterm GUP pin ... with huge zeropage ok 154 Longterm R/O pin is reliable # [RUN] R/O longterm GUP pin ... with memfd hugetlb (2048 kB) ok 155 Longterm R/O pin is reliable # [RUN] R/O longterm GUP pin ... with memfd hugetlb (1048576 kB) ok 156 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with shared zeropage ok 157 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with memfd ok 158 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with tmpfile ok 159 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with huge zeropage ok 160 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with memfd hugetlb (2048 kB) ok 161 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with memfd hugetlb (1048576 kB) ok 162 Longterm R/O pin is reliable Note 1: We don't care about short-term R/O-pinning, because they have snapshot semantics: they are not supposed to observe modifications that happen after pinning. As one example, assume we start direct I/O to read from a page and store page content into a file: modifications to page content after starting direct I/O are not guaranteed to end up in the file. So even if we'd pin the shared zeropage, the end result would be as expected -- getting zeroes stored to the file. Note 2: For shared mappings we'll now always fallback to the slow path to lookup the VMA when R/O long-term pining. While that's the necessary price we have to pay right now, it's actually not that bad in practice: most FOLL_LONGTERM users already specify FOLL_WRITE, for example, along with FOLL_FORCE because they tried dealing with COW mappings correctly ... Note 3: For users that use FOLL_LONGTERM right now without FOLL_WRITE, such as VFIO, we'd now no longer pin the shared zeropage. Instead, we'd populate exclusive anon pages that we can pin. There was a concern that this could affect the memlock limit of existing setups. For example, a VM running with VFIO could run into the memlock limit and fail to run. However, we essentially had the same behavior already in commit 17839856fd58 ("gup: document and work around "COW can break either way" issue") which got merged into some enterprise distros, and there were not any such complaints. So most probably, we're fine. Signed-off-by: David Hildenbrand --- include/linux/mm.h | 27 --- mm/gup.c | 10 +- mm/huge_memory.c | 2 +- mm/hugetlb.c | 7 --- 4 files changed, 34 insertions(+), 12 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 6bd2ee5872dd..e8cc838f42f9 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3095,8 +3095,12 @@ static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags) * Must be called with the (sub)page that's actually referenced via the * page table entry, which might not necessarily be the head page for a * PTE-mapped THP. + * + * If the vma is NULL, we're coming from the GUP-fast path and might have + * to fallback to the slow path just to lookup the vma. */ -static inline bool gup_must_unshare(unsigned int flags, struct page *page) +static inline bool gup_must_unshare(struct vm_area_struct *vma, + unsigned int flags, struct page *page) { /* * FOLL_WRITE is implicitly handled correctly as the page table entry @@ -3109,8 +3113,25 @@ static inline bool gup_must_unshare(unsigned int flags, struct page *page) * Note: PageAnon(page) is stable until the page is actually getting * freed. */ - if (!PageAnon(page)) - return false; + if (!PageAnon(page)) { +
[PATCH mm-unstable v1 13/20] media: videobuf-dma-sg: remove FOLL_FORCE usage
GUP now supports reliable R/O long-term pinning in COW mappings, such that we break COW early. MAP_SHARED VMAs only use the shared zeropage so far in one corner case (DAXFS file with holes), which can be ignored because GUP does not support long-term pinning in fsdax (see check_vma_flags()). Consequently, FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM is no longer required for reliable R/O long-term pinning: FOLL_LONGTERM is sufficient. So stop using FOLL_FORCE, which is really only for ptrace access. Cc: Mauro Carvalho Chehab Signed-off-by: David Hildenbrand --- drivers/media/v4l2-core/videobuf-dma-sg.c | 14 +- 1 file changed, 5 insertions(+), 9 deletions(-) diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c b/drivers/media/v4l2-core/videobuf-dma-sg.c index f75e5eedeee0..234e9f647c96 100644 --- a/drivers/media/v4l2-core/videobuf-dma-sg.c +++ b/drivers/media/v4l2-core/videobuf-dma-sg.c @@ -151,17 +151,16 @@ static void videobuf_dma_init(struct videobuf_dmabuf *dma) static int videobuf_dma_init_user_locked(struct videobuf_dmabuf *dma, int direction, unsigned long data, unsigned long size) { + unsigned int gup_flags = FOLL_LONGTERM; unsigned long first, last; - int err, rw = 0; - unsigned int flags = FOLL_FORCE; + int err; dma->direction = direction; switch (dma->direction) { case DMA_FROM_DEVICE: - rw = READ; + gup_flags |= FOLL_WRITE; break; case DMA_TO_DEVICE: - rw = WRITE; break; default: BUG(); @@ -177,14 +176,11 @@ static int videobuf_dma_init_user_locked(struct videobuf_dmabuf *dma, if (NULL == dma->pages) return -ENOMEM; - if (rw == READ) - flags |= FOLL_WRITE; - dprintk(1, "init user [0x%lx+0x%lx => %lu pages]\n", data, size, dma->nr_pages); - err = pin_user_pages(data & PAGE_MASK, dma->nr_pages, -flags | FOLL_LONGTERM, dma->pages, NULL); + err = pin_user_pages(data & PAGE_MASK, dma->nr_pages, gup_flags, +dma->pages, NULL); if (err != dma->nr_pages) { dma->nr_pages = (err >= 0) ? err : 0; -- 2.38.1
[PATCH mm-unstable v1 15/20] media: pci/ivtv: remove FOLL_FORCE usage
FOLL_FORCE is really only for ptrace access. R/O pinning a page is supposed to fail if the VMA misses proper access permissions (no VM_READ). Let's just remove FOLL_FORCE usage here; there would have to be a pretty good reason to allow arbitrary drivers to R/O pin pages in a PROT_NONE VMA. Most probably, FOLL_FORCE usage is just some legacy leftover. Cc: Andy Walls Cc: Mauro Carvalho Chehab Signed-off-by: David Hildenbrand --- drivers/media/pci/ivtv/ivtv-udma.c | 2 +- drivers/media/pci/ivtv/ivtv-yuv.c | 5 ++--- 2 files changed, 3 insertions(+), 4 deletions(-) diff --git a/drivers/media/pci/ivtv/ivtv-udma.c b/drivers/media/pci/ivtv/ivtv-udma.c index 210be8290f24..99b9f55ca829 100644 --- a/drivers/media/pci/ivtv/ivtv-udma.c +++ b/drivers/media/pci/ivtv/ivtv-udma.c @@ -115,7 +115,7 @@ int ivtv_udma_setup(struct ivtv *itv, unsigned long ivtv_dest_addr, /* Pin user pages for DMA Xfer */ err = pin_user_pages_unlocked(user_dma.uaddr, user_dma.page_count, - dma->map, FOLL_FORCE); + dma->map, 0); if (user_dma.page_count != err) { IVTV_DEBUG_WARN("failed to map user pages, returned %d instead of %d\n", diff --git a/drivers/media/pci/ivtv/ivtv-yuv.c b/drivers/media/pci/ivtv/ivtv-yuv.c index 4ba10c34a16a..582146f8d70d 100644 --- a/drivers/media/pci/ivtv/ivtv-yuv.c +++ b/drivers/media/pci/ivtv/ivtv-yuv.c @@ -63,12 +63,11 @@ static int ivtv_yuv_prep_user_dma(struct ivtv *itv, struct ivtv_user_dma *dma, /* Pin user pages for DMA Xfer */ y_pages = pin_user_pages_unlocked(y_dma.uaddr, - y_dma.page_count, >map[0], FOLL_FORCE); + y_dma.page_count, >map[0], 0); uv_pages = 0; /* silence gcc. value is set and consumed only if: */ if (y_pages == y_dma.page_count) { uv_pages = pin_user_pages_unlocked(uv_dma.uaddr, - uv_dma.page_count, >map[y_pages], - FOLL_FORCE); + uv_dma.page_count, >map[y_pages], 0); } if (y_pages != y_dma.page_count || uv_pages != uv_dma.page_count) { -- 2.38.1
[PATCH mm-unstable v1 16/20] mm/frame-vector: remove FOLL_FORCE usage
FOLL_FORCE is really only for ptrace access. According to commit 707947247e95 ("media: videobuf2-vmalloc: get_userptr: buffers are always writable"), get_vaddr_frames() currently pins all pages writable as a workaround for issues with read-only buffers. FOLL_FORCE, however, seems to be a legacy leftover as it predates commit 707947247e95 ("media: videobuf2-vmalloc: get_userptr: buffers are always writable"). Let's just remove it. Once the read-only buffer issue has been resolved, FOLL_WRITE could again be set depending on the DMA direction. Cc: Hans Verkuil Cc: Marek Szyprowski Cc: Tomasz Figa Cc: Marek Szyprowski Cc: Mauro Carvalho Chehab Signed-off-by: David Hildenbrand --- drivers/media/common/videobuf2/frame_vector.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/media/common/videobuf2/frame_vector.c b/drivers/media/common/videobuf2/frame_vector.c index 542dde9d2609..062e98148c53 100644 --- a/drivers/media/common/videobuf2/frame_vector.c +++ b/drivers/media/common/videobuf2/frame_vector.c @@ -50,7 +50,7 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames, start = untagged_addr(start); ret = pin_user_pages_fast(start, nr_frames, - FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM, + FOLL_WRITE | FOLL_LONGTERM, (struct page **)(vec->ptrs)); if (ret > 0) { vec->got_ref = true; -- 2.38.1
[PATCH mm-unstable v1 14/20] drm/etnaviv: remove FOLL_FORCE usage
GUP now supports reliable R/O long-term pinning in COW mappings, such that we break COW early. MAP_SHARED VMAs only use the shared zeropage so far in one corner case (DAXFS file with holes), which can be ignored because GUP does not support long-term pinning in fsdax (see check_vma_flags()). commit cd5297b0855f ("drm/etnaviv: Use FOLL_FORCE for userptr") documents that FOLL_FORCE | FOLL_WRITE was really only used for reliable R/O pinning. Consequently, FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM is no longer required for reliable R/O long-term pinning: FOLL_LONGTERM is sufficient. So stop using FOLL_FORCE, which is really only for ptrace access. Cc: Daniel Vetter Cc: Lucas Stach Cc: Russell King Cc: Christian Gmeiner Cc: David Airlie Signed-off-by: David Hildenbrand --- drivers/gpu/drm/etnaviv/etnaviv_gem.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gem.c b/drivers/gpu/drm/etnaviv/etnaviv_gem.c index cc386f8a7116..efe2240945d0 100644 --- a/drivers/gpu/drm/etnaviv/etnaviv_gem.c +++ b/drivers/gpu/drm/etnaviv/etnaviv_gem.c @@ -638,6 +638,7 @@ static int etnaviv_gem_userptr_get_pages(struct etnaviv_gem_object *etnaviv_obj) struct page **pvec = NULL; struct etnaviv_gem_userptr *userptr = _obj->userptr; int ret, pinned = 0, npages = etnaviv_obj->base.size >> PAGE_SHIFT; + unsigned int gup_flags = FOLL_LONGTERM; might_lock_read(>mm->mmap_lock); @@ -648,14 +649,15 @@ static int etnaviv_gem_userptr_get_pages(struct etnaviv_gem_object *etnaviv_obj) if (!pvec) return -ENOMEM; + if (!userptr->ro) + gup_flags |= FOLL_WRITE; + do { unsigned num_pages = npages - pinned; uint64_t ptr = userptr->ptr + pinned * PAGE_SIZE; struct page **pages = pvec + pinned; - ret = pin_user_pages_fast(ptr, num_pages, - FOLL_WRITE | FOLL_FORCE | FOLL_LONGTERM, - pages); + ret = pin_user_pages_fast(ptr, num_pages, gup_flags, pages); if (ret < 0) { unpin_user_pages(pvec, pinned); kvfree(pvec); -- 2.38.1
[PATCH mm-unstable v1 12/20] RDMA/siw: remove FOLL_FORCE usage
GUP now supports reliable R/O long-term pinning in COW mappings, such that we break COW early. MAP_SHARED VMAs only use the shared zeropage so far in one corner case (DAXFS file with holes), which can be ignored because GUP does not support long-term pinning in fsdax (see check_vma_flags()). Consequently, FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM is no longer required for reliable R/O long-term pinning: FOLL_LONGTERM is sufficient. So stop using FOLL_FORCE, which is really only for ptrace access. Cc: Bernard Metzler Cc: Jason Gunthorpe Cc: Leon Romanovsky Signed-off-by: David Hildenbrand --- drivers/infiniband/sw/siw/siw_mem.c | 9 - 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/sw/siw/siw_mem.c b/drivers/infiniband/sw/siw/siw_mem.c index 61c17db70d65..b2b33dd3b4fa 100644 --- a/drivers/infiniband/sw/siw/siw_mem.c +++ b/drivers/infiniband/sw/siw/siw_mem.c @@ -368,7 +368,7 @@ struct siw_umem *siw_umem_get(u64 start, u64 len, bool writable) struct mm_struct *mm_s; u64 first_page_va; unsigned long mlock_limit; - unsigned int foll_flags = FOLL_WRITE; + unsigned int foll_flags = FOLL_LONGTERM; int num_pages, num_chunks, i, rv = 0; if (!can_do_mlock()) @@ -391,8 +391,8 @@ struct siw_umem *siw_umem_get(u64 start, u64 len, bool writable) mmgrab(mm_s); - if (!writable) - foll_flags |= FOLL_FORCE; + if (writable) + foll_flags |= FOLL_WRITE; mmap_read_lock(mm_s); @@ -423,8 +423,7 @@ struct siw_umem *siw_umem_get(u64 start, u64 len, bool writable) while (nents) { struct page **plist = >page_chunk[i].plist[got]; - rv = pin_user_pages(first_page_va, nents, - foll_flags | FOLL_LONGTERM, + rv = pin_user_pages(first_page_va, nents, foll_flags, plist, NULL); if (rv < 0) goto out_sem_up; -- 2.38.1
[PATCH mm-unstable v1 11/20] RDMA/usnic: remove FOLL_FORCE usage
GUP now supports reliable R/O long-term pinning in COW mappings, such that we break COW early. MAP_SHARED VMAs only use the shared zeropage so far in one corner case (DAXFS file with holes), which can be ignored because GUP does not support long-term pinning in fsdax (see check_vma_flags()). Consequently, FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM is no longer required for reliable R/O long-term pinning: FOLL_LONGTERM is sufficient. So stop using FOLL_FORCE, which is really only for ptrace access. Cc: Christian Benvenuti Cc: Nelson Escobar Cc: Jason Gunthorpe Cc: Leon Romanovsky Signed-off-by: David Hildenbrand --- drivers/infiniband/hw/usnic/usnic_uiom.c | 9 - 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/hw/usnic/usnic_uiom.c b/drivers/infiniband/hw/usnic/usnic_uiom.c index 67923ced6e2d..c301b3be9f30 100644 --- a/drivers/infiniband/hw/usnic/usnic_uiom.c +++ b/drivers/infiniband/hw/usnic/usnic_uiom.c @@ -85,6 +85,7 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable, int dmasync, struct usnic_uiom_reg *uiomr) { struct list_head *chunk_list = >chunk_list; + unsigned int gup_flags = FOLL_LONGTERM; struct page **page_list; struct scatterlist *sg; struct usnic_uiom_chunk *chunk; @@ -96,7 +97,6 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable, int off; int i; dma_addr_t pa; - unsigned int gup_flags; struct mm_struct *mm; /* @@ -131,8 +131,8 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable, goto out; } - gup_flags = FOLL_WRITE; - gup_flags |= (writable) ? 0 : FOLL_FORCE; + if (writable) + gup_flags |= FOLL_WRITE; cur_base = addr & PAGE_MASK; ret = 0; @@ -140,8 +140,7 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable, ret = pin_user_pages(cur_base, min_t(unsigned long, npages, PAGE_SIZE / sizeof(struct page *)), -gup_flags | FOLL_LONGTERM, -page_list, NULL); +gup_flags, page_list, NULL); if (ret < 0) goto out; -- 2.38.1
[PATCH mm-unstable v1 10/20] RDMA/umem: remove FOLL_FORCE usage
GUP now supports reliable R/O long-term pinning in COW mappings, such that we break COW early. MAP_SHARED VMAs only use the shared zeropage so far in one corner case (DAXFS file with holes), which can be ignored because GUP does not support long-term pinning in fsdax (see check_vma_flags()). Consequently, FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM is no longer required for reliable R/O long-term pinning: FOLL_LONGTERM is sufficient. So stop using FOLL_FORCE, which is really only for ptrace access. Tested-by: Leon Romanovsky # Over mlx4 and mlx5. Cc: Jason Gunthorpe Cc: Leon Romanovsky Signed-off-by: David Hildenbrand --- drivers/infiniband/core/umem.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index 86d479772fbc..755a9c57db6f 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -156,7 +156,7 @@ struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr, struct mm_struct *mm; unsigned long npages; int pinned, ret; - unsigned int gup_flags = FOLL_WRITE; + unsigned int gup_flags = FOLL_LONGTERM; /* * If the combination of the addr and size requested for this memory @@ -210,8 +210,8 @@ struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr, cur_base = addr & PAGE_MASK; - if (!umem->writable) - gup_flags |= FOLL_FORCE; + if (umem->writable) + gup_flags |= FOLL_WRITE; while (npages) { cond_resched(); @@ -219,7 +219,7 @@ struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr, min_t(unsigned long, npages, PAGE_SIZE / sizeof(struct page *)), - gup_flags | FOLL_LONGTERM, page_list); + gup_flags, page_list); if (pinned < 0) { ret = pinned; goto umem_release; -- 2.38.1
[PATCH mm-unstable v1 07/20] mm: don't call vm_ops->huge_fault() in wp_huge_pmd()/wp_huge_pud() for private mappings
If we already have a PMD/PUD mapped write-protected in a private mapping and we want to break COW either due to FAULT_FLAG_WRITE or FAULT_FLAG_UNSHARE, there is no need to inform the file system just like on the PTE path. Let's just split (->zap) + fallback in that case. This is a preparation for more generic FAULT_FLAG_UNSHARE support in COW mappings. Signed-off-by: David Hildenbrand --- mm/memory.c | 24 +++- 1 file changed, 15 insertions(+), 9 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index c35e6cd32b6a..d47ad33c6487 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4802,6 +4802,7 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf) static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf) { const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE; + vm_fault_t ret; if (vma_is_anonymous(vmf->vma)) { if (likely(!unshare) && @@ -4809,11 +4810,13 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf) return handle_userfault(vmf, VM_UFFD_WP); return do_huge_pmd_wp_page(vmf); } - if (vmf->vma->vm_ops->huge_fault) { - vm_fault_t ret = vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PMD); - if (!(ret & VM_FAULT_FALLBACK)) - return ret; + if (vmf->vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) { + if (vmf->vma->vm_ops->huge_fault) { + ret = vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PMD); + if (!(ret & VM_FAULT_FALLBACK)) + return ret; + } } /* COW or write-notify handled on pte level: split pmd. */ @@ -4839,14 +4842,17 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud) { #if defined(CONFIG_TRANSPARENT_HUGEPAGE) &&\ defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD) + vm_fault_t ret; + /* No support for anonymous transparent PUD pages yet */ if (vma_is_anonymous(vmf->vma)) goto split; - if (vmf->vma->vm_ops->huge_fault) { - vm_fault_t ret = vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PUD); - - if (!(ret & VM_FAULT_FALLBACK)) - return ret; + if (vmf->vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) { + if (vmf->vma->vm_ops->huge_fault) { + ret = vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PUD); + if (!(ret & VM_FAULT_FALLBACK)) + return ret; + } } split: /* COW or write-notify not handled on PUD level: split pud.*/ -- 2.38.1
[PATCH mm-unstable v1 08/20] mm: extend FAULT_FLAG_UNSHARE support to anything in a COW mapping
Extend FAULT_FLAG_UNSHARE to break COW on anything mapped into a COW (i.e., private writable) mapping and adjust the documentation accordingly. FAULT_FLAG_UNSHARE will now also break COW when encountering the shared zeropage, a pagecache page, a PFNMAP, ... inside a COW mapping, by properly replacing the mapped page/pfn by a private copy (an exclusive anonymous page). Note that only do_wp_page() needs care: hugetlb_wp() already handles FAULT_FLAG_UNSHARE correctly. wp_huge_pmd()/wp_huge_pud() also handles it correctly, for example, splitting the huge zeropage on FAULT_FLAG_UNSHARE such that we can handle FAULT_FLAG_UNSHARE on the PTE level. This change is a requirement for reliable long-term R/O pinning in COW mappings. Signed-off-by: David Hildenbrand --- include/linux/mm_types.h | 8 mm/memory.c | 4 2 files changed, 4 insertions(+), 8 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 5e7f4fac1e78..5e9aaad8c7b2 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -1037,9 +1037,9 @@ typedef struct { * @FAULT_FLAG_REMOTE: The fault is not for current task/mm. * @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch. * @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals. - * @FAULT_FLAG_UNSHARE: The fault is an unsharing request to unshare (and mark - * exclusive) a possibly shared anonymous page that is - * mapped R/O. + * @FAULT_FLAG_UNSHARE: The fault is an unsharing request to break COW in a + * COW mapping, making sure that an exclusive anon page is + * mapped after the fault. * @FAULT_FLAG_ORIG_PTE_VALID: whether the fault has vmf->orig_pte cached. *We should only access orig_pte if this flag set. * @@ -1064,7 +1064,7 @@ typedef struct { * * The combination FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE is illegal. * FAULT_FLAG_UNSHARE is ignored and treated like an ordinary read fault when - * no existing R/O-mapped anonymous page is encountered. + * applied to mappings that are not COW mappings. */ enum fault_flag { FAULT_FLAG_WRITE = 1 << 0, diff --git a/mm/memory.c b/mm/memory.c index d47ad33c6487..56b21ab1e4d2 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3432,10 +3432,6 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) } wp_page_reuse(vmf); return 0; - } else if (unshare) { - /* No anonymous page -> nothing to do. */ - pte_unmap_unlock(vmf->pte, vmf->ptl); - return 0; } copy: /* -- 2.38.1
[PATCH mm-unstable v1 06/20] mm: rework handling in do_wp_page() based on private vs. shared mappings
We want to extent FAULT_FLAG_UNSHARE support to anything mapped into a COW mapping (pagecache page, zeropage, PFN, ...), not just anonymous pages. Let's prepare for that by handling shared mappings first such that we can handle private mappings last. While at it, use folio-based functions instead of page-based functions where we touch the code either way. Signed-off-by: David Hildenbrand --- mm/memory.c | 38 +- 1 file changed, 17 insertions(+), 21 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index c4fa378ec2a0..c35e6cd32b6a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3342,7 +3342,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) { const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE; struct vm_area_struct *vma = vmf->vma; - struct folio *folio; + struct folio *folio = NULL; if (likely(!unshare)) { if (userfaultfd_pte_wp(vma, *vmf->pte)) { @@ -3360,13 +3360,12 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) } vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte); - if (!vmf->page) { - if (unlikely(unshare)) { - /* No anonymous page -> nothing to do. */ - pte_unmap_unlock(vmf->pte, vmf->ptl); - return 0; - } + /* +* Shared mapping: we are guaranteed to have VM_WRITE and +* FAULT_FLAG_WRITE set at this point. +*/ + if (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) { /* * VM_MIXEDMAP !pfn_valid() case, or VM_SOFTDIRTY clear on a * VM_PFNMAP VMA. @@ -3374,20 +3373,19 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) * We should not cow pages in a shared writeable mapping. * Just mark the pages writable and/or call ops->pfn_mkwrite. */ - if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) == -(VM_WRITE|VM_SHARED)) + if (!vmf->page) return wp_pfn_shared(vmf); - - pte_unmap_unlock(vmf->pte, vmf->ptl); - return wp_page_copy(vmf); + return wp_page_shared(vmf); } + if (vmf->page) + folio = page_folio(vmf->page); + /* -* Take out anonymous pages first, anonymous shared vmas are -* not dirty accountable. +* Private mapping: create an exclusive anonymous page copy if reuse +* is impossible. We might miss VM_WRITE for FOLL_FORCE handling. */ - folio = page_folio(vmf->page); - if (folio_test_anon(folio)) { + if (folio && folio_test_anon(folio)) { /* * If the page is exclusive to this process we must reuse the * page without further checks. @@ -3438,19 +3436,17 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) /* No anonymous page -> nothing to do. */ pte_unmap_unlock(vmf->pte, vmf->ptl); return 0; - } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) == - (VM_WRITE|VM_SHARED))) { - return wp_page_shared(vmf); } copy: /* * Ok, we need to copy. Oh, well.. */ - get_page(vmf->page); + if (folio) + folio_get(folio); pte_unmap_unlock(vmf->pte, vmf->ptl); #ifdef CONFIG_KSM - if (PageKsm(vmf->page)) + if (folio && folio_test_ksm(folio)) count_vm_event(COW_KSM); #endif return wp_page_copy(vmf); -- 2.38.1
[PATCH mm-unstable v1 05/20] mm: add early FAULT_FLAG_WRITE consistency checks
Let's catch abuse of FAULT_FLAG_WRITE early, such that we don't have to care in all other handlers and might get "surprises" if we forget to do so. Write faults without VM_MAYWRITE don't make any sense, and our maybe_mkwrite() logic could have hidden such abuse for now. Write faults without VM_WRITE on something that is not a COW mapping is similarly broken, and e.g., do_wp_page() could end up placing an anonymous page into a shared mapping, which would be bad. This is a preparation for reliable R/O long-term pinning of pages in private mappings, whereby we want to make sure that we will never break COW in a read-only private mapping. Signed-off-by: David Hildenbrand --- mm/memory.c | 8 1 file changed, 8 insertions(+) diff --git a/mm/memory.c b/mm/memory.c index e014435a87db..c4fa378ec2a0 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -5170,6 +5170,14 @@ static vm_fault_t sanitize_fault_flags(struct vm_area_struct *vma, */ if (!is_cow_mapping(vma->vm_flags)) *flags &= ~FAULT_FLAG_UNSHARE; + } else if (*flags & FAULT_FLAG_WRITE) { + /* Write faults on read-only mappings are impossible ... */ + if (WARN_ON_ONCE(!(vma->vm_flags & VM_MAYWRITE))) + return VM_FAULT_SIGSEGV; + /* ... and FOLL_FORCE only applies to COW mappings. */ + if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE) && +!is_cow_mapping(vma->vm_flags))) + return VM_FAULT_SIGSEGV; } return 0; } -- 2.38.1
[PATCH mm-unstable v1 02/20] selftests/vm: cow: basic COW tests for non-anonymous pages
Let's add basic tests for COW with non-anonymous pages in private mappings: write access should properly trigger COW and result in the private changes not being visible through other page mappings. Especially, add tests for: * Zeropage * Huge zeropage * Ordinary pagecache pages via memfd and tmpfile() * Hugetlb pages via memfd Fortunately, all tests pass. Signed-off-by: David Hildenbrand --- tools/testing/selftests/vm/cow.c | 338 ++- 1 file changed, 337 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/vm/cow.c b/tools/testing/selftests/vm/cow.c index d202bfd63585..fb07bd44529c 100644 --- a/tools/testing/selftests/vm/cow.c +++ b/tools/testing/selftests/vm/cow.c @@ -19,6 +19,7 @@ #include #include #include +#include #include "local_config.h" #ifdef LOCAL_CONFIG_HAVE_LIBURING @@ -35,6 +36,7 @@ static size_t thpsize; static int nr_hugetlbsizes; static size_t hugetlbsizes[10]; static int gup_fd; +static bool has_huge_zeropage; static void detect_thpsize(void) { @@ -64,6 +66,31 @@ static void detect_thpsize(void) close(fd); } +static void detect_huge_zeropage(void) +{ + int fd = open("/sys/kernel/mm/transparent_hugepage/use_zero_page", + O_RDONLY); + size_t enabled = 0; + char buf[15]; + int ret; + + if (fd < 0) + return; + + ret = pread(fd, buf, sizeof(buf), 0); + if (ret > 0 && ret < sizeof(buf)) { + buf[ret] = 0; + + enabled = strtoul(buf, NULL, 10); + if (enabled == 1) { + has_huge_zeropage = true; + ksft_print_msg("[INFO] huge zeropage is enabled\n"); + } + } + + close(fd); +} + static void detect_hugetlbsizes(void) { DIR *dir = opendir("/sys/kernel/mm/hugepages/"); @@ -1148,6 +1175,312 @@ static int tests_per_anon_test_case(void) return tests; } +typedef void (*non_anon_test_fn)(char *mem, const char *smem, size_t size); + +static void test_cow(char *mem, const char *smem, size_t size) +{ + char *old = malloc(size); + + /* Backup the original content. */ + memcpy(old, smem, size); + + /* Modify the page. */ + memset(mem, 0xff, size); + + /* See if we still read the old values via the other mapping. */ + ksft_test_result(!memcmp(smem, old, size), +"Other mapping not modified\n"); + free(old); +} + +static void run_with_zeropage(non_anon_test_fn fn, const char *desc) +{ + char *mem, *smem, tmp; + + ksft_print_msg("[RUN] %s ... with shared zeropage\n", desc); + + mem = mmap(NULL, pagesize, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANON, -1, 0); + if (mem == MAP_FAILED) { + ksft_test_result_fail("mmap() failed\n"); + return; + } + + smem = mmap(NULL, pagesize, PROT_READ, MAP_PRIVATE | MAP_ANON, -1, 0); + if (mem == MAP_FAILED) { + ksft_test_result_fail("mmap() failed\n"); + goto munmap; + } + + /* Read from the page to populate the shared zeropage. */ + tmp = *mem + *smem; + asm volatile("" : "+r" (tmp)); + + fn(mem, smem, pagesize); +munmap: + munmap(mem, pagesize); + if (smem != MAP_FAILED) + munmap(smem, pagesize); +} + +static void run_with_huge_zeropage(non_anon_test_fn fn, const char *desc) +{ + char *mem, *smem, *mmap_mem, *mmap_smem, tmp; + size_t mmap_size; + int ret; + + ksft_print_msg("[RUN] %s ... with huge zeropage\n", desc); + + if (!has_huge_zeropage) { + ksft_test_result_skip("Huge zeropage not enabled\n"); + return; + } + + /* For alignment purposes, we need twice the thp size. */ + mmap_size = 2 * thpsize; + mmap_mem = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + if (mmap_mem == MAP_FAILED) { + ksft_test_result_fail("mmap() failed\n"); + return; + } + mmap_smem = mmap(NULL, mmap_size, PROT_READ, +MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + if (mmap_smem == MAP_FAILED) { + ksft_test_result_fail("mmap() failed\n"); + goto munmap; + } + + /* We need a THP-aligned memory area. */ + mem = (char *)(((uintptr_t)mmap_mem + thpsize) & ~(thpsize - 1)); + smem = (char *)(((uintptr_t)mmap_smem + thpsize) & ~(thpsize - 1)); + + ret = madvise(mem, thpsize, MADV_HUGEPAGE); + ret |= madvise(smem, thpsize, MADV_HUGEPAGE); + if (ret) { + ksft_test_result_fail("MADV_HUGEPAGE failed\n"); + goto munmap; +
[PATCH mm-unstable v1 04/20] mm: add early FAULT_FLAG_UNSHARE consistency checks
For now, FAULT_FLAG_UNSHARE only applies to anonymous pages, which implies a COW mapping. Let's hide FAULT_FLAG_UNSHARE early if we're not dealing with a COW mapping, such that we treat it like a read fault as documented and don't have to worry about the flag throughout all fault handlers. While at it, centralize the check for mutual exclusion of FAULT_FLAG_UNSHARE and FAULT_FLAG_WRITE and just drop the check that either flag is set in the WP handler. Signed-off-by: David Hildenbrand --- mm/huge_memory.c | 3 --- mm/hugetlb.c | 5 - mm/memory.c | 23 --- 3 files changed, 20 insertions(+), 11 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index ed12cd3acbfd..68d00196b519 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1267,9 +1267,6 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) vmf->ptl = pmd_lockptr(vma->vm_mm, vmf->pmd); VM_BUG_ON_VMA(!vma->anon_vma, vma); - VM_BUG_ON(unshare && (vmf->flags & FAULT_FLAG_WRITE)); - VM_BUG_ON(!unshare && !(vmf->flags & FAULT_FLAG_WRITE)); - if (is_huge_zero_pmd(orig_pmd)) goto fallback; diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 1de986c62976..383b26069b33 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5314,9 +5314,6 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long haddr = address & huge_page_mask(h); struct mmu_notifier_range range; - VM_BUG_ON(unshare && (flags & FOLL_WRITE)); - VM_BUG_ON(!unshare && !(flags & FOLL_WRITE)); - /* * hugetlb does not support FOLL_FORCE-style write faults that keep the * PTE mapped R/O such as maybe_mkwrite() would do. @@ -5326,8 +5323,6 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma, /* Let's take out MAP_SHARED mappings first. */ if (vma->vm_flags & VM_MAYSHARE) { - if (unlikely(unshare)) - return 0; set_huge_ptep_writable(vma, haddr, ptep); return 0; } diff --git a/mm/memory.c b/mm/memory.c index 2d453736f87c..e014435a87db 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3344,9 +3344,6 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) struct vm_area_struct *vma = vmf->vma; struct folio *folio; - VM_BUG_ON(unshare && (vmf->flags & FAULT_FLAG_WRITE)); - VM_BUG_ON(!unshare && !(vmf->flags & FAULT_FLAG_WRITE)); - if (likely(!unshare)) { if (userfaultfd_pte_wp(vma, *vmf->pte)) { pte_unmap_unlock(vmf->pte, vmf->ptl); @@ -5161,6 +5158,22 @@ static void lru_gen_exit_fault(void) } #endif /* CONFIG_LRU_GEN */ +static vm_fault_t sanitize_fault_flags(struct vm_area_struct *vma, + unsigned int *flags) +{ + if (unlikely(*flags & FAULT_FLAG_UNSHARE)) { + if (WARN_ON_ONCE(*flags & FAULT_FLAG_WRITE)) + return VM_FAULT_SIGSEGV; + /* +* FAULT_FLAG_UNSHARE only applies to COW mappings. Let's +* just treat it like an ordinary read-fault otherwise. +*/ + if (!is_cow_mapping(vma->vm_flags)) + *flags &= ~FAULT_FLAG_UNSHARE; + } + return 0; +} + /* * By the time we get here, we already hold the mm semaphore * @@ -5177,6 +5190,10 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, count_vm_event(PGFAULT); count_memcg_event_mm(vma->vm_mm, PGFAULT); + ret = sanitize_fault_flags(vma, ); + if (ret) + return ret; + if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE, flags & FAULT_FLAG_INSTRUCTION, flags & FAULT_FLAG_REMOTE)) -- 2.38.1
[PATCH mm-unstable v1 03/20] selftests/vm: cow: R/O long-term pinning reliability tests for non-anon pages
Let's test whether R/O long-term pinning is reliable for non-anonymous memory: when R/O long-term pinning a page, the expectation is that we break COW early before pinning, such that actual write access via the page tables won't break COW later and end up replacing the R/O-pinned page in the page table. Consequently, R/O long-term pinning in private mappings would only target exclusive anonymous pages. For now, all tests fail: # [RUN] R/O longterm GUP pin ... with shared zeropage not ok 151 Longterm R/O pin is reliable # [RUN] R/O longterm GUP pin ... with memfd not ok 152 Longterm R/O pin is reliable # [RUN] R/O longterm GUP pin ... with tmpfile not ok 153 Longterm R/O pin is reliable # [RUN] R/O longterm GUP pin ... with huge zeropage not ok 154 Longterm R/O pin is reliable # [RUN] R/O longterm GUP pin ... with memfd hugetlb (2048 kB) not ok 155 Longterm R/O pin is reliable # [RUN] R/O longterm GUP pin ... with memfd hugetlb (1048576 kB) not ok 156 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with shared zeropage not ok 157 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with memfd not ok 158 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with tmpfile not ok 159 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with huge zeropage not ok 160 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with memfd hugetlb (2048 kB) not ok 161 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with memfd hugetlb (1048576 kB) not ok 162 Longterm R/O pin is reliable Signed-off-by: David Hildenbrand --- tools/testing/selftests/vm/cow.c | 28 +++- 1 file changed, 27 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/vm/cow.c b/tools/testing/selftests/vm/cow.c index fb07bd44529c..73e05b52c49e 100644 --- a/tools/testing/selftests/vm/cow.c +++ b/tools/testing/selftests/vm/cow.c @@ -561,6 +561,7 @@ static void test_iouring_fork(char *mem, size_t size) #endif /* LOCAL_CONFIG_HAVE_LIBURING */ enum ro_pin_test { + RO_PIN_TEST, RO_PIN_TEST_SHARED, RO_PIN_TEST_PREVIOUSLY_SHARED, RO_PIN_TEST_RO_EXCLUSIVE, @@ -593,6 +594,8 @@ static void do_test_ro_pin(char *mem, size_t size, enum ro_pin_test test, } switch (test) { + case RO_PIN_TEST: + break; case RO_PIN_TEST_SHARED: case RO_PIN_TEST_PREVIOUSLY_SHARED: /* @@ -1193,6 +1196,16 @@ static void test_cow(char *mem, const char *smem, size_t size) free(old); } +static void test_ro_pin(char *mem, const char *smem, size_t size) +{ + do_test_ro_pin(mem, size, RO_PIN_TEST, false); +} + +static void test_ro_fast_pin(char *mem, const char *smem, size_t size) +{ + do_test_ro_pin(mem, size, RO_PIN_TEST, true); +} + static void run_with_zeropage(non_anon_test_fn fn, const char *desc) { char *mem, *smem, tmp; @@ -1433,7 +1446,7 @@ struct non_anon_test_case { }; /* - * Test cases that target any pages in private mappings that are non anonymous: + * Test cases that target any pages in private mappings that are not anonymous: * pages that may get shared via COW ndependent of fork(). This includes * the shared zeropage(s), pagecache pages, ... */ @@ -1446,6 +1459,19 @@ static const struct non_anon_test_case non_anon_test_cases[] = { "Basic COW", test_cow, }, + /* +* Take a R/O longterm pin. When modifying the page via the page table, +* the page content change must be visible via the pin. +*/ + { + "R/O longterm GUP pin", + test_ro_pin, + }, + /* Same as above, but using GUP-fast. */ + { + "R/O longterm GUP-fast pin", + test_ro_fast_pin, + }, }; static void run_non_anon_test_case(struct non_anon_test_case const *test_case) -- 2.38.1
[PATCH mm-unstable v1 01/20] selftests/vm: anon_cow: prepare for non-anonymous COW tests
Originally, the plan was to have a separate tests for testing COW of non-anonymous (e.g., shared zeropage) pages. Turns out, that we'd need a lot of similar functionality and that there isn't a really good reason to separate it. So let's prepare for non-anon tests by renaming to "cow". Signed-off-by: David Hildenbrand --- tools/testing/selftests/vm/.gitignore | 2 +- tools/testing/selftests/vm/Makefile | 10 tools/testing/selftests/vm/check_config.sh| 4 +-- .../selftests/vm/{anon_cow.c => cow.c}| 25 +++ tools/testing/selftests/vm/run_vmtests.sh | 8 +++--- 5 files changed, 27 insertions(+), 22 deletions(-) rename tools/testing/selftests/vm/{anon_cow.c => cow.c} (97%) diff --git a/tools/testing/selftests/vm/.gitignore b/tools/testing/selftests/vm/.gitignore index 8a536c731e3c..ee8c41c998e6 100644 --- a/tools/testing/selftests/vm/.gitignore +++ b/tools/testing/selftests/vm/.gitignore @@ -1,5 +1,5 @@ # SPDX-License-Identifier: GPL-2.0-only -anon_cow +cow hugepage-mmap hugepage-mremap hugepage-shm diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile index 0986bd60c19f..89c14e41bd43 100644 --- a/tools/testing/selftests/vm/Makefile +++ b/tools/testing/selftests/vm/Makefile @@ -27,7 +27,7 @@ MAKEFLAGS += --no-builtin-rules CFLAGS = -Wall -I $(top_srcdir) -I $(top_srcdir)/usr/include $(EXTRA_CFLAGS) $(KHDR_INCLUDES) LDLIBS = -lrt -lpthread -TEST_GEN_FILES = anon_cow +TEST_GEN_FILES = cow TEST_GEN_FILES += compaction_test TEST_GEN_FILES += gup_test TEST_GEN_FILES += hmm-tests @@ -99,7 +99,7 @@ TEST_FILES += va_128TBswitch.sh include ../lib.mk -$(OUTPUT)/anon_cow: vm_util.c +$(OUTPUT)/cow: vm_util.c $(OUTPUT)/khugepaged: vm_util.c $(OUTPUT)/ksm_functional_tests: vm_util.c $(OUTPUT)/madv_populate: vm_util.c @@ -156,8 +156,8 @@ warn_32bit_failure: endif endif -# ANON_COW_EXTRA_LIBS may get set in local_config.mk, or it may be left empty. -$(OUTPUT)/anon_cow: LDLIBS += $(ANON_COW_EXTRA_LIBS) +# cow_EXTRA_LIBS may get set in local_config.mk, or it may be left empty. +$(OUTPUT)/cow: LDLIBS += $(COW_EXTRA_LIBS) $(OUTPUT)/mlock-random-test $(OUTPUT)/memfd_secret: LDLIBS += -lcap @@ -170,7 +170,7 @@ local_config.mk local_config.h: check_config.sh EXTRA_CLEAN += local_config.mk local_config.h -ifeq ($(ANON_COW_EXTRA_LIBS),) +ifeq ($(COW_EXTRA_LIBS),) all: warn_missing_liburing warn_missing_liburing: diff --git a/tools/testing/selftests/vm/check_config.sh b/tools/testing/selftests/vm/check_config.sh index 9a44c6520925..bcba3af0acea 100644 --- a/tools/testing/selftests/vm/check_config.sh +++ b/tools/testing/selftests/vm/check_config.sh @@ -21,11 +21,11 @@ $CC -c $tmpfile_c -o $tmpfile_o >/dev/null 2>&1 if [ -f $tmpfile_o ]; then echo "#define LOCAL_CONFIG_HAVE_LIBURING 1" > $OUTPUT_H_FILE -echo "ANON_COW_EXTRA_LIBS = -luring" > $OUTPUT_MKFILE +echo "COW_EXTRA_LIBS = -luring" > $OUTPUT_MKFILE else echo "// No liburing support found" > $OUTPUT_H_FILE echo "# No liburing support found, so:" > $OUTPUT_MKFILE -echo "ANON_COW_EXTRA_LIBS = " >> $OUTPUT_MKFILE +echo "COW_EXTRA_LIBS = ">> $OUTPUT_MKFILE fi rm ${tmpname}.* diff --git a/tools/testing/selftests/vm/anon_cow.c b/tools/testing/selftests/vm/cow.c similarity index 97% rename from tools/testing/selftests/vm/anon_cow.c rename to tools/testing/selftests/vm/cow.c index bbb251eb5025..d202bfd63585 100644 --- a/tools/testing/selftests/vm/anon_cow.c +++ b/tools/testing/selftests/vm/cow.c @@ -1,6 +1,6 @@ // SPDX-License-Identifier: GPL-2.0-only /* - * COW (Copy On Write) tests for anonymous memory. + * COW (Copy On Write) tests. * * Copyright 2022, Red Hat, Inc. * @@ -986,7 +986,11 @@ struct test_case { test_fn fn; }; -static const struct test_case test_cases[] = { +/* + * Test cases that are specific to anonymous pages: pages in private mappings + * that may get shared via COW during fork(). + */ +static const struct test_case anon_test_cases[] = { /* * Basic COW tests for fork() without any GUP. If we miss to break COW, * either the child can observe modifications by the parent or the @@ -1104,7 +1108,7 @@ static const struct test_case test_cases[] = { }, }; -static void run_test_case(struct test_case const *test_case) +static void run_anon_test_case(struct test_case const *test_case) { int i; @@ -1125,15 +1129,17 @@ static void run_test_case(struct test_case const *test_case) hugetlbsizes[i]); } -static void run_test_cases(void) +static void run_anon_test_cases(void) { int i; - for (i = 0; i < ARRAY_SIZE(test_cases); i++) - run_test_case(_cases[i]); + k
[PATCH mm-unstable v1 00/20] mm/gup: remove FOLL_FORCE usage from drivers (reliable R/O long-term pinning)
For now, we did not support reliable R/O long-term pinning in COW mappings. That means, if we would trigger R/O long-term pinning in MAP_PRIVATE mapping, we could end up pinning the (R/O-mapped) shared zeropage or a pagecache page. The next write access would trigger a write fault and replace the pinned page by an exclusive anonymous page in the process page table; whatever the process would write to that private page copy would not be visible by the owner of the previous page pin: for example, RDMA could read stale data. The end result is essentially an unexpected and hard-to-debug memory corruption. Some drivers tried working around that limitation by using "FOLL_FORCE|FOLL_WRITE|FOLL_LONGTERM" for R/O long-term pinning for now. FOLL_WRITE would trigger a write fault, if required, and break COW before pinning the page. FOLL_FORCE is required because the VMA might lack write permissions, and drivers wanted to make that working as well, just like one would expect (no write access, but still triggering a write access to break COW). However, that is not a practical solution, because (1) Drivers that don't stick to that undocumented and debatable pattern would still run into that issue. For example, VFIO only uses FOLL_LONGTERM for R/O long-term pinning. (2) Using FOLL_WRITE just to work around a COW mapping + page pinning limitation is unintuitive. FOLL_WRITE would, for example, mark the page softdirty or trigger uffd-wp, even though, there actually isn't going to be any write access. (3) The purpose of FOLL_FORCE is debug access, not access without lack of VMA permissions by arbitrarty drivers. So instead, make R/O long-term pinning work as expected, by breaking COW in a COW mapping early, such that we can remove any FOLL_FORCE usage from drivers and make FOLL_FORCE ptrace-specific (renaming it to FOLL_PTRACE). More details in patch #8. Patches #1--#3 add COW tests for non-anonymous pages. Patches #4--#7 prepare core MM for extended FAULT_FLAG_UNSHARE support in COW mappings. Patch #8 implements reliable R/O long-term pinning in COW mappings Patches #9--#19 remove any FOLL_FORCE usage from drivers. Patch #20 renames FOLL_FORCE to FOLL_PTRACE. I'm refraining from CCing all driver/arch maintainers on the whole patch set, but only CC them on the cover letter and the applicable patch (I know, I know, someone is always unhappy ... sorry). RFC -> v1: * Use term "ptrace" instead of "debuggers" in patch descriptions * Added ACK/Tested-by * "mm/frame-vector: remove FOLL_FORCE usage" -> Adjust description * "mm: rename FOLL_FORCE to FOLL_PTRACE" -> Added David Hildenbrand (20): selftests/vm: anon_cow: prepare for non-anonymous COW tests selftests/vm: cow: basic COW tests for non-anonymous pages selftests/vm: cow: R/O long-term pinning reliability tests for non-anon pages mm: add early FAULT_FLAG_UNSHARE consistency checks mm: add early FAULT_FLAG_WRITE consistency checks mm: rework handling in do_wp_page() based on private vs. shared mappings mm: don't call vm_ops->huge_fault() in wp_huge_pmd()/wp_huge_pud() for private mappings mm: extend FAULT_FLAG_UNSHARE support to anything in a COW mapping mm/gup: reliable R/O long-term pinning in COW mappings RDMA/umem: remove FOLL_FORCE usage RDMA/usnic: remove FOLL_FORCE usage RDMA/siw: remove FOLL_FORCE usage media: videobuf-dma-sg: remove FOLL_FORCE usage drm/etnaviv: remove FOLL_FORCE usage media: pci/ivtv: remove FOLL_FORCE usage mm/frame-vector: remove FOLL_FORCE usage drm/exynos: remove FOLL_FORCE usage RDMA/hw/qib/qib_user_pages: remove FOLL_FORCE usage habanalabs: remove FOLL_FORCE usage mm: rename FOLL_FORCE to FOLL_PTRACE arch/alpha/kernel/ptrace.c| 6 +- arch/arm64/kernel/mte.c | 2 +- arch/ia64/kernel/ptrace.c | 10 +- arch/mips/kernel/ptrace32.c | 4 +- arch/mips/math-emu/dsemul.c | 2 +- arch/powerpc/kernel/ptrace/ptrace32.c | 4 +- arch/sparc/kernel/ptrace_32.c | 4 +- arch/sparc/kernel/ptrace_64.c | 8 +- arch/x86/kernel/step.c| 2 +- arch/x86/um/ptrace_32.c | 2 +- arch/x86/um/ptrace_64.c | 2 +- drivers/gpu/drm/etnaviv/etnaviv_gem.c | 8 +- drivers/gpu/drm/exynos/exynos_drm_g2d.c | 2 +- drivers/infiniband/core/umem.c| 8 +- drivers/infiniband/hw/qib/qib_user_pages.c| 2 +- drivers/infiniband/hw/usnic/usnic_uiom.c | 9 +- drivers/infiniband/sw/siw/siw_mem.c | 9 +- drivers/media/common/videobuf2/frame_vector.c | 2 +- drivers/media/pci/ivtv/ivtv-udma.c| 2 +- drivers/media/pci/ivtv/ivtv-yuv.c | 5 +- drivers/media/v4l2-core/videobuf-dma-sg.c | 14 +- dri
Re: [PATCH] mm/memory: Return vm_fault_t result from migrate_to_ram() callback
On 14.11.22 12:55, Alistair Popple wrote: The migrate_to_ram() callback should always succeed, but in rare cases can fail usually returning VM_FAULT_SIGBUS. Commit 16ce101db85d ("mm/memory.c: fix race when faulting a device private page") incorrectly stopped passing the return code up the stack. Fix this by setting the ret variable, restoring the previous behaviour on migrate_to_ram() failure. Fixes: 16ce101db85d ("mm/memory.c: fix race when faulting a device private page") Signed-off-by: Alistair Popple Cc: Ralph Campbell Cc: John Hubbard Cc: Alex Sierra Cc: Ben Skeggs Cc: Felix Kuehling Cc: Lyude Paul Cc: Jason Gunthorpe Cc: Michael Ellerman --- Hi Andrew, I noticed my series needs another small fix which I'm hoping you can apply for v6.1-rcX. Thanks (and sorry for not catching this sooner). --- mm/memory.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/memory.c b/mm/memory.c index f88c351aecd4..8a6d5c823f91 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3763,7 +3763,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) */ get_page(vmf->page); pte_unmap_unlock(vmf->pte, vmf->ptl); - vmf->page->pgmap->ops->migrate_to_ram(vmf); + ret = vmf->page->pgmap->ops->migrate_to_ram(vmf); put_page(vmf->page); } else if (is_hwpoison_entry(entry)) { ret = VM_FAULT_HWPOISON; Acked-by: David Hildenbrand -- Thanks, David / dhildenb
Re: [PATCH RFC 00/19] mm/gup: remove FOLL_FORCE usage from drivers (reliable R/O long-term pinning)
On 14.11.22 07:03, Christoph Hellwig wrote: On Mon, Nov 07, 2022 at 09:27:23AM -0800, Linus Torvalds wrote: And I'd really love to just have this long saga ended, and FOLL_FORCE finally relegated to purely ptrace accesses. At that point we should also rename it to FOLL_PTRACE to make that very clear, and also break anything in-flight accidentally readding it, which I'd otherwise expect to happen. Good idea; I'll include a patch in v1. -- Thanks, David / dhildenb
Re: [PATCH RFC 16/19] mm/frame-vector: remove FOLL_FORCE usage
On 08.11.22 05:45, Tomasz Figa wrote: Hi David, Hi Tomasz, thanks for looking into this! On Tue, Nov 8, 2022 at 1:19 AM David Hildenbrand wrote: FOLL_FORCE is really only for debugger access. According to commit 707947247e95 ("media: videobuf2-vmalloc: get_userptr: buffers are always writable"), the pinned pages are always writable. As reference, the cover letter of the series: https://lkml.kernel.org/r/20221107161740.144456-1-da...@redhat.com Actually that patch is only a workaround to temporarily disable support for read-only pages as they seemed to suffer from some corruption issues in the retrieved user pages. We expect to support read-only pages as hardware input after. That said, FOLL_FORCE doesn't sound like the right thing even in that case, but I don't know the background behind it being added here in the first place. +Hans Verkuil +Marek Szyprowski do you happen to remember anything about it? Maybe I mis-interpreted 707947247e95; re-reading it again, I am not quite sure what the actual problem is and how it relates to GUP FOLL_WRITE handling. FOLL_FORCE already was in place before 707947247e95 and should be removed. What I understood is "Just always call vb2_create_framevec() with FOLL_WRITE to always allow writing to the buffers.". If the pinned page is never written to via the obtained GUP reference: * FOLL_WRITE should not be set * FOLL_FORCE should not be set * We should not dirty the page when unpinning If the pinned page may be written to via the obtained GUP reference: * FOLL_WRITE should be set * FOLL_FORCE should not be set * We should dirty the page when unpinning If the function is called for both, we should think about doing it conditional based on a "write" variable, like pre-707947247e95 did. @Hans, any insight? -- Thanks, David / dhildenb
Re: [PATCH RFC 00/19] mm/gup: remove FOLL_FORCE usage from drivers (reliable R/O long-term pinning)
On 07.11.22 18:27, Linus Torvalds wrote: On Mon, Nov 7, 2022 at 8:18 AM David Hildenbrand wrote: So instead, make R/O long-term pinning work as expected, by breaking COW in a COW mapping early, such that we can remove any FOLL_FORCE usage from drivers. Nothing makes me unhappy from a quick scan through these patches. And I'd really love to just have this long saga ended, and FOLL_FORCE finally relegated to purely ptrace accesses. So an enthusiastic Ack from me. Thanks Linus! My hope is that we can remove it from all drivers and not have to leave it in for some corner cases; so far it looks promising. -- Thanks, David / dhildenb
Re: [PATCH RFC 05/19] mm: add early FAULT_FLAG_WRITE consistency checks
On 07.11.22 20:03, Nadav Amit wrote: On Nov 7, 2022, at 8:17 AM, David Hildenbrand wrote: !! External Email Let's catch abuse of FAULT_FLAG_WRITE early, such that we don't have to care in all other handlers and might get "surprises" if we forget to do so. Write faults without VM_MAYWRITE don't make any sense, and our maybe_mkwrite() logic could have hidden such abuse for now. Write faults without VM_WRITE on something that is not a COW mapping is similarly broken, and e.g., do_wp_page() could end up placing an anonymous page into a shared mapping, which would be bad. This is a preparation for reliable R/O long-term pinning of pages in private mappings, whereby we want to make sure that we will never break COW in a read-only private mapping. Signed-off-by: David Hildenbrand --- mm/memory.c | 8 1 file changed, 8 insertions(+) diff --git a/mm/memory.c b/mm/memory.c index fe131273217a..826353da7b23 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -5159,6 +5159,14 @@ static vm_fault_t sanitize_fault_flags(struct vm_area_struct *vma, */ if (!is_cow_mapping(vma->vm_flags)) *flags &= ~FAULT_FLAG_UNSHARE; + } else if (*flags & FAULT_FLAG_WRITE) { + /* Write faults on read-only mappings are impossible ... */ + if (WARN_ON_ONCE(!(vma->vm_flags & VM_MAYWRITE))) + return VM_FAULT_SIGSEGV; + /* ... and FOLL_FORCE only applies to COW mappings. */ + if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE) && +!is_cow_mapping(vma->vm_flags))) + return VM_FAULT_SIGSEGV; Not sure about the WARN_*(). Seems as if it might trigger in benign even if rare scenarios, e.g., mprotect() racing with page-fault. We most certainly would want to catch any such broken/racy cases. There are no benign cases I could possibly think of. Page faults need the mmap lock in read. mprotect() / VMA changes need the mmap lock in write. Whoever calls handle_mm_fault() is supposed to properly check VMA permissions. -- Thanks, David / dhildenb
[PATCH RFC 18/19] RDMA/hw/qib/qib_user_pages: remove FOLL_FORCE usage
FOLL_FORCE is really only for debugger access. As we unpin the pinned pages using unpin_user_pages_dirty_lock(true), the assumption is that all these pages are writable. FOLL_FORCE in this case seems to be a legacy leftover. Let's just remove it. Cc: Dennis Dalessandro Cc: Jason Gunthorpe Cc: Leon Romanovsky Signed-off-by: David Hildenbrand --- drivers/infiniband/hw/qib/qib_user_pages.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/infiniband/hw/qib/qib_user_pages.c b/drivers/infiniband/hw/qib/qib_user_pages.c index f4b5f05058e4..f693bc753b6b 100644 --- a/drivers/infiniband/hw/qib/qib_user_pages.c +++ b/drivers/infiniband/hw/qib/qib_user_pages.c @@ -110,7 +110,7 @@ int qib_get_user_pages(unsigned long start_page, size_t num_pages, for (got = 0; got < num_pages; got += ret) { ret = pin_user_pages(start_page + got * PAGE_SIZE, num_pages - got, -FOLL_LONGTERM | FOLL_WRITE | FOLL_FORCE, +FOLL_LONGTERM | FOLL_WRITE, p + got, NULL); if (ret < 0) { mmap_read_unlock(current->mm); -- 2.38.1
[PATCH RFC 19/19] habanalabs: remove FOLL_FORCE usage
FOLL_FORCE is really only for debugger access. As we unpin the pinned pages using unpin_user_pages_dirty_lock(true), the assumption is that all these pages are writable. FOLL_FORCE in this case seems to be due to copy-and-past from other drivers. Let's just remove it. Cc: Oded Gabbay Cc: Arnd Bergmann Cc: Greg Kroah-Hartman Signed-off-by: David Hildenbrand --- drivers/misc/habanalabs/common/memory.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/misc/habanalabs/common/memory.c b/drivers/misc/habanalabs/common/memory.c index ef28f3b37b93..e35cca96bbef 100644 --- a/drivers/misc/habanalabs/common/memory.c +++ b/drivers/misc/habanalabs/common/memory.c @@ -2312,8 +2312,7 @@ static int get_user_memory(struct hl_device *hdev, u64 addr, u64 size, if (!userptr->pages) return -ENOMEM; - rc = pin_user_pages_fast(start, npages, -FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM, + rc = pin_user_pages_fast(start, npages, FOLL_WRITE | FOLL_LONGTERM, userptr->pages); if (rc != npages) { -- 2.38.1
[PATCH RFC 16/19] mm/frame-vector: remove FOLL_FORCE usage
FOLL_FORCE is really only for debugger access. According to commit 707947247e95 ("media: videobuf2-vmalloc: get_userptr: buffers are always writable"), the pinned pages are always writable. FOLL_FORCE in this case seems to be a legacy leftover. Let's just remove it. Cc: Tomasz Figa Cc: Marek Szyprowski Cc: Mauro Carvalho Chehab Signed-off-by: David Hildenbrand --- drivers/media/common/videobuf2/frame_vector.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/media/common/videobuf2/frame_vector.c b/drivers/media/common/videobuf2/frame_vector.c index 542dde9d2609..062e98148c53 100644 --- a/drivers/media/common/videobuf2/frame_vector.c +++ b/drivers/media/common/videobuf2/frame_vector.c @@ -50,7 +50,7 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames, start = untagged_addr(start); ret = pin_user_pages_fast(start, nr_frames, - FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM, + FOLL_WRITE | FOLL_LONGTERM, (struct page **)(vec->ptrs)); if (ret > 0) { vec->got_ref = true; -- 2.38.1
[PATCH RFC 17/19] drm/exynos: remove FOLL_FORCE usage
FOLL_FORCE is really only for debugger access. As we unpin the pinned pages using unpin_user_pages_dirty_lock(true), the assumption is that all these pages are writable. FOLL_FORCE in this case seems to be a legacy leftover. Let's just remove it. Cc: Inki Dae Cc: Seung-Woo Kim Cc: Kyungmin Park Cc: David Airlie Cc: Daniel Vetter Cc: Krzysztof Kozlowski Signed-off-by: David Hildenbrand --- drivers/gpu/drm/exynos/exynos_drm_g2d.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/exynos/exynos_drm_g2d.c b/drivers/gpu/drm/exynos/exynos_drm_g2d.c index 471fd6c8135f..e19c2ceb3759 100644 --- a/drivers/gpu/drm/exynos/exynos_drm_g2d.c +++ b/drivers/gpu/drm/exynos/exynos_drm_g2d.c @@ -477,7 +477,7 @@ static dma_addr_t *g2d_userptr_get_dma_addr(struct g2d_data *g2d, } ret = pin_user_pages_fast(start, npages, - FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM, + FOLL_WRITE | FOLL_LONGTERM, g2d_userptr->pages); if (ret != npages) { DRM_DEV_ERROR(g2d->dev, -- 2.38.1
[PATCH RFC 15/19] media: pci/ivtv: remove FOLL_FORCE usage
FOLL_FORCE is really only for debugger access. R/O pinning a page is supposed to fail if the VMA misses proper access permissions (no VM_READ). Let's just remove FOLL_FORCE usage here; there would have to be a pretty good reason to allow arbitrary drivers to R/O pin pages in a PROT_NONE VMA. Most probably, FOLL_FORCE usage is just some legacy leftover. Cc: Andy Walls Cc: Mauro Carvalho Chehab Signed-off-by: David Hildenbrand --- drivers/media/pci/ivtv/ivtv-udma.c | 2 +- drivers/media/pci/ivtv/ivtv-yuv.c | 5 ++--- 2 files changed, 3 insertions(+), 4 deletions(-) diff --git a/drivers/media/pci/ivtv/ivtv-udma.c b/drivers/media/pci/ivtv/ivtv-udma.c index 210be8290f24..99b9f55ca829 100644 --- a/drivers/media/pci/ivtv/ivtv-udma.c +++ b/drivers/media/pci/ivtv/ivtv-udma.c @@ -115,7 +115,7 @@ int ivtv_udma_setup(struct ivtv *itv, unsigned long ivtv_dest_addr, /* Pin user pages for DMA Xfer */ err = pin_user_pages_unlocked(user_dma.uaddr, user_dma.page_count, - dma->map, FOLL_FORCE); + dma->map, 0); if (user_dma.page_count != err) { IVTV_DEBUG_WARN("failed to map user pages, returned %d instead of %d\n", diff --git a/drivers/media/pci/ivtv/ivtv-yuv.c b/drivers/media/pci/ivtv/ivtv-yuv.c index 4ba10c34a16a..582146f8d70d 100644 --- a/drivers/media/pci/ivtv/ivtv-yuv.c +++ b/drivers/media/pci/ivtv/ivtv-yuv.c @@ -63,12 +63,11 @@ static int ivtv_yuv_prep_user_dma(struct ivtv *itv, struct ivtv_user_dma *dma, /* Pin user pages for DMA Xfer */ y_pages = pin_user_pages_unlocked(y_dma.uaddr, - y_dma.page_count, >map[0], FOLL_FORCE); + y_dma.page_count, >map[0], 0); uv_pages = 0; /* silence gcc. value is set and consumed only if: */ if (y_pages == y_dma.page_count) { uv_pages = pin_user_pages_unlocked(uv_dma.uaddr, - uv_dma.page_count, >map[y_pages], - FOLL_FORCE); + uv_dma.page_count, >map[y_pages], 0); } if (y_pages != y_dma.page_count || uv_pages != uv_dma.page_count) { -- 2.38.1
[PATCH RFC 14/19] drm/etnaviv: remove FOLL_FORCE usage
GUP now supports reliable R/O long-term pinning in COW mappings, such that we break COW early. MAP_SHARED VMAs only use the shared zeropage so far in one corner case (DAXFS file with holes), which can be ignored because GUP does not support long-term pinning in fsdax (see check_vma_flags()). commit cd5297b0855f ("drm/etnaviv: Use FOLL_FORCE for userptr") documents that FOLL_FORCE | FOLL_WRITE was really only used for reliable R/O pinning. Consequently, FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM is no longer required for reliable R/O long-term pinning: FOLL_LONGTERM is sufficient. So stop using FOLL_FORCE, which is really only for debugger access. Cc: Daniel Vetter Cc: Lucas Stach Cc: Russell King Cc: Christian Gmeiner Cc: David Airlie Signed-off-by: David Hildenbrand --- drivers/gpu/drm/etnaviv/etnaviv_gem.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gem.c b/drivers/gpu/drm/etnaviv/etnaviv_gem.c index cc386f8a7116..efe2240945d0 100644 --- a/drivers/gpu/drm/etnaviv/etnaviv_gem.c +++ b/drivers/gpu/drm/etnaviv/etnaviv_gem.c @@ -638,6 +638,7 @@ static int etnaviv_gem_userptr_get_pages(struct etnaviv_gem_object *etnaviv_obj) struct page **pvec = NULL; struct etnaviv_gem_userptr *userptr = _obj->userptr; int ret, pinned = 0, npages = etnaviv_obj->base.size >> PAGE_SHIFT; + unsigned int gup_flags = FOLL_LONGTERM; might_lock_read(>mm->mmap_lock); @@ -648,14 +649,15 @@ static int etnaviv_gem_userptr_get_pages(struct etnaviv_gem_object *etnaviv_obj) if (!pvec) return -ENOMEM; + if (!userptr->ro) + gup_flags |= FOLL_WRITE; + do { unsigned num_pages = npages - pinned; uint64_t ptr = userptr->ptr + pinned * PAGE_SIZE; struct page **pages = pvec + pinned; - ret = pin_user_pages_fast(ptr, num_pages, - FOLL_WRITE | FOLL_FORCE | FOLL_LONGTERM, - pages); + ret = pin_user_pages_fast(ptr, num_pages, gup_flags, pages); if (ret < 0) { unpin_user_pages(pvec, pinned); kvfree(pvec); -- 2.38.1
[PATCH RFC 13/19] media: videobuf-dma-sg: remove FOLL_FORCE usage
GUP now supports reliable R/O long-term pinning in COW mappings, such that we break COW early. MAP_SHARED VMAs only use the shared zeropage so far in one corner case (DAXFS file with holes), which can be ignored because GUP does not support long-term pinning in fsdax (see check_vma_flags()). Consequently, FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM is no longer required for reliable R/O long-term pinning: FOLL_LONGTERM is sufficient. So stop using FOLL_FORCE, which is really only for debugger access. Cc: Mauro Carvalho Chehab Signed-off-by: David Hildenbrand --- drivers/media/v4l2-core/videobuf-dma-sg.c | 14 +- 1 file changed, 5 insertions(+), 9 deletions(-) diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c b/drivers/media/v4l2-core/videobuf-dma-sg.c index f75e5eedeee0..234e9f647c96 100644 --- a/drivers/media/v4l2-core/videobuf-dma-sg.c +++ b/drivers/media/v4l2-core/videobuf-dma-sg.c @@ -151,17 +151,16 @@ static void videobuf_dma_init(struct videobuf_dmabuf *dma) static int videobuf_dma_init_user_locked(struct videobuf_dmabuf *dma, int direction, unsigned long data, unsigned long size) { + unsigned int gup_flags = FOLL_LONGTERM; unsigned long first, last; - int err, rw = 0; - unsigned int flags = FOLL_FORCE; + int err; dma->direction = direction; switch (dma->direction) { case DMA_FROM_DEVICE: - rw = READ; + gup_flags |= FOLL_WRITE; break; case DMA_TO_DEVICE: - rw = WRITE; break; default: BUG(); @@ -177,14 +176,11 @@ static int videobuf_dma_init_user_locked(struct videobuf_dmabuf *dma, if (NULL == dma->pages) return -ENOMEM; - if (rw == READ) - flags |= FOLL_WRITE; - dprintk(1, "init user [0x%lx+0x%lx => %lu pages]\n", data, size, dma->nr_pages); - err = pin_user_pages(data & PAGE_MASK, dma->nr_pages, -flags | FOLL_LONGTERM, dma->pages, NULL); + err = pin_user_pages(data & PAGE_MASK, dma->nr_pages, gup_flags, +dma->pages, NULL); if (err != dma->nr_pages) { dma->nr_pages = (err >= 0) ? err : 0; -- 2.38.1
[PATCH RFC 12/19] RDMA/siw: remove FOLL_FORCE usage
GUP now supports reliable R/O long-term pinning in COW mappings, such that we break COW early. MAP_SHARED VMAs only use the shared zeropage so far in one corner case (DAXFS file with holes), which can be ignored because GUP does not support long-term pinning in fsdax (see check_vma_flags()). Consequently, FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM is no longer required for reliable R/O long-term pinning: FOLL_LONGTERM is sufficient. So stop using FOLL_FORCE, which is really only for debugger access. Cc: Bernard Metzler Cc: Jason Gunthorpe Cc: Leon Romanovsky Signed-off-by: David Hildenbrand --- drivers/infiniband/sw/siw/siw_mem.c | 9 - 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/sw/siw/siw_mem.c b/drivers/infiniband/sw/siw/siw_mem.c index 61c17db70d65..b2b33dd3b4fa 100644 --- a/drivers/infiniband/sw/siw/siw_mem.c +++ b/drivers/infiniband/sw/siw/siw_mem.c @@ -368,7 +368,7 @@ struct siw_umem *siw_umem_get(u64 start, u64 len, bool writable) struct mm_struct *mm_s; u64 first_page_va; unsigned long mlock_limit; - unsigned int foll_flags = FOLL_WRITE; + unsigned int foll_flags = FOLL_LONGTERM; int num_pages, num_chunks, i, rv = 0; if (!can_do_mlock()) @@ -391,8 +391,8 @@ struct siw_umem *siw_umem_get(u64 start, u64 len, bool writable) mmgrab(mm_s); - if (!writable) - foll_flags |= FOLL_FORCE; + if (writable) + foll_flags |= FOLL_WRITE; mmap_read_lock(mm_s); @@ -423,8 +423,7 @@ struct siw_umem *siw_umem_get(u64 start, u64 len, bool writable) while (nents) { struct page **plist = >page_chunk[i].plist[got]; - rv = pin_user_pages(first_page_va, nents, - foll_flags | FOLL_LONGTERM, + rv = pin_user_pages(first_page_va, nents, foll_flags, plist, NULL); if (rv < 0) goto out_sem_up; -- 2.38.1
[PATCH RFC 11/19] RDMA/usnic: remove FOLL_FORCE usage
GUP now supports reliable R/O long-term pinning in COW mappings, such that we break COW early. MAP_SHARED VMAs only use the shared zeropage so far in one corner case (DAXFS file with holes), which can be ignored because GUP does not support long-term pinning in fsdax (see check_vma_flags()). Consequently, FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM is no longer required for reliable R/O long-term pinning: FOLL_LONGTERM is sufficient. So stop using FOLL_FORCE, which is really only for debugger access. Cc: Christian Benvenuti Cc: Nelson Escobar Cc: Jason Gunthorpe Cc: Leon Romanovsky Signed-off-by: David Hildenbrand --- drivers/infiniband/hw/usnic/usnic_uiom.c | 9 - 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/hw/usnic/usnic_uiom.c b/drivers/infiniband/hw/usnic/usnic_uiom.c index 67923ced6e2d..c301b3be9f30 100644 --- a/drivers/infiniband/hw/usnic/usnic_uiom.c +++ b/drivers/infiniband/hw/usnic/usnic_uiom.c @@ -85,6 +85,7 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable, int dmasync, struct usnic_uiom_reg *uiomr) { struct list_head *chunk_list = >chunk_list; + unsigned int gup_flags = FOLL_LONGTERM; struct page **page_list; struct scatterlist *sg; struct usnic_uiom_chunk *chunk; @@ -96,7 +97,6 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable, int off; int i; dma_addr_t pa; - unsigned int gup_flags; struct mm_struct *mm; /* @@ -131,8 +131,8 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable, goto out; } - gup_flags = FOLL_WRITE; - gup_flags |= (writable) ? 0 : FOLL_FORCE; + if (writable) + gup_flags |= FOLL_WRITE; cur_base = addr & PAGE_MASK; ret = 0; @@ -140,8 +140,7 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable, ret = pin_user_pages(cur_base, min_t(unsigned long, npages, PAGE_SIZE / sizeof(struct page *)), -gup_flags | FOLL_LONGTERM, -page_list, NULL); +gup_flags, page_list, NULL); if (ret < 0) goto out; -- 2.38.1
[PATCH RFC 10/19] RDMA/umem: remove FOLL_FORCE usage
GUP now supports reliable R/O long-term pinning in COW mappings, such that we break COW early. MAP_SHARED VMAs only use the shared zeropage so far in one corner case (DAXFS file with holes), which can be ignored because GUP does not support long-term pinning in fsdax (see check_vma_flags()). Consequently, FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM is no longer required for reliable R/O long-term pinning: FOLL_LONGTERM is sufficient. So stop using FOLL_FORCE, which is really only for debugger access. Cc: Jason Gunthorpe Cc: Leon Romanovsky Signed-off-by: David Hildenbrand --- drivers/infiniband/core/umem.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index 86d479772fbc..755a9c57db6f 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -156,7 +156,7 @@ struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr, struct mm_struct *mm; unsigned long npages; int pinned, ret; - unsigned int gup_flags = FOLL_WRITE; + unsigned int gup_flags = FOLL_LONGTERM; /* * If the combination of the addr and size requested for this memory @@ -210,8 +210,8 @@ struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr, cur_base = addr & PAGE_MASK; - if (!umem->writable) - gup_flags |= FOLL_FORCE; + if (umem->writable) + gup_flags |= FOLL_WRITE; while (npages) { cond_resched(); @@ -219,7 +219,7 @@ struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr, min_t(unsigned long, npages, PAGE_SIZE / sizeof(struct page *)), - gup_flags | FOLL_LONGTERM, page_list); + gup_flags, page_list); if (pinned < 0) { ret = pinned; goto umem_release; -- 2.38.1
[PATCH RFC 08/19] mm: extend FAULT_FLAG_UNSHARE support to anything in a COW mapping
Extend FAULT_FLAG_UNSHARE to break COW on anything mapped into a COW (i.e., private writable) mapping and adjust the documentation accordingly. FAULT_FLAG_UNSHARE will now also break COW when encountering the shared zeropage, a pagecache page, a PFNMAP, ... inside a COW mapping, by properly replacing the mapped page/pfn by a private copy (an exclusive anonymous page). Note that only do_wp_page() needs care: hugetlb_wp() already handles FAULT_FLAG_UNSHARE correctly. wp_huge_pmd()/wp_huge_pud() also handles it correctly, for example, splitting the huge zeropage on FAULT_FLAG_UNSHARE such that we can handle FAULT_FLAG_UNSHARE on the PTE level. This change is a requirement for reliable long-term R/O pinning in COW mappings. Signed-off-by: David Hildenbrand --- include/linux/mm_types.h | 8 mm/memory.c | 4 2 files changed, 4 insertions(+), 8 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 834022721bc6..3f9fa01a3e24 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -965,9 +965,9 @@ typedef struct { * @FAULT_FLAG_REMOTE: The fault is not for current task/mm. * @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch. * @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals. - * @FAULT_FLAG_UNSHARE: The fault is an unsharing request to unshare (and mark - * exclusive) a possibly shared anonymous page that is - * mapped R/O. + * @FAULT_FLAG_UNSHARE: The fault is an unsharing request to break COW in a + * COW mapping, making sure that an exclusive anon page is + * mapped after the fault. * @FAULT_FLAG_ORIG_PTE_VALID: whether the fault has vmf->orig_pte cached. *We should only access orig_pte if this flag set. * @@ -992,7 +992,7 @@ typedef struct { * * The combination FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE is illegal. * FAULT_FLAG_UNSHARE is ignored and treated like an ordinary read fault when - * no existing R/O-mapped anonymous page is encountered. + * applied to mappings that are not COW mappings. */ enum fault_flag { FAULT_FLAG_WRITE = 1 << 0, diff --git a/mm/memory.c b/mm/memory.c index d2f9673755be..73ed83def548 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3431,10 +3431,6 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) } wp_page_reuse(vmf); return 0; - } else if (unshare) { - /* No anonymous page -> nothing to do. */ - pte_unmap_unlock(vmf->pte, vmf->ptl); - return 0; } copy: /* -- 2.38.1
[PATCH RFC 09/19] mm/gup: reliable R/O long-term pinning in COW mappings
We already support reliable R/O pinning of anonymous memory. However, assume we end up pinning (R/O long-term) a pagecache page or the shared zeropage inside a writable private ("COW") mapping. The next write access will trigger a write-fault and replace the pinned page by an exclusive anonymous page in the process page tables to break COW: the pinned page no longer corresponds to the page mapped into the process' page table. Now that FAULT_FLAG_UNSHARE can break COW on anything mapped into a COW mapping, let's properly break COW first before R/O long-term pinning something that's not an exclusive anon page inside a COW mapping. FAULT_FLAG_UNSHARE will break COW and map an exclusive anon page instead that can get pinned safely. With this change, we can stop using FOLL_FORCE|FOLL_WRITE for reliable R/O long-term pinning in COW mappings. With this change, the new R/O long-term pinning tests for non-anonymous memory succeed: # [RUN] R/O longterm GUP pin ... with shared zeropage ok 151 Longterm R/O pin is reliable # [RUN] R/O longterm GUP pin ... with memfd ok 152 Longterm R/O pin is reliable # [RUN] R/O longterm GUP pin ... with tmpfile ok 153 Longterm R/O pin is reliable # [RUN] R/O longterm GUP pin ... with huge zeropage ok 154 Longterm R/O pin is reliable # [RUN] R/O longterm GUP pin ... with memfd hugetlb (2048 kB) ok 155 Longterm R/O pin is reliable # [RUN] R/O longterm GUP pin ... with memfd hugetlb (1048576 kB) ok 156 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with shared zeropage ok 157 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with memfd ok 158 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with tmpfile ok 159 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with huge zeropage ok 160 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with memfd hugetlb (2048 kB) ok 161 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with memfd hugetlb (1048576 kB) ok 162 Longterm R/O pin is reliable Note 1: We don't care about short-term R/O-pinning, because they have snapshot semantics: they are not supposed to observe modifications that happen after pinning. As one example, assume we start direct I/O to read from a page and store page content into a file: modifications to page content after starting direct I/O are not guaranteed to end up in the file. So even if we'd pin the shared zeropage, the end result would be as expected -- getting zeroes stored to the file. Note 2: For shared mappings we'll now always fallback to the slow path to lookup the VMA when R/O long-term pining. While that's the necessary price we have to pay right now, it's actually not that bad in practice: most FOLL_LONGTERM users already specify FOLL_WRITE, for example, along with FOLL_FORCE because they tried dealing with COW mappings correctly ... Note 3: For users that use FOLL_LONGTERM right now without FOLL_WRITE, such as VFIO, we'd now no longer pin the shared zeropage. Instead, we'd populate exclusive anon pages that we can pin. There was a concern that this could affect the memlock limit of existing setups. For example, a VM running with VFIO could run into the memlock limit and fail to run. However, we essentially had the same behavior already in commit 17839856fd58 ("gup: document and work around "COW can break either way" issue") which got merged into some enterprise distros, and there were not any such complaints. So most probably, we're fine. Signed-off-by: David Hildenbrand --- include/linux/mm.h | 27 --- mm/gup.c | 10 +- mm/huge_memory.c | 2 +- mm/hugetlb.c | 7 --- 4 files changed, 34 insertions(+), 12 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 517c8cc8ccb9..3252ed88b472 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3002,8 +3002,12 @@ static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags) * Must be called with the (sub)page that's actually referenced via the * page table entry, which might not necessarily be the head page for a * PTE-mapped THP. + * + * If the vma is NULL, we're coming from the GUP-fast path and might have + * to fallback to the slow path just to lookup the vma. */ -static inline bool gup_must_unshare(unsigned int flags, struct page *page) +static inline bool gup_must_unshare(struct vm_area_struct *vma, + unsigned int flags, struct page *page) { /* * FOLL_WRITE is implicitly handled correctly as the page table entry @@ -3016,8 +3020,25 @@ static inline bool gup_must_unshare(unsigned int flags, struct page *page) * Note: PageAnon(page) is stable until the page is actually getting * freed. */ - if (!PageAnon(page)) - return false; + if (!PageAnon(page)) { +
[PATCH RFC 07/19] mm: don't call vm_ops->huge_fault() in wp_huge_pmd()/wp_huge_pud() for private mappings
If we already have a PMD/PUD mapped write-protected in a private mapping and we want to break COW either due to FAULT_FLAG_WRITE or FAULT_FLAG_UNSHARE, there is no need to inform the file system just like on the PTE path. Let's just split (->zap) + fallback in that case. This is a preparation for more generic FAULT_FLAG_UNSHARE support in COW mappings. Signed-off-by: David Hildenbrand --- mm/memory.c | 24 +++- 1 file changed, 15 insertions(+), 9 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 41e4c697033a..d2f9673755be 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4791,6 +4791,7 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf) static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf) { const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE; + vm_fault_t ret; if (vma_is_anonymous(vmf->vma)) { if (likely(!unshare) && @@ -4798,11 +4799,13 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf) return handle_userfault(vmf, VM_UFFD_WP); return do_huge_pmd_wp_page(vmf); } - if (vmf->vma->vm_ops->huge_fault) { - vm_fault_t ret = vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PMD); - if (!(ret & VM_FAULT_FALLBACK)) - return ret; + if (vmf->vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) { + if (vmf->vma->vm_ops->huge_fault) { + ret = vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PMD); + if (!(ret & VM_FAULT_FALLBACK)) + return ret; + } } /* COW or write-notify handled on pte level: split pmd. */ @@ -4828,14 +4831,17 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud) { #if defined(CONFIG_TRANSPARENT_HUGEPAGE) &&\ defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD) + vm_fault_t ret; + /* No support for anonymous transparent PUD pages yet */ if (vma_is_anonymous(vmf->vma)) goto split; - if (vmf->vma->vm_ops->huge_fault) { - vm_fault_t ret = vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PUD); - - if (!(ret & VM_FAULT_FALLBACK)) - return ret; + if (vmf->vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) { + if (vmf->vma->vm_ops->huge_fault) { + ret = vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PUD); + if (!(ret & VM_FAULT_FALLBACK)) + return ret; + } } split: /* COW or write-notify not handled on PUD level: split pud.*/ -- 2.38.1
[PATCH RFC 06/19] mm: rework handling in do_wp_page() based on private vs. shared mappings
We want to extent FAULT_FLAG_UNSHARE support to anything mapped into a COW mapping (pagecache page, zeropage, PFN, ...), not just anonymous pages. Let's prepare for that by handling shared mappings first such that we can handle private mappings last. While at it, use folio-based functions instead of page-based functions where we touch the code either way. Signed-off-by: David Hildenbrand --- mm/memory.c | 38 +- 1 file changed, 17 insertions(+), 21 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 826353da7b23..41e4c697033a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3341,7 +3341,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) { const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE; struct vm_area_struct *vma = vmf->vma; - struct folio *folio; + struct folio *folio = NULL; if (likely(!unshare)) { if (userfaultfd_pte_wp(vma, *vmf->pte)) { @@ -3359,13 +3359,12 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) } vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte); - if (!vmf->page) { - if (unlikely(unshare)) { - /* No anonymous page -> nothing to do. */ - pte_unmap_unlock(vmf->pte, vmf->ptl); - return 0; - } + /* +* Shared mapping: we are guaranteed to have VM_WRITE and +* FAULT_FLAG_WRITE set at this point. +*/ + if (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) { /* * VM_MIXEDMAP !pfn_valid() case, or VM_SOFTDIRTY clear on a * VM_PFNMAP VMA. @@ -3373,20 +3372,19 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) * We should not cow pages in a shared writeable mapping. * Just mark the pages writable and/or call ops->pfn_mkwrite. */ - if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) == -(VM_WRITE|VM_SHARED)) + if (!vmf->page) return wp_pfn_shared(vmf); - - pte_unmap_unlock(vmf->pte, vmf->ptl); - return wp_page_copy(vmf); + return wp_page_shared(vmf); } + if (vmf->page) + folio = page_folio(vmf->page); + /* -* Take out anonymous pages first, anonymous shared vmas are -* not dirty accountable. +* Private mapping: create an exclusive anonymous page copy if reuse +* is impossible. We might miss VM_WRITE for FOLL_FORCE handling. */ - folio = page_folio(vmf->page); - if (folio_test_anon(folio)) { + if (folio && folio_test_anon(folio)) { /* * If the page is exclusive to this process we must reuse the * page without further checks. @@ -3437,19 +3435,17 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) /* No anonymous page -> nothing to do. */ pte_unmap_unlock(vmf->pte, vmf->ptl); return 0; - } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) == - (VM_WRITE|VM_SHARED))) { - return wp_page_shared(vmf); } copy: /* * Ok, we need to copy. Oh, well.. */ - get_page(vmf->page); + if (folio) + folio_get(folio); pte_unmap_unlock(vmf->pte, vmf->ptl); #ifdef CONFIG_KSM - if (PageKsm(vmf->page)) + if (folio && folio_test_ksm(folio)) count_vm_event(COW_KSM); #endif return wp_page_copy(vmf); -- 2.38.1
[PATCH RFC 04/19] mm: add early FAULT_FLAG_UNSHARE consistency checks
For now, FAULT_FLAG_UNSHARE only applies to anonymous pages, which implies a COW mapping. Let's hide FAULT_FLAG_UNSHARE early if we're not dealing with a COW mapping, such that we treat it like a read fault as documented and don't have to worry about the flag throughout all fault handlers. While at it, centralize the check for mutual exclusion of FAULT_FLAG_UNSHARE and FAULT_FLAG_WRITE and just drop the check that either flag is set in the WP handler. Signed-off-by: David Hildenbrand --- mm/huge_memory.c | 3 --- mm/hugetlb.c | 5 - mm/memory.c | 23 --- 3 files changed, 20 insertions(+), 11 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 1d47b3f7b877..7173756d6868 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1267,9 +1267,6 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) vmf->ptl = pmd_lockptr(vma->vm_mm, vmf->pmd); VM_BUG_ON_VMA(!vma->anon_vma, vma); - VM_BUG_ON(unshare && (vmf->flags & FAULT_FLAG_WRITE)); - VM_BUG_ON(!unshare && !(vmf->flags & FAULT_FLAG_WRITE)); - if (is_huge_zero_pmd(orig_pmd)) goto fallback; diff --git a/mm/hugetlb.c b/mm/hugetlb.c index be572af75d9c..3672c7e06748 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5316,9 +5316,6 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long haddr = address & huge_page_mask(h); struct mmu_notifier_range range; - VM_BUG_ON(unshare && (flags & FOLL_WRITE)); - VM_BUG_ON(!unshare && !(flags & FOLL_WRITE)); - /* * hugetlb does not support FOLL_FORCE-style write faults that keep the * PTE mapped R/O such as maybe_mkwrite() would do. @@ -5328,8 +5325,6 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma, /* Let's take out MAP_SHARED mappings first. */ if (vma->vm_flags & VM_MAYSHARE) { - if (unlikely(unshare)) - return 0; set_huge_ptep_writable(vma, haddr, ptep); return 0; } diff --git a/mm/memory.c b/mm/memory.c index 78e2c58f6f31..fe131273217a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3343,9 +3343,6 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) struct vm_area_struct *vma = vmf->vma; struct folio *folio; - VM_BUG_ON(unshare && (vmf->flags & FAULT_FLAG_WRITE)); - VM_BUG_ON(!unshare && !(vmf->flags & FAULT_FLAG_WRITE)); - if (likely(!unshare)) { if (userfaultfd_pte_wp(vma, *vmf->pte)) { pte_unmap_unlock(vmf->pte, vmf->ptl); @@ -5150,6 +5147,22 @@ static void lru_gen_exit_fault(void) } #endif /* CONFIG_LRU_GEN */ +static vm_fault_t sanitize_fault_flags(struct vm_area_struct *vma, + unsigned int *flags) +{ + if (unlikely(*flags & FAULT_FLAG_UNSHARE)) { + if (WARN_ON_ONCE(*flags & FAULT_FLAG_WRITE)) + return VM_FAULT_SIGSEGV; + /* +* FAULT_FLAG_UNSHARE only applies to COW mappings. Let's +* just treat it like an ordinary read-fault otherwise. +*/ + if (!is_cow_mapping(vma->vm_flags)) + *flags &= ~FAULT_FLAG_UNSHARE; + } + return 0; +} + /* * By the time we get here, we already hold the mm semaphore * @@ -5166,6 +5179,10 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, count_vm_event(PGFAULT); count_memcg_event_mm(vma->vm_mm, PGFAULT); + ret = sanitize_fault_flags(vma, ); + if (ret) + return ret; + if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE, flags & FAULT_FLAG_INSTRUCTION, flags & FAULT_FLAG_REMOTE)) -- 2.38.1
[PATCH RFC 05/19] mm: add early FAULT_FLAG_WRITE consistency checks
Let's catch abuse of FAULT_FLAG_WRITE early, such that we don't have to care in all other handlers and might get "surprises" if we forget to do so. Write faults without VM_MAYWRITE don't make any sense, and our maybe_mkwrite() logic could have hidden such abuse for now. Write faults without VM_WRITE on something that is not a COW mapping is similarly broken, and e.g., do_wp_page() could end up placing an anonymous page into a shared mapping, which would be bad. This is a preparation for reliable R/O long-term pinning of pages in private mappings, whereby we want to make sure that we will never break COW in a read-only private mapping. Signed-off-by: David Hildenbrand --- mm/memory.c | 8 1 file changed, 8 insertions(+) diff --git a/mm/memory.c b/mm/memory.c index fe131273217a..826353da7b23 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -5159,6 +5159,14 @@ static vm_fault_t sanitize_fault_flags(struct vm_area_struct *vma, */ if (!is_cow_mapping(vma->vm_flags)) *flags &= ~FAULT_FLAG_UNSHARE; + } else if (*flags & FAULT_FLAG_WRITE) { + /* Write faults on read-only mappings are impossible ... */ + if (WARN_ON_ONCE(!(vma->vm_flags & VM_MAYWRITE))) + return VM_FAULT_SIGSEGV; + /* ... and FOLL_FORCE only applies to COW mappings. */ + if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE) && +!is_cow_mapping(vma->vm_flags))) + return VM_FAULT_SIGSEGV; } return 0; } -- 2.38.1
[PATCH RFC 03/19] selftests/vm: cow: R/O long-term pinning reliability tests for non-anon pages
Let's test whether R/O long-term pinning is reliable for non-anonymous memory: when R/O long-term pinning a page, the expectation is that we break COW early before pinning, such that actual write access via the page tables won't break COW later and end up replacing the R/O-pinned page in the page table. Consequently, R/O long-term pinning in private mappings would only target exclusive anonymous pages. For now, all tests fail: # [RUN] R/O longterm GUP pin ... with shared zeropage not ok 151 Longterm R/O pin is reliable # [RUN] R/O longterm GUP pin ... with memfd not ok 152 Longterm R/O pin is reliable # [RUN] R/O longterm GUP pin ... with tmpfile not ok 153 Longterm R/O pin is reliable # [RUN] R/O longterm GUP pin ... with huge zeropage not ok 154 Longterm R/O pin is reliable # [RUN] R/O longterm GUP pin ... with memfd hugetlb (2048 kB) not ok 155 Longterm R/O pin is reliable # [RUN] R/O longterm GUP pin ... with memfd hugetlb (1048576 kB) not ok 156 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with shared zeropage not ok 157 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with memfd not ok 158 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with tmpfile not ok 159 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with huge zeropage not ok 160 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with memfd hugetlb (2048 kB) not ok 161 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with memfd hugetlb (1048576 kB) not ok 162 Longterm R/O pin is reliable Signed-off-by: David Hildenbrand --- tools/testing/selftests/vm/cow.c | 28 +++- 1 file changed, 27 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/vm/cow.c b/tools/testing/selftests/vm/cow.c index 93c643bcdcf5..40ba45d0c6b4 100644 --- a/tools/testing/selftests/vm/cow.c +++ b/tools/testing/selftests/vm/cow.c @@ -534,6 +534,7 @@ static void test_iouring_fork(char *mem, size_t size) #endif /* LOCAL_CONFIG_HAVE_LIBURING */ enum ro_pin_test { + RO_PIN_TEST, RO_PIN_TEST_SHARED, RO_PIN_TEST_PREVIOUSLY_SHARED, RO_PIN_TEST_RO_EXCLUSIVE, @@ -566,6 +567,8 @@ static void do_test_ro_pin(char *mem, size_t size, enum ro_pin_test test, } switch (test) { + case RO_PIN_TEST: + break; case RO_PIN_TEST_SHARED: case RO_PIN_TEST_PREVIOUSLY_SHARED: /* @@ -1150,6 +1153,16 @@ static void test_cow(char *mem, const char *smem, size_t size) free(old); } +static void test_ro_pin(char *mem, const char *smem, size_t size) +{ + do_test_ro_pin(mem, size, RO_PIN_TEST, false); +} + +static void test_ro_fast_pin(char *mem, const char *smem, size_t size) +{ + do_test_ro_pin(mem, size, RO_PIN_TEST, true); +} + static void run_with_zeropage(non_anon_test_fn fn, const char *desc) { char *mem, *smem, tmp; @@ -1390,7 +1403,7 @@ struct non_anon_test_case { }; /* - * Test cases that target any pages in private mappings that are non anonymous: + * Test cases that target any pages in private mappings that are not anonymous: * pages that may get shared via COW ndependent of fork(). This includes * the shared zeropage(s), pagecache pages, ... */ @@ -1403,6 +1416,19 @@ static const struct non_anon_test_case non_anon_test_cases[] = { "Basic COW", test_cow, }, + /* +* Take a R/O longterm pin. When modifying the page via the page table, +* the page content change must be visible via the pin. +*/ + { + "R/O longterm GUP pin", + test_ro_pin, + }, + /* Same as above, but using GUP-fast. */ + { + "R/O longterm GUP-fast pin", + test_ro_fast_pin, + }, }; static void run_non_anon_test_case(struct non_anon_test_case const *test_case) -- 2.38.1