Re: [RFC PATCH RESEND 00/28] per-VMA locks proposal

2022-09-27 Thread Suren Baghdasaryan
On Sun, Sep 11, 2022 at 2:35 AM Vlastimil Babka  wrote:
>
> On 9/2/22 01:26, Suren Baghdasaryan wrote:
> > On Thu, Sep 1, 2022 at 1:58 PM Kent Overstreet
> >  wrote:
> >>
> >> On Thu, Sep 01, 2022 at 10:34:48AM -0700, Suren Baghdasaryan wrote:
> >> > Resending to fix the issue with the In-Reply-To tag in the original
> >> > submission at [4].
> >> >
> >> > This is a proof of concept for per-vma locks idea that was discussed
> >> > during SPF [1] discussion at LSF/MM this year [2], which concluded with
> >> > suggestion that “a reader/writer semaphore could be put into the VMA
> >> > itself; that would have the effect of using the VMA as a sort of range
> >> > lock. There would still be contention at the VMA level, but it would be 
> >> > an
> >> > improvement.” This patchset implements this suggested approach.
> >> >
> >> > When handling page faults we lookup the VMA that contains the faulting
> >> > page under RCU protection and try to acquire its lock. If that fails we
> >> > fall back to using mmap_lock, similar to how SPF handled this situation.
> >> >
> >> > One notable way the implementation deviates from the proposal is the way
> >> > VMAs are marked as locked. Because during some of mm updates multiple
> >> > VMAs need to be locked until the end of the update (e.g. vma_merge,
> >> > split_vma, etc). Tracking all the locked VMAs, avoiding recursive locks
> >> > and other complications would make the code more complex. Therefore we
> >> > provide a way to "mark" VMAs as locked and then unmark all locked VMAs
> >> > all at once. This is done using two sequence numbers - one in the
> >> > vm_area_struct and one in the mm_struct. VMA is considered locked when
> >> > these sequence numbers are equal. To mark a VMA as locked we set the
> >> > sequence number in vm_area_struct to be equal to the sequence number
> >> > in mm_struct. To unlock all VMAs we increment mm_struct's seq number.
> >> > This allows for an efficient way to track locked VMAs and to drop the
> >> > locks on all VMAs at the end of the update.
> >>
> >> I like it - the sequence numbers are a stroke of genuius. For what it's 
> >> doing
> >> the patchset seems almost small.
> >
> > Thanks for reviewing it!
> >
> >>
> >> Two complaints so far:
> >>  - I don't like the vma_mark_locked() name. To me it says that the caller
> >>already took or is taking the lock and this function is just marking 
> >> that
> >>we're holding the lock, but it's really taking a different type of 
> >> lock. But
> >>this function can block, it really is taking a lock, so it should say 
> >> that.
> >>
> >>This is AFAIK a new concept, not sure I'm going to have anything good 
> >> either,
> >>but perhaps vma_lock_multiple()?
> >
> > I'm open to name suggestions but vma_lock_multiple() is a bit
> > confusing to me. Will wait for more suggestions.
>
> Well, it does act like a vma_write_lock(), no? So why not that name. The
> checking function for it is even called vma_assert_write_locked().
>
> We just don't provide a single vma_write_unlock(), but a
> vma_mark_unlocked_all(), that could be instead named e.g.
> vma_write_unlock_all().
> But it's called on a mm, so maybe e.g. mm_vma_write_unlock_all()?

Thank you for your suggestions, Vlastimil! vma_write_lock() sounds
good to me. For vma_mark_unlocked_all() replacement, I would prefer
vma_write_unlock_all() which keeps the vma_write_XXX naming pattern to
indicate that these are operating on the same locks. If the fact that
it accepts mm_struct as a parameter is an issue then maybe
vma_write_unlock_mm() ?

>
>


Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock

2023-01-10 Thread Suren Baghdasaryan
On Tue, Jan 10, 2023 at 12:04 AM Vlastimil Babka  wrote:
>
> On 1/9/23 21:53, Suren Baghdasaryan wrote:
> > rw_semaphore is a sizable structure of 40 bytes and consumes
> > considerable space for each vm_area_struct. However vma_lock has
> > two important specifics which can be used to replace rw_semaphore
> > with a simpler structure:
> > 1. Readers never wait. They try to take the vma_lock and fall back to
> > mmap_lock if that fails.
> > 2. Only one writer at a time will ever try to write-lock a vma_lock
> > because writers first take mmap_lock in write mode.
> > Because of these requirements, full rw_semaphore functionality is not
> > needed and we can replace rw_semaphore with an atomic variable.
> > When a reader takes read lock, it increments the atomic unless the
> > value is negative. If that fails read-locking is aborted and mmap_lock
> > is used instead.
> > When writer takes write lock, it resets atomic value to -1 if the
> > current value is 0 (no readers). Since all writers take mmap_lock in
> > write mode first, there can be only one writer at a time. If there
> > are readers, writer will place itself into a wait queue using new
> > mm_struct.vma_writer_wait waitqueue head. The last reader to release
> > the vma_lock will signal the writer to wake up.
> > vm_lock_seq is also moved into vma_lock and along with atomic_t they
> > are nicely packed and consume 8 bytes, bringing the overhead from
> > vma_lock from 44 to 16 bytes:
> >
> > slabinfo before the changes:
> >  ...: ...
> > vm_area_struct...152   532 : ...
> >
> > slabinfo with vma_lock:
> >  ...: ...
> > rw_semaphore  ...  8  5121 : ...
>
> I guess the cache is called vma_lock, not rw_semaphore?

Yes, sorry. Copy/paste error when combining the results. The numbers
though look correct, so I did not screw up that part :)

>
> > vm_area_struct...160   512 : ...
> >
> > Assuming 4 vm_area_structs, memory consumption would be:
> > baseline: 6040kB
> > vma_lock (vm_area_structs+vma_lock): 6280kB+316kB=6596kB
> > Total increase: 556kB
> >
> > atomic_t might overflow if there are many competing readers, therefore
> > vma_read_trylock() implements an overflow check and if that occurs it
> > restors the previous value and exits with a failure to lock.
> >
> > Signed-off-by: Suren Baghdasaryan 
>
> This patch is indeed an interesting addition indeed, but I can't help but
> think it obsoletes the previous one :) We allocate an extra 8 bytes slab
> object for the lock, and the pointer to it is also 8 bytes, and requires an
> indirection. The vma_lock cache is not cacheline aligned (otherwise it would
> be a major waste), so we have potential false sharing with up to 7 other
> vma_lock's.

True, I thought long and hard about combining the last two patches but
decided to keep them separate to document the intent. The previous
patch splits the lock for performance reasons and this one is focused
on memory consumption. I'm open to changing this if it's confusing.

> I'd expect if the vma_lock was placed with the relatively cold fields of
> vm_area_struct, it shouldn't cause much cache ping pong when working with
> that vma. Even if we don't cache align the vma to save memory (would be 192
> bytes instead of 160 when aligned) and place the vma_lock and the cold
> fields at the end of the vma, it may be false sharing the cacheline with the
> next vma in the slab.

I would love to combine the vma_lock with vm_area_struct and I spent
several days trying different combinations to achieve decent
performance. My best achieved result was when I placed the vm_lock
into the third cache line at offset 192 and allocated vm_area_structs
from cache-aligned slab (horrible memory waste with each vma consuming
256 bytes). Even then I see regression in pft-threads test on a NUMA
machine (where cache-bouncing problem is most pronounced):

This is the result with split vma locks (current version). The higher
number the better:

BASEPVL
Hmean faults/sec-1469201.7282 (   0.00%)   464453.3976 *  -1.01%*
Hmean faults/sec-4   1754465.6221 (   0.00%)  1660688.0452 *  -5.35%*
Hmean faults/sec-7   2808141.6711 (   0.00%)  2688910.6458 *  -4.25%*
Hmean faults/sec-12  3750307.7553 (   0.00%)  3863490.2057 *   3.02%*
Hmean faults/sec-21  4145672.4677 (   0.00%)  3904532.7241 *  -5.82%*
Hmean faults/sec-30  3775722.5726 (   0.00%)  3923225.3734 *   3.91%*
Hmean faults/sec-48  4152563.5864 (   0.00%)  4783720.6811 *  15.20%*
Hmean faults/sec-56  4163868.7111 (   0.00%)  4851473.7241 *  16.51%*

Here are results 

Re: [PATCH 08/41] mm: introduce CONFIG_PER_VMA_LOCK

2023-01-10 Thread Suren Baghdasaryan
On Tue, Jan 10, 2023 at 4:39 PM Davidlohr Bueso  wrote:
>
> On Mon, 09 Jan 2023, Suren Baghdasaryan wrote:
>
> >This configuration variable will be used to build the support for VMA
> >locking during page fault handling.
> >
> >This is enabled by default on supported architectures with SMP and MMU
> >set.
> >
> >The architecture support is needed since the page fault handler is called
> >from the architecture's page faulting code which needs modifications to
> >handle faults under VMA lock.
>
> I don't think that per-vma locking should be something that is 
> user-configurable.
> It should just be depdendant on the arch. So maybe just remove 
> CONFIG_PER_VMA_LOCK?

Thanks for the suggestion! I would be happy to make that change if
there are no objections. I think the only pushback might have been the
vma size increase but with the latest optimization in the last patch
maybe that's less of an issue?

>
> Thanks,
> Davidlohr
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@android.com.
>


[PATCH 03/41] maple_tree: Fix freeing of nodes in rcu mode

2023-01-09 Thread Suren Baghdasaryan
From: Liam Howlett 

The walk to destroy the nodes was not always setting the node type and
would result in a destroy method potentially using the values as nodes.
Avoid this by setting the correct node types.  This is necessary for the
RCU mode of the maple tree.

Fixes: 54a611b60590 ("Maple Tree: add new data structure")
Signed-off-by: Liam Howlett 
Signed-off-by: Suren Baghdasaryan 
---
 lib/maple_tree.c | 73 
 1 file changed, 62 insertions(+), 11 deletions(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index a748938ad2e9..a11eea943f8d 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -897,6 +897,44 @@ static inline void ma_set_meta(struct maple_node *mn, enum 
maple_type mt,
meta->end = end;
 }
 
+/*
+ * mas_clear_meta() - clear the metadata information of a node, if it exists
+ * @mas: The maple state
+ * @mn: The maple node
+ * @mt: The maple node type
+ * @offset: The offset of the highest sub-gap in this node.
+ * @end: The end of the data in this node.
+ */
+static inline void mas_clear_meta(struct ma_state *mas, struct maple_node *mn,
+ enum maple_type mt)
+{
+   struct maple_metadata *meta;
+   unsigned long *pivots;
+   void __rcu **slots;
+   void *next;
+
+   switch (mt) {
+   case maple_range_64:
+   pivots = mn->mr64.pivot;
+   if (unlikely(pivots[MAPLE_RANGE64_SLOTS - 2])) {
+   slots = mn->mr64.slot;
+   next = mas_slot_locked(mas, slots,
+  MAPLE_RANGE64_SLOTS - 1);
+   if (unlikely((mte_to_node(next) && 
mte_node_type(next
+   return; /* The last slot is a node, no metadata 
*/
+   }
+   fallthrough;
+   case maple_arange_64:
+   meta = ma_meta(mn, mt);
+   break;
+   default:
+   return;
+   }
+
+   meta->gap = 0;
+   meta->end = 0;
+}
+
 /*
  * ma_meta_end() - Get the data end of a node from the metadata
  * @mn: The maple node
@@ -5448,20 +5486,22 @@ static inline int mas_rev_alloc(struct ma_state *mas, 
unsigned long min,
  * mas_dead_leaves() - Mark all leaves of a node as dead.
  * @mas: The maple state
  * @slots: Pointer to the slot array
+ * @type: The maple node type
  *
  * Must hold the write lock.
  *
  * Return: The number of leaves marked as dead.
  */
 static inline
-unsigned char mas_dead_leaves(struct ma_state *mas, void __rcu **slots)
+unsigned char mas_dead_leaves(struct ma_state *mas, void __rcu **slots,
+ enum maple_type mt)
 {
struct maple_node *node;
enum maple_type type;
void *entry;
int offset;
 
-   for (offset = 0; offset < mt_slot_count(mas->node); offset++) {
+   for (offset = 0; offset < mt_slots[mt]; offset++) {
entry = mas_slot_locked(mas, slots, offset);
type = mte_node_type(entry);
node = mte_to_node(entry);
@@ -5480,14 +5520,13 @@ unsigned char mas_dead_leaves(struct ma_state *mas, 
void __rcu **slots)
 
 static void __rcu **mas_dead_walk(struct ma_state *mas, unsigned char offset)
 {
-   struct maple_node *node, *next;
+   struct maple_node *next;
void __rcu **slots = NULL;
 
next = mas_mn(mas);
do {
-   mas->node = ma_enode_ptr(next);
-   node = mas_mn(mas);
-   slots = ma_slots(node, node->type);
+   mas->node = mt_mk_node(next, next->type);
+   slots = ma_slots(next, next->type);
next = mas_slot_locked(mas, slots, offset);
offset = 0;
} while (!ma_is_leaf(next->type));
@@ -5551,11 +5590,14 @@ static inline void __rcu **mas_destroy_descend(struct 
ma_state *mas,
node = mas_mn(mas);
slots = ma_slots(node, mte_node_type(mas->node));
next = mas_slot_locked(mas, slots, 0);
-   if ((mte_dead_node(next)))
+   if ((mte_dead_node(next))) {
+   mte_to_node(next)->type = mte_node_type(next);
next = mas_slot_locked(mas, slots, 1);
+   }
 
mte_set_node_dead(mas->node);
node->type = mte_node_type(mas->node);
+   mas_clear_meta(mas, node, node->type);
node->piv_parent = prev;
node->parent_slot = offset;
offset = 0;
@@ -5575,13 +5617,18 @@ static void mt_destroy_walk(struct maple_enode *enode, 
unsigned char ma_flags,
 
MA_STATE(mas, , 0, 0);
 
-   if (mte_is_leaf(enode))
+   mas.node = enode;
+   if (mte_is_leaf(enode)) {
+   node->type = mte_node_type(enode);
goto free_leaf;
+   }
 
+

[PATCH 06/41] maple_tree: Add smp_rmb() to dead node detection

2023-01-09 Thread Suren Baghdasaryan
From: "Liam R. Howlett" 

Add an smp_rmb() before reading the parent pointer to ensure that
anything read from the node prior to the parent pointer hasn't been
reordered ahead of this check.

The is necessary for RCU mode.

Fixes: 54a611b60590 ("Maple Tree: add new data structure")
Signed-off-by: Liam R. Howlett 
Signed-off-by: Suren Baghdasaryan 
---
 lib/maple_tree.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 8066fb1e8ec9..80ca28b656d3 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -535,9 +535,11 @@ static inline struct maple_node *mte_parent(const struct 
maple_enode *enode)
  */
 static inline bool ma_dead_node(const struct maple_node *node)
 {
-   struct maple_node *parent = (void *)((unsigned long)
-node->parent & ~MAPLE_NODE_MASK);
+   struct maple_node *parent;
 
+   /* Do not reorder reads from the node prior to the parent check */
+   smp_rmb();
+   parent = (void *)((unsigned long) node->parent & ~MAPLE_NODE_MASK);
return (parent == node);
 }
 
@@ -552,6 +554,8 @@ static inline bool mte_dead_node(const struct maple_enode 
*enode)
struct maple_node *parent, *node;
 
node = mte_to_node(enode);
+   /* Do not reorder reads from the node prior to the parent check */
+   smp_rmb();
parent = mte_parent(enode);
return (parent == node);
 }
-- 
2.39.0



[PATCH 13/41] mm: introduce vma->vm_flags modifier functions

2023-01-09 Thread Suren Baghdasaryan
To keep vma locking correctness when vm_flags are modified, add modifier
functions to be used whenever flags are updated.

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/mm.h   | 38 ++
 include/linux/mm_types.h |  8 +++-
 2 files changed, 45 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ec2c4c227d51..35cf0a6cbcc2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -702,6 +702,44 @@ static inline void vma_init(struct vm_area_struct *vma, 
struct mm_struct *mm)
vma_init_lock(vma);
 }
 
+/* Use when VMA is not part of the VMA tree and needs no locking */
+static inline
+void init_vm_flags(struct vm_area_struct *vma, unsigned long flags)
+{
+   WRITE_ONCE(vma->vm_flags, flags);
+}
+
+/* Use when VMA is part of the VMA tree and needs appropriate locking */
+static inline
+void reset_vm_flags(struct vm_area_struct *vma, unsigned long flags)
+{
+   vma_write_lock(vma);
+   init_vm_flags(vma, flags);
+}
+
+static inline
+void set_vm_flags(struct vm_area_struct *vma, unsigned long flags)
+{
+   vma_write_lock(vma);
+   vma->vm_flags |= flags;
+}
+
+static inline
+void clear_vm_flags(struct vm_area_struct *vma, unsigned long flags)
+{
+   vma_write_lock(vma);
+   vma->vm_flags &= ~flags;
+}
+
+static inline
+void mod_vm_flags(struct vm_area_struct *vma,
+ unsigned long set, unsigned long clear)
+{
+   vma_write_lock(vma);
+   vma->vm_flags |= set;
+   vma->vm_flags &= ~clear;
+}
+
 static inline void vma_set_anonymous(struct vm_area_struct *vma)
 {
vma->vm_ops = NULL;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5f7c5ca89931..0d27edd3e63a 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -553,7 +553,13 @@ struct vm_area_struct {
 * See vmf_insert_mixed_prot() for discussion.
 */
pgprot_t vm_page_prot;
-   unsigned long vm_flags; /* Flags, see mm.h. */
+
+   /*
+* Flags, see mm.h.
+* WARNING! Do not modify directly to keep correct VMA locking.
+* Use {init|reset|set|clear|mod}_vm_flags() functions instead.
+*/
+   unsigned long vm_flags;
 
 #ifdef CONFIG_PER_VMA_LOCK
int vm_lock_seq;
-- 
2.39.0



[PATCH 14/41] mm: replace VM_LOCKED_CLEAR_MASK with VM_LOCKED_MASK

2023-01-09 Thread Suren Baghdasaryan
To simplify the usage of VM_LOCKED_CLEAR_MASK in clear_vm_flags(),
replace it with VM_LOCKED_MASK bitmask and convert all users.

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/mm.h | 4 ++--
 kernel/fork.c  | 2 +-
 mm/hugetlb.c   | 4 ++--
 mm/mlock.c | 6 +++---
 mm/mmap.c  | 6 +++---
 mm/mremap.c| 2 +-
 6 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 35cf0a6cbcc2..2b16d45b75a6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -416,8 +416,8 @@ extern unsigned int kobjsize(const void *objp);
 /* This mask defines which mm->def_flags a process can inherit its parent */
 #define VM_INIT_DEF_MASK   VM_NOHUGEPAGE
 
-/* This mask is used to clear all the VMA flags used by mlock */
-#define VM_LOCKED_CLEAR_MASK   (~(VM_LOCKED | VM_LOCKONFAULT))
+/* This mask represents all the VMA flag bits used by mlock */
+#define VM_LOCKED_MASK (VM_LOCKED | VM_LOCKONFAULT)
 
 /* Arch-specific flags to clear when updating VM flags on protection change */
 #ifndef VM_ARCH_CLEAR
diff --git a/kernel/fork.c b/kernel/fork.c
index c026d75108b3..1591dd8a0745 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -674,7 +674,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
tmp->anon_vma = NULL;
} else if (anon_vma_fork(tmp, mpnt))
goto fail_nomem_anon_vma_fork;
-   tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
+   clear_vm_flags(tmp, VM_LOCKED_MASK);
file = tmp->vm_file;
if (file) {
struct address_space *mapping = file->f_mapping;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index db895230ee7e..24861cbfa2b1 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6950,8 +6950,8 @@ static unsigned long page_table_shareable(struct 
vm_area_struct *svma,
unsigned long s_end = sbase + PUD_SIZE;
 
/* Allow segments to share if only one is marked locked */
-   unsigned long vm_flags = vma->vm_flags & VM_LOCKED_CLEAR_MASK;
-   unsigned long svm_flags = svma->vm_flags & VM_LOCKED_CLEAR_MASK;
+   unsigned long vm_flags = vma->vm_flags & ~VM_LOCKED_MASK;
+   unsigned long svm_flags = svma->vm_flags & ~VM_LOCKED_MASK;
 
/*
 * match the virtual addresses, permission and the alignment of the
diff --git a/mm/mlock.c b/mm/mlock.c
index 7032f6dd0ce1..06aa9e204fac 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -490,7 +490,7 @@ static int apply_vma_lock_flags(unsigned long start, size_t 
len,
prev = mas_prev(, 0);
 
for (nstart = start ; ; ) {
-   vm_flags_t newflags = vma->vm_flags & VM_LOCKED_CLEAR_MASK;
+   vm_flags_t newflags = vma->vm_flags & ~VM_LOCKED_MASK;
 
newflags |= flags;
 
@@ -662,7 +662,7 @@ static int apply_mlockall_flags(int flags)
struct vm_area_struct *vma, *prev = NULL;
vm_flags_t to_add = 0;
 
-   current->mm->def_flags &= VM_LOCKED_CLEAR_MASK;
+   current->mm->def_flags &= ~VM_LOCKED_MASK;
if (flags & MCL_FUTURE) {
current->mm->def_flags |= VM_LOCKED;
 
@@ -682,7 +682,7 @@ static int apply_mlockall_flags(int flags)
mas_for_each(, vma, ULONG_MAX) {
vm_flags_t newflags;
 
-   newflags = vma->vm_flags & VM_LOCKED_CLEAR_MASK;
+   newflags = vma->vm_flags & ~VM_LOCKED_MASK;
newflags |= to_add;
 
/* Ignore errors */
diff --git a/mm/mmap.c b/mm/mmap.c
index 9db37adfc00a..5c4b608edde9 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2721,7 +2721,7 @@ unsigned long mmap_region(struct file *file, unsigned 
long addr,
if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
is_vm_hugetlb_page(vma) ||
vma == get_gate_vma(current->mm))
-   vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
+   clear_vm_flags(vma, VM_LOCKED_MASK);
else
mm->locked_vm += (len >> PAGE_SHIFT);
}
@@ -3392,8 +3392,8 @@ static struct vm_area_struct *__install_special_mapping(
vma->vm_start = addr;
vma->vm_end = addr + len;
 
-   vma->vm_flags = vm_flags | mm->def_flags | VM_DONTEXPAND | VM_SOFTDIRTY;
-   vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
+   init_vm_flags(vma, (vm_flags | mm->def_flags |
+ VM_DONTEXPAND | VM_SOFTDIRTY) & ~VM_LOCKED_MASK);
vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
 
vma->vm_ops = ops;
diff --git a/mm/mremap.c b/mm/mremap.c
index fe587c5d6591..5f6f9931bff1 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -686,7 +686,7 @@ static unsigned long move_vma(struct vm_area_str

[PATCH 18/41] mm/khugepaged: write-lock VMA while collapsing a huge page

2023-01-09 Thread Suren Baghdasaryan
Protect VMA from concurrent page fault handler while collapsing a huge
page. Page fault handler needs a stable PMD to use PTL and relies on
per-VMA lock to prevent concurrent PMD changes. pmdp_collapse_flush(),
set_huge_pmd() and collapse_and_free_pmd() can modify a PMD, which will
not be detected by a page fault handler without proper locking.

Signed-off-by: Suren Baghdasaryan 
---
 mm/khugepaged.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 5376246a3052..d8d0647f0c2c 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1032,6 +1032,7 @@ static int collapse_huge_page(struct mm_struct *mm, 
unsigned long address,
if (result != SCAN_SUCCEED)
goto out_up_write;
 
+   vma_write_lock(vma);
anon_vma_lock_write(vma->anon_vma);
 
mmu_notifier_range_init(, MMU_NOTIFY_CLEAR, 0, NULL, mm,
@@ -1503,6 +1504,9 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, 
unsigned long addr,
goto drop_hpage;
}
 
+   /* Lock the vma before taking i_mmap and page table locks */
+   vma_write_lock(vma);
+
/*
 * We need to lock the mapping so that from here on, only GUP-fast and
 * hardware page walks can access the parts of the page tables that
@@ -1690,6 +1694,7 @@ static int retract_page_tables(struct address_space 
*mapping, pgoff_t pgoff,
result = SCAN_PTE_UFFD_WP;
goto unlock_next;
}
+   vma_write_lock(vma);
collapse_and_free_pmd(mm, vma, addr, pmd);
if (!cc->is_khugepaged && is_target)
result = set_huge_pmd(vma, addr, pmd, hpage);
-- 
2.39.0



[PATCH 00/41] Per-VMA locks

2023-01-09 Thread Suren Baghdasaryan
ming zerocopy receive and 1 thread is performing madvise() with the
write lock taken (e.g. needs to change vm_flags) will result in all N -1
receive threads blocking until the madvise is done. Conversely, on a busy
process receiving a lot of data, an madvise operation that does need to
take the mmap lock in write mode will need to wait for all of the receives
to be done - a lose:lose proposition. Per-VMA locking _removes_ by
definition this source of contention entirely.
There are other benefits for receive as well, chiefly a reduction in
cacheline bouncing across receiving threads for locking/unlocking the
single mmap lock. On an RPC style synthetic workload with 4KB RPCs:
1a) The find+lock+unlock VMA path in the base case, without the per-vma
lock patchset, is about 0.7% of cycles as measured by perf.
1b) mmap_read_lock + mmap_read_unlock in the base case is about 0.5%
cycles overall - most of this is within the TCP read hotpath (a small
fraction is 'other' usage in the system).
2a) The find+lock+unlock VMA path, with the per-vma patchset and a trivial
patch written to take advantage of it in TCP, is about 0.4% of cycles
(down from 0.7% above)
2b) mmap_read_lock + mmap_read_unlock in the per-vma patchset is < 0.1%
cycles and is out of the TCP read hotpath entirely (down from 0.5% before,
the remaining usage is the 'other' usage in the system).
So, in addition to entirely removing an onerous source of contention, it
also reduces the CPU cycles of TCP receive zerocopy by about 0.5%+
(compared to overall cycles in perf) for the 'small' RPC scenario.

The patchset structure is:
0001-0007: Enable maple-tree RCU mode
0008-0038: Main per-vma locks patchset
0039-0040: Performance optimizations
0041: Memory overhead optimization

Branch for testing is posted at:
https://github.com/surenbaghdasaryan/linux/tree/per_vma_lock

The patchset applies cleanly over Linus' tree at:
commit b7bfaa761d76 ("Linux 6.2-rc3")

[1] https://lore.kernel.org/all/20220901173516.702122-1-sur...@google.com/
[2] https://lwn.net/Articles/906852/
[3] https://lore.kernel.org/all/20220128131006.67712-1-mic...@lespinasse.org/
[4] https://lwn.net/Articles/893906/

Laurent Dufour (1):
  powerc/mm: try VMA lock-based page fault handling first

Liam Howlett (4):
  maple_tree: Be more cautious about dead nodes
  maple_tree: Detect dead nodes in mas_start()
  maple_tree: Fix freeing of nodes in rcu mode
  maple_tree: remove extra smp_wmb() from mas_dead_leaves()

Liam R. Howlett (3):
  maple_tree: Fix write memory barrier of nodes once dead for RCU mode
  maple_tree: Add smp_rmb() to dead node detection
  mm: Enable maple tree RCU mode by default.

Michel Lespinasse (1):
  mm: rcu safe VMA freeing

Suren Baghdasaryan (32):
  mm: introduce CONFIG_PER_VMA_LOCK
  mm: move mmap_lock assert function definitions
  mm: export dump_mm()
  mm: add per-VMA lock and helper functions to control it
  mm: introduce vma->vm_flags modifier functions
  mm: replace VM_LOCKED_CLEAR_MASK with VM_LOCKED_MASK
  mm: replace vma->vm_flags direct modifications with modifier calls
  mm: replace vma->vm_flags indirect modification in ksm_madvise
  mm/mmap: move VMA locking before anon_vma_lock_write call
  mm/khugepaged: write-lock VMA while collapsing a huge page
  mm/mmap: write-lock VMAs before merging, splitting or expanding them
  mm/mmap: write-lock VMAs in vma_adjust
  mm/mmap: write-lock VMAs affected by VMA expansion
  mm/mremap: write-lock VMA while remapping it to a new address range
  mm: write-lock VMAs before removing them from VMA tree
  mm: conditionally write-lock VMA in free_pgtables
  mm/mmap: write-lock adjacent VMAs if they can grow into unmapped area
  kernel/fork: assert no VMA readers during its destruction
  mm/mmap: prevent pagefault handler from racing with mmu_notifier
registration
  mm: introduce lock_vma_under_rcu to be used from arch-specific code
  mm: fall back to mmap_lock if vma->anon_vma is not yet set
  mm: add FAULT_FLAG_VMA_LOCK flag
  mm: prevent do_swap_page from handling page faults under VMA lock
  mm: prevent userfaults to be handled under per-vma lock
  mm: introduce per-VMA lock statistics
  x86/mm: try VMA lock-based page fault handling first
  arm64/mm: try VMA lock-based page fault handling first
  mm: introduce mod_vm_flags_nolock
  mm: avoid assertion in untrack_pfn
  kernel/fork: throttle call_rcu() calls in vm_area_free
  mm: separate vma->lock from vm_area_struct
  mm: replace rw_semaphore with atomic_t in vma_lock

 arch/arm/kernel/process.c |   2 +-
 arch/arm64/Kconfig|   1 +
 arch/arm64/mm/fault.c |  36 
 arch/ia64/mm/init.c   |   8 +-
 arch/loongarch/include/asm/tlb.h  |   2 +-
 arch/powerpc/kvm/book3s_hv_uvmem.c|   5 +-
 arch/powerpc/kvm/book3s_xive_native.c |   2 +-
 arch/powerpc/mm/book3s64/subpage_prot.c   |   2 +-
 arch/powerpc/mm/fault.c   

[PATCH 01/41] maple_tree: Be more cautious about dead nodes

2023-01-09 Thread Suren Baghdasaryan
From: Liam Howlett 

ma_pivots() and ma_data_end() may be called with a dead node.  Ensure to
that the node isn't dead before using the returned values.

This is necessary for RCU mode of the maple tree.

Fixes: 54a611b60590 ("Maple Tree: add new data structure")
Signed-off-by: Liam Howlett 
Signed-off-by: Suren Baghdasaryan 
---
 lib/maple_tree.c | 53 +++-
 1 file changed, 43 insertions(+), 10 deletions(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 26e2045d3cda..ff9f04e0150d 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -540,6 +540,7 @@ static inline bool ma_dead_node(const struct maple_node 
*node)
 
return (parent == node);
 }
+
 /*
  * mte_dead_node() - check if the @enode is dead.
  * @enode: The encoded maple node
@@ -621,6 +622,8 @@ static inline unsigned int mas_alloc_req(const struct 
ma_state *mas)
  * @node - the maple node
  * @type - the node type
  *
+ * In the event of a dead node, this array may be %NULL
+ *
  * Return: A pointer to the maple node pivots
  */
 static inline unsigned long *ma_pivots(struct maple_node *node,
@@ -1091,8 +1094,11 @@ static int mas_ascend(struct ma_state *mas)
a_type = mas_parent_enum(mas, p_enode);
a_node = mte_parent(p_enode);
a_slot = mte_parent_slot(p_enode);
-   pivots = ma_pivots(a_node, a_type);
a_enode = mt_mk_node(a_node, a_type);
+   pivots = ma_pivots(a_node, a_type);
+
+   if (unlikely(ma_dead_node(a_node)))
+   return 1;
 
if (!set_min && a_slot) {
set_min = true;
@@ -1398,6 +1404,9 @@ static inline unsigned char ma_data_end(struct maple_node 
*node,
 {
unsigned char offset;
 
+   if (!pivots)
+   return 0;
+
if (type == maple_arange_64)
return ma_meta_end(node, type);
 
@@ -1433,6 +1442,9 @@ static inline unsigned char mas_data_end(struct ma_state 
*mas)
return ma_meta_end(node, type);
 
pivots = ma_pivots(node, type);
+   if (unlikely(ma_dead_node(node)))
+   return 0;
+
offset = mt_pivots[type] - 1;
if (likely(!pivots[offset]))
return ma_meta_end(node, type);
@@ -4504,6 +4516,9 @@ static inline int mas_prev_node(struct ma_state *mas, 
unsigned long min)
node = mas_mn(mas);
slots = ma_slots(node, mt);
pivots = ma_pivots(node, mt);
+   if (unlikely(ma_dead_node(node)))
+   return 1;
+
mas->max = pivots[offset];
if (offset)
mas->min = pivots[offset - 1] + 1;
@@ -4525,6 +4540,9 @@ static inline int mas_prev_node(struct ma_state *mas, 
unsigned long min)
slots = ma_slots(node, mt);
pivots = ma_pivots(node, mt);
offset = ma_data_end(node, mt, pivots, mas->max);
+   if (unlikely(ma_dead_node(node)))
+   return 1;
+
if (offset)
mas->min = pivots[offset - 1] + 1;
 
@@ -4573,6 +4591,7 @@ static inline int mas_next_node(struct ma_state *mas, 
struct maple_node *node,
struct maple_enode *enode;
int level = 0;
unsigned char offset;
+   unsigned char node_end;
enum maple_type mt;
void __rcu **slots;
 
@@ -4596,7 +4615,11 @@ static inline int mas_next_node(struct ma_state *mas, 
struct maple_node *node,
node = mas_mn(mas);
mt = mte_node_type(mas->node);
pivots = ma_pivots(node, mt);
-   } while (unlikely(offset == ma_data_end(node, mt, pivots, mas->max)));
+   node_end = ma_data_end(node, mt, pivots, mas->max);
+   if (unlikely(ma_dead_node(node)))
+   return 1;
+
+   } while (unlikely(offset == node_end));
 
slots = ma_slots(node, mt);
pivot = mas_safe_pivot(mas, pivots, ++offset, mt);
@@ -4612,6 +4635,9 @@ static inline int mas_next_node(struct ma_state *mas, 
struct maple_node *node,
mt = mte_node_type(mas->node);
slots = ma_slots(node, mt);
pivots = ma_pivots(node, mt);
+   if (unlikely(ma_dead_node(node)))
+   return 1;
+
offset = 0;
pivot = pivots[0];
}
@@ -4658,16 +4684,18 @@ static inline void *mas_next_nentry(struct ma_state 
*mas,
return NULL;
}
 
-   pivots = ma_pivots(node, type);
slots = ma_slots(node, type);
-   mas->index = mas_safe_min(mas, pivots, mas->offset);
-   if (ma_dead_node(node))
+   pivots = ma_pivots(node, type);
+   count = ma_data_end(node, type, pivots, mas->max);
+   if (unlikely(ma_dead_node(node)))
return NULL;
 
+   mas->index = mas_safe_min(mas, pivots, mas->offset);
+   if (

[PATCH 02/41] maple_tree: Detect dead nodes in mas_start()

2023-01-09 Thread Suren Baghdasaryan
From: Liam Howlett 

When initially starting a search, the root node may already be in the
process of being replaced in RCU mode.  Detect and restart the walk if
this is the case.  This is necessary for RCU mode of the maple tree.

Fixes: 54a611b60590 ("Maple Tree: add new data structure")
Signed-off-by: Liam Howlett 
Signed-off-by: Suren Baghdasaryan 
---
 lib/maple_tree.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index ff9f04e0150d..a748938ad2e9 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -1359,11 +1359,15 @@ static inline struct maple_enode *mas_start(struct 
ma_state *mas)
mas->depth = 0;
mas->offset = 0;
 
+retry:
root = mas_root(mas);
/* Tree with nodes */
if (likely(xa_is_node(root))) {
mas->depth = 1;
mas->node = mte_safe_root(root);
+   if (mte_dead_node(mas->node))
+   goto retry;
+
return NULL;
}
 
-- 
2.39.0



[PATCH 05/41] maple_tree: Fix write memory barrier of nodes once dead for RCU mode

2023-01-09 Thread Suren Baghdasaryan
From: "Liam R. Howlett" 

During the development of the maple tree, the strategy of freeing
multiple nodes changed and, in the process, the pivots were reused to
store pointers to dead nodes.  To ensure the readers see accurate
pivots, the writers need to mark the nodes as dead and call smp_wmb() to
ensure any readers can identify the node as dead before using the pivot
values.

There were two places where the old method of marking the node as dead
without smp_wmb() were being used, which resulted in RCU readers seeing
the wrong pivot value before seeing the node was dead.  Fix this race
condition by using mte_set_node_dead() which has the smp_wmb() call to
ensure the race is closed.

Add a WARN_ON() to the ma_free_rcu() call to ensure all nodes being
freed are marked as dead to ensure there are no other call paths besides
the two updated paths.

This is necessary for the RCU mode of the maple tree.

Fixes: 54a611b60590 ("Maple Tree: add new data structure")
Signed-off-by: Liam R. Howlett 
Signed-off-by: Suren Baghdasaryan 
---
 lib/maple_tree.c |  7 +--
 tools/testing/radix-tree/maple.c | 16 
 2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index d85291b19f86..8066fb1e8ec9 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -179,7 +179,7 @@ static void mt_free_rcu(struct rcu_head *head)
  */
 static void ma_free_rcu(struct maple_node *node)
 {
-   node->parent = ma_parent_ptr(node);
+   WARN_ON(node->parent != ma_parent_ptr(node));
call_rcu(>rcu, mt_free_rcu);
 }
 
@@ -1775,8 +1775,10 @@ static inline void mas_replace(struct ma_state *mas, 
bool advanced)
rcu_assign_pointer(slots[offset], mas->node);
}
 
-   if (!advanced)
+   if (!advanced) {
+   mte_set_node_dead(old_enode);
mas_free(mas, old_enode);
+   }
 }
 
 /*
@@ -4217,6 +4219,7 @@ static inline bool mas_wr_node_store(struct ma_wr_state 
*wr_mas)
 done:
mas_leaf_set_meta(mas, newnode, dst_pivots, maple_leaf_64, new_end);
if (in_rcu) {
+   mte_set_node_dead(mas->node);
mas->node = mt_mk_node(newnode, wr_mas->type);
mas_replace(mas, false);
} else {
diff --git a/tools/testing/radix-tree/maple.c b/tools/testing/radix-tree/maple.c
index 81fa7ec2e66a..2539ad6c4777 100644
--- a/tools/testing/radix-tree/maple.c
+++ b/tools/testing/radix-tree/maple.c
@@ -108,6 +108,7 @@ static noinline void check_new_node(struct maple_tree *mt)
MT_BUG_ON(mt, mn->slot[1] != NULL);
MT_BUG_ON(mt, mas_allocated() != 0);
 
+   mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
mas.node = MAS_START;
mas_nomem(, GFP_KERNEL);
@@ -160,6 +161,7 @@ static noinline void check_new_node(struct maple_tree *mt)
MT_BUG_ON(mt, mas_allocated() != i);
MT_BUG_ON(mt, !mn);
MT_BUG_ON(mt, not_empty(mn));
+   mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
}
 
@@ -192,6 +194,7 @@ static noinline void check_new_node(struct maple_tree *mt)
MT_BUG_ON(mt, not_empty(mn));
MT_BUG_ON(mt, mas_allocated() != i - 1);
MT_BUG_ON(mt, !mn);
+   mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
}
 
@@ -210,6 +213,7 @@ static noinline void check_new_node(struct maple_tree *mt)
mn = mas_pop_node();
MT_BUG_ON(mt, not_empty(mn));
MT_BUG_ON(mt, mas_allocated() != j - 1);
+   mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
}
MT_BUG_ON(mt, mas_allocated() != 0);
@@ -233,6 +237,7 @@ static noinline void check_new_node(struct maple_tree *mt)
MT_BUG_ON(mt, mas_allocated() != i - j);
mn = mas_pop_node();
MT_BUG_ON(mt, not_empty(mn));
+   mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
MT_BUG_ON(mt, mas_allocated() != i - j - 1);
}
@@ -269,6 +274,7 @@ static noinline void check_new_node(struct maple_tree *mt)
mn = mas_pop_node(); /* get the next node. */
MT_BUG_ON(mt, mn == NULL);
MT_BUG_ON(mt, not_empty(mn));
+   mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
}
MT_BUG_ON(mt, mas_allocated() != 0);
@@ -294,6 +300,7 @@ static noinline void check_new_node(struct maple_tree *mt)
mn = mas_pop_node(); /* get the next node. */
MT_BUG_ON(mt, mn == NULL);
MT_BUG_ON(mt, not_emp

[PATCH 04/41] maple_tree: remove extra smp_wmb() from mas_dead_leaves()

2023-01-09 Thread Suren Baghdasaryan
From: Liam Howlett 

The call to mte_set_dead_node() before the smp_wmb() already calls
smp_wmb() so this is not needed.  This is an optimization for the RCU
mode of the maple tree.

Fixes: 54a611b60590 ("Maple Tree: add new data structure")
Signed-off-by: Liam Howlett 
Signed-off-by: Suren Baghdasaryan 
---
 lib/maple_tree.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index a11eea943f8d..d85291b19f86 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -5510,7 +5510,6 @@ unsigned char mas_dead_leaves(struct ma_state *mas, void 
__rcu **slots,
break;
 
mte_set_node_dead(entry);
-   smp_wmb(); /* Needed for RCU */
node->type = type;
rcu_assign_pointer(slots[offset], node);
}
-- 
2.39.0



[PATCH 07/41] mm: Enable maple tree RCU mode by default.

2023-01-09 Thread Suren Baghdasaryan
From: "Liam R. Howlett" 

Use the maple tree in RCU mode for VMA tracking.  This is necessary for
the use of per-VMA locking.  RCU mode is enabled by default but disabled
when exiting an mm and for the new tree during a fork.

Also enable RCU for the tree used in munmap operations to ensure the
nodes remain valid for readers.

Signed-off-by: Liam R. Howlett 
Signed-off-by: Suren Baghdasaryan 
---
 include/linux/mm_types.h | 3 ++-
 kernel/fork.c| 3 +++
 mm/mmap.c| 4 +++-
 3 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 3b8475007734..4b6bce73fbb4 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -810,7 +810,8 @@ struct mm_struct {
unsigned long cpu_bitmap[];
 };
 
-#define MM_MT_FLAGS(MT_FLAGS_ALLOC_RANGE | MT_FLAGS_LOCK_EXTERN)
+#define MM_MT_FLAGS(MT_FLAGS_ALLOC_RANGE | MT_FLAGS_LOCK_EXTERN | \
+MT_FLAGS_USE_RCU)
 extern struct mm_struct init_mm;
 
 /* Pointer magic because the dynamic array size confuses some compilers. */
diff --git a/kernel/fork.c b/kernel/fork.c
index 9f7fe3541897..58aab6c889a4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -617,6 +617,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
if (retval)
goto out;
 
+   mt_clear_in_rcu(mas.tree);
mas_for_each(_mas, mpnt, ULONG_MAX) {
struct file *file;
 
@@ -703,6 +704,8 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
retval = arch_dup_mmap(oldmm, mm);
 loop_out:
mas_destroy();
+   if (!retval)
+   mt_set_in_rcu(mas.tree);
 out:
mmap_write_unlock(mm);
flush_tlb_mm(oldmm);
diff --git a/mm/mmap.c b/mm/mmap.c
index 87d929316d57..9db37adfc00a 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2304,7 +2304,8 @@ do_mas_align_munmap(struct ma_state *mas, struct 
vm_area_struct *vma,
int count = 0;
int error = -ENOMEM;
MA_STATE(mas_detach, _detach, 0, 0);
-   mt_init_flags(_detach, MT_FLAGS_LOCK_EXTERN);
+   mt_init_flags(_detach, mas->tree->ma_flags &
+ (MT_FLAGS_LOCK_MASK | MT_FLAGS_USE_RCU));
mt_set_external_lock(_detach, >mmap_lock);
 
if (mas_preallocate(mas, vma, GFP_KERNEL))
@@ -3091,6 +3092,7 @@ void exit_mmap(struct mm_struct *mm)
 */
set_bit(MMF_OOM_SKIP, >flags);
mmap_write_lock(mm);
+   mt_clear_in_rcu(>mm_mt);
free_pgtables(, >mm_mt, vma, FIRST_USER_ADDRESS,
  USER_PGTABLES_CEILING);
tlb_finish_mmu();
-- 
2.39.0



[PATCH 11/41] mm: export dump_mm()

2023-01-09 Thread Suren Baghdasaryan
mmap_assert_write_locked() will be used in the next patch to ensure
vma write lock is taken only under mmap_lock exclusive lock. Because
mmap_assert_write_locked() uses dump_mm() and there are cases when
vma write lock is taken from inside a module, it's necessary to export
dump_mm() function.

Signed-off-by: Suren Baghdasaryan 
---
 mm/debug.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/debug.c b/mm/debug.c
index 7f8e5f744e42..b6e9e53469d1 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -215,6 +215,7 @@ void dump_mm(const struct mm_struct *mm)
mm->def_flags, >def_flags
);
 }
+EXPORT_SYMBOL(dump_mm);
 
 static bool page_init_poisoning __read_mostly = true;
 
-- 
2.39.0



[PATCH 10/41] mm: move mmap_lock assert function definitions

2023-01-09 Thread Suren Baghdasaryan
Move mmap_lock assert function definitions up so that they can be used
by other mmap_lock routines.

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/mmap_lock.h | 24 
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index 96e113e23d04..e49ba91bb1f0 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -60,6 +60,18 @@ static inline void __mmap_lock_trace_released(struct 
mm_struct *mm, bool write)
 
 #endif /* CONFIG_TRACING */
 
+static inline void mmap_assert_locked(struct mm_struct *mm)
+{
+   lockdep_assert_held(>mmap_lock);
+   VM_BUG_ON_MM(!rwsem_is_locked(>mmap_lock), mm);
+}
+
+static inline void mmap_assert_write_locked(struct mm_struct *mm)
+{
+   lockdep_assert_held_write(>mmap_lock);
+   VM_BUG_ON_MM(!rwsem_is_locked(>mmap_lock), mm);
+}
+
 static inline void mmap_init_lock(struct mm_struct *mm)
 {
init_rwsem(>mmap_lock);
@@ -150,18 +162,6 @@ static inline void mmap_read_unlock_non_owner(struct 
mm_struct *mm)
up_read_non_owner(>mmap_lock);
 }
 
-static inline void mmap_assert_locked(struct mm_struct *mm)
-{
-   lockdep_assert_held(>mmap_lock);
-   VM_BUG_ON_MM(!rwsem_is_locked(>mmap_lock), mm);
-}
-
-static inline void mmap_assert_write_locked(struct mm_struct *mm)
-{
-   lockdep_assert_held_write(>mmap_lock);
-   VM_BUG_ON_MM(!rwsem_is_locked(>mmap_lock), mm);
-}
-
 static inline int mmap_lock_is_contended(struct mm_struct *mm)
 {
return rwsem_is_contended(>mmap_lock);
-- 
2.39.0



[PATCH 17/41] mm/mmap: move VMA locking before anon_vma_lock_write call

2023-01-09 Thread Suren Baghdasaryan
Move VMA flag modification (which now implies VMA locking) before
anon_vma_lock_write to match the locking order of page fault handler.

Signed-off-by: Suren Baghdasaryan 
---
 mm/mmap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index fa994ae903d9..53d885e70a54 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2953,13 +2953,13 @@ static int do_brk_flags(struct ma_state *mas, struct 
vm_area_struct *vma,
if (mas_preallocate(mas, vma, GFP_KERNEL))
goto unacct_fail;
 
+   set_vm_flags(vma, VM_SOFTDIRTY);
vma_adjust_trans_huge(vma, vma->vm_start, addr + len, 0);
if (vma->anon_vma) {
anon_vma_lock_write(vma->anon_vma);
anon_vma_interval_tree_pre_update_vma(vma);
}
vma->vm_end = addr + len;
-   set_vm_flags(vma, VM_SOFTDIRTY);
mas_store_prealloc(mas, vma);
 
if (vma->anon_vma) {
-- 
2.39.0



[PATCH 19/41] mm/mmap: write-lock VMAs before merging, splitting or expanding them

2023-01-09 Thread Suren Baghdasaryan
Decisions about whether VMAs can be merged, split or expanded must be
made while VMAs are protected from the changes which can affect that
decision. For example, merge_vma uses vma->anon_vma in its decision
whether the VMA can be merged. Meanwhile, page fault handler changes
vma->anon_vma during COW operation.
Write-lock all VMAs which might be affected by a merge or split operation
before making decision how such operations should be performed.

Not sure if expansion really needs this, just being paranoid. Otherwise
mmap_region and vm_brk_flags might not locking.

Signed-off-by: Suren Baghdasaryan 
---
 mm/mmap.c | 23 ---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 53d885e70a54..f6ca4a87f9e2 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -254,8 +254,11 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
 */
mas_set(, oldbrk);
next = mas_find(, newbrk - 1 + PAGE_SIZE + stack_guard_gap);
-   if (next && newbrk + PAGE_SIZE > vm_start_gap(next))
-   goto out;
+   if (next) {
+   vma_write_lock(next);
+   if (newbrk + PAGE_SIZE > vm_start_gap(next))
+   goto out;
+   }
 
brkvma = mas_prev(, mm->start_brk);
/* Ok, looks good - let it rip. */
@@ -1017,10 +1020,17 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
if (vm_flags & VM_SPECIAL)
return NULL;
 
+   if (prev)
+   vma_write_lock(prev);
next = find_vma(mm, prev ? prev->vm_end : 0);
mid = next;
-   if (next && next->vm_end == end)/* cases 6, 7, 8 */
+   if (next)
+   vma_write_lock(next);
+   if (next && next->vm_end == end) {  /* cases 6, 7, 8 */
next = find_vma(mm, next->vm_end);
+   if (next)
+   vma_write_lock(next);
+   }
 
/* verify some invariant that must be enforced by the caller */
VM_WARN_ON(prev && addr <= prev->vm_start);
@@ -2198,6 +2208,7 @@ int __split_vma(struct mm_struct *mm, struct 
vm_area_struct *vma,
int err;
validate_mm_mt(mm);
 
+   vma_write_lock(vma);
if (vma->vm_ops && vma->vm_ops->may_split) {
err = vma->vm_ops->may_split(vma, addr);
if (err)
@@ -2564,6 +2575,8 @@ unsigned long mmap_region(struct file *file, unsigned 
long addr,
 
/* Attempt to expand an old mapping */
/* Check next */
+   if (next)
+   vma_write_lock(next);
if (next && next->vm_start == end && !vma_policy(next) &&
can_vma_merge_before(next, vm_flags, NULL, file, pgoff+pglen,
 NULL_VM_UFFD_CTX, NULL)) {
@@ -2573,6 +2586,8 @@ unsigned long mmap_region(struct file *file, unsigned 
long addr,
}
 
/* Check prev */
+   if (prev)
+   vma_write_lock(prev);
if (prev && prev->vm_end == addr && !vma_policy(prev) &&
(vma ? can_vma_merge_after(prev, vm_flags, vma->anon_vma, file,
   pgoff, vma->vm_userfaultfd_ctx, NULL) :
@@ -2942,6 +2957,8 @@ static int do_brk_flags(struct ma_state *mas, struct 
vm_area_struct *vma,
if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT))
return -ENOMEM;
 
+   if (vma)
+   vma_write_lock(vma);
/*
 * Expand the existing vma if possible; Note that singular lists do not
 * occur after forking, so the expand will only happen on new VMAs.
-- 
2.39.0



[PATCH 20/41] mm/mmap: write-lock VMAs in vma_adjust

2023-01-09 Thread Suren Baghdasaryan
vma_adjust modifies a VMA and possibly its neighbors. Write-lock them
before making the modifications.

Signed-off-by: Suren Baghdasaryan 
---
 mm/mmap.c | 17 -
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index f6ca4a87f9e2..1e2154137631 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -614,6 +614,12 @@ inline int vma_expand(struct ma_state *mas, struct 
vm_area_struct *vma,
  * The following helper function should be used when such adjustments
  * are necessary.  The "insert" vma (if any) is to be inserted
  * before we drop the necessary locks.
+ * 'expand' vma is always locked before it's passed to __vma_adjust()
+ * from vma_merge() because vma should not change from the moment
+ * can_vma_merge_{before|after} decision is made.
+ * 'insert' vma is used only by __split_vma() and it's always a brand
+ * new vma which is not yet added into mm's vma tree, therefore no need
+ * to lock it.
  */
 int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert,
@@ -633,6 +639,10 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long 
start,
MA_STATE(mas, >mm_mt, 0, 0);
struct vm_area_struct *exporter = NULL, *importer = NULL;
 
+   vma_write_lock(vma);
+   if (next)
+   vma_write_lock(next);
+
if (next && !insert) {
if (end >= next->vm_end) {
/*
@@ -676,8 +686,11 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long 
start,
 * If next doesn't have anon_vma, import from vma after
 * next, if the vma overlaps with it.
 */
-   if (remove_next == 2 && !next->anon_vma)
+   if (remove_next == 2 && !next->anon_vma) {
exporter = next_next;
+   if (exporter)
+   vma_write_lock(exporter);
+   }
 
} else if (end > next->vm_start) {
/*
@@ -850,6 +863,8 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long 
start,
if (remove_next == 2) {
remove_next = 1;
next = next_next;
+   if (next)
+   vma_write_lock(next);
goto again;
}
}
-- 
2.39.0



[PATCH 24/41] mm: conditionally write-lock VMA in free_pgtables

2023-01-09 Thread Suren Baghdasaryan
Normally free_pgtables needs to lock affected VMAs except for the case
when VMAs were isolated under VMA write-lock. munmap() does just that,
isolating while holding appropriate locks and then downgrading mmap_lock
and dropping per-VMA locks before freeing page tables.
Add a parameter to free_pgtables and unmap_region for such scenario.

Signed-off-by: Suren Baghdasaryan 
---
 mm/internal.h |  2 +-
 mm/memory.c   |  6 +-
 mm/mmap.c | 18 --
 3 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index bcf75a8b032d..5ea4ff1a70e7 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -87,7 +87,7 @@ void folio_activate(struct folio *folio);
 
 void free_pgtables(struct mmu_gather *tlb, struct maple_tree *mt,
   struct vm_area_struct *start_vma, unsigned long floor,
-  unsigned long ceiling);
+  unsigned long ceiling, bool lock_vma);
 void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte);
 
 struct zap_details;
diff --git a/mm/memory.c b/mm/memory.c
index 2fabf89b2be9..9ece18548db1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -348,7 +348,7 @@ void free_pgd_range(struct mmu_gather *tlb,
 
 void free_pgtables(struct mmu_gather *tlb, struct maple_tree *mt,
   struct vm_area_struct *vma, unsigned long floor,
-  unsigned long ceiling)
+  unsigned long ceiling, bool lock_vma)
 {
MA_STATE(mas, mt, vma->vm_end, vma->vm_end);
 
@@ -366,6 +366,8 @@ void free_pgtables(struct mmu_gather *tlb, struct 
maple_tree *mt,
 * Hide vma from rmap and truncate_pagecache before freeing
 * pgtables
 */
+   if (lock_vma)
+   vma_write_lock(vma);
unlink_anon_vmas(vma);
unlink_file_vma(vma);
 
@@ -380,6 +382,8 @@ void free_pgtables(struct mmu_gather *tlb, struct 
maple_tree *mt,
   && !is_vm_hugetlb_page(next)) {
vma = next;
next = mas_find(, ceiling - 1);
+   if (lock_vma)
+   vma_write_lock(vma);
unlink_anon_vmas(vma);
unlink_file_vma(vma);
}
diff --git a/mm/mmap.c b/mm/mmap.c
index be289e0b693b..0d767ce043af 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -78,7 +78,7 @@ core_param(ignore_rlimit_data, ignore_rlimit_data, bool, 
0644);
 static void unmap_region(struct mm_struct *mm, struct maple_tree *mt,
struct vm_area_struct *vma, struct vm_area_struct *prev,
struct vm_area_struct *next, unsigned long start,
-   unsigned long end);
+   unsigned long end, bool lock_vma);
 
 static pgprot_t vm_pgprot_modify(pgprot_t oldprot, unsigned long vm_flags)
 {
@@ -2202,7 +2202,7 @@ static inline void remove_mt(struct mm_struct *mm, struct 
ma_state *mas)
 static void unmap_region(struct mm_struct *mm, struct maple_tree *mt,
struct vm_area_struct *vma, struct vm_area_struct *prev,
struct vm_area_struct *next,
-   unsigned long start, unsigned long end)
+   unsigned long start, unsigned long end, bool lock_vma)
 {
struct mmu_gather tlb;
 
@@ -2211,7 +2211,8 @@ static void unmap_region(struct mm_struct *mm, struct 
maple_tree *mt,
update_hiwater_rss(mm);
unmap_vmas(, mt, vma, start, end);
free_pgtables(, mt, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
-next ? next->vm_start : USER_PGTABLES_CEILING);
+next ? next->vm_start : USER_PGTABLES_CEILING,
+lock_vma);
tlb_finish_mmu();
 }
 
@@ -2468,7 +2469,11 @@ do_mas_align_munmap(struct ma_state *mas, struct 
vm_area_struct *vma,
mmap_write_downgrade(mm);
}
 
-   unmap_region(mm, _detach, vma, prev, next, start, end);
+   /*
+* We can free page tables without locking the vmas because they were
+* isolated before we downgraded mmap_lock and dropped per-vma locks.
+*/
+   unmap_region(mm, _detach, vma, prev, next, start, end, !downgrade);
/* Statistics and freeing VMAs */
mas_set(_detach, start);
remove_mt(mm, _detach);
@@ -2785,7 +2790,8 @@ unsigned long mmap_region(struct file *file, unsigned 
long addr,
vma->vm_file = NULL;
 
/* Undo any partial mapping done by a device driver. */
-   unmap_region(mm, mas.tree, vma, prev, next, vma->vm_start, vma->vm_end);
+   unmap_region(mm, mas.tree, vma, prev, next, vma->vm_start, vma->vm_end,
+true);
if (file && (vm_flags & VM_SHARED))
mapping_unmap_writable(file->f_mapping);
 free_vma:
@@ -

[PATCH 25/41] mm/mmap: write-lock adjacent VMAs if they can grow into unmapped area

2023-01-09 Thread Suren Baghdasaryan
While unmapping VMAs, adjacent VMAs might be able to grow into the area
being unmapped. In such cases write-lock adjacent VMAs to prevent this
growth.

Signed-off-by: Suren Baghdasaryan 
---
 mm/mmap.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 0d767ce043af..30c7d1c5206e 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2461,11 +2461,13 @@ do_mas_align_munmap(struct ma_state *mas, struct 
vm_area_struct *vma,
 * down_read(mmap_lock) and collide with the VMA we are about to unmap.
 */
if (downgrade) {
-   if (next && (next->vm_flags & VM_GROWSDOWN))
+   if (next && (next->vm_flags & VM_GROWSDOWN)) {
+   vma_write_lock(next);
downgrade = false;
-   else if (prev && (prev->vm_flags & VM_GROWSUP))
+   } else if (prev && (prev->vm_flags & VM_GROWSUP)) {
+   vma_write_lock(prev);
downgrade = false;
-   else
+   } else
mmap_write_downgrade(mm);
}
 
-- 
2.39.0



[PATCH 08/41] mm: introduce CONFIG_PER_VMA_LOCK

2023-01-09 Thread Suren Baghdasaryan
This configuration variable will be used to build the support for VMA
locking during page fault handling.

This is enabled by default on supported architectures with SMP and MMU
set.

The architecture support is needed since the page fault handler is called
from the architecture's page faulting code which needs modifications to
handle faults under VMA lock.

Signed-off-by: Suren Baghdasaryan 
---
 mm/Kconfig | 13 +
 1 file changed, 13 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index ff7b209dec05..0aeca3794972 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1183,6 +1183,19 @@ config LRU_GEN_STATS
  This option has a per-memcg and per-node memory overhead.
 # }
 
+config ARCH_SUPPORTS_PER_VMA_LOCK
+   def_bool n
+
+config PER_VMA_LOCK
+   bool "Per-vma locking support"
+   default y
+   depends on ARCH_SUPPORTS_PER_VMA_LOCK && MMU && SMP
+   help
+ Allow per-vma locking during page fault handling.
+
+ This feature allows locking each virtual memory area separately when
+ handling page faults instead of taking mmap_lock.
+
 source "mm/damon/Kconfig"
 
 endmenu
-- 
2.39.0



[PATCH 09/41] mm: rcu safe VMA freeing

2023-01-09 Thread Suren Baghdasaryan
From: Michel Lespinasse 

This prepares for page faults handling under VMA lock, looking up VMAs
under protection of an rcu read lock, instead of the usual mmap read lock.

Signed-off-by: Michel Lespinasse 
Signed-off-by: Suren Baghdasaryan 
---
 include/linux/mm_types.h | 13 ++---
 kernel/fork.c| 13 +
 2 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 4b6bce73fbb4..d5cdec1314fe 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -535,9 +535,16 @@ struct anon_vma_name {
 struct vm_area_struct {
/* The first cache line has the info for VMA tree walking. */
 
-   unsigned long vm_start; /* Our start address within vm_mm. */
-   unsigned long vm_end;   /* The first byte after our end address
-  within vm_mm. */
+   union {
+   struct {
+   /* VMA covers [vm_start; vm_end) addresses within mm */
+   unsigned long vm_start;
+   unsigned long vm_end;
+   };
+#ifdef CONFIG_PER_VMA_LOCK
+   struct rcu_head vm_rcu; /* Used for deferred freeing. */
+#endif
+   };
 
struct mm_struct *vm_mm;/* The address space we belong to. */
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 58aab6c889a4..5986817f393c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -479,10 +479,23 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct 
*orig)
return new;
 }
 
+#ifdef CONFIG_PER_VMA_LOCK
+static void __vm_area_free(struct rcu_head *head)
+{
+   struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
+ vm_rcu);
+   kmem_cache_free(vm_area_cachep, vma);
+}
+#endif
+
 void vm_area_free(struct vm_area_struct *vma)
 {
free_anon_vma_name(vma);
+#ifdef CONFIG_PER_VMA_LOCK
+   call_rcu(>vm_rcu, __vm_area_free);
+#else
kmem_cache_free(vm_area_cachep, vma);
+#endif
 }
 
 static void account_kernel_stack(struct task_struct *tsk, int account)
-- 
2.39.0



[PATCH 12/41] mm: add per-VMA lock and helper functions to control it

2023-01-09 Thread Suren Baghdasaryan
Introduce a per-VMA rw_semaphore to be used during page fault handling
instead of mmap_lock. Because there are cases when multiple VMAs need
to be exclusively locked during VMA tree modifications, instead of the
usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
exclusively and setting vma->lock_seq to the current mm->lock_seq. When
mmap_write_lock holder is done with all modifications and drops mmap_lock,
it will increment mm->lock_seq, effectively unlocking all VMAs marked as
locked.
VMA lock is placed on the cache line boundary so that its 'count' field
falls into the first cache line while the rest of the fields fall into
the second cache line. This lets the 'count' field to be cached with
other frequently accessed fields and used quickly in uncontended case
while 'owner' and other fields used in the contended case will not
invalidate the first cache line while waiting on the lock.

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/mm.h| 80 +++
 include/linux/mm_types.h  |  8 
 include/linux/mmap_lock.h | 13 +++
 kernel/fork.c |  4 ++
 mm/init-mm.c  |  3 ++
 5 files changed, 108 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f3f196e4d66d..ec2c4c227d51 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -612,6 +612,85 @@ struct vm_operations_struct {
  unsigned long addr);
 };
 
+#ifdef CONFIG_PER_VMA_LOCK
+static inline void vma_init_lock(struct vm_area_struct *vma)
+{
+   init_rwsem(>lock);
+   vma->vm_lock_seq = -1;
+}
+
+static inline void vma_write_lock(struct vm_area_struct *vma)
+{
+   int mm_lock_seq;
+
+   mmap_assert_write_locked(vma->vm_mm);
+
+   /*
+* current task is holding mmap_write_lock, both vma->vm_lock_seq and
+* mm->mm_lock_seq can't be concurrently modified.
+*/
+   mm_lock_seq = READ_ONCE(vma->vm_mm->mm_lock_seq);
+   if (vma->vm_lock_seq == mm_lock_seq)
+   return;
+
+   down_write(>lock);
+   vma->vm_lock_seq = mm_lock_seq;
+   up_write(>lock);
+}
+
+/*
+ * Try to read-lock a vma. The function is allowed to occasionally yield false
+ * locked result to avoid performance overhead, in which case we fall back to
+ * using mmap_lock. The function should never yield false unlocked result.
+ */
+static inline bool vma_read_trylock(struct vm_area_struct *vma)
+{
+   /* Check before locking. A race might cause false locked result. */
+   if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
+   return false;
+
+   if (unlikely(down_read_trylock(>lock) == 0))
+   return false;
+
+   /*
+* Overflow might produce false locked result.
+* False unlocked result is impossible because we modify and check
+* vma->vm_lock_seq under vma->lock protection and mm->mm_lock_seq
+* modification invalidates all existing locks.
+*/
+   if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
+   up_read(>lock);
+   return false;
+   }
+   return true;
+}
+
+static inline void vma_read_unlock(struct vm_area_struct *vma)
+{
+   up_read(>lock);
+}
+
+static inline void vma_assert_write_locked(struct vm_area_struct *vma)
+{
+   mmap_assert_write_locked(vma->vm_mm);
+   /*
+* current task is holding mmap_write_lock, both vma->vm_lock_seq and
+* mm->mm_lock_seq can't be concurrently modified.
+*/
+   VM_BUG_ON_VMA(vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq), 
vma);
+}
+
+#else /* CONFIG_PER_VMA_LOCK */
+
+static inline void vma_init_lock(struct vm_area_struct *vma) {}
+static inline void vma_write_lock(struct vm_area_struct *vma) {}
+static inline bool vma_read_trylock(struct vm_area_struct *vma)
+   { return false; }
+static inline void vma_read_unlock(struct vm_area_struct *vma) {}
+static inline void vma_assert_write_locked(struct vm_area_struct *vma) {}
+
+#endif /* CONFIG_PER_VMA_LOCK */
+
 static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 {
static const struct vm_operations_struct dummy_vm_ops = {};
@@ -620,6 +699,7 @@ static inline void vma_init(struct vm_area_struct *vma, 
struct mm_struct *mm)
vma->vm_mm = mm;
vma->vm_ops = _vm_ops;
INIT_LIST_HEAD(>anon_vma_chain);
+   vma_init_lock(vma);
 }
 
 static inline void vma_set_anonymous(struct vm_area_struct *vma)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d5cdec1314fe..5f7c5ca89931 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -555,6 +555,11 @@ struct vm_area_struct {
pgprot_t vm_page_prot;
unsigned long vm_flags; /* Flags, see mm.h. */
 
+#ifdef CONFIG_PER_VMA_

[PATCH 15/41] mm: replace vma->vm_flags direct modifications with modifier calls

2023-01-09 Thread Suren Baghdasaryan
Replace direct modifications to vma->vm_flags with calls to modifier
functions to be able to track flag changes and to keep vma locking
correctness.

Signed-off-by: Suren Baghdasaryan 
---
 arch/arm/kernel/process.c  |  2 +-
 arch/ia64/mm/init.c|  8 
 arch/loongarch/include/asm/tlb.h   |  2 +-
 arch/powerpc/kvm/book3s_xive_native.c  |  2 +-
 arch/powerpc/mm/book3s64/subpage_prot.c|  2 +-
 arch/powerpc/platforms/book3s/vas-api.c|  2 +-
 arch/powerpc/platforms/cell/spufs/file.c   | 14 +++---
 arch/s390/mm/gmap.c|  3 +--
 arch/x86/entry/vsyscall/vsyscall_64.c  |  2 +-
 arch/x86/kernel/cpu/sgx/driver.c   |  2 +-
 arch/x86/kernel/cpu/sgx/virt.c |  2 +-
 arch/x86/mm/pat/memtype.c  |  6 +++---
 arch/x86/um/mem_32.c   |  2 +-
 drivers/acpi/pfr_telemetry.c   |  2 +-
 drivers/android/binder.c   |  3 +--
 drivers/char/mspec.c   |  2 +-
 drivers/crypto/hisilicon/qm.c  |  2 +-
 drivers/dax/device.c   |  2 +-
 drivers/dma/idxd/cdev.c|  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c|  2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c   |  4 ++--
 drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c  |  4 ++--
 drivers/gpu/drm/amd/amdkfd/kfd_events.c|  4 ++--
 drivers/gpu/drm/amd/amdkfd/kfd_process.c   |  4 ++--
 drivers/gpu/drm/drm_gem.c  |  2 +-
 drivers/gpu/drm/drm_gem_dma_helper.c   |  3 +--
 drivers/gpu/drm/drm_gem_shmem_helper.c |  2 +-
 drivers/gpu/drm/drm_vm.c   |  8 
 drivers/gpu/drm/etnaviv/etnaviv_gem.c  |  2 +-
 drivers/gpu/drm/exynos/exynos_drm_gem.c|  4 ++--
 drivers/gpu/drm/gma500/framebuffer.c   |  2 +-
 drivers/gpu/drm/i810/i810_dma.c|  2 +-
 drivers/gpu/drm/i915/gem/i915_gem_mman.c   |  4 ++--
 drivers/gpu/drm/mediatek/mtk_drm_gem.c |  2 +-
 drivers/gpu/drm/msm/msm_gem.c  |  2 +-
 drivers/gpu/drm/omapdrm/omap_gem.c |  3 +--
 drivers/gpu/drm/rockchip/rockchip_drm_gem.c|  3 +--
 drivers/gpu/drm/tegra/gem.c|  5 ++---
 drivers/gpu/drm/ttm/ttm_bo_vm.c|  3 +--
 drivers/gpu/drm/virtio/virtgpu_vram.c  |  2 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c   |  2 +-
 drivers/gpu/drm/xen/xen_drm_front_gem.c|  3 +--
 drivers/hsi/clients/cmt_speech.c   |  2 +-
 drivers/hwtracing/intel_th/msu.c   |  2 +-
 drivers/hwtracing/stm/core.c   |  2 +-
 drivers/infiniband/hw/hfi1/file_ops.c  |  4 ++--
 drivers/infiniband/hw/mlx5/main.c  |  4 ++--
 drivers/infiniband/hw/qib/qib_file_ops.c   | 13 ++---
 drivers/infiniband/hw/usnic/usnic_ib_verbs.c   |  2 +-
 drivers/infiniband/hw/vmw_pvrdma/pvrdma_verbs.c|  2 +-
 .../media/common/videobuf2/videobuf2-dma-contig.c  |  2 +-
 drivers/media/common/videobuf2/videobuf2-vmalloc.c |  2 +-
 drivers/media/v4l2-core/videobuf-dma-contig.c  |  2 +-
 drivers/media/v4l2-core/videobuf-dma-sg.c  |  4 ++--
 drivers/media/v4l2-core/videobuf-vmalloc.c |  2 +-
 drivers/misc/cxl/context.c |  2 +-
 drivers/misc/habanalabs/common/memory.c|  2 +-
 drivers/misc/habanalabs/gaudi/gaudi.c  |  4 ++--
 drivers/misc/habanalabs/gaudi2/gaudi2.c|  8 
 drivers/misc/habanalabs/goya/goya.c|  4 ++--
 drivers/misc/ocxl/context.c|  4 ++--
 drivers/misc/ocxl/sysfs.c  |  2 +-
 drivers/misc/open-dice.c   |  6 +++---
 drivers/misc/sgi-gru/grufile.c |  4 ++--
 drivers/misc/uacce/uacce.c |  2 +-
 drivers/sbus/char/oradax.c |  2 +-
 drivers/scsi/cxlflash/ocxl_hw.c|  2 +-
 drivers/scsi/sg.c  |  2 +-
 drivers/staging/media/atomisp/pci/hmm/hmm_bo.c |  2 +-
 drivers/staging/media/deprecated/meye/meye.c   |  4 ++--
 .../media/deprecated/stkwebcam/stk-webcam.c|  2 +-
 drivers/target/target_core_user.c  |  2 +-
 drivers/uio/uio.c  |  2 +-
 drivers/usb/core/devio.c   |  3 +--
 drivers/usb/mon/mon_bin.c  |  3 +--
 drivers/vdpa/vdpa_user/iova_domain.c   |  2 +-
 drivers/vfio/pci/vfio_pci_core.c   |  2 +-
 drivers/vhost/vdpa.c   |  2 +-
 drivers/video/fbdev/68328f

[PATCH 16/41] mm: replace vma->vm_flags indirect modification in ksm_madvise

2023-01-09 Thread Suren Baghdasaryan
Replace indirect modifications to vma->vm_flags with calls to modifier
functions to be able to track flag changes and to keep vma locking
correctness. Add a BUG_ON check in ksm_madvise() to catch indirect
vm_flags modification attempts.

Signed-off-by: Suren Baghdasaryan 
---
 arch/powerpc/kvm/book3s_hv_uvmem.c | 5 -
 arch/s390/mm/gmap.c| 5 -
 mm/khugepaged.c| 2 ++
 mm/ksm.c   | 2 ++
 4 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 1d67baa5557a..325a7a47d348 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -393,6 +393,7 @@ static int kvmppc_memslot_page_merge(struct kvm *kvm,
 {
unsigned long gfn = memslot->base_gfn;
unsigned long end, start = gfn_to_hva(kvm, gfn);
+   unsigned long vm_flags;
int ret = 0;
struct vm_area_struct *vma;
int merge_flag = (merge) ? MADV_MERGEABLE : MADV_UNMERGEABLE;
@@ -409,12 +410,14 @@ static int kvmppc_memslot_page_merge(struct kvm *kvm,
ret = H_STATE;
break;
}
+   vm_flags = vma->vm_flags;
ret = ksm_madvise(vma, vma->vm_start, vma->vm_end,
- merge_flag, >vm_flags);
+ merge_flag, _flags);
if (ret) {
ret = H_STATE;
break;
}
+   reset_vm_flags(vma, vm_flags);
start = vma->vm_end;
} while (end > vma->vm_end);
 
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 3811d6c86d09..e47387f8be6d 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2587,14 +2587,17 @@ int gmap_mark_unmergeable(void)
 {
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
+   unsigned long vm_flags;
int ret;
VMA_ITERATOR(vmi, mm, 0);
 
for_each_vma(vmi, vma) {
+   vm_flags = vma->vm_flags;
ret = ksm_madvise(vma, vma->vm_start, vma->vm_end,
- MADV_UNMERGEABLE, >vm_flags);
+ MADV_UNMERGEABLE, _flags);
if (ret)
return ret;
+   reset_vm_flags(vma, vm_flags);
}
mm->def_flags &= ~VM_MERGEABLE;
return 0;
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 5cb401aa2b9d..5376246a3052 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -352,6 +352,8 @@ struct attribute_group khugepaged_attr_group = {
 int hugepage_madvise(struct vm_area_struct *vma,
 unsigned long *vm_flags, int advice)
 {
+   /* vma->vm_flags can be changed only using modifier functions */
+   BUG_ON(vm_flags == >vm_flags);
switch (advice) {
case MADV_HUGEPAGE:
 #ifdef CONFIG_S390
diff --git a/mm/ksm.c b/mm/ksm.c
index dd02780c387f..d05c41b289db 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -2471,6 +2471,8 @@ int ksm_madvise(struct vm_area_struct *vma, unsigned long 
start,
struct mm_struct *mm = vma->vm_mm;
int err;
 
+   /* vma->vm_flags can be changed only using modifier functions */
+   BUG_ON(vm_flags == >vm_flags);
switch (advice) {
case MADV_MERGEABLE:
/*
-- 
2.39.0



[PATCH 21/41] mm/mmap: write-lock VMAs affected by VMA expansion

2023-01-09 Thread Suren Baghdasaryan
vma_expand changes VMA boundaries and might result in freeing an adjacent
VMA. Write-lock affected VMAs to prevent concurrent page faults.

Signed-off-by: Suren Baghdasaryan 
---
 mm/mmap.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/mmap.c b/mm/mmap.c
index 1e2154137631..ff02cb51e7e7 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -544,6 +544,7 @@ inline int vma_expand(struct ma_state *mas, struct 
vm_area_struct *vma,
if (mas_preallocate(mas, vma, GFP_KERNEL))
goto nomem;
 
+   vma_write_lock(vma);
vma_adjust_trans_huge(vma, start, end, 0);
 
if (file) {
@@ -590,6 +591,7 @@ inline int vma_expand(struct ma_state *mas, struct 
vm_area_struct *vma,
}
 
if (remove_next) {
+   vma_write_lock(next);
if (file) {
uprobe_munmap(next, next->vm_start, next->vm_end);
fput(file);
-- 
2.39.0



[PATCH 22/41] mm/mremap: write-lock VMA while remapping it to a new address range

2023-01-09 Thread Suren Baghdasaryan
Write-lock VMA as locked before copying it and when copy_vma produces
a new VMA.

Signed-off-by: Suren Baghdasaryan 
Reviewed-by: Laurent Dufour 
---
 mm/mmap.c   | 1 +
 mm/mremap.c | 1 +
 2 files changed, 2 insertions(+)

diff --git a/mm/mmap.c b/mm/mmap.c
index ff02cb51e7e7..da1908730828 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3261,6 +3261,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct 
**vmap,
get_file(new_vma->vm_file);
if (new_vma->vm_ops && new_vma->vm_ops->open)
new_vma->vm_ops->open(new_vma);
+   vma_write_lock(new_vma);
if (vma_link(mm, new_vma))
goto out_vma_link;
*need_rmap_locks = false;
diff --git a/mm/mremap.c b/mm/mremap.c
index 2ccdd1561f5b..d24a79bcb1a1 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -622,6 +622,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
return -ENOMEM;
}
 
+   vma_write_lock(vma);
new_pgoff = vma->vm_pgoff + ((old_addr - vma->vm_start) >> PAGE_SHIFT);
new_vma = copy_vma(, new_addr, new_len, new_pgoff,
   _rmap_locks);
-- 
2.39.0



[PATCH 23/41] mm: write-lock VMAs before removing them from VMA tree

2023-01-09 Thread Suren Baghdasaryan
Write-locking VMAs before isolating them ensures that page fault
handlers don't operate on isolated VMAs.

Signed-off-by: Suren Baghdasaryan 
---
 mm/mmap.c  | 2 ++
 mm/nommu.c | 5 +
 2 files changed, 7 insertions(+)

diff --git a/mm/mmap.c b/mm/mmap.c
index da1908730828..be289e0b693b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -448,6 +448,7 @@ void vma_mas_store(struct vm_area_struct *vma, struct 
ma_state *mas)
  */
 void vma_mas_remove(struct vm_area_struct *vma, struct ma_state *mas)
 {
+   vma_write_lock(vma);
trace_vma_mas_szero(mas->tree, vma->vm_start, vma->vm_end - 1);
mas->index = vma->vm_start;
mas->last = vma->vm_end - 1;
@@ -2300,6 +2301,7 @@ int split_vma(struct mm_struct *mm, struct vm_area_struct 
*vma,
 static inline int munmap_sidetree(struct vm_area_struct *vma,
   struct ma_state *mas_detach)
 {
+   vma_write_lock(vma);
mas_set_range(mas_detach, vma->vm_start, vma->vm_end - 1);
if (mas_store_gfp(mas_detach, vma, GFP_KERNEL))
return -ENOMEM;
diff --git a/mm/nommu.c b/mm/nommu.c
index b3154357ced5..7ae91337ef14 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -552,6 +552,7 @@ void vma_mas_store(struct vm_area_struct *vma, struct 
ma_state *mas)
 
 void vma_mas_remove(struct vm_area_struct *vma, struct ma_state *mas)
 {
+   vma_write_lock(vma);
mas->index = vma->vm_start;
mas->last = vma->vm_end - 1;
mas_store_prealloc(mas, NULL);
@@ -1551,6 +1552,10 @@ void exit_mmap(struct mm_struct *mm)
mmap_write_lock(mm);
for_each_vma(vmi, vma) {
cleanup_vma_from_mm(vma);
+   /*
+* No need to lock VMA because this is the only mm user and no
+* page fault handled can race with it.
+*/
delete_vma(mm, vma);
cond_resched();
}
-- 
2.39.0



[PATCH 26/41] kernel/fork: assert no VMA readers during its destruction

2023-01-09 Thread Suren Baghdasaryan
Assert there are no holders of VMA lock for reading when it is about to be
destroyed.

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/mm.h | 8 
 kernel/fork.c  | 2 ++
 2 files changed, 10 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 594e835bad9c..c464fc8a514c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -680,6 +680,13 @@ static inline void vma_assert_write_locked(struct 
vm_area_struct *vma)
VM_BUG_ON_VMA(vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq), 
vma);
 }
 
+static inline void vma_assert_no_reader(struct vm_area_struct *vma)
+{
+   VM_BUG_ON_VMA(rwsem_is_locked(>lock) &&
+ vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq),
+ vma);
+}
+
 #else /* CONFIG_PER_VMA_LOCK */
 
 static inline void vma_init_lock(struct vm_area_struct *vma) {}
@@ -688,6 +695,7 @@ static inline bool vma_read_trylock(struct vm_area_struct 
*vma)
{ return false; }
 static inline void vma_read_unlock(struct vm_area_struct *vma) {}
 static inline void vma_assert_write_locked(struct vm_area_struct *vma) {}
+static inline void vma_assert_no_reader(struct vm_area_struct *vma) {}
 
 #endif /* CONFIG_PER_VMA_LOCK */
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 1591dd8a0745..6d9f14e55ecf 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -485,6 +485,8 @@ static void __vm_area_free(struct rcu_head *head)
 {
struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
  vm_rcu);
+   /* The vma should either have no lock holders or be write-locked. */
+   vma_assert_no_reader(vma);
kmem_cache_free(vm_area_cachep, vma);
 }
 #endif
-- 
2.39.0



[PATCH 27/41] mm/mmap: prevent pagefault handler from racing with mmu_notifier registration

2023-01-09 Thread Suren Baghdasaryan
Page fault handlers might need to fire MMU notifications while a new
notifier is being registered. Modify mm_take_all_locks to write-lock all
VMAs and prevent this race with fault handlers that would hold VMA locks.
VMAs are locked before i_mmap_rwsem and anon_vma to keep the same
locking order as in page fault handlers.

Signed-off-by: Suren Baghdasaryan 
---
 mm/mmap.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/mmap.c b/mm/mmap.c
index 30c7d1c5206e..a256deca0bc0 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3566,6 +3566,7 @@ static void vm_lock_mapping(struct mm_struct *mm, struct 
address_space *mapping)
  * of mm/rmap.c:
  *   - all hugetlbfs_i_mmap_rwsem_key locks (aka mapping->i_mmap_rwsem for
  * hugetlb mapping);
+ *   - all vmas marked locked
  *   - all i_mmap_rwsem locks;
  *   - all anon_vma->rwseml
  *
@@ -3591,6 +3592,7 @@ int mm_take_all_locks(struct mm_struct *mm)
mas_for_each(, vma, ULONG_MAX) {
if (signal_pending(current))
goto out_unlock;
+   vma_write_lock(vma);
if (vma->vm_file && vma->vm_file->f_mapping &&
is_vm_hugetlb_page(vma))
vm_lock_mapping(mm, vma->vm_file->f_mapping);
@@ -3677,6 +3679,7 @@ void mm_drop_all_locks(struct mm_struct *mm)
if (vma->vm_file && vma->vm_file->f_mapping)
vm_unlock_mapping(vma->vm_file->f_mapping);
}
+   vma_write_unlock_mm(mm);
 
mutex_unlock(_all_locks_mutex);
 }
-- 
2.39.0



[PATCH 28/41] mm: introduce lock_vma_under_rcu to be used from arch-specific code

2023-01-09 Thread Suren Baghdasaryan
Introduce lock_vma_under_rcu function to lookup and lock a VMA during
page fault handling. When VMA is not found, can't be locked or changes
after being locked, the function returns NULL. The lookup is performed
under RCU protection to prevent the found VMA from being destroyed before
the VMA lock is acquired. VMA lock statistics are updated according to
the results.
For now only anonymous VMAs can be searched this way. In other cases the
function returns NULL.

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/mm.h |  3 +++
 mm/memory.c| 51 ++
 2 files changed, 54 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c464fc8a514c..d0fddf6a1de9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -687,6 +687,9 @@ static inline void vma_assert_no_reader(struct 
vm_area_struct *vma)
  vma);
 }
 
+struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
+ unsigned long address);
+
 #else /* CONFIG_PER_VMA_LOCK */
 
 static inline void vma_init_lock(struct vm_area_struct *vma) {}
diff --git a/mm/memory.c b/mm/memory.c
index 9ece18548db1..a658e26d965d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5242,6 +5242,57 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, 
unsigned long address,
 }
 EXPORT_SYMBOL_GPL(handle_mm_fault);
 
+#ifdef CONFIG_PER_VMA_LOCK
+/*
+ * Lookup and lock a VMA under RCU protection. Returned VMA is guaranteed to be
+ * stable and not isolated. If the VMA is not found or is being modified the
+ * function returns NULL.
+ */
+struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
+ unsigned long address)
+{
+   MA_STATE(mas, >mm_mt, address, address);
+   struct vm_area_struct *vma, *validate;
+
+   rcu_read_lock();
+   vma = mas_walk();
+retry:
+   if (!vma)
+   goto inval;
+
+   /* Only anonymous vmas are supported for now */
+   if (!vma_is_anonymous(vma))
+   goto inval;
+
+   if (!vma_read_trylock(vma))
+   goto inval;
+
+   /* Check since vm_start/vm_end might change before we lock the VMA */
+   if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
+   vma_read_unlock(vma);
+   goto inval;
+   }
+
+   /* Check if the VMA got isolated after we found it */
+   mas.index = address;
+   validate = mas_walk();
+   if (validate != vma) {
+   vma_read_unlock(vma);
+   count_vm_vma_lock_event(VMA_LOCK_MISS);
+   /* The area was replaced with another one. */
+   vma = validate;
+   goto retry;
+   }
+
+   rcu_read_unlock();
+   return vma;
+inval:
+   rcu_read_unlock();
+   count_vm_vma_lock_event(VMA_LOCK_ABORT);
+   return NULL;
+}
+#endif /* CONFIG_PER_VMA_LOCK */
+
 #ifndef __PAGETABLE_P4D_FOLDED
 /*
  * Allocate p4d page table.
-- 
2.39.0



[PATCH 30/41] mm: add FAULT_FLAG_VMA_LOCK flag

2023-01-09 Thread Suren Baghdasaryan
Add a new flag to distinguish page faults handled under protection of
per-vma lock.

Signed-off-by: Suren Baghdasaryan 
Reviewed-by: Laurent Dufour 
---
 include/linux/mm.h   | 3 ++-
 include/linux/mm_types.h | 1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index d0fddf6a1de9..2e3be1d45371 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -467,7 +467,8 @@ static inline bool fault_flag_allow_retry_first(enum 
fault_flag flags)
{ FAULT_FLAG_USER,  "USER" }, \
{ FAULT_FLAG_REMOTE,"REMOTE" }, \
{ FAULT_FLAG_INSTRUCTION,   "INSTRUCTION" }, \
-   { FAULT_FLAG_INTERRUPTIBLE, "INTERRUPTIBLE" }
+   { FAULT_FLAG_INTERRUPTIBLE, "INTERRUPTIBLE" }, \
+   { FAULT_FLAG_VMA_LOCK,  "VMA_LOCK" }
 
 /*
  * vm_fault is filled by the pagefault handler and passed to the vma's
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0d27edd3e63a..fce9113d979c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1103,6 +1103,7 @@ enum fault_flag {
FAULT_FLAG_INTERRUPTIBLE =  1 << 9,
FAULT_FLAG_UNSHARE =1 << 10,
FAULT_FLAG_ORIG_PTE_VALID = 1 << 11,
+   FAULT_FLAG_VMA_LOCK =   1 << 12,
 };
 
 typedef unsigned int __bitwise zap_flags_t;
-- 
2.39.0



[PATCH 31/41] mm: prevent do_swap_page from handling page faults under VMA lock

2023-01-09 Thread Suren Baghdasaryan
Due to the possibility of do_swap_page dropping mmap_lock, abort fault
handling under VMA lock and retry holding mmap_lock. This can be handled
more gracefully in the future.

Signed-off-by: Suren Baghdasaryan 
Reviewed-by: Laurent Dufour 
---
 mm/memory.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 2560524ad7f4..20806bc8b4eb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3707,6 +3707,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (!pte_unmap_same(vmf))
goto out;
 
+   if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
+   ret = VM_FAULT_RETRY;
+   goto out;
+   }
+
entry = pte_to_swp_entry(vmf->orig_pte);
if (unlikely(non_swap_entry(entry))) {
if (is_migration_entry(entry)) {
-- 
2.39.0



[PATCH 32/41] mm: prevent userfaults to be handled under per-vma lock

2023-01-09 Thread Suren Baghdasaryan
Due to the possibility of handle_userfault dropping mmap_lock, avoid fault
handling under VMA lock and retry holding mmap_lock. This can be handled
more gracefully in the future.

Signed-off-by: Suren Baghdasaryan 
Suggested-by: Peter Xu 
---
 mm/memory.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 20806bc8b4eb..12508f4d845a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5273,6 +5273,13 @@ struct vm_area_struct *lock_vma_under_rcu(struct 
mm_struct *mm,
if (!vma->anon_vma)
goto inval;
 
+   /*
+* Due to the possibility of userfault handler dropping mmap_lock, avoid
+* it for now and fall back to page fault handling under mmap_lock.
+*/
+   if (userfaultfd_armed(vma))
+   goto inval;
+
if (!vma_read_trylock(vma))
goto inval;
 
-- 
2.39.0



[PATCH 33/41] mm: introduce per-VMA lock statistics

2023-01-09 Thread Suren Baghdasaryan
Add a new CONFIG_PER_VMA_LOCK_STATS config option to dump extra
statistics about handling page fault under VMA lock.

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/vm_event_item.h | 6 ++
 include/linux/vmstat.h| 6 ++
 mm/Kconfig.debug  | 8 
 mm/vmstat.c   | 6 ++
 4 files changed, 26 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 7f5d1caf5890..8abfa1240040 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -149,6 +149,12 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #ifdef CONFIG_X86
DIRECT_MAP_LEVEL2_SPLIT,
DIRECT_MAP_LEVEL3_SPLIT,
+#endif
+#ifdef CONFIG_PER_VMA_LOCK_STATS
+   VMA_LOCK_SUCCESS,
+   VMA_LOCK_ABORT,
+   VMA_LOCK_RETRY,
+   VMA_LOCK_MISS,
 #endif
NR_VM_EVENT_ITEMS
 };
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 19cf5b6892ce..fed855bae6d8 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -125,6 +125,12 @@ static inline void vm_events_fold_cpu(int cpu)
 #define count_vm_tlb_events(x, y) do { (void)(y); } while (0)
 #endif
 
+#ifdef CONFIG_PER_VMA_LOCK_STATS
+#define count_vm_vma_lock_event(x) count_vm_event(x)
+#else
+#define count_vm_vma_lock_event(x) do {} while (0)
+#endif
+
 #define __count_zid_vm_events(item, zid, delta) \
__count_vm_events(item##_NORMAL - ZONE_NORMAL + zid, delta)
 
diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index fca699ad1fb0..32a93b064590 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -207,3 +207,11 @@ config PTDUMP_DEBUGFS
  kernel.
 
  If in doubt, say N.
+
+
+config PER_VMA_LOCK_STATS
+   bool "Statistics for per-vma locks"
+   depends on PER_VMA_LOCK
+   default y
+   help
+ Statistics for per-vma locks.
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 1ea6a5ce1c41..4f1089a1860e 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1399,6 +1399,12 @@ const char * const vmstat_text[] = {
"direct_map_level2_splits",
"direct_map_level3_splits",
 #endif
+#ifdef CONFIG_PER_VMA_LOCK_STATS
+   "vma_lock_success",
+   "vma_lock_abort",
+   "vma_lock_retry",
+   "vma_lock_miss",
+#endif
 #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */
 };
 #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */
-- 
2.39.0



[PATCH 36/41] powerc/mm: try VMA lock-based page fault handling first

2023-01-09 Thread Suren Baghdasaryan
From: Laurent Dufour 

Attempt VMA lock-based page fault handling first, and fall back to the
existing mmap_lock-based handling if that fails.
Copied from "x86/mm: try VMA lock-based page fault handling first"

Signed-off-by: Laurent Dufour 
Signed-off-by: Suren Baghdasaryan 
---
 arch/powerpc/mm/fault.c| 41 ++
 arch/powerpc/platforms/powernv/Kconfig |  1 +
 arch/powerpc/platforms/pseries/Kconfig |  1 +
 3 files changed, 43 insertions(+)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 2bef19cc1b98..f92f8956d5f2 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -469,6 +469,44 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned 
long address,
if (is_exec)
flags |= FAULT_FLAG_INSTRUCTION;
 
+#ifdef CONFIG_PER_VMA_LOCK
+   if (!(flags & FAULT_FLAG_USER) || atomic_read(>mm_users) == 1)
+   goto lock_mmap;
+
+   vma = lock_vma_under_rcu(mm, address);
+   if (!vma)
+   goto lock_mmap;
+
+   if (unlikely(access_pkey_error(is_write, is_exec,
+  (error_code & DSISR_KEYFAULT), vma))) {
+   int rc = bad_access_pkey(regs, address, vma);
+
+   vma_read_unlock(vma);
+   return rc;
+   }
+
+   if (unlikely(access_error(is_write, is_exec, vma))) {
+   int rc = bad_access(regs, address);
+
+   vma_read_unlock(vma);
+   return rc;
+   }
+
+   fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, 
regs);
+   vma_read_unlock(vma);
+
+   if (!(fault & VM_FAULT_RETRY)) {
+   count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
+   goto done;
+   }
+   count_vm_vma_lock_event(VMA_LOCK_RETRY);
+
+   if (fault_signal_pending(fault, regs))
+   return user_mode(regs) ? 0 : SIGBUS;
+
+lock_mmap:
+#endif /* CONFIG_PER_VMA_LOCK */
+
/* When running in the kernel we expect faults to occur only to
 * addresses in user space.  All other faults represent errors in the
 * kernel and should generate an OOPS.  Unfortunately, in the case of an
@@ -545,6 +583,9 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned 
long address,
 
mmap_read_unlock(current->mm);
 
+#ifdef CONFIG_PER_VMA_LOCK
+done:
+#endif
if (unlikely(fault & VM_FAULT_ERROR))
return mm_fault_error(regs, address, fault);
 
diff --git a/arch/powerpc/platforms/powernv/Kconfig 
b/arch/powerpc/platforms/powernv/Kconfig
index ae248a161b43..70a46acc70d6 100644
--- a/arch/powerpc/platforms/powernv/Kconfig
+++ b/arch/powerpc/platforms/powernv/Kconfig
@@ -16,6 +16,7 @@ config PPC_POWERNV
select PPC_DOORBELL
select MMU_NOTIFIER
select FORCE_SMP
+   select ARCH_SUPPORTS_PER_VMA_LOCK
default y
 
 config OPAL_PRD
diff --git a/arch/powerpc/platforms/pseries/Kconfig 
b/arch/powerpc/platforms/pseries/Kconfig
index a3b4d99567cb..e036a04ff1ca 100644
--- a/arch/powerpc/platforms/pseries/Kconfig
+++ b/arch/powerpc/platforms/pseries/Kconfig
@@ -21,6 +21,7 @@ config PPC_PSERIES
select HOTPLUG_CPU
select FORCE_SMP
select SWIOTLB
+   select ARCH_SUPPORTS_PER_VMA_LOCK
default y
 
 config PARAVIRT
-- 
2.39.0



[PATCH 29/41] mm: fall back to mmap_lock if vma->anon_vma is not yet set

2023-01-09 Thread Suren Baghdasaryan
When vma->anon_vma is not set, pagefault handler will set it by either
reusing anon_vma of an adjacent VMA if VMAs are compatible or by
allocating a new one. find_mergeable_anon_vma() walks VMA tree to find
a compatible adjacent VMA and that requires not only the faulting VMA
to be stable but also the tree structure and other VMAs inside that tree.
Therefore locking just the faulting VMA is not enough for this search.
Fall back to taking mmap_lock when vma->anon_vma is not set. This
situation happens only on the first page fault and should not affect
overall performance.

Signed-off-by: Suren Baghdasaryan 
---
 mm/memory.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index a658e26d965d..2560524ad7f4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5264,6 +5264,10 @@ struct vm_area_struct *lock_vma_under_rcu(struct 
mm_struct *mm,
if (!vma_is_anonymous(vma))
goto inval;
 
+   /* find_mergeable_anon_vma uses adjacent vmas which are not locked */
+   if (!vma->anon_vma)
+   goto inval;
+
if (!vma_read_trylock(vma))
goto inval;
 
-- 
2.39.0



[PATCH 35/41] arm64/mm: try VMA lock-based page fault handling first

2023-01-09 Thread Suren Baghdasaryan
Attempt VMA lock-based page fault handling first, and fall back to the
existing mmap_lock-based handling if that fails.

Signed-off-by: Suren Baghdasaryan 
---
 arch/arm64/Kconfig|  1 +
 arch/arm64/mm/fault.c | 36 
 2 files changed, 37 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 03934808b2ed..829fa6d14a36 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -95,6 +95,7 @@ config ARM64
select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
select ARCH_SUPPORTS_NUMA_BALANCING
select ARCH_SUPPORTS_PAGE_TABLE_CHECK
+   select ARCH_SUPPORTS_PER_VMA_LOCK
select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT
select ARCH_WANT_DEFAULT_BPF_JIT
select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 596f46dabe4e..833fa8bab291 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -535,6 +535,9 @@ static int __kprobes do_page_fault(unsigned long far, 
unsigned long esr,
unsigned long vm_flags;
unsigned int mm_flags = FAULT_FLAG_DEFAULT;
unsigned long addr = untagged_addr(far);
+#ifdef CONFIG_PER_VMA_LOCK
+   struct vm_area_struct *vma;
+#endif
 
if (kprobe_page_fault(regs, esr))
return 0;
@@ -585,6 +588,36 @@ static int __kprobes do_page_fault(unsigned long far, 
unsigned long esr,
 
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
 
+#ifdef CONFIG_PER_VMA_LOCK
+   if (!(mm_flags & FAULT_FLAG_USER) || atomic_read(>mm_users) == 1)
+   goto lock_mmap;
+
+   vma = lock_vma_under_rcu(mm, addr);
+   if (!vma)
+   goto lock_mmap;
+
+   if (!(vma->vm_flags & vm_flags)) {
+   vma_read_unlock(vma);
+   goto lock_mmap;
+   }
+   fault = handle_mm_fault(vma, addr & PAGE_MASK,
+   mm_flags | FAULT_FLAG_VMA_LOCK, regs);
+   vma_read_unlock(vma);
+
+   if (!(fault & VM_FAULT_RETRY)) {
+   count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
+   goto done;
+   }
+   count_vm_vma_lock_event(VMA_LOCK_RETRY);
+
+   /* Quick path to respond to signals */
+   if (fault_signal_pending(fault, regs)) {
+   if (!user_mode(regs))
+   goto no_context;
+   return 0;
+   }
+lock_mmap:
+#endif /* CONFIG_PER_VMA_LOCK */
/*
 * As per x86, we may deadlock here. However, since the kernel only
 * validly references user space from well defined areas of the code,
@@ -628,6 +661,9 @@ static int __kprobes do_page_fault(unsigned long far, 
unsigned long esr,
}
mmap_read_unlock(mm);
 
+#ifdef CONFIG_PER_VMA_LOCK
+done:
+#endif
/*
 * Handle the "normal" (no error) case first.
 */
-- 
2.39.0



[PATCH 34/41] x86/mm: try VMA lock-based page fault handling first

2023-01-09 Thread Suren Baghdasaryan
Attempt VMA lock-based page fault handling first, and fall back to the
existing mmap_lock-based handling if that fails.

Signed-off-by: Suren Baghdasaryan 
---
 arch/x86/Kconfig|  1 +
 arch/x86/mm/fault.c | 36 
 2 files changed, 37 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 3604074a878b..3647f7bdb110 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -27,6 +27,7 @@ config X86_64
# Options that are inherently 64-bit kernel only:
select ARCH_HAS_GIGANTIC_PAGE
select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
+   select ARCH_SUPPORTS_PER_VMA_LOCK
select ARCH_USE_CMPXCHG_LOCKREF
select HAVE_ARCH_SOFT_DIRTY
select MODULES_USE_ELF_RELA
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 7b0d4ab894c8..983266e7c49b 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -19,6 +19,7 @@
 #include  /* faulthandler_disabled()  */
 #include  /* 
efi_crash_gracefully_on_page_fault()*/
 #include 
+#include   /* find_and_lock_vma() */
 
 #include /* boot_cpu_has, ...*/
 #include  /* dotraplinkage, ...   */
@@ -1354,6 +1355,38 @@ void do_user_addr_fault(struct pt_regs *regs,
}
 #endif
 
+#ifdef CONFIG_PER_VMA_LOCK
+   if (!(flags & FAULT_FLAG_USER) || atomic_read(>mm_users) == 1)
+   goto lock_mmap;
+
+   vma = lock_vma_under_rcu(mm, address);
+   if (!vma)
+   goto lock_mmap;
+
+   if (unlikely(access_error(error_code, vma))) {
+   vma_read_unlock(vma);
+   goto lock_mmap;
+   }
+   fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, 
regs);
+   vma_read_unlock(vma);
+
+   if (!(fault & VM_FAULT_RETRY)) {
+   count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
+   goto done;
+   }
+   count_vm_vma_lock_event(VMA_LOCK_RETRY);
+
+   /* Quick path to respond to signals */
+   if (fault_signal_pending(fault, regs)) {
+   if (!user_mode(regs))
+   kernelmode_fixup_or_oops(regs, error_code, address,
+SIGBUS, BUS_ADRERR,
+ARCH_DEFAULT_PKEY);
+   return;
+   }
+lock_mmap:
+#endif /* CONFIG_PER_VMA_LOCK */
+
/*
 * Kernel-mode access to the user address space should only occur
 * on well-defined single instructions listed in the exception
@@ -1454,6 +1487,9 @@ void do_user_addr_fault(struct pt_regs *regs,
}
 
mmap_read_unlock(mm);
+#ifdef CONFIG_PER_VMA_LOCK
+done:
+#endif
if (likely(!(fault & VM_FAULT_ERROR)))
return;
 
-- 
2.39.0



[PATCH 37/41] mm: introduce mod_vm_flags_nolock

2023-01-09 Thread Suren Baghdasaryan
In cases when VMA flags are modified after VMA was isolated and mmap_lock
was downgraded, flags modifications do not require per-VMA locking and
an attempt to lock the VMA would result in an assertion because mmap
write lock is not held.
Introduce mod_vm_flags_nolock to be used in such situation.

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/mm.h | 8 
 1 file changed, 8 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2e3be1d45371..7d436a5027cc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -743,6 +743,14 @@ void clear_vm_flags(struct vm_area_struct *vma, unsigned 
long flags)
vma->vm_flags &= ~flags;
 }
 
+static inline
+void mod_vm_flags_nolock(struct vm_area_struct *vma,
+ unsigned long set, unsigned long clear)
+{
+   vma->vm_flags |= set;
+   vma->vm_flags &= ~clear;
+}
+
 static inline
 void mod_vm_flags(struct vm_area_struct *vma,
  unsigned long set, unsigned long clear)
-- 
2.39.0



[PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free

2023-01-09 Thread Suren Baghdasaryan
call_rcu() can take a long time when callback offloading is enabled.
Its use in the vm_area_free can cause regressions in the exit path when
multiple VMAs are being freed. To minimize that impact, place VMAs into
a list and free them in groups using one call_rcu() call per group.

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/mm.h   |  1 +
 include/linux/mm_types.h | 19 +--
 kernel/fork.c| 68 +++-
 mm/init-mm.c |  3 ++
 mm/mmap.c|  1 +
 5 files changed, 82 insertions(+), 10 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3158f33e268c..50c7a6dd9c7a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -250,6 +250,7 @@ void setup_initial_init_mm(void *start_code, void *end_code,
 struct vm_area_struct *vm_area_alloc(struct mm_struct *);
 struct vm_area_struct *vm_area_dup(struct vm_area_struct *);
 void vm_area_free(struct vm_area_struct *);
+void drain_free_vmas(struct mm_struct *mm);
 
 #ifndef CONFIG_MMU
 extern struct rb_root nommu_region_tree;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index fce9113d979c..c0e6c8e4700b 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -592,8 +592,18 @@ struct vm_area_struct {
/* Information about our backing store: */
unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE
   units */
-   struct file * vm_file;  /* File we map to (can be NULL). */
-   void * vm_private_data; /* was vm_pte (shared mem) */
+   union {
+   struct {
+   /* File we map to (can be NULL). */
+   struct file *vm_file;
+
+   /* was vm_pte (shared mem) */
+   void *vm_private_data;
+   };
+#ifdef CONFIG_PER_VMA_LOCK
+   struct list_head vm_free_list;
+#endif
+   };
 
 #ifdef CONFIG_ANON_VMA_NAME
/*
@@ -693,6 +703,11 @@ struct mm_struct {
  */
 #ifdef CONFIG_PER_VMA_LOCK
int mm_lock_seq;
+   struct {
+   struct list_head head;
+   spinlock_t lock;
+   int size;
+   } vma_free_list;
 #endif
 
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 6d9f14e55ecf..97f2b751f88d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -481,26 +481,75 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct 
*orig)
 }
 
 #ifdef CONFIG_PER_VMA_LOCK
-static void __vm_area_free(struct rcu_head *head)
+static inline void __vm_area_free(struct vm_area_struct *vma)
 {
-   struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
- vm_rcu);
/* The vma should either have no lock holders or be write-locked. */
vma_assert_no_reader(vma);
kmem_cache_free(vm_area_cachep, vma);
 }
-#endif
+
+static void vma_free_rcu_callback(struct rcu_head *head)
+{
+   struct vm_area_struct *first_vma;
+   struct vm_area_struct *vma, *vma2;
+
+   first_vma = container_of(head, struct vm_area_struct, vm_rcu);
+   list_for_each_entry_safe(vma, vma2, _vma->vm_free_list, 
vm_free_list)
+   __vm_area_free(vma);
+   __vm_area_free(first_vma);
+}
+
+void drain_free_vmas(struct mm_struct *mm)
+{
+   struct vm_area_struct *first_vma;
+   LIST_HEAD(to_destroy);
+
+   spin_lock(>vma_free_list.lock);
+   list_splice_init(>vma_free_list.head, _destroy);
+   mm->vma_free_list.size = 0;
+   spin_unlock(>vma_free_list.lock);
+
+   if (list_empty(_destroy))
+   return;
+
+   first_vma = list_first_entry(_destroy, struct vm_area_struct, 
vm_free_list);
+   /* Remove the head which is allocated on the stack */
+   list_del(_destroy);
+
+   call_rcu(_vma->vm_rcu, vma_free_rcu_callback);
+}
+
+#define VM_AREA_FREE_LIST_MAX  32
+
+void vm_area_free(struct vm_area_struct *vma)
+{
+   struct mm_struct *mm = vma->vm_mm;
+   bool drain;
+
+   free_anon_vma_name(vma);
+
+   spin_lock(>vma_free_list.lock);
+   list_add(>vm_free_list, >vma_free_list.head);
+   mm->vma_free_list.size++;
+   drain = mm->vma_free_list.size > VM_AREA_FREE_LIST_MAX;
+   spin_unlock(>vma_free_list.lock);
+
+   if (drain)
+   drain_free_vmas(mm);
+}
+
+#else /* CONFIG_PER_VMA_LOCK */
+
+void drain_free_vmas(struct mm_struct *mm) {}
 
 void vm_area_free(struct vm_area_struct *vma)
 {
free_anon_vma_name(vma);
-#ifdef CONFIG_PER_VMA_LOCK
-   call_rcu(>vm_rcu, __vm_area_free);
-#else
kmem_cache_free(vm_area_cachep, vma);
-#endif
 }
 
+#endif /* CONFIG_PER_VMA_LOCK */
+
 static void account_kernel_stack(struct task_struct *tsk, int account)
 {
if (IS_ENABLED(CONFIG_VMAP_

[PATCH 38/41] mm: avoid assertion in untrack_pfn

2023-01-09 Thread Suren Baghdasaryan
untrack_pfn can be called after VMA was isolated and mmap_lock downgraded.
An attempt to lock affected VMA would cause an assertion, therefore
use mod_vm_flags_nolock in such situations.

Signed-off-by: Suren Baghdasaryan 
---
 arch/x86/mm/pat/memtype.c | 10 +++---
 include/linux/mm.h|  2 +-
 include/linux/pgtable.h   |  5 +++--
 mm/memory.c   | 15 ---
 mm/memremap.c |  4 ++--
 mm/mmap.c |  4 ++--
 6 files changed, 23 insertions(+), 17 deletions(-)

diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
index 9e490a372896..f71c8381430b 100644
--- a/arch/x86/mm/pat/memtype.c
+++ b/arch/x86/mm/pat/memtype.c
@@ -1045,7 +1045,7 @@ void track_pfn_insert(struct vm_area_struct *vma, 
pgprot_t *prot, pfn_t pfn)
  * can be for the entire vma (in which case pfn, size are zero).
  */
 void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
-unsigned long size)
+unsigned long size, bool lock_vma)
 {
resource_size_t paddr;
unsigned long prot;
@@ -1064,8 +1064,12 @@ void untrack_pfn(struct vm_area_struct *vma, unsigned 
long pfn,
size = vma->vm_end - vma->vm_start;
}
free_pfn_range(paddr, size);
-   if (vma)
-   clear_vm_flags(vma, VM_PAT);
+   if (vma) {
+   if (lock_vma)
+   clear_vm_flags(vma, VM_PAT);
+   else
+   mod_vm_flags_nolock(vma, 0, VM_PAT);
+   }
 }
 
 /*
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7d436a5027cc..3158f33e268c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2135,7 +2135,7 @@ void zap_page_range_single(struct vm_area_struct *vma, 
unsigned long address,
   unsigned long size, struct zap_details *details);
 void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
struct vm_area_struct *start_vma, unsigned long start,
-   unsigned long end);
+   unsigned long end, bool lock_vma);
 
 struct mmu_notifier_range;
 
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 1159b25b0542..eaa831bd675d 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1214,7 +1214,8 @@ static inline int track_pfn_copy(struct vm_area_struct 
*vma)
  * can be for the entire vma (in which case pfn, size are zero).
  */
 static inline void untrack_pfn(struct vm_area_struct *vma,
-  unsigned long pfn, unsigned long size)
+  unsigned long pfn, unsigned long size,
+  bool lock_vma)
 {
 }
 
@@ -1232,7 +1233,7 @@ extern void track_pfn_insert(struct vm_area_struct *vma, 
pgprot_t *prot,
 pfn_t pfn);
 extern int track_pfn_copy(struct vm_area_struct *vma);
 extern void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
-   unsigned long size);
+   unsigned long size, bool lock_vma);
 extern void untrack_pfn_moved(struct vm_area_struct *vma);
 #endif
 
diff --git a/mm/memory.c b/mm/memory.c
index 12508f4d845a..5c7d5eaa60d8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1610,7 +1610,7 @@ void unmap_page_range(struct mmu_gather *tlb,
 static void unmap_single_vma(struct mmu_gather *tlb,
struct vm_area_struct *vma, unsigned long start_addr,
unsigned long end_addr,
-   struct zap_details *details)
+   struct zap_details *details, bool lock_vma)
 {
unsigned long start = max(vma->vm_start, start_addr);
unsigned long end;
@@ -1625,7 +1625,7 @@ static void unmap_single_vma(struct mmu_gather *tlb,
uprobe_munmap(vma, start, end);
 
if (unlikely(vma->vm_flags & VM_PFNMAP))
-   untrack_pfn(vma, 0, 0);
+   untrack_pfn(vma, 0, 0, lock_vma);
 
if (start != end) {
if (unlikely(is_vm_hugetlb_page(vma))) {
@@ -1672,7 +1672,7 @@ static void unmap_single_vma(struct mmu_gather *tlb,
  */
 void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
struct vm_area_struct *vma, unsigned long start_addr,
-   unsigned long end_addr)
+   unsigned long end_addr, bool lock_vma)
 {
struct mmu_notifier_range range;
struct zap_details details = {
@@ -1686,7 +1686,8 @@ void unmap_vmas(struct mmu_gather *tlb, struct maple_tree 
*mt,
start_addr, end_addr);
mmu_notifier_invalidate_range_start();
do {
-   unmap_single_vma(tlb, vma, start_addr, end_addr, );
+   unmap_single_vma(tlb, vma, start_addr, end_addr, ,
+lock_vma);
} while ((vma = mas_find(, end_addr - 1)) != NULL);
mmu_notifier_invalidate_range_end();
 }
@@ -1715,7 +1716,7 @@ void zap_page_range(struct vm_area_struct *vma, unsi

[PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock

2023-01-09 Thread Suren Baghdasaryan
rw_semaphore is a sizable structure of 40 bytes and consumes
considerable space for each vm_area_struct. However vma_lock has
two important specifics which can be used to replace rw_semaphore
with a simpler structure:
1. Readers never wait. They try to take the vma_lock and fall back to
mmap_lock if that fails.
2. Only one writer at a time will ever try to write-lock a vma_lock
because writers first take mmap_lock in write mode.
Because of these requirements, full rw_semaphore functionality is not
needed and we can replace rw_semaphore with an atomic variable.
When a reader takes read lock, it increments the atomic unless the
value is negative. If that fails read-locking is aborted and mmap_lock
is used instead.
When writer takes write lock, it resets atomic value to -1 if the
current value is 0 (no readers). Since all writers take mmap_lock in
write mode first, there can be only one writer at a time. If there
are readers, writer will place itself into a wait queue using new
mm_struct.vma_writer_wait waitqueue head. The last reader to release
the vma_lock will signal the writer to wake up.
vm_lock_seq is also moved into vma_lock and along with atomic_t they
are nicely packed and consume 8 bytes, bringing the overhead from
vma_lock from 44 to 16 bytes:

slabinfo before the changes:
 ...: ...
vm_area_struct...152   532 : ...

slabinfo with vma_lock:
 ...: ...
rw_semaphore  ...  8  5121 : ...
vm_area_struct...160   512 : ...

Assuming 4 vm_area_structs, memory consumption would be:
baseline: 6040kB
vma_lock (vm_area_structs+vma_lock): 6280kB+316kB=6596kB
Total increase: 556kB

atomic_t might overflow if there are many competing readers, therefore
vma_read_trylock() implements an overflow check and if that occurs it
restors the previous value and exits with a failure to lock.

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/mm.h   | 37 +
 include/linux/mm_types.h | 10 --
 kernel/fork.c|  6 +++---
 mm/init-mm.c |  2 ++
 4 files changed, 38 insertions(+), 17 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index d40bf8a5e19e..294dd44b2198 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -627,12 +627,16 @@ static inline void vma_write_lock(struct vm_area_struct 
*vma)
 * mm->mm_lock_seq can't be concurrently modified.
 */
mm_lock_seq = READ_ONCE(vma->vm_mm->mm_lock_seq);
-   if (vma->vm_lock_seq == mm_lock_seq)
+   if (vma->vm_lock->lock_seq == mm_lock_seq)
return;
 
-   down_write(>vm_lock->lock);
-   vma->vm_lock_seq = mm_lock_seq;
-   up_write(>vm_lock->lock);
+   if (atomic_cmpxchg(>vm_lock->count, 0, -1))
+   wait_event(vma->vm_mm->vma_writer_wait,
+  atomic_cmpxchg(>vm_lock->count, 0, -1) == 0);
+   vma->vm_lock->lock_seq = mm_lock_seq;
+   /* Write barrier to ensure lock_seq change is visible before count */
+   smp_wmb();
+   atomic_set(>vm_lock->count, 0);
 }
 
 /*
@@ -643,20 +647,28 @@ static inline void vma_write_lock(struct vm_area_struct 
*vma)
 static inline bool vma_read_trylock(struct vm_area_struct *vma)
 {
/* Check before locking. A race might cause false locked result. */
-   if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
+   if (vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
return false;
 
-   if (unlikely(down_read_trylock(>vm_lock->lock) == 0))
+   if (unlikely(!atomic_inc_unless_negative(>vm_lock->count)))
return false;
 
+   /* If atomic_t overflows, restore and fail to lock. */
+   if (unlikely(atomic_read(>vm_lock->count) < 0)) {
+   if (atomic_dec_and_test(>vm_lock->count))
+   wake_up(>vm_mm->vma_writer_wait);
+   return false;
+   }
+
/*
 * Overflow might produce false locked result.
 * False unlocked result is impossible because we modify and check
 * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
 * modification invalidates all existing locks.
 */
-   if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
-   up_read(>vm_lock->lock);
+   if (unlikely(vma->vm_lock->lock_seq == 
READ_ONCE(vma->vm_mm->mm_lock_seq))) {
+   if (atomic_dec_and_test(>vm_lock->count))
+   wake_up(>vm_mm->vma_writer_wait);
return false;
}
return true;
@@ -664,7 +676,8 @@ static inline bool vma_read_trylock(struct vm_area_struct 
*vma)
 
 static inline void vma_read_unlock(struct vm_area_struct *vma)
 {
-   up_read(

[PATCH 40/41] mm: separate vma->lock from vm_area_struct

2023-01-09 Thread Suren Baghdasaryan
vma->lock being part of the vm_area_struct causes performance regression
during page faults because during contention its count and owner fields
are constantly updated and having other parts of vm_area_struct used
during page fault handling next to them causes constant cache line
bouncing. Fix that by moving the lock outside of the vm_area_struct.
All attempts to keep vma->lock inside vm_area_struct in a separate
cache line still produce performance regression especially on NUMA
machines. Smallest regression was achieved when lock is placed in the
fourth cache line but that bloats vm_area_struct to 256 bytes.
Considering performance and memory impact, separate lock looks like
the best option. It increases memory footprint of each VMA but that
will be addressed in the next patch.
Note that after this change vma_init() does not allocate or
initialize vma->lock anymore. A number of drivers allocate a pseudo
VMA on the stack but they never use the VMA's lock, therefore it does
not need to be allocated. The future drivers which might need the VMA
lock should use vm_area_alloc()/vm_area_free() to allocate it.

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/mm.h   | 25 ++--
 include/linux/mm_types.h |  6 ++-
 kernel/fork.c| 82 
 3 files changed, 74 insertions(+), 39 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 50c7a6dd9c7a..d40bf8a5e19e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -615,11 +615,6 @@ struct vm_operations_struct {
 };
 
 #ifdef CONFIG_PER_VMA_LOCK
-static inline void vma_init_lock(struct vm_area_struct *vma)
-{
-   init_rwsem(>lock);
-   vma->vm_lock_seq = -1;
-}
 
 static inline void vma_write_lock(struct vm_area_struct *vma)
 {
@@ -635,9 +630,9 @@ static inline void vma_write_lock(struct vm_area_struct 
*vma)
if (vma->vm_lock_seq == mm_lock_seq)
return;
 
-   down_write(>lock);
+   down_write(>vm_lock->lock);
vma->vm_lock_seq = mm_lock_seq;
-   up_write(>lock);
+   up_write(>vm_lock->lock);
 }
 
 /*
@@ -651,17 +646,17 @@ static inline bool vma_read_trylock(struct vm_area_struct 
*vma)
if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
return false;
 
-   if (unlikely(down_read_trylock(>lock) == 0))
+   if (unlikely(down_read_trylock(>vm_lock->lock) == 0))
return false;
 
/*
 * Overflow might produce false locked result.
 * False unlocked result is impossible because we modify and check
-* vma->vm_lock_seq under vma->lock protection and mm->mm_lock_seq
+* vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
 * modification invalidates all existing locks.
 */
if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
-   up_read(>lock);
+   up_read(>vm_lock->lock);
return false;
}
return true;
@@ -669,7 +664,7 @@ static inline bool vma_read_trylock(struct vm_area_struct 
*vma)
 
 static inline void vma_read_unlock(struct vm_area_struct *vma)
 {
-   up_read(>lock);
+   up_read(>vm_lock->lock);
 }
 
 static inline void vma_assert_write_locked(struct vm_area_struct *vma)
@@ -684,7 +679,7 @@ static inline void vma_assert_write_locked(struct 
vm_area_struct *vma)
 
 static inline void vma_assert_no_reader(struct vm_area_struct *vma)
 {
-   VM_BUG_ON_VMA(rwsem_is_locked(>lock) &&
+   VM_BUG_ON_VMA(rwsem_is_locked(>vm_lock->lock) &&
  vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq),
  vma);
 }
@@ -694,7 +689,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct 
*mm,
 
 #else /* CONFIG_PER_VMA_LOCK */
 
-static inline void vma_init_lock(struct vm_area_struct *vma) {}
 static inline void vma_write_lock(struct vm_area_struct *vma) {}
 static inline bool vma_read_trylock(struct vm_area_struct *vma)
{ return false; }
@@ -704,6 +698,10 @@ static inline void vma_assert_no_reader(struct 
vm_area_struct *vma) {}
 
 #endif /* CONFIG_PER_VMA_LOCK */
 
+/*
+ * WARNING: vma_init does not initialize vma->vm_lock.
+ * Use vm_area_alloc()/vm_area_free() if vma needs locking.
+ */
 static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 {
static const struct vm_operations_struct dummy_vm_ops = {};
@@ -712,7 +710,6 @@ static inline void vma_init(struct vm_area_struct *vma, 
struct mm_struct *mm)
vma->vm_mm = mm;
vma->vm_ops = _vm_ops;
INIT_LIST_HEAD(>anon_vma_chain);
-   vma_init_lock(vma);
 }
 
 /* Use when VMA is not part of the VMA tree and needs no locking */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index c0e6c8e4700b..faa61b400f9b

Re: [PATCH 13/41] mm: introduce vma->vm_flags modifier functions

2023-01-11 Thread Suren Baghdasaryan
On Wed, Jan 11, 2023 at 12:19 PM Davidlohr Bueso  wrote:
>
> On Wed, 11 Jan 2023, Suren Baghdasaryan wrote:
>
> >On Wed, Jan 11, 2023 at 8:13 AM Davidlohr Bueso  wrote:
> >>
> >> On Mon, 09 Jan 2023, Suren Baghdasaryan wrote:
> >>
> >> >To keep vma locking correctness when vm_flags are modified, add modifier
> >> >functions to be used whenever flags are updated.
> >>
> >> How about moving this patch and the ones that follow out of this series,
> >> into a preliminary patchset? It would reduce the amount of noise in the
> >> per-vma lock changes, which would then only be adding the needed
> >> vma_write_lock()ing.
> >
> >How about moving those prerequisite patches to the beginning of the
> >patchset (before maple_tree RCU changes)? I feel like they do belong
> >in the patchset because as a standalone patchset it would be unclear
> >why I'm adding all these accessor functions and introducing this
> >churn. Would that be acceptable?
>
> imo the abstraction of vm_flags handling is worth being standalone and is
> easier to be picked up before a more complex locking scheme change. But
> either way, it's up to you.

I see your point. Ok, if you think it makes sense as a stand-alone
patch I can post it separately in the next version.
Thanks,
Suren.

>
> Thanks,
> Davidlohr
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@android.com.
>


Re: [PATCH 08/41] mm: introduce CONFIG_PER_VMA_LOCK

2023-01-11 Thread Suren Baghdasaryan
On Wed, Jan 11, 2023 at 9:37 AM Michal Hocko  wrote:
>
> On Wed 11-01-23 09:04:41, Suren Baghdasaryan wrote:
> > On Wed, Jan 11, 2023 at 8:44 AM Michal Hocko  wrote:
> > >
> > > On Wed 11-01-23 08:28:49, Suren Baghdasaryan wrote:
> > > [...]
> > > > Anyhow. Sounds like the overhead of the current design is small enough
> > > > to remove CONFIG_PER_VMA_LOCK and let it depend only on architecture
> > > > support?
> > >
> > > Yes. Further optimizations can be done on top. Let's not over optimize
> > > at this stage.
> >
> > Sure, I won't optimize any further.
> > Just to expand on your question. Original design would be problematic
> > for embedded systems like Android. It notoriously has a high number of
> > VMAs due to anonymous VMAs being named, which prevents them from
> > merging.
>
> What is the usual number of VMAs in that environment?

I've seen some games which had over 4000 VMAs but that's on the upper
side. In my calculations I used 4 VMAs as a ballpark number and
rough calculations before size optimization would increase memory
consumption by ~2M (depending on the lock placement in vm_area_struct
it would vary a bit). In Android, the performance team flags any
change that exceeds 500KB, so it would raise questions.

>
> --
> Michal Hocko
> SUSE Labs


Re: [PATCH 08/41] mm: introduce CONFIG_PER_VMA_LOCK

2023-01-11 Thread Suren Baghdasaryan
On Wed, Jan 11, 2023 at 8:44 AM Michal Hocko  wrote:
>
> On Wed 11-01-23 08:28:49, Suren Baghdasaryan wrote:
> [...]
> > Anyhow. Sounds like the overhead of the current design is small enough
> > to remove CONFIG_PER_VMA_LOCK and let it depend only on architecture
> > support?
>
> Yes. Further optimizations can be done on top. Let's not over optimize
> at this stage.

Sure, I won't optimize any further.
Just to expand on your question. Original design would be problematic
for embedded systems like Android. It notoriously has a high number of
VMAs due to anonymous VMAs being named, which prevents them from
merging. 2M per process increase would raise questions, therefore I
felt the need for optimizing the memory overhead which is done in the
last patch.
Thanks for the feedback!

> --
> Michal Hocko
> SUSE Labs


Re: [PATCH 08/41] mm: introduce CONFIG_PER_VMA_LOCK

2023-01-11 Thread Suren Baghdasaryan
On Wed, Jan 11, 2023 at 10:03 AM Michal Hocko  wrote:
>
> On Wed 11-01-23 09:49:08, Suren Baghdasaryan wrote:
> > On Wed, Jan 11, 2023 at 9:37 AM Michal Hocko  wrote:
> > >
> > > On Wed 11-01-23 09:04:41, Suren Baghdasaryan wrote:
> > > > On Wed, Jan 11, 2023 at 8:44 AM Michal Hocko  wrote:
> > > > >
> > > > > On Wed 11-01-23 08:28:49, Suren Baghdasaryan wrote:
> > > > > [...]
> > > > > > Anyhow. Sounds like the overhead of the current design is small 
> > > > > > enough
> > > > > > to remove CONFIG_PER_VMA_LOCK and let it depend only on architecture
> > > > > > support?
> > > > >
> > > > > Yes. Further optimizations can be done on top. Let's not over optimize
> > > > > at this stage.
> > > >
> > > > Sure, I won't optimize any further.
> > > > Just to expand on your question. Original design would be problematic
> > > > for embedded systems like Android. It notoriously has a high number of
> > > > VMAs due to anonymous VMAs being named, which prevents them from
> > > > merging.
> > >
> > > What is the usual number of VMAs in that environment?
> >
> > I've seen some games which had over 4000 VMAs but that's on the upper
> > side. In my calculations I used 4 VMAs as a ballpark number and
> > rough calculations before size optimization would increase memory
> > consumption by ~2M (depending on the lock placement in vm_area_struct
> > it would vary a bit). In Android, the performance team flags any
> > change that exceeds 500KB, so it would raise questions.
>
> Thanks, that is a useful information! This is just slightly off-topic
> but I ak wondering how much memory those vma names consume. Are there
> that many unique names or they just happen to be alternating so that
> neighboring ones tend to be different.

Good question. I don't have the ready answer to that but will try to
collect some stats. I know that many names are standardized but
haven't looked at how they are distributed in the address space. Will
followup once I collect the data.
Thanks,
Suren.

> --
> Michal Hocko
> SUSE Labs


Re: [PATCH 13/41] mm: introduce vma->vm_flags modifier functions

2023-01-11 Thread Suren Baghdasaryan
On Wed, Jan 11, 2023 at 8:13 AM Davidlohr Bueso  wrote:
>
> On Mon, 09 Jan 2023, Suren Baghdasaryan wrote:
>
> >To keep vma locking correctness when vm_flags are modified, add modifier
> >functions to be used whenever flags are updated.
>
> How about moving this patch and the ones that follow out of this series,
> into a preliminary patchset? It would reduce the amount of noise in the
> per-vma lock changes, which would then only be adding the needed
> vma_write_lock()ing.

How about moving those prerequisite patches to the beginning of the
patchset (before maple_tree RCU changes)? I feel like they do belong
in the patchset because as a standalone patchset it would be unclear
why I'm adding all these accessor functions and introducing this
churn. Would that be acceptable?

>
> Thanks,
> Davidlohr
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@android.com.
>


Re: [PATCH 08/41] mm: introduce CONFIG_PER_VMA_LOCK

2023-01-11 Thread Suren Baghdasaryan
On Wed, Jan 11, 2023 at 2:03 AM David Laight  wrote:
>
> From: Ingo Molnar
> > Sent: 11 January 2023 09:54
> >
> > * Michal Hocko  wrote:
> >
> > > On Tue 10-01-23 16:44:42, Suren Baghdasaryan wrote:
> > > > On Tue, Jan 10, 2023 at 4:39 PM Davidlohr Bueso  
> > > > wrote:
> > > > >
> > > > > On Mon, 09 Jan 2023, Suren Baghdasaryan wrote:
> > > > >
> > > > > >This configuration variable will be used to build the support for VMA
> > > > > >locking during page fault handling.
> > > > > >
> > > > > >This is enabled by default on supported architectures with SMP and 
> > > > > >MMU
> > > > > >set.
> > > > > >
> > > > > >The architecture support is needed since the page fault handler is 
> > > > > >called
> > > > > >from the architecture's page faulting code which needs modifications 
> > > > > >to
> > > > > >handle faults under VMA lock.
> > > > >
> > > > > I don't think that per-vma locking should be something that is 
> > > > > user-configurable.
> > > > > It should just be depdendant on the arch. So maybe just remove 
> > > > > CONFIG_PER_VMA_LOCK?
> > > >
> > > > Thanks for the suggestion! I would be happy to make that change if
> > > > there are no objections. I think the only pushback might have been the
> > > > vma size increase but with the latest optimization in the last patch
> > > > maybe that's less of an issue?
> > >
> > > Has vma size ever been a real problem? Sure there might be a lot of those
> > > but your patch increases it by rwsem (without the last patch) which is
> > > something like 40B on top of 136B vma so we are talking about 400B in
> > > total which even with wild mapcount limits shouldn't really be
> > > prohibitive. With a default map count limit we are talking about 2M
> > > increase at most (per address space).
> > >
> > > Or are you aware of any specific usecases where vma size is a real
> > > problem?

Well, when fixing the cacheline bouncing problem in the initial design
I was adding 44 bytes to 152-byte vm_area_struct (CONFIG_NUMA enabled)
and pushing it just above 192 bytes while allocating these structures
from cache-aligned slab (keeping the lock in a separate cacheline to
prevent cacheline bouncing). That would use the whole 256 bytes per
VMA and it did make me nervous. The current design with no need to
cache-align vm_area_structs and with 44-byte overhead trimmed down to
16 bytes seems much more palatable.

> >
> > 40 bytes for the rwsem, plus the patch also adds a 32-bit sequence counter:
> >
> >   + int vm_lock_seq;
> >   + struct rw_semaphore lock;
> >
> > So it's +44 bytes.

Correct.

>
> Depend in whether vm_lock_seq goes into a padding hole or not
> it will be 40 or 48 bytes.
>
> But if these structures are allocated individually (not an array)
> then it depends on how may items kmalloc() fits into a page (or 2,4).

Yep. Depends on how we arrange the fields.

Anyhow. Sounds like the overhead of the current design is small enough
to remove CONFIG_PER_VMA_LOCK and let it depend only on architecture
support?
Thanks,
Suren.

>
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 
> 1PT, UK
> Registration No: 1397386 (Wales)
>


Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock

2023-01-16 Thread Suren Baghdasaryan
On Mon, Jan 16, 2023 at 3:15 AM Hyeonggon Yoo <42.hye...@gmail.com> wrote:
>
> On Mon, Jan 09, 2023 at 12:53:36PM -0800, Suren Baghdasaryan wrote:
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index d40bf8a5e19e..294dd44b2198 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -627,12 +627,16 @@ static inline void vma_write_lock(struct 
> > vm_area_struct *vma)
> >* mm->mm_lock_seq can't be concurrently modified.
> >*/
> >   mm_lock_seq = READ_ONCE(vma->vm_mm->mm_lock_seq);
> > - if (vma->vm_lock_seq == mm_lock_seq)
> > + if (vma->vm_lock->lock_seq == mm_lock_seq)
> >   return;
> >
> > - down_write(>vm_lock->lock);
> > - vma->vm_lock_seq = mm_lock_seq;
> > - up_write(>vm_lock->lock);
> > + if (atomic_cmpxchg(>vm_lock->count, 0, -1))
> > + wait_event(vma->vm_mm->vma_writer_wait,
> > +atomic_cmpxchg(>vm_lock->count, 0, -1) == 0);
> > + vma->vm_lock->lock_seq = mm_lock_seq;
> > + /* Write barrier to ensure lock_seq change is visible before count */
> > + smp_wmb();
> > + atomic_set(>vm_lock->count, 0);
> >  }
> >
> >  /*
> > @@ -643,20 +647,28 @@ static inline void vma_write_lock(struct 
> > vm_area_struct *vma)
> >  static inline bool vma_read_trylock(struct vm_area_struct *vma)
> >  {
> >   /* Check before locking. A race might cause false locked result. */
> > - if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > + if (vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> >   return false;
> >
> > - if (unlikely(down_read_trylock(>vm_lock->lock) == 0))
> > + if (unlikely(!atomic_inc_unless_negative(>vm_lock->count)))
> >   return false;
> >
> > + /* If atomic_t overflows, restore and fail to lock. */
> > + if (unlikely(atomic_read(>vm_lock->count) < 0)) {
> > + if (atomic_dec_and_test(>vm_lock->count))
> > + wake_up(>vm_mm->vma_writer_wait);
> > + return false;
> > + }
> > +
> >   /*
> >* Overflow might produce false locked result.
> >* False unlocked result is impossible because we modify and check
> >* vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
> >* modification invalidates all existing locks.
> >*/
> > - if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) 
> > {
> > - up_read(>vm_lock->lock);
> > + if (unlikely(vma->vm_lock->lock_seq == 
> > READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > + if (atomic_dec_and_test(>vm_lock->count))
> > + wake_up(>vm_mm->vma_writer_wait);
> >   return false;
> >   }
>
> With this change readers can cause writers to starve.
> What about checking waitqueue_active() before or after increasing
> vma->vm_lock->count?

The readers are in page fault path, which is the fast path, while
writers performing updates are in slow path. Therefore I *think*
starving writers should not be a big issue. So far in benchmarks I
haven't seen issues with that but maybe there is such a case?

>
> --
> Thanks,
> Hyeonggon
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@android.com.
>


Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock

2023-01-16 Thread Suren Baghdasaryan
On Mon, Jan 16, 2023 at 8:14 PM Matthew Wilcox  wrote:
>
> On Mon, Jan 16, 2023 at 11:14:38AM +, Hyeonggon Yoo wrote:
> > > @@ -643,20 +647,28 @@ static inline void vma_write_lock(struct 
> > > vm_area_struct *vma)
> > >  static inline bool vma_read_trylock(struct vm_area_struct *vma)
> > >  {
> > > /* Check before locking. A race might cause false locked result. */
> > > -   if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > > +   if (vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > > return false;
> > >
> > > -   if (unlikely(down_read_trylock(>vm_lock->lock) == 0))
> > > +   if (unlikely(!atomic_inc_unless_negative(>vm_lock->count)))
> > > return false;
> > >
> > > +   /* If atomic_t overflows, restore and fail to lock. */
> > > +   if (unlikely(atomic_read(>vm_lock->count) < 0)) {
> > > +   if (atomic_dec_and_test(>vm_lock->count))
> > > +   wake_up(>vm_mm->vma_writer_wait);
> > > +   return false;
> > > +   }
> > > +
> > > /*
> > >  * Overflow might produce false locked result.
> > >  * False unlocked result is impossible because we modify and check
> > >  * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
> > >  * modification invalidates all existing locks.
> > >  */
> > > -   if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) 
> > > {
> > > -   up_read(>vm_lock->lock);
> > > +   if (unlikely(vma->vm_lock->lock_seq == 
> > > READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > > +   if (atomic_dec_and_test(>vm_lock->count))
> > > +   wake_up(>vm_mm->vma_writer_wait);
> > > return false;
> > > }
> >
> > With this change readers can cause writers to starve.
> > What about checking waitqueue_active() before or after increasing
> > vma->vm_lock->count?
>
> I don't understand how readers can starve a writer.  Readers do
> atomic_inc_unless_negative() so a writer can always force readers
> to fail.

I think the point here was that if page faults keep occuring and they
prevent vm_lock->count from reaching 0 then a writer will be blocked
and there is no reader throttling mechanism (no max time that writer
will be waiting).


Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock

2023-01-16 Thread Suren Baghdasaryan
On Mon, Jan 16, 2023 at 9:46 PM Matthew Wilcox  wrote:
>
> On Mon, Jan 16, 2023 at 08:34:36PM -0800, Suren Baghdasaryan wrote:
> > On Mon, Jan 16, 2023 at 8:14 PM Matthew Wilcox  wrote:
> > >
> > > On Mon, Jan 16, 2023 at 11:14:38AM +, Hyeonggon Yoo wrote:
> > > > > @@ -643,20 +647,28 @@ static inline void vma_write_lock(struct 
> > > > > vm_area_struct *vma)
> > > > >  static inline bool vma_read_trylock(struct vm_area_struct *vma)
> > > > >  {
> > > > > /* Check before locking. A race might cause false locked result. 
> > > > > */
> > > > > -   if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > > > > +   if (vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > > > > return false;
> > > > >
> > > > > -   if (unlikely(down_read_trylock(>vm_lock->lock) == 0))
> > > > > +   if (unlikely(!atomic_inc_unless_negative(>vm_lock->count)))
> > > > > return false;
> > > > >
> > > > > +   /* If atomic_t overflows, restore and fail to lock. */
> > > > > +   if (unlikely(atomic_read(>vm_lock->count) < 0)) {
> > > > > +   if (atomic_dec_and_test(>vm_lock->count))
> > > > > +   wake_up(>vm_mm->vma_writer_wait);
> > > > > +   return false;
> > > > > +   }
> > > > > +
> > > > > /*
> > > > >  * Overflow might produce false locked result.
> > > > >  * False unlocked result is impossible because we modify and check
> > > > >  * vma->vm_lock_seq under vma->vm_lock protection and 
> > > > > mm->mm_lock_seq
> > > > >  * modification invalidates all existing locks.
> > > > >  */
> > > > > -   if (unlikely(vma->vm_lock_seq == 
> > > > > READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > > > > -   up_read(>vm_lock->lock);
> > > > > +   if (unlikely(vma->vm_lock->lock_seq == 
> > > > > READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > > > > +   if (atomic_dec_and_test(>vm_lock->count))
> > > > > +   wake_up(>vm_mm->vma_writer_wait);
> > > > > return false;
> > > > > }
> > > >
> > > > With this change readers can cause writers to starve.
> > > > What about checking waitqueue_active() before or after increasing
> > > > vma->vm_lock->count?
> > >
> > > I don't understand how readers can starve a writer.  Readers do
> > > atomic_inc_unless_negative() so a writer can always force readers
> > > to fail.
> >
> > I think the point here was that if page faults keep occuring and they
> > prevent vm_lock->count from reaching 0 then a writer will be blocked
> > and there is no reader throttling mechanism (no max time that writer
> > will be waiting).
>
> Perhaps I misunderstood your description; I thought that a _waiting_
> writer would make the count negative, not a successfully acquiring
> writer.

A waiting writer does not modify the counter, instead it's placed on
the wait queue and the last reader which sets the count to 0 while
releasing its read lock will wake it up. Once the writer is woken it
will try to set the count to negative and if successful will own the
lock, otherwise it goes back to sleep.


Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock

2023-01-17 Thread Suren Baghdasaryan
On Tue, Jan 17, 2023 at 10:12 AM Jann Horn  wrote:
>
> On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan  wrote:
> > rw_semaphore is a sizable structure of 40 bytes and consumes
> > considerable space for each vm_area_struct. However vma_lock has
> > two important specifics which can be used to replace rw_semaphore
> > with a simpler structure:
> [...]
> >  static inline void vma_read_unlock(struct vm_area_struct *vma)
> >  {
> > -   up_read(>vm_lock->lock);
> > +   if (atomic_dec_and_test(>vm_lock->count))
> > +   wake_up(>vm_mm->vma_writer_wait);
> >  }
>
> I haven't properly reviewed this, but this bit looks like a
> use-after-free because you're accessing the vma after dropping your
> reference on it. You'd have to first look up the vma->vm_mm, then do
> the atomic_dec_and_test(), and afterwards do the wake_up() without
> touching the vma. Or alternatively wrap the whole thing in an RCU
> read-side critical section if the VMA is freed with RCU delay.

vm_lock->count does not control the lifetime of the VMA, it's a
counter of how many readers took the lock or it's negative if the lock
is write-locked.


Re: [PATCH 18/41] mm/khugepaged: write-lock VMA while collapsing a huge page

2023-01-17 Thread Suren Baghdasaryan
On Tue, Jan 17, 2023 at 12:28 PM Jann Horn  wrote:
>
> On Tue, Jan 17, 2023 at 4:25 PM Michal Hocko  wrote:
> > On Mon 09-01-23 12:53:13, Suren Baghdasaryan wrote:
> > > Protect VMA from concurrent page fault handler while collapsing a huge
> > > page. Page fault handler needs a stable PMD to use PTL and relies on
> > > per-VMA lock to prevent concurrent PMD changes. pmdp_collapse_flush(),
> > > set_huge_pmd() and collapse_and_free_pmd() can modify a PMD, which will
> > > not be detected by a page fault handler without proper locking.
> >
> > I am struggling with this changelog. Maybe because my recollection of
> > the THP collapsing subtleties is weak. But aren't you just trying to say
> > that the current #PF handling and THP collapsing need to be mutually
> > exclusive currently so in order to keep that assumption you have mark
> > the vma write locked?
> >
> > Also it is not really clear to me how that handles other vmas which can
> > share the same thp?
>
> It's not about the hugepage itself, it's about how the THP collapse
> operation frees page tables.
>
> Before this series, page tables can be walked under any one of the
> mmap lock, the mapping lock, and the anon_vma lock; so when khugepaged
> unlinks and frees page tables, it must ensure that all of those either
> are locked or don't exist. This series adds a fourth lock under which
> page tables can be traversed, and so khugepaged must also lock out that one.
>
> There is a codepath in khugepaged that iterates through all mappings
> of a file to zap page tables (retract_page_tables()), which locks each
> visited mm with mmap_write_trylock() and now also does
> vma_write_lock().
>
>
> I think one aspect of this patch that might cause trouble later on, if
> support for non-anonymous VMAs is added, is that retract_page_tables()
> now does vma_write_lock() while holding the mapping lock; the page
> fault handling path would probably take the locks the other way
> around, leading to a deadlock? So the vma_write_lock() in
> retract_page_tables() might have to become a trylock later on.
>
> Related: Please add the new VMA lock to the big lock ordering comments
> at the top of mm/rmap.c. (And maybe later mm/filemap.c, if/when you
> add file VMA support.)

Thanks for the clarifications and the warning. I'll add appropriate
comments and will take this deadlocking scenario into account when
later implementing support for file-backed page faults.


Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it

2023-01-17 Thread Suren Baghdasaryan
On Tue, Jan 17, 2023 at 7:12 AM Michal Hocko  wrote:
>
> On Tue 17-01-23 16:04:26, Michal Hocko wrote:
> > On Mon 09-01-23 12:53:07, Suren Baghdasaryan wrote:
> > > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > > instead of mmap_lock. Because there are cases when multiple VMAs need
> > > to be exclusively locked during VMA tree modifications, instead of the
> > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > > mmap_write_lock holder is done with all modifications and drops mmap_lock,
> > > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > > locked.
> >
> > I have to say I was struggling a bit with the above and only understood
> > what you mean by reading the patch several times. I would phrase it like
> > this (feel free to use if you consider this to be an improvement).
> >
> > Introduce a per-VMA rw_semaphore. The lock implementation relies on a
> > per-vma and per-mm sequence counters to note exclusive locking:
> > - read lock - (implemented by vma_read_trylock) requires the the
> >   vma (vm_lock_seq) and mm (mm_lock_seq) sequence counters to
> >   differ. If they match then there must be a vma exclusive lock
> >   held somewhere.
> > - read unlock - (implemented by vma_read_unlock) is a trivial
> >   vma->lock unlock.
> > - write lock - (vma_write_lock) requires the mmap_lock to be
> >   held exclusively and the current mm counter is noted to the vma
> >   side. This will allow multiple vmas to be locked under a single
> >   mmap_lock write lock (e.g. during vma merging). The vma counter
> >   is modified under exclusive vma lock.
>
> Didn't realize one more thing.
> Unlike standard write lock this implementation allows to be
> called multiple times under a single mmap_lock. In a sense
> it is more of mark_vma_potentially_modified than a lock.

In the RFC it was called vma_mark_locked() originally and renames were
discussed in the email thread ending here:
https://lore.kernel.org/all/621612d7-c537-3971-9520-a3dec7b43...@suse.cz/.
If other names are preferable I'm open to changing them.

>
> > - write unlock - (vma_write_unlock_mm) is a batch release of all
> >   vma locks held. It doesn't pair with a specific
> >   vma_write_lock! It is done before exclusive mmap_lock is
> >   released by incrementing mm sequence counter (mm_lock_seq).
> >   - write downgrade - if the mmap_lock is downgraded to the read
> > lock all vma write locks are released as well (effectivelly
> > same as write unlock).
> --
> Michal Hocko
> SUSE Labs


Re: [PATCH 28/41] mm: introduce lock_vma_under_rcu to be used from arch-specific code

2023-01-17 Thread Suren Baghdasaryan
On Tue, Jan 17, 2023 at 7:47 AM Michal Hocko  wrote:
>
> On Mon 09-01-23 12:53:23, Suren Baghdasaryan wrote:
> > Introduce lock_vma_under_rcu function to lookup and lock a VMA during
> > page fault handling. When VMA is not found, can't be locked or changes
> > after being locked, the function returns NULL. The lookup is performed
> > under RCU protection to prevent the found VMA from being destroyed before
> > the VMA lock is acquired. VMA lock statistics are updated according to
> > the results.
> > For now only anonymous VMAs can be searched this way. In other cases the
> > function returns NULL.
>
> Could you describe why only anonymous vmas are handled at this stage and
> what (roughly) has to be done to support other vmas? lock_vma_under_rcu
> doesn't seem to have any anonymous vma specific requirements AFAICS.

TBH I haven't spent too much time looking into file-backed page faults
yet but a couple of tasks I can think of are:
- Ensure that all vma->vm_ops->fault() handlers do not rely on
mmap_lock being read-locked;
- vma->vm_file freeing like VMA freeing will need to be done after RCU
grace period since page fault handlers use it. This will require some
caution because simply adding it into __vm_area_free() called via
call_rcu()  will cause corresponding fops->release() to be called
asynchronously. I had to solve this issue with out-of-tree SPF
implementation when asynchronously called snd_pcm_release() was
problematic.

I'm sure I'm missing more potential issues and maybe Matthew and
Michel can pinpoint more things to resolve here?

>
> Also isn't lock_vma_under_rcu effectively find_read_lock_vma? Not that
> the naming is really the most important part but the rcu locking is
> internal to the function so why should we spread this implementation
> detail to the world...

I wanted the name to indicate that the lookup is done with no locks
held. But I'm open to suggestions.

>
> > Signed-off-by: Suren Baghdasaryan 
> > ---
> >  include/linux/mm.h |  3 +++
> >  mm/memory.c| 51 ++
> >  2 files changed, 54 insertions(+)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index c464fc8a514c..d0fddf6a1de9 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -687,6 +687,9 @@ static inline void vma_assert_no_reader(struct 
> > vm_area_struct *vma)
> > vma);
> >  }
> >
> > +struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> > +   unsigned long address);
> > +
> >  #else /* CONFIG_PER_VMA_LOCK */
> >
> >  static inline void vma_init_lock(struct vm_area_struct *vma) {}
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 9ece18548db1..a658e26d965d 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -5242,6 +5242,57 @@ vm_fault_t handle_mm_fault(struct vm_area_struct 
> > *vma, unsigned long address,
> >  }
> >  EXPORT_SYMBOL_GPL(handle_mm_fault);
> >
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +/*
> > + * Lookup and lock a VMA under RCU protection. Returned VMA is guaranteed 
> > to be
> > + * stable and not isolated. If the VMA is not found or is being modified 
> > the
> > + * function returns NULL.
> > + */
> > +struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> > +   unsigned long address)
> > +{
> > + MA_STATE(mas, >mm_mt, address, address);
> > + struct vm_area_struct *vma, *validate;
> > +
> > + rcu_read_lock();
> > + vma = mas_walk();
> > +retry:
> > + if (!vma)
> > + goto inval;
> > +
> > + /* Only anonymous vmas are supported for now */
> > + if (!vma_is_anonymous(vma))
> > + goto inval;
> > +
> > + if (!vma_read_trylock(vma))
> > + goto inval;
> > +
> > + /* Check since vm_start/vm_end might change before we lock the VMA */
> > + if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
> > + vma_read_unlock(vma);
> > + goto inval;
> > + }
> > +
> > + /* Check if the VMA got isolated after we found it */
> > + mas.index = address;
> > + validate = mas_walk();
> > + if (validate != vma) {
> > + vma_read_unlock(vma);
> > + count_vm_vma_lock_event(VMA_LOCK_MISS);
> > + /* The area was replaced with another one. */
> > + vma = validate;
> > + goto retry;
> > + }
> > +
> > + rcu_read_unlock();
> > + return vma;
> > +inval:
> > + rcu_read_unlock();
> > + count_vm_vma_lock_event(VMA_LOCK_ABORT);
> > + return NULL;
> > +}
> > +#endif /* CONFIG_PER_VMA_LOCK */
> > +
> >  #ifndef __PAGETABLE_P4D_FOLDED
> >  /*
> >   * Allocate p4d page table.
> > --
> > 2.39.0
>
> --
> Michal Hocko
> SUSE Labs


Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock

2023-01-17 Thread Suren Baghdasaryan
On Tue, Jan 17, 2023 at 10:31 AM Matthew Wilcox  wrote:
>
> On Tue, Jan 17, 2023 at 10:26:32AM -0800, Suren Baghdasaryan wrote:
> > On Tue, Jan 17, 2023 at 10:12 AM Jann Horn  wrote:
> > >
> > > On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan  
> > > wrote:
> > > > rw_semaphore is a sizable structure of 40 bytes and consumes
> > > > considerable space for each vm_area_struct. However vma_lock has
> > > > two important specifics which can be used to replace rw_semaphore
> > > > with a simpler structure:
> > > [...]
> > > >  static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > >  {
> > > > -   up_read(>vm_lock->lock);
> > > > +   if (atomic_dec_and_test(>vm_lock->count))
> > > > +   wake_up(>vm_mm->vma_writer_wait);
> > > >  }
> > >
> > > I haven't properly reviewed this, but this bit looks like a
> > > use-after-free because you're accessing the vma after dropping your
> > > reference on it. You'd have to first look up the vma->vm_mm, then do
> > > the atomic_dec_and_test(), and afterwards do the wake_up() without
> > > touching the vma. Or alternatively wrap the whole thing in an RCU
> > > read-side critical section if the VMA is freed with RCU delay.
> >
> > vm_lock->count does not control the lifetime of the VMA, it's a
> > counter of how many readers took the lock or it's negative if the lock
> > is write-locked.
>
> Yes, but ...
>
> Task A:
> atomic_dec_and_test(>vm_lock->count)
> Task B:
> munmap()
> write lock
> free VMA
> synchronize_rcu()
> VMA is really freed
> wake_up(>vm_mm->vma_writer_wait);
>
> ... vma is freed.
>
> Now, I think this doesn't occur.  I'm pretty sure that every caller of
> vma_read_unlock() is holding the RCU read lock.  But maybe we should
> have that assertion?

Yep, that's what this patch is doing
https://lore.kernel.org/all/20230109205336.3665937-27-sur...@google.com/
by calling vma_assert_no_reader() from __vm_area_free().

>


Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock

2023-01-17 Thread Suren Baghdasaryan
On Tue, Jan 17, 2023 at 10:47 AM Matthew Wilcox  wrote:
>
> On Tue, Jan 17, 2023 at 10:36:42AM -0800, Suren Baghdasaryan wrote:
> > On Tue, Jan 17, 2023 at 10:31 AM Matthew Wilcox  wrote:
> > >
> > > On Tue, Jan 17, 2023 at 10:26:32AM -0800, Suren Baghdasaryan wrote:
> > > > On Tue, Jan 17, 2023 at 10:12 AM Jann Horn  wrote:
> > > > >
> > > > > On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan  
> > > > > wrote:
> > > > > > rw_semaphore is a sizable structure of 40 bytes and consumes
> > > > > > considerable space for each vm_area_struct. However vma_lock has
> > > > > > two important specifics which can be used to replace rw_semaphore
> > > > > > with a simpler structure:
> > > > > [...]
> > > > > >  static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > > > >  {
> > > > > > -   up_read(>vm_lock->lock);
> > > > > > +   if (atomic_dec_and_test(>vm_lock->count))
> > > > > > +   wake_up(>vm_mm->vma_writer_wait);
> > > > > >  }
> > > > >
> > > > > I haven't properly reviewed this, but this bit looks like a
> > > > > use-after-free because you're accessing the vma after dropping your
> > > > > reference on it. You'd have to first look up the vma->vm_mm, then do
> > > > > the atomic_dec_and_test(), and afterwards do the wake_up() without
> > > > > touching the vma. Or alternatively wrap the whole thing in an RCU
> > > > > read-side critical section if the VMA is freed with RCU delay.
> > > >
> > > > vm_lock->count does not control the lifetime of the VMA, it's a
> > > > counter of how many readers took the lock or it's negative if the lock
> > > > is write-locked.
> > >
> > > Yes, but ...
> > >
> > > Task A:
> > > atomic_dec_and_test(>vm_lock->count)
> > > Task B:
> > > munmap()
> > > write lock
> > > free VMA
> > > synchronize_rcu()
> > > VMA is really freed
> > > wake_up(>vm_mm->vma_writer_wait);
> > >
> > > ... vma is freed.
> > >
> > > Now, I think this doesn't occur.  I'm pretty sure that every caller of
> > > vma_read_unlock() is holding the RCU read lock.  But maybe we should
> > > have that assertion?
> >
> > Yep, that's what this patch is doing
> > https://lore.kernel.org/all/20230109205336.3665937-27-sur...@google.com/
> > by calling vma_assert_no_reader() from __vm_area_free().
>
> That's not enough though.  Task A still has a pointer to vma after it
> has called atomic_dec_and_test(), even after vma has been freed by
> Task B, and before Task A dereferences vma->vm_mm.

Ah, I see your point now. I guess I'll have to store vma->vm_mm in a
local variable and call mmgrab() before atomic_dec_and_test(), then
use it in wake_up() and call mmdrop(). Is that what you are thinking?


Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it

2023-01-17 Thread Suren Baghdasaryan
On Tue, Jan 17, 2023 at 1:46 PM Jann Horn  wrote:
>
> On Tue, Jan 17, 2023 at 10:28 PM Suren Baghdasaryan  wrote:
> > On Tue, Jan 17, 2023 at 10:03 AM Jann Horn  wrote:
> > >
> > > +locking maintainers
> >
> > Thanks! I'll CC the locking maintainers in the next posting.
> >
> > >
> > > On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan  
> > > wrote:
> > > > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > > > instead of mmap_lock. Because there are cases when multiple VMAs need
> > > > to be exclusively locked during VMA tree modifications, instead of the
> > > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > > > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > > > mmap_write_lock holder is done with all modifications and drops 
> > > > mmap_lock,
> > > > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > > > locked.
> > > [...]
> > > > +static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > > +{
> > > > +   up_read(>lock);
> > > > +}
> > >
> > > One thing that might be gnarly here is that I think you might not be
> > > allowed to use up_read() to fully release ownership of an object -
> > > from what I remember, I think that up_read() (unlike something like
> > > spin_unlock()) can access the lock object after it's already been
> > > acquired by someone else. So if you want to protect against concurrent
> > > deletion, this might have to be something like:
> > >
> > > rcu_read_lock(); /* keeps vma alive */
> > > up_read(>lock);
> > > rcu_read_unlock();
> >
> > But for deleting VMA one would need to write-lock the vma->lock first,
> > which I assume can't happen until this up_read() is complete. Is that
> > assumption wrong?
>
> __up_read() does:
>
> rwsem_clear_reader_owned(sem);
> tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, >count);
> DEBUG_RWSEMS_WARN_ON(tmp < 0, sem);
> if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
>   RWSEM_FLAG_WAITERS)) {
>   clear_nonspinnable(sem);
>   rwsem_wake(sem);
> }
>
> The atomic_long_add_return_release() is the point where we are doing
> the main lock-releasing.
>
> So if a reader dropped the read-lock while someone else was waiting on
> the lock (RWSEM_FLAG_WAITERS) and no other readers were holding the
> lock together with it, the reader also does clear_nonspinnable() and
> rwsem_wake() afterwards.
> But in rwsem_down_write_slowpath(), after we've set
> RWSEM_FLAG_WAITERS, we can return successfully immediately once
> rwsem_try_write_lock() sees that there are no active readers or
> writers anymore (if RWSEM_LOCK_MASK is unset and the cmpxchg
> succeeds). We're not necessarily waiting for the "nonspinnable" bit or
> the wake.
>
> So yeah, I think down_write() can return successfully before up_read()
> is done with its memory accesses.
>
> (Spinlocks are different - the kernel relies on being able to drop
> references via spin_unlock() in some places.)

Thanks for bringing this up. I can add rcu_read_{lock/unlock) as you
suggested and that would fix the issue because we free VMAs from
call_rcu(). However this feels to me as an issue of rw_semaphore
design that this locking pattern is unsafe and might lead to UAF.


Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock

2023-01-17 Thread Suren Baghdasaryan
On Tue, Jan 17, 2023 at 10:23 AM Matthew Wilcox  wrote:
>
> On Mon, Jan 16, 2023 at 09:58:35PM -0800, Suren Baghdasaryan wrote:
> > On Mon, Jan 16, 2023 at 9:46 PM Matthew Wilcox  wrote:
> > >
> > > On Mon, Jan 16, 2023 at 08:34:36PM -0800, Suren Baghdasaryan wrote:
> > > > On Mon, Jan 16, 2023 at 8:14 PM Matthew Wilcox  
> > > > wrote:
> > > > >
> > > > > On Mon, Jan 16, 2023 at 11:14:38AM +, Hyeonggon Yoo wrote:
> > > > > > > @@ -643,20 +647,28 @@ static inline void vma_write_lock(struct 
> > > > > > > vm_area_struct *vma)
> > > > > > >  static inline bool vma_read_trylock(struct vm_area_struct *vma)
> > > > > > >  {
> > > > > > > /* Check before locking. A race might cause false locked 
> > > > > > > result. */
> > > > > > > -   if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > > > > > > +   if (vma->vm_lock->lock_seq == 
> > > > > > > READ_ONCE(vma->vm_mm->mm_lock_seq))
> > > > > > > return false;
> > > > > > >
> > > > > > > -   if (unlikely(down_read_trylock(>vm_lock->lock) == 0))
> > > > > > > +   if 
> > > > > > > (unlikely(!atomic_inc_unless_negative(>vm_lock->count)))
> > > > > > > return false;
> > > > > > >
> > > > > > > +   /* If atomic_t overflows, restore and fail to lock. */
> > > > > > > +   if (unlikely(atomic_read(>vm_lock->count) < 0)) {
> > > > > > > +   if (atomic_dec_and_test(>vm_lock->count))
> > > > > > > +   wake_up(>vm_mm->vma_writer_wait);
> > > > > > > +   return false;
> > > > > > > +   }
> > > > > > > +
> > > > > > > /*
> > > > > > >  * Overflow might produce false locked result.
> > > > > > >  * False unlocked result is impossible because we modify and 
> > > > > > > check
> > > > > > >  * vma->vm_lock_seq under vma->vm_lock protection and 
> > > > > > > mm->mm_lock_seq
> > > > > > >  * modification invalidates all existing locks.
> > > > > > >  */
> > > > > > > -   if (unlikely(vma->vm_lock_seq == 
> > > > > > > READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > > > > > > -   up_read(>vm_lock->lock);
> > > > > > > +   if (unlikely(vma->vm_lock->lock_seq == 
> > > > > > > READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > > > > > > +   if (atomic_dec_and_test(>vm_lock->count))
> > > > > > > +   wake_up(>vm_mm->vma_writer_wait);
> > > > > > > return false;
> > > > > > > }
> > > > > >
> > > > > > With this change readers can cause writers to starve.
> > > > > > What about checking waitqueue_active() before or after increasing
> > > > > > vma->vm_lock->count?
> > > > >
> > > > > I don't understand how readers can starve a writer.  Readers do
> > > > > atomic_inc_unless_negative() so a writer can always force readers
> > > > > to fail.
> > > >
> > > > I think the point here was that if page faults keep occuring and they
> > > > prevent vm_lock->count from reaching 0 then a writer will be blocked
> > > > and there is no reader throttling mechanism (no max time that writer
> > > > will be waiting).
> > >
> > > Perhaps I misunderstood your description; I thought that a _waiting_
> > > writer would make the count negative, not a successfully acquiring
> > > writer.
> >
> > A waiting writer does not modify the counter, instead it's placed on
> > the wait queue and the last reader which sets the count to 0 while
> > releasing its read lock will wake it up. Once the writer is woken it
> > will try to set the count to negative and if successful will own the
> > lock, otherwise it goes back to sleep.
>
> Then yes, that's a starvable lock.  Preventing starvation on the mmap
> sem was the original motivation for making rwsems non-starvable, so
> changing that behaviour now seems like a bad idea.  For efficiency, I'd
> suggest that a waiting writer set the top bit of the counter.  That way,
> all new readers will back off without needing to check a second variable
> and old readers will know that they *may* need to do the wakeup when
> atomic_sub_return_release() is negative.
>
> (rwsem.c has a more complex bitfield, but I don't think we need to go
> that far; the important point is that the waiting writer indicates its
> presence in the count field so that readers can modify their behaviour)

Got it. Ok, I think we can figure something out to check if there are
waiting write-lockers and prevent new readers from taking the lock.


Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock

2023-01-17 Thread Suren Baghdasaryan
On Tue, Jan 17, 2023 at 11:00 AM Jann Horn  wrote:
>
> On Tue, Jan 17, 2023 at 7:55 PM Suren Baghdasaryan  wrote:
> > On Tue, Jan 17, 2023 at 10:47 AM Matthew Wilcox  wrote:
> > >
> > > On Tue, Jan 17, 2023 at 10:36:42AM -0800, Suren Baghdasaryan wrote:
> > > > On Tue, Jan 17, 2023 at 10:31 AM Matthew Wilcox  
> > > > wrote:
> > > > >
> > > > > On Tue, Jan 17, 2023 at 10:26:32AM -0800, Suren Baghdasaryan wrote:
> > > > > > On Tue, Jan 17, 2023 at 10:12 AM Jann Horn  wrote:
> > > > > > >
> > > > > > > On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan 
> > > > > > >  wrote:
> > > > > > > > rw_semaphore is a sizable structure of 40 bytes and consumes
> > > > > > > > considerable space for each vm_area_struct. However vma_lock has
> > > > > > > > two important specifics which can be used to replace 
> > > > > > > > rw_semaphore
> > > > > > > > with a simpler structure:
> > > > > > > [...]
> > > > > > > >  static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > > > > > >  {
> > > > > > > > -   up_read(>vm_lock->lock);
> > > > > > > > +   if (atomic_dec_and_test(>vm_lock->count))
> > > > > > > > +   wake_up(>vm_mm->vma_writer_wait);
> > > > > > > >  }
> > > > > > >
> > > > > > > I haven't properly reviewed this, but this bit looks like a
> > > > > > > use-after-free because you're accessing the vma after dropping 
> > > > > > > your
> > > > > > > reference on it. You'd have to first look up the vma->vm_mm, then 
> > > > > > > do
> > > > > > > the atomic_dec_and_test(), and afterwards do the wake_up() without
> > > > > > > touching the vma. Or alternatively wrap the whole thing in an RCU
> > > > > > > read-side critical section if the VMA is freed with RCU delay.
> > > > > >
> > > > > > vm_lock->count does not control the lifetime of the VMA, it's a
> > > > > > counter of how many readers took the lock or it's negative if the 
> > > > > > lock
> > > > > > is write-locked.
> > > > >
> > > > > Yes, but ...
> > > > >
> > > > > Task A:
> > > > > atomic_dec_and_test(>vm_lock->count)
> > > > > Task B:
> > > > > munmap()
> > > > > write lock
> > > > > free VMA
> > > > > synchronize_rcu()
> > > > > VMA is really freed
> > > > > wake_up(>vm_mm->vma_writer_wait);
> > > > >
> > > > > ... vma is freed.
> > > > >
> > > > > Now, I think this doesn't occur.  I'm pretty sure that every caller of
> > > > > vma_read_unlock() is holding the RCU read lock.  But maybe we should
> > > > > have that assertion?
> > > >
> > > > Yep, that's what this patch is doing
> > > > https://lore.kernel.org/all/20230109205336.3665937-27-sur...@google.com/
> > > > by calling vma_assert_no_reader() from __vm_area_free().
> > >
> > > That's not enough though.  Task A still has a pointer to vma after it
> > > has called atomic_dec_and_test(), even after vma has been freed by
> > > Task B, and before Task A dereferences vma->vm_mm.
> >
> > Ah, I see your point now. I guess I'll have to store vma->vm_mm in a
> > local variable and call mmgrab() before atomic_dec_and_test(), then
> > use it in wake_up() and call mmdrop(). Is that what you are thinking?
>
> You shouldn't need mmgrab()/mmdrop(), because whoever is calling you
> for page fault handling must be keeping the mm_struct alive.

Good point. Will update in the next revision to store mm before
dropping the count. Thanks for all the comments folks!


Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock

2023-01-17 Thread Suren Baghdasaryan
On Tue, Jan 17, 2023 at 12:31 PM Michal Hocko  wrote:
>
> On Tue 17-01-23 10:28:40, Suren Baghdasaryan wrote:
> [...]
> > > Then yes, that's a starvable lock.  Preventing starvation on the mmap
> > > sem was the original motivation for making rwsems non-starvable, so
> > > changing that behaviour now seems like a bad idea.  For efficiency, I'd
> > > suggest that a waiting writer set the top bit of the counter.  That way,
> > > all new readers will back off without needing to check a second variable
> > > and old readers will know that they *may* need to do the wakeup when
> > > atomic_sub_return_release() is negative.
> > >
> > > (rwsem.c has a more complex bitfield, but I don't think we need to go
> > > that far; the important point is that the waiting writer indicates its
> > > presence in the count field so that readers can modify their behaviour)
> >
> > Got it. Ok, I think we can figure something out to check if there are
> > waiting write-lockers and prevent new readers from taking the lock.
>
> Reinventing locking primitives is a ticket to weird bugs. I would stick
> with the rwsem and deal with performance fallouts after it is clear that
> the core idea is generally acceptable and based on actual real life
> numbers. This whole thing is quite big enough that we do not have to go
> through "is this new synchronization primitive correct and behaving
> reasonably" exercise.

Point taken. That's one of the reasons I kept this patch separate.
I'll drop this last patch from the series for now. One correction
though, this will not be a performance fallout but memory consumption
fallout.

>
> --
> Michal Hocko
> SUSE Labs


Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it

2023-01-17 Thread Suren Baghdasaryan
On Tue, Jan 17, 2023 at 7:04 AM Michal Hocko  wrote:
>
> On Mon 09-01-23 12:53:07, Suren Baghdasaryan wrote:
> > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > instead of mmap_lock. Because there are cases when multiple VMAs need
> > to be exclusively locked during VMA tree modifications, instead of the
> > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > mmap_write_lock holder is done with all modifications and drops mmap_lock,
> > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > locked.
>
> I have to say I was struggling a bit with the above and only understood
> what you mean by reading the patch several times. I would phrase it like
> this (feel free to use if you consider this to be an improvement).
>
> Introduce a per-VMA rw_semaphore. The lock implementation relies on a
> per-vma and per-mm sequence counters to note exclusive locking:
> - read lock - (implemented by vma_read_trylock) requires the the
>   vma (vm_lock_seq) and mm (mm_lock_seq) sequence counters to
>   differ. If they match then there must be a vma exclusive lock
>   held somewhere.
> - read unlock - (implemented by vma_read_unlock) is a trivial
>   vma->lock unlock.
> - write lock - (vma_write_lock) requires the mmap_lock to be
>   held exclusively and the current mm counter is noted to the vma
>   side. This will allow multiple vmas to be locked under a single
>   mmap_lock write lock (e.g. during vma merging). The vma counter
>   is modified under exclusive vma lock.
> - write unlock - (vma_write_unlock_mm) is a batch release of all
>   vma locks held. It doesn't pair with a specific
>   vma_write_lock! It is done before exclusive mmap_lock is
>   released by incrementing mm sequence counter (mm_lock_seq).
> - write downgrade - if the mmap_lock is downgraded to the read
>   lock all vma write locks are released as well (effectivelly
>   same as write unlock).

Thanks for the suggestion, Michal. I'll definitely reuse your description.

>
> > VMA lock is placed on the cache line boundary so that its 'count' field
> > falls into the first cache line while the rest of the fields fall into
> > the second cache line. This lets the 'count' field to be cached with
> > other frequently accessed fields and used quickly in uncontended case
> > while 'owner' and other fields used in the contended case will not
> > invalidate the first cache line while waiting on the lock.
> >
> > Signed-off-by: Suren Baghdasaryan 
> > ---
> >  include/linux/mm.h| 80 +++
> >  include/linux/mm_types.h  |  8 
> >  include/linux/mmap_lock.h | 13 +++
> >  kernel/fork.c |  4 ++
> >  mm/init-mm.c  |  3 ++
> >  5 files changed, 108 insertions(+)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index f3f196e4d66d..ec2c4c227d51 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -612,6 +612,85 @@ struct vm_operations_struct {
> > unsigned long addr);
> >  };
> >
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +static inline void vma_init_lock(struct vm_area_struct *vma)
> > +{
> > + init_rwsem(>lock);
> > + vma->vm_lock_seq = -1;
> > +}
> > +
> > +static inline void vma_write_lock(struct vm_area_struct *vma)
> > +{
> > + int mm_lock_seq;
> > +
> > + mmap_assert_write_locked(vma->vm_mm);
> > +
> > + /*
> > +  * current task is holding mmap_write_lock, both vma->vm_lock_seq and
> > +  * mm->mm_lock_seq can't be concurrently modified.
> > +  */
> > + mm_lock_seq = READ_ONCE(vma->vm_mm->mm_lock_seq);
> > + if (vma->vm_lock_seq == mm_lock_seq)
> > + return;
> > +
> > + down_write(>lock);
> > + vma->vm_lock_seq = mm_lock_seq;
> > + up_write(>lock);
> > +}
> > +
> > +/*
> > + * Try to read-lock a vma. The function is allowed to occasionally yield 
> > false
> > + * locked result to avoid performance overhead, in which case we fall back 
> > to
> > + * using mmap_lock. The function should never yield false unlocked result.
> > + */
> > +static inline bool vma_read_trylock(struct vm_area_struct *vma)
> > +{
> > + /* Check bef

Re: [PATCH 40/41] mm: separate vma->lock from vm_area_struct

2023-01-17 Thread Suren Baghdasaryan
On Tue, Jan 17, 2023 at 10:34 AM Jann Horn  wrote:
>
> On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan  wrote:
> > vma->lock being part of the vm_area_struct causes performance regression
> > during page faults because during contention its count and owner fields
> > are constantly updated and having other parts of vm_area_struct used
> > during page fault handling next to them causes constant cache line
> > bouncing. Fix that by moving the lock outside of the vm_area_struct.
> > All attempts to keep vma->lock inside vm_area_struct in a separate
> > cache line still produce performance regression especially on NUMA
> > machines. Smallest regression was achieved when lock is placed in the
> > fourth cache line but that bloats vm_area_struct to 256 bytes.
>
> Just checking: When you tested putting the lock in different cache
> lines, did you force the slab allocator to actually store the
> vm_area_struct with cacheline alignment (by setting SLAB_HWCACHE_ALIGN
> on the slab or with a cacheline_aligned_in_smp on the struct
> definition)?

Yep, I tried all these combinations and still saw noticeable regression.


Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free

2023-01-17 Thread Suren Baghdasaryan
On Tue, Jan 17, 2023 at 7:57 AM Michal Hocko  wrote:
>
> On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > call_rcu() can take a long time when callback offloading is enabled.
> > Its use in the vm_area_free can cause regressions in the exit path when
> > multiple VMAs are being freed.
>
> What kind of regressions.
>
> > To minimize that impact, place VMAs into
> > a list and free them in groups using one call_rcu() call per group.
>
> Please add some data to justify this additional complexity.

Sorry, should have done that in the first place. A 4.3% regression was
noticed when running execl test from unixbench suite. spawn test also
showed 1.6% regression. Profiling revealed that vma freeing was taking
longer due to call_rcu() which is slow when RCU callback offloading is
enabled. I asked Paul McKenney and he explained to me that because the
callbacks are offloaded to some other kthread, possibly running on
some other CPU, it is necessary to use explicit locking.  Locking on a
per-call_rcu() basis would result in excessive contention during
callback flooding. So, by batching call_rcu() work we cut that
overhead and reduce this lock contention.


> --
> Michal Hocko
> SUSE Labs


Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it

2023-01-17 Thread Suren Baghdasaryan
On Tue, Jan 17, 2023 at 7:07 AM Michal Hocko  wrote:
>
> On Mon 09-01-23 12:53:07, Suren Baghdasaryan wrote:
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 5986817f393c..c026d75108b3 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -474,6 +474,7 @@ struct vm_area_struct *vm_area_dup(struct 
> > vm_area_struct *orig)
> >*/
> >   *new = data_race(*orig);
> >   INIT_LIST_HEAD(>anon_vma_chain);
> > + vma_init_lock(new);
> >   dup_anon_vma_name(orig, new);
> >   }
> >   return new;
> > @@ -1145,6 +1146,9 @@ static struct mm_struct *mm_init(struct mm_struct 
> > *mm, struct task_struct *p,
> >   seqcount_init(>write_protect_seq);
> >   mmap_init_lock(mm);
> >   INIT_LIST_HEAD(>mmlist);
> > +#ifdef CONFIG_PER_VMA_LOCK
> > + WRITE_ONCE(mm->mm_lock_seq, 0);
> > +#endif
>
> The mm shouldn't be visible so why WRITE_ONCE?

True. Will change to a simple assignment.

>
> --
> Michal Hocko
> SUSE Labs


Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock

2023-01-17 Thread Suren Baghdasaryan
On Tue, Jan 17, 2023 at 10:36 AM Jann Horn  wrote:
>
> On Tue, Jan 17, 2023 at 7:31 PM Matthew Wilcox  wrote:
> >
> > On Tue, Jan 17, 2023 at 10:26:32AM -0800, Suren Baghdasaryan wrote:
> > > On Tue, Jan 17, 2023 at 10:12 AM Jann Horn  wrote:
> > > >
> > > > On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan  
> > > > wrote:
> > > > > rw_semaphore is a sizable structure of 40 bytes and consumes
> > > > > considerable space for each vm_area_struct. However vma_lock has
> > > > > two important specifics which can be used to replace rw_semaphore
> > > > > with a simpler structure:
> > > > [...]
> > > > >  static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > > >  {
> > > > > -   up_read(>vm_lock->lock);
> > > > > +   if (atomic_dec_and_test(>vm_lock->count))
> > > > > +   wake_up(>vm_mm->vma_writer_wait);
> > > > >  }
> > > >
> > > > I haven't properly reviewed this, but this bit looks like a
> > > > use-after-free because you're accessing the vma after dropping your
> > > > reference on it. You'd have to first look up the vma->vm_mm, then do
> > > > the atomic_dec_and_test(), and afterwards do the wake_up() without
> > > > touching the vma. Or alternatively wrap the whole thing in an RCU
> > > > read-side critical section if the VMA is freed with RCU delay.
> > >
> > > vm_lock->count does not control the lifetime of the VMA, it's a
> > > counter of how many readers took the lock or it's negative if the lock
> > > is write-locked.
> >
> > Yes, but ...
> >
> > Task A:
> > atomic_dec_and_test(>vm_lock->count)
> > Task B:
> > munmap()
> > write lock
> > free VMA
> > synchronize_rcu()
> > VMA is really freed
> > wake_up(>vm_mm->vma_writer_wait);
> >
> > ... vma is freed.
> >
> > Now, I think this doesn't occur.  I'm pretty sure that every caller of
> > vma_read_unlock() is holding the RCU read lock.  But maybe we should
> > have that assertion?
>
> I don't see that. When do_user_addr_fault() is calling
> vma_read_unlock(), there's no RCU read lock held, right?

We free VMAs using call_rcu() after removing them from VMA tree. OTOH
page fault handlers are searching for VMAs from inside RCU read
section and calling vma_read_unlock() from there, see
https://lore.kernel.org/all/20230109205336.3665937-29-sur...@google.com/.
Once we take the VMA read-lock, it ensures that it can't be
write-locked and if someone is destroying or isolating the VMA, it
needs to write-lock it first.


Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it

2023-01-17 Thread Suren Baghdasaryan
On Tue, Jan 17, 2023 at 1:54 PM Matthew Wilcox  wrote:
>
> On Tue, Jan 17, 2023 at 01:21:47PM -0800, Suren Baghdasaryan wrote:
> > On Tue, Jan 17, 2023 at 7:12 AM Michal Hocko  wrote:
> > >
> > > On Tue 17-01-23 16:04:26, Michal Hocko wrote:
> > > > On Mon 09-01-23 12:53:07, Suren Baghdasaryan wrote:
> > > > > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > > > > instead of mmap_lock. Because there are cases when multiple VMAs need
> > > > > to be exclusively locked during VMA tree modifications, instead of the
> > > > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA 
> > > > > lock
> > > > > exclusively and setting vma->lock_seq to the current mm->lock_seq. 
> > > > > When
> > > > > mmap_write_lock holder is done with all modifications and drops 
> > > > > mmap_lock,
> > > > > it will increment mm->lock_seq, effectively unlocking all VMAs marked 
> > > > > as
> > > > > locked.
> > > >
> > > > I have to say I was struggling a bit with the above and only understood
> > > > what you mean by reading the patch several times. I would phrase it like
> > > > this (feel free to use if you consider this to be an improvement).
> > > >
> > > > Introduce a per-VMA rw_semaphore. The lock implementation relies on a
> > > > per-vma and per-mm sequence counters to note exclusive locking:
> > > > - read lock - (implemented by vma_read_trylock) requires the the
> > > >   vma (vm_lock_seq) and mm (mm_lock_seq) sequence counters to
> > > >   differ. If they match then there must be a vma exclusive lock
> > > >   held somewhere.
> > > > - read unlock - (implemented by vma_read_unlock) is a trivial
> > > >   vma->lock unlock.
> > > > - write lock - (vma_write_lock) requires the mmap_lock to be
> > > >   held exclusively and the current mm counter is noted to the 
> > > > vma
> > > >   side. This will allow multiple vmas to be locked under a 
> > > > single
> > > >   mmap_lock write lock (e.g. during vma merging). The vma 
> > > > counter
> > > >   is modified under exclusive vma lock.
> > >
> > > Didn't realize one more thing.
> > > Unlike standard write lock this implementation allows to be
> > > called multiple times under a single mmap_lock. In a sense
> > > it is more of mark_vma_potentially_modified than a lock.
> >
> > In the RFC it was called vma_mark_locked() originally and renames were
> > discussed in the email thread ending here:
> > https://lore.kernel.org/all/621612d7-c537-3971-9520-a3dec7b43...@suse.cz/.
> > If other names are preferable I'm open to changing them.
>
> I don't want to bikeshed this, but rather than locking it seems to be
> more:
>
> vma_start_read()
> vma_end_read()
> vma_start_write()
> vma_end_write()
> vma_downgrade_write()

Couple corrections, we would have to have vma_start_tryread() and
vma_end_write_all(). Also there is no vma_downgrade_write().
mmap_write_downgrade() simply does vma_end_write_all().

>
> ... and that these are _implemented_ with locks (in part) is an
> implementation detail?
>
> Would that reduce people's confusion?
>
> > >
> > > > - write unlock - (vma_write_unlock_mm) is a batch release of all
> > > >   vma locks held. It doesn't pair with a specific
> > > >   vma_write_lock! It is done before exclusive mmap_lock is
> > > >   released by incrementing mm sequence counter (mm_lock_seq).
> > > >   - write downgrade - if the mmap_lock is downgraded to the read
> > > > lock all vma write locks are released as well (effectivelly
> > > > same as write unlock).
> > > --
> > > Michal Hocko
> > > SUSE Labs


Re: [PATCH 32/41] mm: prevent userfaults to be handled under per-vma lock

2023-01-17 Thread Suren Baghdasaryan
On Tue, Jan 17, 2023 at 12:36 PM Jann Horn  wrote:
>
> On Tue, Jan 17, 2023 at 8:51 PM Jann Horn  wrote:
> > On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan  wrote:
> > > Due to the possibility of handle_userfault dropping mmap_lock, avoid fault
> > > handling under VMA lock and retry holding mmap_lock. This can be handled
> > > more gracefully in the future.
> > >
> > > Signed-off-by: Suren Baghdasaryan 
> > > Suggested-by: Peter Xu 
> > > ---
> > >  mm/memory.c | 7 +++
> > >  1 file changed, 7 insertions(+)
> > >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 20806bc8b4eb..12508f4d845a 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -5273,6 +5273,13 @@ struct vm_area_struct *lock_vma_under_rcu(struct 
> > > mm_struct *mm,
> > > if (!vma->anon_vma)
> > > goto inval;
> > >
> > > +   /*
> > > +* Due to the possibility of userfault handler dropping 
> > > mmap_lock, avoid
> > > +* it for now and fall back to page fault handling under 
> > > mmap_lock.
> > > +*/
> > > +   if (userfaultfd_armed(vma))
> > > +   goto inval;
> >
> > This looks racy wrt concurrent userfaultfd_register(). I think you'll
> > want to do the userfaultfd_armed(vma) check _after_ locking the VMA,
>
> I still think this change is needed...

Yes, I think you are right. I'll move the check after locking the VMA. Thanks!

>
> > and ensure that the userfaultfd code write-locks the VMA before
> > changing the __VM_UFFD_FLAGS in vma->vm_flags.
>
> Ah, but now I see you already took care of this half of the issue with
> the reset_vm_flags() change in
> https://lore.kernel.org/linux-mm/20230109205336.3665937-16-sur...@google.com/
> .
>
>
> > > if (!vma_read_trylock(vma))
> > > goto inval;
> > >
> > > --
> > > 2.39.0
> > >


Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it

2023-01-17 Thread Suren Baghdasaryan
On Tue, Jan 17, 2023 at 10:03 AM Jann Horn  wrote:
>
> +locking maintainers

Thanks! I'll CC the locking maintainers in the next posting.

>
> On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan  wrote:
> > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > instead of mmap_lock. Because there are cases when multiple VMAs need
> > to be exclusively locked during VMA tree modifications, instead of the
> > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > mmap_write_lock holder is done with all modifications and drops mmap_lock,
> > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > locked.
> [...]
> > +static inline void vma_read_unlock(struct vm_area_struct *vma)
> > +{
> > +   up_read(>lock);
> > +}
>
> One thing that might be gnarly here is that I think you might not be
> allowed to use up_read() to fully release ownership of an object -
> from what I remember, I think that up_read() (unlike something like
> spin_unlock()) can access the lock object after it's already been
> acquired by someone else. So if you want to protect against concurrent
> deletion, this might have to be something like:
>
> rcu_read_lock(); /* keeps vma alive */
> up_read(>lock);
> rcu_read_unlock();

But for deleting VMA one would need to write-lock the vma->lock first,
which I assume can't happen until this up_read() is complete. Is that
assumption wrong?

>
> But I'm not entirely sure about that, the locking folks might know better.
>
> Also, it might not matter given that the rw_semaphore part is removed
> in the current patch 41/41 anyway...

This does matter because Michal suggested dropping that last 41/41
patch for now.


Re: [PATCH 17/41] mm/mmap: move VMA locking before anon_vma_lock_write call

2023-01-17 Thread Suren Baghdasaryan
On Tue, Jan 17, 2023 at 7:16 AM Michal Hocko  wrote:
>
> On Mon 09-01-23 12:53:12, Suren Baghdasaryan wrote:
> > Move VMA flag modification (which now implies VMA locking) before
> > anon_vma_lock_write to match the locking order of page fault handler.
>
> Does this changelog assumes per vma locking in the #PF?

Hmm, you are right. Page fault handlers do not use per-vma locks yet
but the changelog already talks about that. Maybe I should change it
to simply:
```
Move VMA flag modification (which now implies VMA locking) before
vma_adjust_trans_huge() to ensure the modifications are done after VMA
has been locked.
```
Is that better?

>
> --
> Michal Hocko
> SUSE Labs


Re: [PATCH 26/41] kernel/fork: assert no VMA readers during its destruction

2023-01-17 Thread Suren Baghdasaryan
On Tue, Jan 17, 2023 at 7:42 AM 'Michal Hocko' via kernel-team
 wrote:
>
> On Mon 09-01-23 12:53:21, Suren Baghdasaryan wrote:
> > Assert there are no holders of VMA lock for reading when it is about to be
> > destroyed.
> >
> > Signed-off-by: Suren Baghdasaryan 
> > ---
> >  include/linux/mm.h | 8 
> >  kernel/fork.c  | 2 ++
> >  2 files changed, 10 insertions(+)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 594e835bad9c..c464fc8a514c 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -680,6 +680,13 @@ static inline void vma_assert_write_locked(struct 
> > vm_area_struct *vma)
> >   VM_BUG_ON_VMA(vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq), 
> > vma);
> >  }
> >
> > +static inline void vma_assert_no_reader(struct vm_area_struct *vma)
> > +{
> > + VM_BUG_ON_VMA(rwsem_is_locked(>lock) &&
> > +   vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq),
> > +   vma);
>
> Do we really need to check for vm_lock_seq? rwsem_is_locked should tell
> us something is wrong on its own, no? This could be somebody racing with
> the vma destruction and using the write lock. Unlikely but I do not see
> why to narrow debugging scope.

I wanted to ensure there are no page fault handlers (read-lockers)
when we are destroying the VMA and rwsem_is_locked(>lock) alone
could trigger if someone is concurrently calling vma_write_lock(). But
I don't think we expect someone to be write-locking the VMA while we
are destroying it, so you are right, I'm overcomplicating things here.
I think I can get rid of vma_assert_no_reader() and add
VM_BUG_ON_VMA(rwsem_is_locked(>lock)) directly in
__vm_area_free(). WDYT?


> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@android.com.
>


Re: [PATCH 13/41] mm: introduce vma->vm_flags modifier functions

2023-01-17 Thread Suren Baghdasaryan
On Tue, Jan 17, 2023 at 7:15 AM 'Michal Hocko' via kernel-team
 wrote:
>
> On Tue 17-01-23 16:09:03, Michal Hocko wrote:
> > On Mon 09-01-23 12:53:08, Suren Baghdasaryan wrote:
> > > To keep vma locking correctness when vm_flags are modified, add modifier
> > > functions to be used whenever flags are updated.
> > >
> > > Signed-off-by: Suren Baghdasaryan 
> > > ---
> > >  include/linux/mm.h   | 38 ++
> > >  include/linux/mm_types.h |  8 +++-
> > >  2 files changed, 45 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index ec2c4c227d51..35cf0a6cbcc2 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -702,6 +702,44 @@ static inline void vma_init(struct vm_area_struct 
> > > *vma, struct mm_struct *mm)
> > > vma_init_lock(vma);
> > >  }
> > >
> > > +/* Use when VMA is not part of the VMA tree and needs no locking */
> > > +static inline
> > > +void init_vm_flags(struct vm_area_struct *vma, unsigned long flags)
> > > +{
> > > +   WRITE_ONCE(vma->vm_flags, flags);
> > > +}
> >
> > Why do we need WRITE_ONCE here? Isn't vma invisible during its
> > initialization?

Ack. Will change to a simple assignment.

> >
> > > +
> > > +/* Use when VMA is part of the VMA tree and needs appropriate locking */
> > > +static inline
> > > +void reset_vm_flags(struct vm_area_struct *vma, unsigned long flags)
> > > +{
> > > +   vma_write_lock(vma);
> > > +   init_vm_flags(vma, flags);
> > > +}
> > > +
> > > +static inline
> > > +void set_vm_flags(struct vm_area_struct *vma, unsigned long flags)
> > > +{
> > > +   vma_write_lock(vma);
> > > +   vma->vm_flags |= flags;
> > > +}
> > > +
> > > +static inline
> > > +void clear_vm_flags(struct vm_area_struct *vma, unsigned long flags)
> > > +{
> > > +   vma_write_lock(vma);
> > > +   vma->vm_flags &= ~flags;
> > > +}
> > > +
> > > +static inline
> > > +void mod_vm_flags(struct vm_area_struct *vma,
> > > + unsigned long set, unsigned long clear)
> > > +{
> > > +   vma_write_lock(vma);
> > > +   vma->vm_flags |= set;
> > > +   vma->vm_flags &= ~clear;
> > > +}
> > > +
> >
> > This is rather unusual pattern. There is no note about locking involved
> > in the naming and also why is the locking part of this interface in the
> > first place? I can see reason for access functions to actually check for
> > lock asserts.
>
> OK, it took me a while but it is clear to me now. The confusion comes
> from the naming vma_write_lock is no a lock in its usual terms. It is
> more of a vma_mark_modified with side effects to read locking which is a
> real lock. With that it makes more sense to have this done in these
> helpers rather than requiring all users to keep this subtletly in mind.

If renaming vma-locking primitives the way Matthew suggested in
https://lore.kernel.org/all/y8czmt01z1fvv...@casper.infradead.org/
makes it easier to read/understand, I'm all for it. Let's discuss the
naming in that email thread because that's where these functions are
introduced.

>
> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@android.com.
>


Re: [PATCH 09/41] mm: rcu safe VMA freeing

2023-01-17 Thread Suren Baghdasaryan
On Tue, Jan 17, 2023 at 6:25 AM Michal Hocko  wrote:
>
> On Mon 09-01-23 12:53:04, Suren Baghdasaryan wrote:
> [...]
> >  void vm_area_free(struct vm_area_struct *vma)
> >  {
> >   free_anon_vma_name(vma);
> > +#ifdef CONFIG_PER_VMA_LOCK
> > + call_rcu(>vm_rcu, __vm_area_free);
> > +#else
> >   kmem_cache_free(vm_area_cachep, vma);
> > +#endif
>
> Is it safe to have vma with already freed vma_name? I suspect this is
> safe because of mmap_lock but is there any reason to split the freeing
> process and have this potential UAF lurking?

It should be safe because VMA is either locked or has been isolated
while locked, so no page fault handlers should have access to it. But
you are right, moving free_anon_vma_name() into __vm_area_free() does
seem safer. Will make the change in the next rev.

>
> >  }
> >
> >  static void account_kernel_stack(struct task_struct *tsk, int account)
> > --
> > 2.39.0
>
> --
> Michal Hocko
> SUSE Labs


Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free

2023-01-19 Thread Suren Baghdasaryan
On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko  wrote:
>
> On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > call_rcu() can take a long time when callback offloading is enabled.
> > Its use in the vm_area_free can cause regressions in the exit path when
> > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > a list and free them in groups using one call_rcu() call per group.
>
> After some more clarification I can understand how call_rcu might not be
> super happy about thousands of callbacks to be invoked and I do agree
> that this is not really optimal.
>
> On the other hand I do not like this solution much either.
> VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> much with processes with a huge number of vmas either. It would still be
> in housands of callbacks to be scheduled without a good reason.
>
> Instead, are there any other cases than remove_vma that need this
> batching? We could easily just link all the vmas into linked list and
> use a single call_rcu instead, no? This would both simplify the
> implementation, remove the scaling issue as well and we do not have to
> argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.

Yes, I agree the solution is not stellar. I wanted something simple
but this is probably too simple. OTOH keeping all dead vm_area_structs
on the list without hooking up a shrinker (additional complexity) does
not sound too appealing either. WDYT about time domain throttling to
limit draining the list to say once per second like this:

void vm_area_free(struct vm_area_struct *vma)
{
   struct mm_struct *mm = vma->vm_mm;
   bool drain;

   free_anon_vma_name(vma);

   spin_lock(>vma_free_list.lock);
   list_add(>vm_free_list, >vma_free_list.head);
   mm->vma_free_list.size++;
-   drain = mm->vma_free_list.size > VM_AREA_FREE_LIST_MAX;
+   drain = jiffies > mm->last_drain_tm + HZ;

   spin_unlock(>vma_free_list.lock);

-   if (drain)
+   if (drain) {
  drain_free_vmas(mm);
+ mm->last_drain_tm = jiffies;
+   }
}

Ultimately we want to prevent very frequent call_rcu() calls, so
throttling in the time domain seems appropriate. That's the simplest
way I can think of to address your concern about a quick spike in VMA
freeing. It does not place any restriction on the list size and we
might have excessive dead vm_area_structs if after a large spike there
are no vm_area_free() calls but I don't know if that's a real problem,
so not sure we should be addressing it at this time. WDYT?


>
> --
> Michal Hocko
> SUSE Labs


Re: [PATCH 17/41] mm/mmap: move VMA locking before anon_vma_lock_write call

2023-01-19 Thread Suren Baghdasaryan
On Thu, Jan 19, 2023 at 1:31 AM Michal Hocko  wrote:
>
> On Wed 18-01-23 13:48:13, Suren Baghdasaryan wrote:
> > On Wed, Jan 18, 2023 at 1:33 PM Michal Hocko  wrote:
> [...]
> > > So it will become:
> > > Move VMA flag modification (which now implies VMA locking) before
> > > vma_adjust_trans_huge() to ensure the modifications are done after VMA
> > > has been locked. Because vma_adjust_trans_huge() modifies the VMA and such
> > > modifications should be done under VMA write-lock protection.
> > >
> > > which is effectivelly saying
> > > vma_adjust_trans_huge() modifies the VMA and such modifications should
> > > be done under VMA write-lock protection so move VMA flag modifications
> > > before so all of them are covered by the same write protection.
> > >
> > > right?
> >
> > Yes, and the wording in the latter version is simpler to understand
> > IMO, so I would like to adopt it. Do you agree?
>
> of course.

Will update in the next respin. Thanks!

> --
> Michal Hocko
> SUSE Labs


Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free

2023-01-19 Thread Suren Baghdasaryan
On Thu, Jan 19, 2023 at 11:20 AM Paul E. McKenney  wrote:
>
> On Thu, Jan 19, 2023 at 10:52:03AM -0800, Suren Baghdasaryan wrote:
> > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko  wrote:
> > >
> > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > > a list and free them in groups using one call_rcu() call per group.
> > >
> > > After some more clarification I can understand how call_rcu might not be
> > > super happy about thousands of callbacks to be invoked and I do agree
> > > that this is not really optimal.
> > >
> > > On the other hand I do not like this solution much either.
> > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > > much with processes with a huge number of vmas either. It would still be
> > > in housands of callbacks to be scheduled without a good reason.
> > >
> > > Instead, are there any other cases than remove_vma that need this
> > > batching? We could easily just link all the vmas into linked list and
> > > use a single call_rcu instead, no? This would both simplify the
> > > implementation, remove the scaling issue as well and we do not have to
> > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> >
> > Yes, I agree the solution is not stellar. I wanted something simple
> > but this is probably too simple. OTOH keeping all dead vm_area_structs
> > on the list without hooking up a shrinker (additional complexity) does
> > not sound too appealing either. WDYT about time domain throttling to
> > limit draining the list to say once per second like this:
> >
> > void vm_area_free(struct vm_area_struct *vma)
> > {
> >struct mm_struct *mm = vma->vm_mm;
> >bool drain;
> >
> >free_anon_vma_name(vma);
> >
> >spin_lock(>vma_free_list.lock);
> >list_add(>vm_free_list, >vma_free_list.head);
> >mm->vma_free_list.size++;
> > -   drain = mm->vma_free_list.size > VM_AREA_FREE_LIST_MAX;
> > +   drain = jiffies > mm->last_drain_tm + HZ;
> >
> >spin_unlock(>vma_free_list.lock);
> >
> > -   if (drain)
> > +   if (drain) {
> >   drain_free_vmas(mm);
> > + mm->last_drain_tm = jiffies;
> > +   }
> > }
> >
> > Ultimately we want to prevent very frequent call_rcu() calls, so
> > throttling in the time domain seems appropriate. That's the simplest
> > way I can think of to address your concern about a quick spike in VMA
> > freeing. It does not place any restriction on the list size and we
> > might have excessive dead vm_area_structs if after a large spike there
> > are no vm_area_free() calls but I don't know if that's a real problem,
> > so not sure we should be addressing it at this time. WDYT?
>
> Just to double-check, we really did try the very frequent call_rcu()
> invocations and we really did see a problem, correct?

Correct. More specifically with CONFIG_RCU_NOCB_CPU=y we saw
regressions when a process exits and all its VMAs get destroyed,
causing a flood of call_rcu()'s.

>
> Although it is not perfect, call_rcu() is designed to take a fair amount
> of abuse.  So if we didn't see a real problem, the frequent call_rcu()
> invocations might be a bit simpler.
>
> Thanx, Paul


[PATCH v2 4/6] mm: replace vma->vm_flags indirect modification in ksm_madvise

2023-01-25 Thread Suren Baghdasaryan
Replace indirect modifications to vma->vm_flags with calls to modifier
functions to be able to track flag changes and to keep vma locking
correctness. Add a BUG_ON check in ksm_madvise() to catch indirect
vm_flags modification attempts.

Signed-off-by: Suren Baghdasaryan 
---
 arch/powerpc/kvm/book3s_hv_uvmem.c | 5 -
 arch/s390/mm/gmap.c| 5 -
 mm/khugepaged.c| 2 ++
 mm/ksm.c   | 2 ++
 4 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 1d67baa5557a..325a7a47d348 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -393,6 +393,7 @@ static int kvmppc_memslot_page_merge(struct kvm *kvm,
 {
unsigned long gfn = memslot->base_gfn;
unsigned long end, start = gfn_to_hva(kvm, gfn);
+   unsigned long vm_flags;
int ret = 0;
struct vm_area_struct *vma;
int merge_flag = (merge) ? MADV_MERGEABLE : MADV_UNMERGEABLE;
@@ -409,12 +410,14 @@ static int kvmppc_memslot_page_merge(struct kvm *kvm,
ret = H_STATE;
break;
}
+   vm_flags = vma->vm_flags;
ret = ksm_madvise(vma, vma->vm_start, vma->vm_end,
- merge_flag, >vm_flags);
+ merge_flag, _flags);
if (ret) {
ret = H_STATE;
break;
}
+   reset_vm_flags(vma, vm_flags);
start = vma->vm_end;
} while (end > vma->vm_end);
 
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 3a695b8a1e3c..d5eb47dcdacb 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2587,14 +2587,17 @@ int gmap_mark_unmergeable(void)
 {
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
+   unsigned long vm_flags;
int ret;
VMA_ITERATOR(vmi, mm, 0);
 
for_each_vma(vmi, vma) {
+   vm_flags = vma->vm_flags;
ret = ksm_madvise(vma, vma->vm_start, vma->vm_end,
- MADV_UNMERGEABLE, >vm_flags);
+ MADV_UNMERGEABLE, _flags);
if (ret)
return ret;
+   reset_vm_flags(vma, vm_flags);
}
mm->def_flags &= ~VM_MERGEABLE;
return 0;
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 8abc59345bf2..76b24cd0c179 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -354,6 +354,8 @@ struct attribute_group khugepaged_attr_group = {
 int hugepage_madvise(struct vm_area_struct *vma,
 unsigned long *vm_flags, int advice)
 {
+   /* vma->vm_flags can be changed only using modifier functions */
+   BUG_ON(vm_flags == >vm_flags);
switch (advice) {
case MADV_HUGEPAGE:
 #ifdef CONFIG_S390
diff --git a/mm/ksm.c b/mm/ksm.c
index 04f1c8c2df11..992b2be9f5e6 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -2573,6 +2573,8 @@ int ksm_madvise(struct vm_area_struct *vma, unsigned long 
start,
struct mm_struct *mm = vma->vm_mm;
int err;
 
+   /* vma->vm_flags can be changed only using modifier functions */
+   BUG_ON(vm_flags == >vm_flags);
switch (advice) {
case MADV_MERGEABLE:
/*
-- 
2.39.1



[PATCH v2 1/6] mm: introduce vma->vm_flags modifier functions

2023-01-25 Thread Suren Baghdasaryan
vm_flags are among VMA attributes which affect decisions like VMA merging
and splitting. Therefore all vm_flags modifications are performed after
taking exclusive mmap_lock to prevent vm_flags updates racing with such
operations. Introduce modifier functions for vm_flags to be used whenever
flags are updated. This way we can better check and control correct
locking behavior during these updates.

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/mm.h   | 37 +
 include/linux/mm_types.h |  8 +++-
 2 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c2f62bdce134..b71f2809caac 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -627,6 +627,43 @@ static inline void vma_init(struct vm_area_struct *vma, 
struct mm_struct *mm)
INIT_LIST_HEAD(>anon_vma_chain);
 }
 
+/* Use when VMA is not part of the VMA tree and needs no locking */
+static inline void init_vm_flags(struct vm_area_struct *vma,
+unsigned long flags)
+{
+   vma->vm_flags = flags;
+}
+
+/* Use when VMA is part of the VMA tree and modifications need coordination */
+static inline void reset_vm_flags(struct vm_area_struct *vma,
+ unsigned long flags)
+{
+   mmap_assert_write_locked(vma->vm_mm);
+   init_vm_flags(vma, flags);
+}
+
+static inline void set_vm_flags(struct vm_area_struct *vma,
+   unsigned long flags)
+{
+   mmap_assert_write_locked(vma->vm_mm);
+   vma->vm_flags |= flags;
+}
+
+static inline void clear_vm_flags(struct vm_area_struct *vma,
+ unsigned long flags)
+{
+   mmap_assert_write_locked(vma->vm_mm);
+   vma->vm_flags &= ~flags;
+}
+
+static inline void mod_vm_flags(struct vm_area_struct *vma,
+   unsigned long set, unsigned long clear)
+{
+   mmap_assert_write_locked(vma->vm_mm);
+   vma->vm_flags |= set;
+   vma->vm_flags &= ~clear;
+}
+
 static inline void vma_set_anonymous(struct vm_area_struct *vma)
 {
vma->vm_ops = NULL;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 2d6d790d9bed..6c7c70bf50dd 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -491,7 +491,13 @@ struct vm_area_struct {
 * See vmf_insert_mixed_prot() for discussion.
 */
pgprot_t vm_page_prot;
-   unsigned long vm_flags; /* Flags, see mm.h. */
+
+   /*
+* Flags, see mm.h.
+* WARNING! Do not modify directly.
+* Use {init|reset|set|clear|mod}_vm_flags() functions instead.
+*/
+   unsigned long vm_flags;
 
/*
 * For areas with an address space and backing store,
-- 
2.39.1



[PATCH v2 3/6] mm: replace vma->vm_flags direct modifications with modifier calls

2023-01-25 Thread Suren Baghdasaryan
Replace direct modifications to vma->vm_flags with calls to modifier
functions to be able to track flag changes and to keep vma locking
correctness.

Signed-off-by: Suren Baghdasaryan 
---
 arch/arm/kernel/process.c  |  2 +-
 arch/ia64/mm/init.c|  8 
 arch/loongarch/include/asm/tlb.h   |  2 +-
 arch/powerpc/kvm/book3s_xive_native.c  |  2 +-
 arch/powerpc/mm/book3s64/subpage_prot.c|  2 +-
 arch/powerpc/platforms/book3s/vas-api.c|  2 +-
 arch/powerpc/platforms/cell/spufs/file.c   | 14 +++---
 arch/s390/mm/gmap.c|  3 +--
 arch/x86/entry/vsyscall/vsyscall_64.c  |  2 +-
 arch/x86/kernel/cpu/sgx/driver.c   |  2 +-
 arch/x86/kernel/cpu/sgx/virt.c |  2 +-
 arch/x86/mm/pat/memtype.c  |  6 +++---
 arch/x86/um/mem_32.c   |  2 +-
 drivers/acpi/pfr_telemetry.c   |  2 +-
 drivers/android/binder.c   |  3 +--
 drivers/char/mspec.c   |  2 +-
 drivers/crypto/hisilicon/qm.c  |  2 +-
 drivers/dax/device.c   |  2 +-
 drivers/dma/idxd/cdev.c|  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c|  2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c   |  4 ++--
 drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c  |  4 ++--
 drivers/gpu/drm/amd/amdkfd/kfd_events.c|  4 ++--
 drivers/gpu/drm/amd/amdkfd/kfd_process.c   |  4 ++--
 drivers/gpu/drm/drm_gem.c  |  2 +-
 drivers/gpu/drm/drm_gem_dma_helper.c   |  3 +--
 drivers/gpu/drm/drm_gem_shmem_helper.c |  2 +-
 drivers/gpu/drm/drm_vm.c   |  8 
 drivers/gpu/drm/etnaviv/etnaviv_gem.c  |  2 +-
 drivers/gpu/drm/exynos/exynos_drm_gem.c|  4 ++--
 drivers/gpu/drm/gma500/framebuffer.c   |  2 +-
 drivers/gpu/drm/i810/i810_dma.c|  2 +-
 drivers/gpu/drm/i915/gem/i915_gem_mman.c   |  4 ++--
 drivers/gpu/drm/mediatek/mtk_drm_gem.c |  2 +-
 drivers/gpu/drm/msm/msm_gem.c  |  2 +-
 drivers/gpu/drm/omapdrm/omap_gem.c |  3 +--
 drivers/gpu/drm/rockchip/rockchip_drm_gem.c|  3 +--
 drivers/gpu/drm/tegra/gem.c|  5 ++---
 drivers/gpu/drm/ttm/ttm_bo_vm.c|  3 +--
 drivers/gpu/drm/virtio/virtgpu_vram.c  |  2 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c   |  2 +-
 drivers/gpu/drm/xen/xen_drm_front_gem.c|  3 +--
 drivers/hsi/clients/cmt_speech.c   |  2 +-
 drivers/hwtracing/intel_th/msu.c   |  2 +-
 drivers/hwtracing/stm/core.c   |  2 +-
 drivers/infiniband/hw/hfi1/file_ops.c  |  4 ++--
 drivers/infiniband/hw/mlx5/main.c  |  4 ++--
 drivers/infiniband/hw/qib/qib_file_ops.c   | 13 ++---
 drivers/infiniband/hw/usnic/usnic_ib_verbs.c   |  2 +-
 drivers/infiniband/hw/vmw_pvrdma/pvrdma_verbs.c|  2 +-
 .../media/common/videobuf2/videobuf2-dma-contig.c  |  2 +-
 drivers/media/common/videobuf2/videobuf2-vmalloc.c |  2 +-
 drivers/media/v4l2-core/videobuf-dma-contig.c  |  2 +-
 drivers/media/v4l2-core/videobuf-dma-sg.c  |  4 ++--
 drivers/media/v4l2-core/videobuf-vmalloc.c |  2 +-
 drivers/misc/cxl/context.c |  2 +-
 drivers/misc/habanalabs/common/memory.c|  2 +-
 drivers/misc/habanalabs/gaudi/gaudi.c  |  4 ++--
 drivers/misc/habanalabs/gaudi2/gaudi2.c|  8 
 drivers/misc/habanalabs/goya/goya.c|  4 ++--
 drivers/misc/ocxl/context.c|  4 ++--
 drivers/misc/ocxl/sysfs.c  |  2 +-
 drivers/misc/open-dice.c   |  4 ++--
 drivers/misc/sgi-gru/grufile.c |  4 ++--
 drivers/misc/uacce/uacce.c |  2 +-
 drivers/sbus/char/oradax.c |  2 +-
 drivers/scsi/cxlflash/ocxl_hw.c|  2 +-
 drivers/scsi/sg.c  |  2 +-
 drivers/staging/media/atomisp/pci/hmm/hmm_bo.c |  2 +-
 drivers/staging/media/deprecated/meye/meye.c   |  4 ++--
 .../media/deprecated/stkwebcam/stk-webcam.c|  2 +-
 drivers/target/target_core_user.c  |  2 +-
 drivers/uio/uio.c  |  2 +-
 drivers/usb/core/devio.c   |  3 +--
 drivers/usb/mon/mon_bin.c  |  3 +--
 drivers/vdpa/vdpa_user/iova_domain.c   |  2 +-
 drivers/vfio/pci/vfio_pci_core.c   |  2 +-
 drivers/vhost/vdpa.c   |  2 +-
 drivers/video/fbdev/68328f

[PATCH v2 2/6] mm: replace VM_LOCKED_CLEAR_MASK with VM_LOCKED_MASK

2023-01-25 Thread Suren Baghdasaryan
To simplify the usage of VM_LOCKED_CLEAR_MASK in clear_vm_flags(),
replace it with VM_LOCKED_MASK bitmask and convert all users.

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/mm.h | 4 ++--
 kernel/fork.c  | 2 +-
 mm/hugetlb.c   | 4 ++--
 mm/mlock.c | 6 +++---
 mm/mmap.c  | 6 +++---
 mm/mremap.c| 2 +-
 6 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b71f2809caac..da62bdd627bf 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -421,8 +421,8 @@ extern unsigned int kobjsize(const void *objp);
 /* This mask defines which mm->def_flags a process can inherit its parent */
 #define VM_INIT_DEF_MASK   VM_NOHUGEPAGE
 
-/* This mask is used to clear all the VMA flags used by mlock */
-#define VM_LOCKED_CLEAR_MASK   (~(VM_LOCKED | VM_LOCKONFAULT))
+/* This mask represents all the VMA flag bits used by mlock */
+#define VM_LOCKED_MASK (VM_LOCKED | VM_LOCKONFAULT)
 
 /* Arch-specific flags to clear when updating VM flags on protection change */
 #ifndef VM_ARCH_CLEAR
diff --git a/kernel/fork.c b/kernel/fork.c
index 6683c1b0f460..03d472051236 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -669,7 +669,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
tmp->anon_vma = NULL;
} else if (anon_vma_fork(tmp, mpnt))
goto fail_nomem_anon_vma_fork;
-   tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
+   clear_vm_flags(tmp, VM_LOCKED_MASK);
file = tmp->vm_file;
if (file) {
struct address_space *mapping = file->f_mapping;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d20c8b09890e..4ecdbad9a451 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6973,8 +6973,8 @@ static unsigned long page_table_shareable(struct 
vm_area_struct *svma,
unsigned long s_end = sbase + PUD_SIZE;
 
/* Allow segments to share if only one is marked locked */
-   unsigned long vm_flags = vma->vm_flags & VM_LOCKED_CLEAR_MASK;
-   unsigned long svm_flags = svma->vm_flags & VM_LOCKED_CLEAR_MASK;
+   unsigned long vm_flags = vma->vm_flags & ~VM_LOCKED_MASK;
+   unsigned long svm_flags = svma->vm_flags & ~VM_LOCKED_MASK;
 
/*
 * match the virtual addresses, permission and the alignment of the
diff --git a/mm/mlock.c b/mm/mlock.c
index 0336f52e03d7..5c4fff93cd6b 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -497,7 +497,7 @@ static int apply_vma_lock_flags(unsigned long start, size_t 
len,
if (vma->vm_start != tmp)
return -ENOMEM;
 
-   newflags = vma->vm_flags & VM_LOCKED_CLEAR_MASK;
+   newflags = vma->vm_flags & ~VM_LOCKED_MASK;
newflags |= flags;
/* Here we know that  vma->vm_start <= nstart < vma->vm_end. */
tmp = vma->vm_end;
@@ -661,7 +661,7 @@ static int apply_mlockall_flags(int flags)
struct vm_area_struct *vma, *prev = NULL;
vm_flags_t to_add = 0;
 
-   current->mm->def_flags &= VM_LOCKED_CLEAR_MASK;
+   current->mm->def_flags &= ~VM_LOCKED_MASK;
if (flags & MCL_FUTURE) {
current->mm->def_flags |= VM_LOCKED;
 
@@ -681,7 +681,7 @@ static int apply_mlockall_flags(int flags)
for_each_vma(vmi, vma) {
vm_flags_t newflags;
 
-   newflags = vma->vm_flags & VM_LOCKED_CLEAR_MASK;
+   newflags = vma->vm_flags & ~VM_LOCKED_MASK;
newflags |= to_add;
 
/* Ignore errors */
diff --git a/mm/mmap.c b/mm/mmap.c
index d4abc6feced1..323bd253b25a 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2671,7 +2671,7 @@ unsigned long mmap_region(struct file *file, unsigned 
long addr,
if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
is_vm_hugetlb_page(vma) ||
vma == get_gate_vma(current->mm))
-   vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
+   clear_vm_flags(vma, VM_LOCKED_MASK);
else
mm->locked_vm += (len >> PAGE_SHIFT);
}
@@ -3340,8 +3340,8 @@ static struct vm_area_struct *__install_special_mapping(
vma->vm_start = addr;
vma->vm_end = addr + len;
 
-   vma->vm_flags = vm_flags | mm->def_flags | VM_DONTEXPAND | VM_SOFTDIRTY;
-   vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
+   init_vm_flags(vma, (vm_flags | mm->def_flags |
+ VM_DONTEXPAND | VM_SOFTDIRTY) & ~VM_LOCKED_MASK);
vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
 
vma->vm_ops = ops;
diff --git a/mm/mremap.c b/mm/mremap.c
index 1b3ee02bead7..35db9752c

[PATCH v2 0/6] introduce vm_flags modifier functions

2023-01-25 Thread Suren Baghdasaryan
This patchset was originally published as a part of per-VMA locking [1] and
was split after suggestion that it's viable on its own and to facilitate
the review process. It is now a preprequisite for the next version of per-VMA
lock patchset, which reuses vm_flags modifier functions to lock the VMA when
vm_flags are being updated.

VMA vm_flags modifications are usually done under exclusive mmap_lock
protection because this attrubute affects other decisions like VMA merging
or splitting and races should be prevented. Introduce vm_flags modifier
functions to enforce correct locking.

[1] https://lore.kernel.org/all/20230109205336.3665937-1-sur...@google.com/

The patchset applies cleanly over mm-unstable branch of mm tree.

My apologies for an extremely large distribution list. The patch touches
lots of files and many are in arch/ and drivers/.

Suren Baghdasaryan (6):
  mm: introduce vma->vm_flags modifier functions
  mm: replace VM_LOCKED_CLEAR_MASK with VM_LOCKED_MASK
  mm: replace vma->vm_flags direct modifications with modifier calls
  mm: replace vma->vm_flags indirect modification in ksm_madvise
  mm: introduce mod_vm_flags_nolock and use it in untrack_pfn
  mm: export dump_mm()

 arch/arm/kernel/process.c |  2 +-
 arch/ia64/mm/init.c   |  8 +--
 arch/loongarch/include/asm/tlb.h  |  2 +-
 arch/powerpc/kvm/book3s_hv_uvmem.c|  5 +-
 arch/powerpc/kvm/book3s_xive_native.c |  2 +-
 arch/powerpc/mm/book3s64/subpage_prot.c   |  2 +-
 arch/powerpc/platforms/book3s/vas-api.c   |  2 +-
 arch/powerpc/platforms/cell/spufs/file.c  | 14 ++---
 arch/s390/mm/gmap.c   |  8 +--
 arch/x86/entry/vsyscall/vsyscall_64.c |  2 +-
 arch/x86/kernel/cpu/sgx/driver.c  |  2 +-
 arch/x86/kernel/cpu/sgx/virt.c|  2 +-
 arch/x86/mm/pat/memtype.c | 14 +++--
 arch/x86/um/mem_32.c  |  2 +-
 drivers/acpi/pfr_telemetry.c  |  2 +-
 drivers/android/binder.c  |  3 +-
 drivers/char/mspec.c  |  2 +-
 drivers/crypto/hisilicon/qm.c |  2 +-
 drivers/dax/device.c  |  2 +-
 drivers/dma/idxd/cdev.c   |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c   |  2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c  |  4 +-
 drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c |  4 +-
 drivers/gpu/drm/amd/amdkfd/kfd_events.c   |  4 +-
 drivers/gpu/drm/amd/amdkfd/kfd_process.c  |  4 +-
 drivers/gpu/drm/drm_gem.c |  2 +-
 drivers/gpu/drm/drm_gem_dma_helper.c  |  3 +-
 drivers/gpu/drm/drm_gem_shmem_helper.c|  2 +-
 drivers/gpu/drm/drm_vm.c  |  8 +--
 drivers/gpu/drm/etnaviv/etnaviv_gem.c |  2 +-
 drivers/gpu/drm/exynos/exynos_drm_gem.c   |  4 +-
 drivers/gpu/drm/gma500/framebuffer.c  |  2 +-
 drivers/gpu/drm/i810/i810_dma.c   |  2 +-
 drivers/gpu/drm/i915/gem/i915_gem_mman.c  |  4 +-
 drivers/gpu/drm/mediatek/mtk_drm_gem.c|  2 +-
 drivers/gpu/drm/msm/msm_gem.c |  2 +-
 drivers/gpu/drm/omapdrm/omap_gem.c|  3 +-
 drivers/gpu/drm/rockchip/rockchip_drm_gem.c   |  3 +-
 drivers/gpu/drm/tegra/gem.c   |  5 +-
 drivers/gpu/drm/ttm/ttm_bo_vm.c   |  3 +-
 drivers/gpu/drm/virtio/virtgpu_vram.c |  2 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c  |  2 +-
 drivers/gpu/drm/xen/xen_drm_front_gem.c   |  3 +-
 drivers/hsi/clients/cmt_speech.c  |  2 +-
 drivers/hwtracing/intel_th/msu.c  |  2 +-
 drivers/hwtracing/stm/core.c  |  2 +-
 drivers/infiniband/hw/hfi1/file_ops.c |  4 +-
 drivers/infiniband/hw/mlx5/main.c |  4 +-
 drivers/infiniband/hw/qib/qib_file_ops.c  | 13 +++--
 drivers/infiniband/hw/usnic/usnic_ib_verbs.c  |  2 +-
 .../infiniband/hw/vmw_pvrdma/pvrdma_verbs.c   |  2 +-
 .../common/videobuf2/videobuf2-dma-contig.c   |  2 +-
 .../common/videobuf2/videobuf2-vmalloc.c  |  2 +-
 drivers/media/v4l2-core/videobuf-dma-contig.c |  2 +-
 drivers/media/v4l2-core/videobuf-dma-sg.c |  4 +-
 drivers/media/v4l2-core/videobuf-vmalloc.c|  2 +-
 drivers/misc/cxl/context.c|  2 +-
 drivers/misc/habanalabs/common/memory.c   |  2 +-
 drivers/misc/habanalabs/gaudi/gaudi.c |  4 +-
 drivers/misc/habanalabs/gaudi2/gaudi2.c   |  8 +--
 drivers/misc/habanalabs/goya/goya.c   |  4 +-
 drivers/misc/ocxl/context.c   |  4 +-
 drivers/misc/ocxl/sysfs.c |  2 +-
 drivers/misc/open-dice.c  |  4 +-
 drivers/misc/sgi-gru/grufile.c|  4 +-
 drivers/misc/uacce/uacce.c|  2 +-
 drivers/sbus/char/oradax.c|  2 +-
 drivers/scsi/cxlflash/ocxl_hw.c   |  2 +-
 drivers/

[PATCH v2 6/6] mm: export dump_mm()

2023-01-25 Thread Suren Baghdasaryan
mmap_assert_write_locked() is used in vm_flags modifiers. Because
mmap_assert_write_locked() uses dump_mm() and vm_flags are sometimes
modified from from inside a module, it's necessary to export
dump_mm() function.

Signed-off-by: Suren Baghdasaryan 
---
 mm/debug.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/debug.c b/mm/debug.c
index 9d3d893dc7f4..96d594e16292 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -215,6 +215,7 @@ void dump_mm(const struct mm_struct *mm)
mm->def_flags, >def_flags
);
 }
+EXPORT_SYMBOL(dump_mm);
 
 static bool page_init_poisoning __read_mostly = true;
 
-- 
2.39.1



[PATCH v2 5/6] mm: introduce mod_vm_flags_nolock and use it in untrack_pfn

2023-01-25 Thread Suren Baghdasaryan
In cases when VMA flags are modified after VMA was isolated and mmap_lock
was downgraded, flags modifications would result in an assertion because
mmap write lock is not held.
Introduce mod_vm_flags_nolock to be used in such situation.
Pass a hint to untrack_pfn to conditionally use mod_vm_flags_nolock for
flags modification and to avoid assertion.

Signed-off-by: Suren Baghdasaryan 
---
 arch/x86/mm/pat/memtype.c | 10 +++---
 include/linux/mm.h| 12 +---
 include/linux/pgtable.h   |  5 +++--
 mm/memory.c   | 13 +++--
 mm/memremap.c |  4 ++--
 mm/mmap.c | 16 ++--
 6 files changed, 38 insertions(+), 22 deletions(-)

diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
index ae9645c900fa..d8adc0b42cf2 100644
--- a/arch/x86/mm/pat/memtype.c
+++ b/arch/x86/mm/pat/memtype.c
@@ -1046,7 +1046,7 @@ void track_pfn_insert(struct vm_area_struct *vma, 
pgprot_t *prot, pfn_t pfn)
  * can be for the entire vma (in which case pfn, size are zero).
  */
 void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
-unsigned long size)
+unsigned long size, bool mm_wr_locked)
 {
resource_size_t paddr;
unsigned long prot;
@@ -1065,8 +1065,12 @@ void untrack_pfn(struct vm_area_struct *vma, unsigned 
long pfn,
size = vma->vm_end - vma->vm_start;
}
free_pfn_range(paddr, size);
-   if (vma)
-   clear_vm_flags(vma, VM_PAT);
+   if (vma) {
+   if (mm_wr_locked)
+   clear_vm_flags(vma, VM_PAT);
+   else
+   mod_vm_flags_nolock(vma, 0, VM_PAT);
+   }
 }
 
 /*
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 55335edd1373..48d49930c411 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -656,12 +656,18 @@ static inline void clear_vm_flags(struct vm_area_struct 
*vma,
vma->vm_flags &= ~flags;
 }
 
+static inline void mod_vm_flags_nolock(struct vm_area_struct *vma,
+  unsigned long set, unsigned long clear)
+{
+   vma->vm_flags |= set;
+   vma->vm_flags &= ~clear;
+}
+
 static inline void mod_vm_flags(struct vm_area_struct *vma,
unsigned long set, unsigned long clear)
 {
mmap_assert_write_locked(vma->vm_mm);
-   vma->vm_flags |= set;
-   vma->vm_flags &= ~clear;
+   mod_vm_flags_nolock(vma, set, clear);
 }
 
 static inline void vma_set_anonymous(struct vm_area_struct *vma)
@@ -2087,7 +2093,7 @@ static inline void zap_vma_pages(struct vm_area_struct 
*vma)
 }
 void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
struct vm_area_struct *start_vma, unsigned long start,
-   unsigned long end);
+   unsigned long end, bool mm_wr_locked);
 
 struct mmu_notifier_range;
 
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 5fd45454c073..c63cd44777ec 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1185,7 +1185,8 @@ static inline int track_pfn_copy(struct vm_area_struct 
*vma)
  * can be for the entire vma (in which case pfn, size are zero).
  */
 static inline void untrack_pfn(struct vm_area_struct *vma,
-  unsigned long pfn, unsigned long size)
+  unsigned long pfn, unsigned long size,
+  bool mm_wr_locked)
 {
 }
 
@@ -1203,7 +1204,7 @@ extern void track_pfn_insert(struct vm_area_struct *vma, 
pgprot_t *prot,
 pfn_t pfn);
 extern int track_pfn_copy(struct vm_area_struct *vma);
 extern void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
-   unsigned long size);
+   unsigned long size, bool mm_wr_locked);
 extern void untrack_pfn_moved(struct vm_area_struct *vma);
 #endif
 
diff --git a/mm/memory.c b/mm/memory.c
index d6902065e558..5b11b50e2c4a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1613,7 +1613,7 @@ void unmap_page_range(struct mmu_gather *tlb,
 static void unmap_single_vma(struct mmu_gather *tlb,
struct vm_area_struct *vma, unsigned long start_addr,
unsigned long end_addr,
-   struct zap_details *details)
+   struct zap_details *details, bool mm_wr_locked)
 {
unsigned long start = max(vma->vm_start, start_addr);
unsigned long end;
@@ -1628,7 +1628,7 @@ static void unmap_single_vma(struct mmu_gather *tlb,
uprobe_munmap(vma, start, end);
 
if (unlikely(vma->vm_flags & VM_PFNMAP))
-   untrack_pfn(vma, 0, 0);
+   untrack_pfn(vma, 0, 0, mm_wr_locked);
 
if (start != end) {
if (unlikely(is_vm_hugetlb_page(vma))) {
@@ -1675,7 +1675,7 @@ static void unmap_single_vma(struct mmu_gather *tlb,
  */
 void unmap_vmas

Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free

2023-01-20 Thread Suren Baghdasaryan
On Fri, Jan 20, 2023 at 9:32 AM Matthew Wilcox  wrote:
>
> On Fri, Jan 20, 2023 at 09:17:46AM -0800, Suren Baghdasaryan wrote:
> > On Fri, Jan 20, 2023 at 9:08 AM Liam R. Howlett  
> > wrote:
> > >
> > > * Matthew Wilcox  [230120 11:50]:
> > > > On Fri, Jan 20, 2023 at 08:45:21AM -0800, Suren Baghdasaryan wrote:
> > > > > On Fri, Jan 20, 2023 at 8:20 AM Suren Baghdasaryan 
> > > > >  wrote:
> > > > > >
> > > > > > On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko  
> > > > > > wrote:
> > > > > > >
> > > > > > > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote:
> > > > > > > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko  
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > > > > > > call_rcu() can take a long time when callback offloading is 
> > > > > > > > > > enabled.
> > > > > > > > > > Its use in the vm_area_free can cause regressions in the 
> > > > > > > > > > exit path when
> > > > > > > > > > multiple VMAs are being freed. To minimize that impact, 
> > > > > > > > > > place VMAs into
> > > > > > > > > > a list and free them in groups using one call_rcu() call 
> > > > > > > > > > per group.
> > > > > > > > >
> > > > > > > > > After some more clarification I can understand how call_rcu 
> > > > > > > > > might not be
> > > > > > > > > super happy about thousands of callbacks to be invoked and I 
> > > > > > > > > do agree
> > > > > > > > > that this is not really optimal.
> > > > > > > > >
> > > > > > > > > On the other hand I do not like this solution much either.
> > > > > > > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help 
> > > > > > > > > all that
> > > > > > > > > much with processes with a huge number of vmas either. It 
> > > > > > > > > would still be
> > > > > > > > > in housands of callbacks to be scheduled without a good 
> > > > > > > > > reason.
> > > > > > > > >
> > > > > > > > > Instead, are there any other cases than remove_vma that need 
> > > > > > > > > this
> > > > > > > > > batching? We could easily just link all the vmas into linked 
> > > > > > > > > list and
> > > > > > > > > use a single call_rcu instead, no? This would both simplify 
> > > > > > > > > the
> > > > > > > > > implementation, remove the scaling issue as well and we do 
> > > > > > > > > not have to
> > > > > > > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or 
> > > > > > > > > epsilon + 1.
> > > > > > > >
> > > > > > > > Yes, I agree the solution is not stellar. I wanted something 
> > > > > > > > simple
> > > > > > > > but this is probably too simple. OTOH keeping all dead 
> > > > > > > > vm_area_structs
> > > > > > > > on the list without hooking up a shrinker (additional 
> > > > > > > > complexity) does
> > > > > > > > not sound too appealing either.
> > > > > > >
> > > > > > > I suspect you have missed my idea. I do not really want to keep 
> > > > > > > the list
> > > > > > > around or any shrinker. It is dead simple. Collect all vmas in
> > > > > > > remove_vma and then call_rcu the whole list at once after the 
> > > > > > > whole list
> > > > > > > (be it from exit_mmap or remove_mt). See?
> > > > > >
> > > > > > Yes, I understood your idea but keeping dead objects until the 
> > > > > > process
> > > > > > exits even when the system is low on memory (no shrinkers attached)
> > > > > > seems too wastef

Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free

2023-01-20 Thread Suren Baghdasaryan
On Fri, Jan 20, 2023 at 9:21 AM Paul E. McKenney  wrote:
>
> On Fri, Jan 20, 2023 at 04:49:42PM +, Matthew Wilcox wrote:
> > On Fri, Jan 20, 2023 at 08:45:21AM -0800, Suren Baghdasaryan wrote:
> > > On Fri, Jan 20, 2023 at 8:20 AM Suren Baghdasaryan  
> > > wrote:
> > > >
> > > > On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko  wrote:
> > > > >
> > > > > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote:
> > > > > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko  
> > > > > > wrote:
> > > > > > >
> > > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > > > > call_rcu() can take a long time when callback offloading is 
> > > > > > > > enabled.
> > > > > > > > Its use in the vm_area_free can cause regressions in the exit 
> > > > > > > > path when
> > > > > > > > multiple VMAs are being freed. To minimize that impact, place 
> > > > > > > > VMAs into
> > > > > > > > a list and free them in groups using one call_rcu() call per 
> > > > > > > > group.
> > > > > > >
> > > > > > > After some more clarification I can understand how call_rcu might 
> > > > > > > not be
> > > > > > > super happy about thousands of callbacks to be invoked and I do 
> > > > > > > agree
> > > > > > > that this is not really optimal.
> > > > > > >
> > > > > > > On the other hand I do not like this solution much either.
> > > > > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all 
> > > > > > > that
> > > > > > > much with processes with a huge number of vmas either. It would 
> > > > > > > still be
> > > > > > > in housands of callbacks to be scheduled without a good reason.
> > > > > > >
> > > > > > > Instead, are there any other cases than remove_vma that need this
> > > > > > > batching? We could easily just link all the vmas into linked list 
> > > > > > > and
> > > > > > > use a single call_rcu instead, no? This would both simplify the
> > > > > > > implementation, remove the scaling issue as well and we do not 
> > > > > > > have to
> > > > > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon 
> > > > > > > + 1.
> > > > > >
> > > > > > Yes, I agree the solution is not stellar. I wanted something simple
> > > > > > but this is probably too simple. OTOH keeping all dead 
> > > > > > vm_area_structs
> > > > > > on the list without hooking up a shrinker (additional complexity) 
> > > > > > does
> > > > > > not sound too appealing either.
> > > > >
> > > > > I suspect you have missed my idea. I do not really want to keep the 
> > > > > list
> > > > > around or any shrinker. It is dead simple. Collect all vmas in
> > > > > remove_vma and then call_rcu the whole list at once after the whole 
> > > > > list
> > > > > (be it from exit_mmap or remove_mt). See?
> > > >
> > > > Yes, I understood your idea but keeping dead objects until the process
> > > > exits even when the system is low on memory (no shrinkers attached)
> > > > seems too wasteful. If we do this I would advocate for attaching a
> > > > shrinker.
> > >
> > > Maybe even simpler, since we are hit with this VMA freeing flood
> > > during exit_mmap (when all VMAs are destroyed), we pass a hint to
> > > vm_area_free to batch the destruction and all other cases call
> > > call_rcu()? I don't think there will be other cases of VMA destruction
> > > floods.
> >
> > ... or have two different call_rcu functions; one for munmap() and
> > one for exit.  It'd be nice to use kmem_cache_free_bulk().
>
> Good point, kfree_rcu(p, r) where "r" is the name of the rcu_head
> structure's field, is much more cache-efficient.
>
> The penalty is that there is no callback function to do any cleanup.
> There is just a kfree()/kvfree (bulk version where applicable),
> nothing else.

If Liam's suggestion works then we won't need anything additional. We
will free the vm_area_structs directly on process exit and use
call_rcu() in all other cases. Let's see if Michal knows of any case
which still needs an RCU grace period during exit_mmap.

>
> Thanx, Paul
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@android.com.
>


Re: [PATCH v2 1/6] mm: introduce vma->vm_flags modifier functions

2023-01-25 Thread Suren Baghdasaryan
On Wed, Jan 25, 2023 at 10:33 AM Matthew Wilcox  wrote:
>
> On Wed, Jan 25, 2023 at 12:38:46AM -0800, Suren Baghdasaryan wrote:
> > +/* Use when VMA is not part of the VMA tree and needs no locking */
> > +static inline void init_vm_flags(struct vm_area_struct *vma,
> > +  unsigned long flags)
> > +{
> > + vma->vm_flags = flags;
>
> vm_flags are supposed to have type vm_flags_t.  That's not been
> fully realised yet, but perhaps we could avoid making it worse?
>
> >   pgprot_t vm_page_prot;
> > - unsigned long vm_flags; /* Flags, see mm.h. */
> > +
> > + /*
> > +  * Flags, see mm.h.
> > +  * WARNING! Do not modify directly.
> > +  * Use {init|reset|set|clear|mod}_vm_flags() functions instead.
> > +  */
> > + unsigned long vm_flags;
>
> Including changing this line to vm_flags_t

Good point. Will make the change. Thanks!


Re: [PATCH v3 2/7] mm: introduce vma->vm_flags wrapper functions

2023-01-25 Thread Suren Baghdasaryan
On Wed, Jan 25, 2023 at 4:24 PM Andrew Morton  wrote:
>
> On Wed, 25 Jan 2023 15:35:49 -0800 Suren Baghdasaryan  
> wrote:
>
> > vm_flags are among VMA attributes which affect decisions like VMA merging
> > and splitting. Therefore all vm_flags modifications are performed after
> > taking exclusive mmap_lock to prevent vm_flags updates racing with such
> > operations. Introduce modifier functions for vm_flags to be used whenever
> > flags are updated. This way we can better check and control correct
> > locking behavior during these updates.
> >
> > ...
> >
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > +static inline void init_vm_flags(struct vm_area_struct *vma,
> > +static inline void reset_vm_flags(struct vm_area_struct *vma,
> > +static inline void set_vm_flags(struct vm_area_struct *vma,
> > +static inline void clear_vm_flags(struct vm_area_struct *vma,
> > +static inline void mod_vm_flags(struct vm_area_struct *vma,
>
> vm_flags_init(), vm_flags_reset(), etc?
>
> This would be more idiomatic and I do think the most-significant-first
> naming style is preferable.

Thanks for the suggestion! I will rename them in the next version.

>
> --
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@android.com.
>


[PATCH v3 0/7] introduce vm_flags modifier functions

2023-01-25 Thread Suren Baghdasaryan
This patchset was originally published as a part of per-VMA locking [1] and
was split after suggestion that it's viable on its own and to facilitate
the review process. It is now a preprequisite for the next version of per-VMA
lock patchset, which reuses vm_flags modifier functions to lock the VMA when
vm_flags are being updated.

VMA vm_flags modifications are usually done under exclusive mmap_lock
protection because this attrubute affects other decisions like VMA merging
or splitting and races should be prevented. Introduce vm_flags modifier
functions to enforce correct locking.

The patchset applies cleanly over mm-unstable branch of mm tree.

My apologies for an extremely large distribution list. The patch touches
lots of files and many are in arch/ and drivers/.

Changes since v2 [2]
- Add an additional patch to convert vma assignment to a memcpy.
- Change vm_flags to a union of __private and const fields,
per Peter Zijlstra and Matthew Wilcox.
- Changed vm_flags type to vm_flags_t, per Matthew Wilcox.
- Removed unnecessary BUG_ON's from ksm_madvise and hugepage_madvise,
per Michal Hocko.
- Documented mod_vm_flags_nolock usage, per Michal Hocko.

[1] https://lore.kernel.org/all/20230109205336.3665937-1-sur...@google.com/
[2] https://lore.kernel.org/lkml/20230125083851.27759-1-sur...@google.com/

Suren Baghdasaryan (7):
  kernel/fork: convert vma assignment to a memcpy
  mm: introduce vma->vm_flags wrapper functions
  mm: replace VM_LOCKED_CLEAR_MASK with VM_LOCKED_MASK
  mm: replace vma->vm_flags direct modifications with modifier calls
  mm: replace vma->vm_flags indirect modification in ksm_madvise
  mm: introduce mod_vm_flags_nolock and use it in untrack_pfn
  mm: export dump_mm()

 arch/arm/kernel/process.c |  2 +-
 arch/ia64/mm/init.c   |  8 +--
 arch/loongarch/include/asm/tlb.h  |  2 +-
 arch/powerpc/kvm/book3s_hv_uvmem.c|  5 +-
 arch/powerpc/kvm/book3s_xive_native.c |  2 +-
 arch/powerpc/mm/book3s64/subpage_prot.c   |  2 +-
 arch/powerpc/platforms/book3s/vas-api.c   |  2 +-
 arch/powerpc/platforms/cell/spufs/file.c  | 14 ++---
 arch/s390/mm/gmap.c   |  8 ++-
 arch/x86/entry/vsyscall/vsyscall_64.c |  2 +-
 arch/x86/kernel/cpu/sgx/driver.c  |  2 +-
 arch/x86/kernel/cpu/sgx/virt.c|  2 +-
 arch/x86/mm/pat/memtype.c | 14 +++--
 arch/x86/um/mem_32.c  |  2 +-
 drivers/acpi/pfr_telemetry.c  |  2 +-
 drivers/android/binder.c  |  3 +-
 drivers/char/mspec.c  |  2 +-
 drivers/crypto/hisilicon/qm.c |  2 +-
 drivers/dax/device.c  |  2 +-
 drivers/dma/idxd/cdev.c   |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c   |  2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c  |  4 +-
 drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c |  4 +-
 drivers/gpu/drm/amd/amdkfd/kfd_events.c   |  4 +-
 drivers/gpu/drm/amd/amdkfd/kfd_process.c  |  4 +-
 drivers/gpu/drm/drm_gem.c |  2 +-
 drivers/gpu/drm/drm_gem_dma_helper.c  |  3 +-
 drivers/gpu/drm/drm_gem_shmem_helper.c|  2 +-
 drivers/gpu/drm/drm_vm.c  |  8 +--
 drivers/gpu/drm/etnaviv/etnaviv_gem.c |  2 +-
 drivers/gpu/drm/exynos/exynos_drm_gem.c   |  4 +-
 drivers/gpu/drm/gma500/framebuffer.c  |  2 +-
 drivers/gpu/drm/i810/i810_dma.c   |  2 +-
 drivers/gpu/drm/i915/gem/i915_gem_mman.c  |  4 +-
 drivers/gpu/drm/mediatek/mtk_drm_gem.c|  2 +-
 drivers/gpu/drm/msm/msm_gem.c |  2 +-
 drivers/gpu/drm/omapdrm/omap_gem.c|  3 +-
 drivers/gpu/drm/rockchip/rockchip_drm_gem.c   |  3 +-
 drivers/gpu/drm/tegra/gem.c   |  5 +-
 drivers/gpu/drm/ttm/ttm_bo_vm.c   |  3 +-
 drivers/gpu/drm/virtio/virtgpu_vram.c |  2 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c  |  2 +-
 drivers/gpu/drm/xen/xen_drm_front_gem.c   |  3 +-
 drivers/hsi/clients/cmt_speech.c  |  2 +-
 drivers/hwtracing/intel_th/msu.c  |  2 +-
 drivers/hwtracing/stm/core.c  |  2 +-
 drivers/infiniband/hw/hfi1/file_ops.c |  4 +-
 drivers/infiniband/hw/mlx5/main.c |  4 +-
 drivers/infiniband/hw/qib/qib_file_ops.c  | 13 ++---
 drivers/infiniband/hw/usnic/usnic_ib_verbs.c  |  2 +-
 .../infiniband/hw/vmw_pvrdma/pvrdma_verbs.c   |  2 +-
 .../common/videobuf2/videobuf2-dma-contig.c   |  2 +-
 .../common/videobuf2/videobuf2-vmalloc.c  |  2 +-
 drivers/media/v4l2-core/videobuf-dma-contig.c |  2 +-
 drivers/media/v4l2-core/videobuf-dma-sg.c |  4 +-
 drivers/media/v4l2-core/videobuf-vmalloc.c|  2 +-
 drivers/misc/cxl/context.c|  2 +-
 drivers/misc/habanalabs/common/memory.c   |  2 +-
 drivers/misc/habanalabs/gaudi/gaudi.c |  4 +-
 dri

Re: [PATCH v3 1/7] kernel/fork: convert vma assignment to a memcpy

2023-01-25 Thread Suren Baghdasaryan
On Wed, Jan 25, 2023 at 4:22 PM Andrew Morton  wrote:
>
> On Wed, 25 Jan 2023 15:35:48 -0800 Suren Baghdasaryan  
> wrote:
>
> > Convert vma assignment in vm_area_dup() to a memcpy() to prevent compiler
> > errors when we add a const modifier to vma->vm_flags.
> >
> > ...
> >
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -482,7 +482,7 @@ struct vm_area_struct *vm_area_dup(struct 
> > vm_area_struct *orig)
> >* orig->shared.rb may be modified concurrently, but the clone
> >* will be reinitialized.
> >*/
> > - *new = data_race(*orig);
> > + memcpy(new, orig, sizeof(*new));
>
> The data_race() removal is unchangelogged?

True. I'll add a note in the changelog about that. Ideally I would
like to preserve it but I could not find a way to do that.

>
> >   INIT_LIST_HEAD(>anon_vma_chain);
> >   dup_anon_vma_name(orig, new);
> >   }
>


[PATCH v3 4/7] mm: replace vma->vm_flags direct modifications with modifier calls

2023-01-25 Thread Suren Baghdasaryan
Replace direct modifications to vma->vm_flags with calls to modifier
functions to be able to track flag changes and to keep vma locking
correctness.

Signed-off-by: Suren Baghdasaryan 
Acked-by: Michal Hocko 
---
 arch/arm/kernel/process.c  |  2 +-
 arch/ia64/mm/init.c|  8 
 arch/loongarch/include/asm/tlb.h   |  2 +-
 arch/powerpc/kvm/book3s_xive_native.c  |  2 +-
 arch/powerpc/mm/book3s64/subpage_prot.c|  2 +-
 arch/powerpc/platforms/book3s/vas-api.c|  2 +-
 arch/powerpc/platforms/cell/spufs/file.c   | 14 +++---
 arch/s390/mm/gmap.c|  3 +--
 arch/x86/entry/vsyscall/vsyscall_64.c  |  2 +-
 arch/x86/kernel/cpu/sgx/driver.c   |  2 +-
 arch/x86/kernel/cpu/sgx/virt.c |  2 +-
 arch/x86/mm/pat/memtype.c  |  6 +++---
 arch/x86/um/mem_32.c   |  2 +-
 drivers/acpi/pfr_telemetry.c   |  2 +-
 drivers/android/binder.c   |  3 +--
 drivers/char/mspec.c   |  2 +-
 drivers/crypto/hisilicon/qm.c  |  2 +-
 drivers/dax/device.c   |  2 +-
 drivers/dma/idxd/cdev.c|  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c|  2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c   |  4 ++--
 drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c  |  4 ++--
 drivers/gpu/drm/amd/amdkfd/kfd_events.c|  4 ++--
 drivers/gpu/drm/amd/amdkfd/kfd_process.c   |  4 ++--
 drivers/gpu/drm/drm_gem.c  |  2 +-
 drivers/gpu/drm/drm_gem_dma_helper.c   |  3 +--
 drivers/gpu/drm/drm_gem_shmem_helper.c |  2 +-
 drivers/gpu/drm/drm_vm.c   |  8 
 drivers/gpu/drm/etnaviv/etnaviv_gem.c  |  2 +-
 drivers/gpu/drm/exynos/exynos_drm_gem.c|  4 ++--
 drivers/gpu/drm/gma500/framebuffer.c   |  2 +-
 drivers/gpu/drm/i810/i810_dma.c|  2 +-
 drivers/gpu/drm/i915/gem/i915_gem_mman.c   |  4 ++--
 drivers/gpu/drm/mediatek/mtk_drm_gem.c |  2 +-
 drivers/gpu/drm/msm/msm_gem.c  |  2 +-
 drivers/gpu/drm/omapdrm/omap_gem.c |  3 +--
 drivers/gpu/drm/rockchip/rockchip_drm_gem.c|  3 +--
 drivers/gpu/drm/tegra/gem.c|  5 ++---
 drivers/gpu/drm/ttm/ttm_bo_vm.c|  3 +--
 drivers/gpu/drm/virtio/virtgpu_vram.c  |  2 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c   |  2 +-
 drivers/gpu/drm/xen/xen_drm_front_gem.c|  3 +--
 drivers/hsi/clients/cmt_speech.c   |  2 +-
 drivers/hwtracing/intel_th/msu.c   |  2 +-
 drivers/hwtracing/stm/core.c   |  2 +-
 drivers/infiniband/hw/hfi1/file_ops.c  |  4 ++--
 drivers/infiniband/hw/mlx5/main.c  |  4 ++--
 drivers/infiniband/hw/qib/qib_file_ops.c   | 13 ++---
 drivers/infiniband/hw/usnic/usnic_ib_verbs.c   |  2 +-
 drivers/infiniband/hw/vmw_pvrdma/pvrdma_verbs.c|  2 +-
 .../media/common/videobuf2/videobuf2-dma-contig.c  |  2 +-
 drivers/media/common/videobuf2/videobuf2-vmalloc.c |  2 +-
 drivers/media/v4l2-core/videobuf-dma-contig.c  |  2 +-
 drivers/media/v4l2-core/videobuf-dma-sg.c  |  4 ++--
 drivers/media/v4l2-core/videobuf-vmalloc.c |  2 +-
 drivers/misc/cxl/context.c |  2 +-
 drivers/misc/habanalabs/common/memory.c|  2 +-
 drivers/misc/habanalabs/gaudi/gaudi.c  |  4 ++--
 drivers/misc/habanalabs/gaudi2/gaudi2.c|  8 
 drivers/misc/habanalabs/goya/goya.c|  4 ++--
 drivers/misc/ocxl/context.c|  4 ++--
 drivers/misc/ocxl/sysfs.c  |  2 +-
 drivers/misc/open-dice.c   |  4 ++--
 drivers/misc/sgi-gru/grufile.c |  4 ++--
 drivers/misc/uacce/uacce.c |  2 +-
 drivers/sbus/char/oradax.c |  2 +-
 drivers/scsi/cxlflash/ocxl_hw.c|  2 +-
 drivers/scsi/sg.c  |  2 +-
 drivers/staging/media/atomisp/pci/hmm/hmm_bo.c |  2 +-
 drivers/staging/media/deprecated/meye/meye.c   |  4 ++--
 .../media/deprecated/stkwebcam/stk-webcam.c|  2 +-
 drivers/target/target_core_user.c  |  2 +-
 drivers/uio/uio.c  |  2 +-
 drivers/usb/core/devio.c   |  3 +--
 drivers/usb/mon/mon_bin.c  |  3 +--
 drivers/vdpa/vdpa_user/iova_domain.c   |  2 +-
 drivers/vfio/pci/vfio_pci_core.c   |  2 +-
 drivers/vhost/vdpa.c   |  2 +-
 drivers/vi

[PATCH v3 3/7] mm: replace VM_LOCKED_CLEAR_MASK with VM_LOCKED_MASK

2023-01-25 Thread Suren Baghdasaryan
To simplify the usage of VM_LOCKED_CLEAR_MASK in clear_vm_flags(),
replace it with VM_LOCKED_MASK bitmask and convert all users.

Signed-off-by: Suren Baghdasaryan 
Acked-by: Michal Hocko 
---
 include/linux/mm.h | 4 ++--
 kernel/fork.c  | 2 +-
 mm/hugetlb.c   | 4 ++--
 mm/mlock.c | 6 +++---
 mm/mmap.c  | 6 +++---
 mm/mremap.c| 2 +-
 6 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bf16ddd544a5..cccbc2811827 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -421,8 +421,8 @@ extern unsigned int kobjsize(const void *objp);
 /* This mask defines which mm->def_flags a process can inherit its parent */
 #define VM_INIT_DEF_MASK   VM_NOHUGEPAGE
 
-/* This mask is used to clear all the VMA flags used by mlock */
-#define VM_LOCKED_CLEAR_MASK   (~(VM_LOCKED | VM_LOCKONFAULT))
+/* This mask represents all the VMA flag bits used by mlock */
+#define VM_LOCKED_MASK (VM_LOCKED | VM_LOCKONFAULT)
 
 /* Arch-specific flags to clear when updating VM flags on protection change */
 #ifndef VM_ARCH_CLEAR
diff --git a/kernel/fork.c b/kernel/fork.c
index a531901859d9..4f097999a570 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -669,7 +669,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
tmp->anon_vma = NULL;
} else if (anon_vma_fork(tmp, mpnt))
goto fail_nomem_anon_vma_fork;
-   tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
+   clear_vm_flags(tmp, VM_LOCKED_MASK);
file = tmp->vm_file;
if (file) {
struct address_space *mapping = file->f_mapping;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d20c8b09890e..4ecdbad9a451 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6973,8 +6973,8 @@ static unsigned long page_table_shareable(struct 
vm_area_struct *svma,
unsigned long s_end = sbase + PUD_SIZE;
 
/* Allow segments to share if only one is marked locked */
-   unsigned long vm_flags = vma->vm_flags & VM_LOCKED_CLEAR_MASK;
-   unsigned long svm_flags = svma->vm_flags & VM_LOCKED_CLEAR_MASK;
+   unsigned long vm_flags = vma->vm_flags & ~VM_LOCKED_MASK;
+   unsigned long svm_flags = svma->vm_flags & ~VM_LOCKED_MASK;
 
/*
 * match the virtual addresses, permission and the alignment of the
diff --git a/mm/mlock.c b/mm/mlock.c
index 0336f52e03d7..5c4fff93cd6b 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -497,7 +497,7 @@ static int apply_vma_lock_flags(unsigned long start, size_t 
len,
if (vma->vm_start != tmp)
return -ENOMEM;
 
-   newflags = vma->vm_flags & VM_LOCKED_CLEAR_MASK;
+   newflags = vma->vm_flags & ~VM_LOCKED_MASK;
newflags |= flags;
/* Here we know that  vma->vm_start <= nstart < vma->vm_end. */
tmp = vma->vm_end;
@@ -661,7 +661,7 @@ static int apply_mlockall_flags(int flags)
struct vm_area_struct *vma, *prev = NULL;
vm_flags_t to_add = 0;
 
-   current->mm->def_flags &= VM_LOCKED_CLEAR_MASK;
+   current->mm->def_flags &= ~VM_LOCKED_MASK;
if (flags & MCL_FUTURE) {
current->mm->def_flags |= VM_LOCKED;
 
@@ -681,7 +681,7 @@ static int apply_mlockall_flags(int flags)
for_each_vma(vmi, vma) {
vm_flags_t newflags;
 
-   newflags = vma->vm_flags & VM_LOCKED_CLEAR_MASK;
+   newflags = vma->vm_flags & ~VM_LOCKED_MASK;
newflags |= to_add;
 
/* Ignore errors */
diff --git a/mm/mmap.c b/mm/mmap.c
index d4abc6feced1..323bd253b25a 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2671,7 +2671,7 @@ unsigned long mmap_region(struct file *file, unsigned 
long addr,
if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
is_vm_hugetlb_page(vma) ||
vma == get_gate_vma(current->mm))
-   vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
+   clear_vm_flags(vma, VM_LOCKED_MASK);
else
mm->locked_vm += (len >> PAGE_SHIFT);
}
@@ -3340,8 +3340,8 @@ static struct vm_area_struct *__install_special_mapping(
vma->vm_start = addr;
vma->vm_end = addr + len;
 
-   vma->vm_flags = vm_flags | mm->def_flags | VM_DONTEXPAND | VM_SOFTDIRTY;
-   vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
+   init_vm_flags(vma, (vm_flags | mm->def_flags |
+ VM_DONTEXPAND | VM_SOFTDIRTY) & ~VM_LOCKED_MASK);
vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
 
vma->vm_ops = ops;
diff --git a/mm/mremap.c b/mm/mremap.c
index 1b3e

Re: [PATCH v2 3/6] mm: replace vma->vm_flags direct modifications with modifier calls

2023-01-25 Thread Suren Baghdasaryan
On Wed, Jan 25, 2023 at 1:30 AM 'Michal Hocko' via kernel-team
 wrote:
>
> On Wed 25-01-23 00:38:48, Suren Baghdasaryan wrote:
> > Replace direct modifications to vma->vm_flags with calls to modifier
> > functions to be able to track flag changes and to keep vma locking
> > correctness.
>
> Is this a manual (git grep) based work or have you used Coccinele for
> the patch generation?

It was a manual "search and replace" and in the process I temporarily
renamed vm_flags to ensure I did not miss any usage.

>
> My potentially incomplete check
> $ git grep ">[[:space:]]*vm_flags[[:space:]]*[&|^]="
>
> shows that nothing should be left after this. There is still quite a lot
> of direct checks of the flags (more than 600). Maybe it would be good to
> make flags accessible only via accessors which would also prevent any
> future direct setting of those flags in uncontrolled way as well.

Yes, I think Peter's suggestion in the first patch would also require
that. Much more churn but probably worth it for the future
maintenance. I'll add a patch which converts all readers as well.

>
> Anyway
> Acked-by: Michal Hocko 

Thanks for all the reviews!

> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@android.com.
>


Re: [PATCH v2 1/6] mm: introduce vma->vm_flags modifier functions

2023-01-25 Thread Suren Baghdasaryan
On Wed, Jan 25, 2023 at 1:10 AM Peter Zijlstra  wrote:
>
> On Wed, Jan 25, 2023 at 12:38:46AM -0800, Suren Baghdasaryan wrote:
>
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 2d6d790d9bed..6c7c70bf50dd 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -491,7 +491,13 @@ struct vm_area_struct {
> >* See vmf_insert_mixed_prot() for discussion.
> >*/
> >   pgprot_t vm_page_prot;
> > - unsigned long vm_flags; /* Flags, see mm.h. */
> > +
> > + /*
> > +  * Flags, see mm.h.
> > +  * WARNING! Do not modify directly.
> > +  * Use {init|reset|set|clear|mod}_vm_flags() functions instead.
> > +  */
> > + unsigned long vm_flags;
>
> We have __private and ACCESS_PRIVATE() to help with enforcing this.

Thanks for pointing this out, Peter! I guess for that I'll need to
convert all read accesses and provide get_vm_flags() too? That will
cause some additional churt (a quick search shows 801 hits over 248
files) but maybe it's worth it? I think Michal suggested that too in
another patch. Should I do that while we are at it?

>


Re: [PATCH v2 1/6] mm: introduce vma->vm_flags modifier functions

2023-01-25 Thread Suren Baghdasaryan
On Wed, Jan 25, 2023 at 10:37 AM Matthew Wilcox  wrote:
>
> On Wed, Jan 25, 2023 at 08:49:50AM -0800, Suren Baghdasaryan wrote:
> > On Wed, Jan 25, 2023 at 1:10 AM Peter Zijlstra  wrote:
> > > > + /*
> > > > +  * Flags, see mm.h.
> > > > +  * WARNING! Do not modify directly.
> > > > +  * Use {init|reset|set|clear|mod}_vm_flags() functions instead.
> > > > +  */
> > > > + unsigned long vm_flags;
> > >
> > > We have __private and ACCESS_PRIVATE() to help with enforcing this.
> >
> > Thanks for pointing this out, Peter! I guess for that I'll need to
> > convert all read accesses and provide get_vm_flags() too? That will
> > cause some additional churt (a quick search shows 801 hits over 248
> > files) but maybe it's worth it? I think Michal suggested that too in
> > another patch. Should I do that while we are at it?
>
> Here's a trick I saw somewhere in the VFS:
>
> union {
> const vm_flags_t vm_flags;
> vm_flags_t __private __vm_flags;
> };
>
> Now it can be read by anybody but written only by those using
> ACCESS_PRIVATE.

Huh, this is quite nice! I think it does not save us from the cases
when vma->vm_flags is passed by a reference and modified indirectly,
like in ksm_madvise()? Though maybe such usecases are so rare (I found
only 2 cases) that we can ignore this?


[PATCH v3 7/7] mm: export dump_mm()

2023-01-25 Thread Suren Baghdasaryan
mmap_assert_write_locked() is used in vm_flags modifiers. Because
mmap_assert_write_locked() uses dump_mm() and vm_flags are sometimes
modified from inside a module, it's necessary to export dump_mm()
function.

Signed-off-by: Suren Baghdasaryan 
Acked-by: Michal Hocko 
---
 mm/debug.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/debug.c b/mm/debug.c
index 9d3d893dc7f4..96d594e16292 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -215,6 +215,7 @@ void dump_mm(const struct mm_struct *mm)
mm->def_flags, >def_flags
);
 }
+EXPORT_SYMBOL(dump_mm);
 
 static bool page_init_poisoning __read_mostly = true;
 
-- 
2.39.1



Re: [PATCH v3 2/7] mm: introduce vma->vm_flags wrapper functions

2023-01-25 Thread Suren Baghdasaryan
On Wed, Jan 25, 2023 at 4:28 PM Andrew Morton  wrote:
>
> On Wed, 25 Jan 2023 15:35:49 -0800 Suren Baghdasaryan  
> wrote:
>
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -491,7 +491,15 @@ struct vm_area_struct {
> >* See vmf_insert_mixed_prot() for discussion.
> >*/
> >   pgprot_t vm_page_prot;
> > - unsigned long vm_flags; /* Flags, see mm.h. */
> > +
> > + /*
> > +  * Flags, see mm.h.
> > +  * To modify use {init|reset|set|clear|mod}_vm_flags() functions.
> > +  */
> > + union {
> > + const vm_flags_t vm_flags;
> > + vm_flags_t __private __vm_flags;
> > + };
>
> Typically when making a change like this we'll rename the affected
> field/variable/function/etc.  This will reliably and deliberately break
> unconverted usage sites.
>
> This const trick will get us partway there, by breaking setters.  But
> renaming it will break both setters and getters.

My intent here is to break setters but to allow getters to keep
reading vma->vm_flags directly. We could provide get_vm_flags() and
convert all getters as well but it would introduce a huge additional
churn (800+ hits) with no obvious benefits, I think. Does that clarify
the intent of this trick?

>
> --
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@android.com.
>


Re: [PATCH v2 5/6] mm: introduce mod_vm_flags_nolock and use it in untrack_pfn

2023-01-25 Thread Suren Baghdasaryan
On Wed, Jan 25, 2023 at 1:42 AM Michal Hocko  wrote:
>
> On Wed 25-01-23 00:38:50, Suren Baghdasaryan wrote:
> > In cases when VMA flags are modified after VMA was isolated and mmap_lock
> > was downgraded, flags modifications would result in an assertion because
> > mmap write lock is not held.
> > Introduce mod_vm_flags_nolock to be used in such situation.
> > Pass a hint to untrack_pfn to conditionally use mod_vm_flags_nolock for
> > flags modification and to avoid assertion.
>
> The changelog nor the documentation of mod_vm_flags_nolock
> really explain when it is safe to use it. This is really important for
> future potential users.

True. I'll add clarification in the comments and in the changelog. Thanks!

>
> > Signed-off-by: Suren Baghdasaryan 
> > ---
> >  arch/x86/mm/pat/memtype.c | 10 +++---
> >  include/linux/mm.h| 12 +---
> >  include/linux/pgtable.h   |  5 +++--
> >  mm/memory.c   | 13 +++--
> >  mm/memremap.c |  4 ++--
> >  mm/mmap.c | 16 ++--
> >  6 files changed, 38 insertions(+), 22 deletions(-)
> >
> > diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> > index ae9645c900fa..d8adc0b42cf2 100644
> > --- a/arch/x86/mm/pat/memtype.c
> > +++ b/arch/x86/mm/pat/memtype.c
> > @@ -1046,7 +1046,7 @@ void track_pfn_insert(struct vm_area_struct *vma, 
> > pgprot_t *prot, pfn_t pfn)
> >   * can be for the entire vma (in which case pfn, size are zero).
> >   */
> >  void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
> > -  unsigned long size)
> > +  unsigned long size, bool mm_wr_locked)
> >  {
> >   resource_size_t paddr;
> >   unsigned long prot;
> > @@ -1065,8 +1065,12 @@ void untrack_pfn(struct vm_area_struct *vma, 
> > unsigned long pfn,
> >   size = vma->vm_end - vma->vm_start;
> >   }
> >   free_pfn_range(paddr, size);
> > - if (vma)
> > - clear_vm_flags(vma, VM_PAT);
> > + if (vma) {
> > + if (mm_wr_locked)
> > + clear_vm_flags(vma, VM_PAT);
> > + else
> > + mod_vm_flags_nolock(vma, 0, VM_PAT);
> > + }
> >  }
> >
> >  /*
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 55335edd1373..48d49930c411 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -656,12 +656,18 @@ static inline void clear_vm_flags(struct 
> > vm_area_struct *vma,
> >   vma->vm_flags &= ~flags;
> >  }
> >
> > +static inline void mod_vm_flags_nolock(struct vm_area_struct *vma,
> > +unsigned long set, unsigned long clear)
> > +{
> > + vma->vm_flags |= set;
> > + vma->vm_flags &= ~clear;
> > +}
> > +
> >  static inline void mod_vm_flags(struct vm_area_struct *vma,
> >   unsigned long set, unsigned long clear)
> >  {
> >   mmap_assert_write_locked(vma->vm_mm);
> > - vma->vm_flags |= set;
> > - vma->vm_flags &= ~clear;
> > + mod_vm_flags_nolock(vma, set, clear);
> >  }
> >
> >  static inline void vma_set_anonymous(struct vm_area_struct *vma)
> > @@ -2087,7 +2093,7 @@ static inline void zap_vma_pages(struct 
> > vm_area_struct *vma)
> >  }
> >  void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
> >   struct vm_area_struct *start_vma, unsigned long start,
> > - unsigned long end);
> > + unsigned long end, bool mm_wr_locked);
> >
> >  struct mmu_notifier_range;
> >
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index 5fd45454c073..c63cd44777ec 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -1185,7 +1185,8 @@ static inline int track_pfn_copy(struct 
> > vm_area_struct *vma)
> >   * can be for the entire vma (in which case pfn, size are zero).
> >   */
> >  static inline void untrack_pfn(struct vm_area_struct *vma,
> > -unsigned long pfn, unsigned long size)
> > +unsigned long pfn, unsigned long size,
> > +bool mm_wr_locked)
> >  {
> >  }
> >
> > @@ -1203,7 +1204,7 @@ extern void track_pfn_insert(struct vm_area_struct 
> > *vma, pgprot_t *prot,
> >pfn_t pfn);
> >  extern int track_pfn_copy

  1   2   3   4   >