[Devel] [PATCH VZ7] netfilter: nft_dynset: do not reject set updates with NFT_SET_EVAL

2024-05-28 Thread Pavel Tikhomirov
From: Pablo Neira Ayuso 

NFT_SET_EVAL is signalling the kernel that this sets can be updated from
the evaluation path, even if there are no expressions attached to the
element. Otherwise, set updates with no expressions fail. Update
description to describe the right semantics.

Fixes: 22fe54d5fefc ("netfilter: nf_tables: add support for dynamic set 
updates")
Signed-off-by: Pablo Neira Ayuso 

Note: This fixes the problem of adding dynamic set modification rules:

nft add table inet test-table
nft add set inet test-table test-set { typeof ip saddr \; flags timeout \; size 
1 \; }
nft add chain inet test-table test-chain { type filter hook input priority 
filter \; policy accept \; }
nft add rule inet test-table test-chain add @test-set { ip saddr timeout 2m }

Without this patch the last command fails.

Pinging the above setup will automaticly create set elements, e.g.:

nft list ruleset
table inet test-table {
set test-set {
typeof ip saddr
size 1
flags timeout
elements = { 172.29.1.154 expires 1m38s945ms,
 172.29.158.29 expires 1m45s348ms,
 192.168.16.220 expires 1m56s984ms }
}

chain test-chain {
type filter hook input priority filter; policy accept;
add @test-set { ip saddr timeout 2m }
}
}

https://virtuozzo.atlassian.net/browse/PSBM-156601
(cherry picked from commit 215a31f19dedd4e92a67cf5a9717ee898d012b3a)
Signed-off-by: Pavel Tikhomirov 
---
 include/uapi/linux/netfilter/nf_tables.h | 2 +-
 net/netfilter/nft_dynset.c   | 3 +--
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/netfilter/nf_tables.h 
b/include/uapi/linux/netfilter/nf_tables.h
index 98ad9c3d98f4..4ba705db4ae5 100644
--- a/include/uapi/linux/netfilter/nf_tables.h
+++ b/include/uapi/linux/netfilter/nf_tables.h
@@ -243,7 +243,7 @@ enum nft_rule_compat_attributes {
  * @NFT_SET_INTERVAL: set contains intervals
  * @NFT_SET_MAP: set is used as a dictionary
  * @NFT_SET_TIMEOUT: set uses timeouts
- * @NFT_SET_EVAL: set contains expressions for evaluation
+ * @NFT_SET_EVAL: set can be updated from the evaluation path
  */
 enum nft_set_flags {
NFT_SET_ANONYMOUS   = 0x1,
diff --git a/net/netfilter/nft_dynset.c b/net/netfilter/nft_dynset.c
index abf5b1cfdb3a..a32b39ad5d8d 100644
--- a/net/netfilter/nft_dynset.c
+++ b/net/netfilter/nft_dynset.c
@@ -178,8 +178,7 @@ static int nft_dynset_init(const struct nft_ctx *ctx,
priv->expr = nft_expr_init(ctx, tb[NFTA_DYNSET_EXPR]);
if (IS_ERR(priv->expr))
return PTR_ERR(priv->expr);
-   } else if (set->flags & NFT_SET_EVAL)
-   return -EINVAL;
+   }
 
nft_set_ext_prepare(>tmpl);
nft_set_ext_add_length(>tmpl, NFT_SET_EXT_KEY, set->klen);
-- 
2.44.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH VZ9] mm/memcontrol: Add page cache limit to cgroup-v2

2024-04-14 Thread Pavel Tikhomirov




On 22/03/2024 12:58, Pavel Tikhomirov wrote:

The interface is slightly reworked to be more v2 like:

- rename memory.cache.limit/usage_in_bytes -> memory.cache.max/current
- show "max" when uninitialized and allow to write it
- memcg_max_mutex with page_counter_set_max replaced with simple xchg
- we set limit first before looping and then try to enforce it if
   needed, no more enforce before setting logic
- retry reclaim couple of times if it fails to enforce the limit and
   then just give up (memory_max_write triggers oom in this case, but we
   probably do not want to trigger oom due to cache limit)

https://virtuozzo.atlassian.net/browse/PSBM-154207
Signed-off-by: Pavel Tikhomirov 

Feature: mm: Memory cgroup page cache limit
---
  mm/memcontrol.c | 75 +
  1 file changed, 75 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9000dc00ed03..28a39b50cbe1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -8810,3 +8810,78 @@ static int __init mem_cgroup_swap_init(void)
  subsys_initcall(mem_cgroup_swap_init);
  
  #endif /* CONFIG_SWAP */

+
+static int cache_max_show(struct seq_file *m, void *v)
+{
+   return seq_puts_memcg_tunable(m,
+   READ_ONCE(mem_cgroup_from_seq(m)->cache.max));
+}
+
+static ssize_t cache_max_write(struct kernfs_open_file *of,
+  char *buf, size_t nbytes, loff_t off)
+{
+   struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+   unsigned int nr_reclaims = MAX_RECLAIM_RETRIES;
+   unsigned long max;
+   int err;
+
+   buf = strstrip(buf);
+   err = page_counter_memparse(buf, "max", );
+   if (err)
+   return err;
+
+   xchg(>cache.max, max);


For history:

Here is my understanding (from discussion with Johannes Weiner 
) of why we need xchg() in memory_max_write:


In page_counter_try_charge we do:

a) increase usage
-- implicit full memory barrier ---
b) check usage is within limit, else
c) revert usage

In page_counter_set_max we do:

A) check usage is within new limit, then
-- implicit full memory barrier ---
B) set new limit
-- implicit full memory barrier ---
C) check usage didn't grow under us, else
D) revert old limit

If at (b) we don't see concurrent limit change then at (C) we would see 
usage grow and vice versa.


And after switch to memory_max_write in cgroup-v2 code path (C) and (D) 
were replaced with:

C') check usage is within new limit, else
D') reclaim until is, or OOM if can't reach

So basically we need xchg() to sync with page_counter_try_charge.

For memcg->cache we don't have uses of page_counter_try_charge, so it 
should be safe to omit xchg() here (same for oom.guarantee).


But we also already have such unjustified xchg(), which is not strictly 
required, for swap.max in mainstream code, probably just copy-pasted 
from memory_max_write. And as it is not a fast-path we can just leave 
xchg() here and there unless we would really need to remove it for some 
reason.



+
+   for (;;) {
+   unsigned long nr_cache = page_counter_read(>cache);
+
+   if (nr_cache <= max)
+   break;
+
+   if (signal_pending(current))
+   break;
+
+   if (!nr_reclaims)
+   break;
+
+   if (!try_to_free_mem_cgroup_pages(memcg, nr_cache - max,
+   GFP_KERNEL, false))
+   nr_reclaims--;
+   }
+
+   memcg_wb_domain_size_changed(memcg);
+   return nbytes;
+}
+
+static u64 cache_current_read(struct cgroup_subsys_state *css,
+  struct cftype *cft)
+{
+   struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+   return (u64)page_counter_read(>cache) * PAGE_SIZE;
+}
+
+static struct cftype cache_files[] = {
+   {
+   .name = "cache.max",
+   .flags = CFTYPE_NOT_ON_ROOT,
+   .seq_show = cache_max_show,
+   .write = cache_max_write,
+   },
+   {
+   .name = "cache.current",
+   .flags = CFTYPE_NOT_ON_ROOT,
+   .read_u64 = cache_current_read,
+   },
+   { } /* terminate */
+};
+
+static int __init mem_cgroup_cache_init(void)
+{
+   if (mem_cgroup_disabled())
+   return 0;
+
+   WARN_ON(cgroup_add_dfl_cftypes(_cgrp_subsys, cache_files));
+   return 0;
+}
+subsys_initcall(mem_cgroup_cache_init);


--
Best regards, Tikhomirov Pavel
Senior Software Developer, Virtuozzo.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH VZ9 v2 1/3] memcg: fix oom_guarantee to be considered in global oom and in berserker

2024-03-25 Thread Pavel Tikhomirov
After rebase to VZ8 the code which sets memcg->overdraft was moved to
mem_cgroup_scan_tasks() which is completely wrong (for instance in VZ7
we had this code in oom_unlock(), so everywhere in oom we always had
->overdraft information properly set). Now we don't have proper refresh
of ->overdraft information in two cases: in global oom and in berserker.

Let's fix this by spliting the refresh code to separate function
refresh_mem_cgroup_overdraft() and call it where it is really needed
(where later in stack oom_badness uses the refreshed ->overdraft).

Fixes: c31dabeaf42d ("memcg: add oom_guarantee")
Signed-off-by: Pavel Tikhomirov 

Feature: mm/oom: OOM guarantee feature
---
 include/linux/memcontrol.h |  6 ++
 mm/memcontrol.c| 30 +++---
 mm/oom_kill.c  |  4 
 3 files changed, 29 insertions(+), 11 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 4719eb84894f..9c2b8774639e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct mem_cgroup;
 struct obj_cgroup;
@@ -1004,6 +1005,7 @@ void mem_cgroup_print_oom_context(struct mem_cgroup 
*memcg,
 
 void mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg);
 unsigned long mem_cgroup_overdraft(struct mem_cgroup *memcg);
+void refresh_mem_cgroup_overdraft(struct oom_control *oc);
 
 static inline void mem_cgroup_enter_user_fault(void)
 {
@@ -1529,6 +1531,10 @@ static inline unsigned long mem_cgroup_overdraft(struct 
mem_cgroup *memcg)
return 0;
 }
 
+static inline void refresh_mem_cgroup_overdraft(struct oom_control *oc)
+{
+}
+
 static inline void
 mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a0a7c3b4ed35..8ca27af8d902 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1308,17 +1308,6 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
for_each_mem_cgroup_tree(iter, memcg) {
struct css_task_iter it;
struct task_struct *task;
-   struct mem_cgroup *parent;
-
-   /*
-* Update overdraft of each cgroup under us. This
-* information will be used in oom_badness.
-*/
-   iter->overdraft = mem_cgroup_overdraft(iter);
-   parent = parent_mem_cgroup(iter);
-   if (parent && iter != memcg)
-   iter->overdraft = max(iter->overdraft,
-   parent->overdraft);
 
css_task_iter_start(>css, CSS_TASK_ITER_PROCS, );
while (!ret && (task = css_task_iter_next()))
@@ -1518,6 +1507,25 @@ unsigned long mem_cgroup_overdraft(struct mem_cgroup 
*memcg)
return usage > guarantee ? (usage - guarantee) : 0;
 }
 
+void refresh_mem_cgroup_overdraft(struct oom_control *oc)
+{
+   struct mem_cgroup *iter;
+
+   for_each_mem_cgroup_tree(iter, oc->memcg) {
+   struct mem_cgroup *parent;
+
+   /*
+* Update overdraft of each cgroup under us. This
+* information will be used in oom_badness.
+*/
+   iter->overdraft = mem_cgroup_overdraft(iter);
+   parent = parent_mem_cgroup(iter);
+   if (parent && iter != oc->memcg)
+   iter->overdraft = max(iter->overdraft,
+ parent->overdraft);
+   }
+}
+
 bool mem_cgroup_dcache_is_low(struct mem_cgroup *memcg, int 
vfs_cache_min_ratio)
 {
unsigned long anon, file, dcache;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 1d44024d6c6f..d27e04295e7f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -378,6 +378,8 @@ static void select_bad_process(struct oom_control *oc)
 {
oc->chosen_points = LONG_MIN;
 
+   refresh_mem_cgroup_overdraft(oc);
+
if (is_memcg_oom(oc))
mem_cgroup_scan_tasks(oc->memcg, oom_evaluate_task, oc);
else {
@@ -1073,6 +1075,8 @@ static void oom_berserker(struct oom_control *oc)
if (rage < 0)
return;
 
+   refresh_mem_cgroup_overdraft(oc);
+
/*
 * So, we are in rage. Kill (1 << rage) youngest tasks that are
 * as bad as the victim.
-- 
2.43.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH VZ9 v2 2/3] memcg: fix oom_guarantee overdraft calculation in cgroup-v2

2024-03-25 Thread Pavel Tikhomirov
The memsw counter in cgroup-v2 is reused by swap counter and does not
include memory usage, we need to replace it with memory + swap.

Fixes: c31dabeaf42d ("memcg: add oom_guarantee")
Signed-off-by: Pavel Tikhomirov 

Feature: mm/oom: OOM guarantee feature
---
 mm/memcontrol.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8ca27af8d902..fcc4b24ba330 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1503,7 +1503,10 @@ unsigned long mem_cgroup_overdraft(struct mem_cgroup 
*memcg)
return 0;
 
guarantee = READ_ONCE(memcg->oom_guarantee);
-   usage = page_counter_read(>memsw);
+   if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
+   usage = page_counter_read(>memory) + 
page_counter_read(>swap);
+   else
+   usage = page_counter_read(>memsw);
return usage > guarantee ? (usage - guarantee) : 0;
 }
 
-- 
2.43.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH VZ9 v2 0/3] cgroup-v2: enable memory.oom_guarantee vz-specific cgroup property

2024-03-25 Thread Pavel Tikhomirov
https://virtuozzo.atlassian.net/browse/PSBM-154224

Pavel Tikhomirov (3):
  memcg: fix oom_guarantee to be considered in global oom and in
berserker
  memcg: fix oom_guarantee overdraft calculation in cgroup-v2
  mm/memcontrol: make memory.oom_guarantee file also available in
cgroup-v2

 include/linux/memcontrol.h |  8 -
 mm/memcontrol.c| 65 +++---
 mm/oom_kill.c  |  4 +++
 3 files changed, 64 insertions(+), 13 deletions(-)

-- 
2.43.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH VZ9 3/3] mm/memcontrol: make memory.oom_guarantee file also available in cgroup-v2

2024-03-25 Thread Pavel Tikhomirov
The interface is slightly reworked to be more v2 like:

- switch from "-1" to "max" for max value following cgroup-v2 style

While on it also:
- make oom_guarantee (unsigned long) similar to regular cgroup limits,
  as we never set more than PAGE_COUNTER_MAX there anyway
- set oom_guarantee with xchg(), this ensures that update is atomic

https://virtuozzo.atlassian.net/browse/PSBM-154224
Signed-off-by: Pavel Tikhomirov 

Feature: mm/oom: OOM guarantee feature
---
v2:
- correct commit message part about "max"
- make oom_guarantee unsigned long
- update oom_guarantee with xchg()
---
 include/linux/memcontrol.h |  2 +-
 mm/memcontrol.c| 30 ++
 2 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 9c2b8774639e..2aa330cb81b4 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -263,7 +263,7 @@ struct mem_cgroup {
struct vmpressure vmpressure;
 
unsigned long overdraft;
-   unsigned long long oom_guarantee;
+   unsigned long oom_guarantee;
 
/*
 * Should the OOM killer kill all belonging tasks, had it kill one?
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fcc4b24ba330..1722a49855ac 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5239,6 +5239,14 @@ static u64 mem_cgroup_oom_guarantee_read(struct 
cgroup_subsys_state *css,
return mem_cgroup_from_css(css)->oom_guarantee << PAGE_SHIFT;
 }
 
+static int seq_puts_memcg_tunable(struct seq_file *m, unsigned long value);
+
+static int memory_oom_guarantee_show(struct seq_file *m, void *v)
+{
+   return seq_puts_memcg_tunable(m,
+   READ_ONCE(mem_cgroup_from_seq(m)->oom_guarantee));
+}
+
 static ssize_t mem_cgroup_oom_guarantee_write(struct kernfs_open_file *kops,
char *buffer, size_t nbytes, loff_t off)
 {
@@ -5255,6 +5263,22 @@ static ssize_t mem_cgroup_oom_guarantee_write(struct 
kernfs_open_file *kops,
return nbytes;
 }
 
+static ssize_t memory_oom_guarantee_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+   struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+   unsigned long oom_guarantee;
+   int err;
+
+   buf = strstrip(buf);
+   err = page_counter_memparse(buf, "max", _guarantee);
+   if (err)
+   return err;
+
+   xchg(>oom_guarantee, oom_guarantee);
+   return nbytes;
+}
+
 #ifdef CONFIG_CLEANCACHE
 static u64 mem_cgroup_disable_cleancache_read(struct cgroup_subsys_state *css,
  struct cftype *cft)
@@ -7629,6 +7653,12 @@ static struct cftype memory_files[] = {
.flags = CFTYPE_NS_DELEGATABLE,
.write = memory_reclaim,
},
+   {
+   .name = "oom_guarantee",
+   .flags = CFTYPE_NOT_ON_ROOT,
+   .seq_show = memory_oom_guarantee_show,
+   .write = memory_oom_guarantee_write,
+   },
{ } /* terminate */
 };
 
-- 
2.43.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH VZ9 3/3] mm/memcontrol: make memory.oom_guarantee file also available in cgroup-v2

2024-03-22 Thread Pavel Tikhomirov




On 22/03/2024 13:21, Pavel Tikhomirov wrote:

The interface is slightly reworked to be more v2 like:

- show "max" when uninitialized and allow to write it


Slightly bad wording, it is 0 when uninitialized. Should be:

- show "max" when has max value and allow to write it



https://virtuozzo.atlassian.net/browse/PSBM-154224
Signed-off-by: Pavel Tikhomirov 

Feature: mm/oom: OOM guarantee feature
---
  mm/memcontrol.c | 30 ++
  1 file changed, 30 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fcc4b24ba330..f709ed6be891 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5239,6 +5239,14 @@ static u64 mem_cgroup_oom_guarantee_read(struct 
cgroup_subsys_state *css,
return mem_cgroup_from_css(css)->oom_guarantee << PAGE_SHIFT;
  }
  
+static int seq_puts_memcg_tunable(struct seq_file *m, unsigned long value);

+
+static int memory_oom_guarantee_show(struct seq_file *m, void *v)
+{
+   return seq_puts_memcg_tunable(m,
+   READ_ONCE(mem_cgroup_from_seq(m)->oom_guarantee));
+}
+
  static ssize_t mem_cgroup_oom_guarantee_write(struct kernfs_open_file *kops,
char *buffer, size_t nbytes, loff_t off)
  {
@@ -5255,6 +5263,22 @@ static ssize_t mem_cgroup_oom_guarantee_write(struct 
kernfs_open_file *kops,
return nbytes;
  }
  
+static ssize_t memory_oom_guarantee_write(struct kernfs_open_file *of,

+ char *buf, size_t nbytes, loff_t off)
+{
+   struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+   unsigned long oom_guarantee;
+   int err;
+
+   buf = strstrip(buf);
+   err = page_counter_memparse(buf, "max", _guarantee);
+   if (err)
+   return err;
+
+   memcg->oom_guarantee = oom_guarantee;
+   return nbytes;
+}
+
  #ifdef CONFIG_CLEANCACHE
  static u64 mem_cgroup_disable_cleancache_read(struct cgroup_subsys_state *css,
  struct cftype *cft)
@@ -7629,6 +7653,12 @@ static struct cftype memory_files[] = {
.flags = CFTYPE_NS_DELEGATABLE,
.write = memory_reclaim,
},
+   {
+   .name = "oom_guarantee",
+   .flags = CFTYPE_NOT_ON_ROOT,
+   .seq_show = memory_oom_guarantee_show,
+   .write = memory_oom_guarantee_write,
+   },
{ } /* terminate */
  };
  


--
Best regards, Tikhomirov Pavel
Senior Software Developer, Virtuozzo.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH VZ9 2/3] memcg: fix oom_guarantee overdraft calculation in cgroup-v2

2024-03-21 Thread Pavel Tikhomirov
The memsw counter in cgroup-v2 is reused by swap counter and does not
include memory usage, we need to replace it with memory + swap.

https://virtuozzo.atlassian.net/browse/PSBM-154224
Signed-off-by: Pavel Tikhomirov 

Feature: mm/oom: OOM guarantee feature
---
 mm/memcontrol.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8ca27af8d902..fcc4b24ba330 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1503,7 +1503,10 @@ unsigned long mem_cgroup_overdraft(struct mem_cgroup 
*memcg)
return 0;
 
guarantee = READ_ONCE(memcg->oom_guarantee);
-   usage = page_counter_read(>memsw);
+   if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
+   usage = page_counter_read(>memory) + 
page_counter_read(>swap);
+   else
+   usage = page_counter_read(>memsw);
return usage > guarantee ? (usage - guarantee) : 0;
 }
 
-- 
2.43.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH VZ9 0/2] cgroup-v2: enable memory.oom_guarantee vz-specific cgroup property

2024-03-21 Thread Pavel Tikhomirov
https://virtuozzo.atlassian.net/browse/PSBM-153799

Pavel Tikhomirov (3):
  memcg: fix oom_guarantee to be considered in global oom and in
berserker
  memcg: fix oom_guarantee overdraft calculation in cgroup-v2
  mm/memcontrol: make memory.oom_guarantee file also available in
cgroup-v2

 include/linux/memcontrol.h |  6 
 mm/memcontrol.c| 65 +++---
 mm/oom_kill.c  |  4 +++
 3 files changed, 63 insertions(+), 12 deletions(-)

-- 
2.43.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH VZ9 3/3] mm/memcontrol: make memory.oom_guarantee file also available in cgroup-v2

2024-03-21 Thread Pavel Tikhomirov
The interface is slightly reworked to be more v2 like:

- show "max" when uninitialized and allow to write it

https://virtuozzo.atlassian.net/browse/PSBM-154224
Signed-off-by: Pavel Tikhomirov 

Feature: mm/oom: OOM guarantee feature
---
 mm/memcontrol.c | 30 ++
 1 file changed, 30 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fcc4b24ba330..f709ed6be891 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5239,6 +5239,14 @@ static u64 mem_cgroup_oom_guarantee_read(struct 
cgroup_subsys_state *css,
return mem_cgroup_from_css(css)->oom_guarantee << PAGE_SHIFT;
 }
 
+static int seq_puts_memcg_tunable(struct seq_file *m, unsigned long value);
+
+static int memory_oom_guarantee_show(struct seq_file *m, void *v)
+{
+   return seq_puts_memcg_tunable(m,
+   READ_ONCE(mem_cgroup_from_seq(m)->oom_guarantee));
+}
+
 static ssize_t mem_cgroup_oom_guarantee_write(struct kernfs_open_file *kops,
char *buffer, size_t nbytes, loff_t off)
 {
@@ -5255,6 +5263,22 @@ static ssize_t mem_cgroup_oom_guarantee_write(struct 
kernfs_open_file *kops,
return nbytes;
 }
 
+static ssize_t memory_oom_guarantee_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+   struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+   unsigned long oom_guarantee;
+   int err;
+
+   buf = strstrip(buf);
+   err = page_counter_memparse(buf, "max", _guarantee);
+   if (err)
+   return err;
+
+   memcg->oom_guarantee = oom_guarantee;
+   return nbytes;
+}
+
 #ifdef CONFIG_CLEANCACHE
 static u64 mem_cgroup_disable_cleancache_read(struct cgroup_subsys_state *css,
  struct cftype *cft)
@@ -7629,6 +7653,12 @@ static struct cftype memory_files[] = {
.flags = CFTYPE_NS_DELEGATABLE,
.write = memory_reclaim,
},
+   {
+   .name = "oom_guarantee",
+   .flags = CFTYPE_NOT_ON_ROOT,
+   .seq_show = memory_oom_guarantee_show,
+   .write = memory_oom_guarantee_write,
+   },
{ } /* terminate */
 };
 
-- 
2.43.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH VZ9 1/3] memcg: fix oom_guarantee to be considered in global oom and in berserker

2024-03-21 Thread Pavel Tikhomirov
After rebase to VZ8 the code which sets memcg->overdraft was moved to
mem_cgroup_scan_tasks() which is completely wrong (for instance in VZ7
we had this code in oom_unlock(), so everywhere in oom we always had
->overdraft information properly set). Now we don't have proper refresh
of ->overdraft information in two cases: in global oom and in berserker.

Let's fix this by spliting the refresh code to separate function
refresh_mem_cgroup_overdraft() and call it where it is really needed
(where later in stack oom_badness uses the refreshed ->overdraft).

https://virtuozzo.atlassian.net/browse/PSBM-154224
Fixes: c31dabeaf42d ("memcg: add oom_guarantee")
Signed-off-by: Pavel Tikhomirov 

Feature: mm/oom: OOM guarantee feature
---
 include/linux/memcontrol.h |  6 ++
 mm/memcontrol.c| 30 +++---
 mm/oom_kill.c  |  4 
 3 files changed, 29 insertions(+), 11 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 4719eb84894f..9c2b8774639e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct mem_cgroup;
 struct obj_cgroup;
@@ -1004,6 +1005,7 @@ void mem_cgroup_print_oom_context(struct mem_cgroup 
*memcg,
 
 void mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg);
 unsigned long mem_cgroup_overdraft(struct mem_cgroup *memcg);
+void refresh_mem_cgroup_overdraft(struct oom_control *oc);
 
 static inline void mem_cgroup_enter_user_fault(void)
 {
@@ -1529,6 +1531,10 @@ static inline unsigned long mem_cgroup_overdraft(struct 
mem_cgroup *memcg)
return 0;
 }
 
+static inline void refresh_mem_cgroup_overdraft(struct oom_control *oc)
+{
+}
+
 static inline void
 mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a0a7c3b4ed35..8ca27af8d902 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1308,17 +1308,6 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
for_each_mem_cgroup_tree(iter, memcg) {
struct css_task_iter it;
struct task_struct *task;
-   struct mem_cgroup *parent;
-
-   /*
-* Update overdraft of each cgroup under us. This
-* information will be used in oom_badness.
-*/
-   iter->overdraft = mem_cgroup_overdraft(iter);
-   parent = parent_mem_cgroup(iter);
-   if (parent && iter != memcg)
-   iter->overdraft = max(iter->overdraft,
-   parent->overdraft);
 
css_task_iter_start(>css, CSS_TASK_ITER_PROCS, );
while (!ret && (task = css_task_iter_next()))
@@ -1518,6 +1507,25 @@ unsigned long mem_cgroup_overdraft(struct mem_cgroup 
*memcg)
return usage > guarantee ? (usage - guarantee) : 0;
 }
 
+void refresh_mem_cgroup_overdraft(struct oom_control *oc)
+{
+   struct mem_cgroup *iter;
+
+   for_each_mem_cgroup_tree(iter, oc->memcg) {
+   struct mem_cgroup *parent;
+
+   /*
+* Update overdraft of each cgroup under us. This
+* information will be used in oom_badness.
+*/
+   iter->overdraft = mem_cgroup_overdraft(iter);
+   parent = parent_mem_cgroup(iter);
+   if (parent && iter != oc->memcg)
+   iter->overdraft = max(iter->overdraft,
+ parent->overdraft);
+   }
+}
+
 bool mem_cgroup_dcache_is_low(struct mem_cgroup *memcg, int 
vfs_cache_min_ratio)
 {
unsigned long anon, file, dcache;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 1d44024d6c6f..d27e04295e7f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -378,6 +378,8 @@ static void select_bad_process(struct oom_control *oc)
 {
oc->chosen_points = LONG_MIN;
 
+   refresh_mem_cgroup_overdraft(oc);
+
if (is_memcg_oom(oc))
mem_cgroup_scan_tasks(oc->memcg, oom_evaluate_task, oc);
else {
@@ -1073,6 +1075,8 @@ static void oom_berserker(struct oom_control *oc)
if (rage < 0)
return;
 
+   refresh_mem_cgroup_overdraft(oc);
+
/*
 * So, we are in rage. Kill (1 << rage) youngest tasks that are
 * as bad as the victim.
-- 
2.43.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH VZ9] mm/memcontrol: Add page cache limit to cgroup-v2

2024-03-21 Thread Pavel Tikhomirov
The interface is slightly reworked to be more v2 like:

- rename memory.cache.limit/usage_in_bytes -> memory.cache.max/current
- show "max" when uninitialized and allow to write it
- memcg_max_mutex with page_counter_set_max replaced with simple xchg
- we set limit first before looping and then try to enforce it if
  needed, no more enforce before setting logic
- retry reclaim couple of times if it fails to enforce the limit and
  then just give up (memory_max_write triggers oom in this case, but we
  probably do not want to trigger oom due to cache limit)

https://virtuozzo.atlassian.net/browse/PSBM-154207
Signed-off-by: Pavel Tikhomirov 

Feature: mm: Memory cgroup page cache limit
---
 mm/memcontrol.c | 75 +
 1 file changed, 75 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9000dc00ed03..28a39b50cbe1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -8810,3 +8810,78 @@ static int __init mem_cgroup_swap_init(void)
 subsys_initcall(mem_cgroup_swap_init);
 
 #endif /* CONFIG_SWAP */
+
+static int cache_max_show(struct seq_file *m, void *v)
+{
+   return seq_puts_memcg_tunable(m,
+   READ_ONCE(mem_cgroup_from_seq(m)->cache.max));
+}
+
+static ssize_t cache_max_write(struct kernfs_open_file *of,
+  char *buf, size_t nbytes, loff_t off)
+{
+   struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+   unsigned int nr_reclaims = MAX_RECLAIM_RETRIES;
+   unsigned long max;
+   int err;
+
+   buf = strstrip(buf);
+   err = page_counter_memparse(buf, "max", );
+   if (err)
+   return err;
+
+   xchg(>cache.max, max);
+
+   for (;;) {
+   unsigned long nr_cache = page_counter_read(>cache);
+
+   if (nr_cache <= max)
+   break;
+
+   if (signal_pending(current))
+   break;
+
+   if (!nr_reclaims)
+   break;
+
+   if (!try_to_free_mem_cgroup_pages(memcg, nr_cache - max,
+   GFP_KERNEL, false))
+   nr_reclaims--;
+   }
+
+   memcg_wb_domain_size_changed(memcg);
+   return nbytes;
+}
+
+static u64 cache_current_read(struct cgroup_subsys_state *css,
+  struct cftype *cft)
+{
+   struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+   return (u64)page_counter_read(>cache) * PAGE_SIZE;
+}
+
+static struct cftype cache_files[] = {
+   {
+   .name = "cache.max",
+   .flags = CFTYPE_NOT_ON_ROOT,
+   .seq_show = cache_max_show,
+   .write = cache_max_write,
+   },
+   {
+   .name = "cache.current",
+   .flags = CFTYPE_NOT_ON_ROOT,
+   .read_u64 = cache_current_read,
+   },
+   { } /* terminate */
+};
+
+static int __init mem_cgroup_cache_init(void)
+{
+   if (mem_cgroup_disabled())
+   return 0;
+
+   WARN_ON(cgroup_add_dfl_cftypes(_cgrp_subsys, cache_files));
+   return 0;
+}
+subsys_initcall(mem_cgroup_cache_init);
-- 
2.43.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH VZ9 2/2] sched/stat: make cpu.proc.stat file also available in cgroup-v2

2024-03-21 Thread Pavel Tikhomirov
We use the fact that cpuacct controller is implicitly enabled in
cgroupv-2 by previous patch. So we can just do everyting the same as we
do it on cgroup-v1 where we have both cpu and cpuacct controllers
mounted together.

Note: Stats in cpu.proc.stat give per-cpu per-cgroup usages which are
used in vcmmd to calculate per-numa cpu loads for it's balancing
algorithm. This used both for Containers and VMs.

https://virtuozzo.atlassian.net/browse/PSBM-154206
Signed-off-by: Pavel Tikhomirov 

Feature: sched: emulate virtual cpus for Containers
---
 kernel/sched/core.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 98e9f78e19cf..2235f1565078 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -11835,6 +11835,10 @@ static struct cftype cpu_files[] = {
.write = cpu_uclamp_max_write,
},
 #endif
+   {
+   .name = "proc.stat",
+   .seq_show = cpu_cgroup_proc_stat_show,
+   },
{ } /* terminate */
 };
 
-- 
2.43.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH VZ9 0/2] cgroup-v2: enable cpu.proc.stat vz-specific cgroup property

2024-03-21 Thread Pavel Tikhomirov
https://virtuozzo.atlassian.net/browse/PSBM-154206
Signed-off-by: Pavel Tikhomirov 

Pavel Tikhomirov (2):
  cpuacct: implicitly enable cpuacct controller in cgroup-v2
  sched/stat: make cpu.proc.stat file also available in cgroup-v2

 kernel/sched/core.c| 4 
 kernel/sched/cpuacct.c | 3 +++
 2 files changed, 7 insertions(+)

-- 
2.43.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH VZ9 1/2] cpuacct: implicitly enable cpuacct controller in cgroup-v2

2024-03-21 Thread Pavel Tikhomirov
This way the cpuacct controller is transparent to userspace but can be
used to track cgroup cpu usages per-cpu, in next patch we will show
those usages in cpu.proc.stat file.

Notes:
- In case cpuacct is mounted in cgroup-v1 we won't have it in cgroup-v2.
- We also make this cgroup threaded to be able to make it implicit on
  default hierarchy. There should not be any problem as each thread
  accounts separately anyway.

https://virtuozzo.atlassian.net/browse/PSBM-154206
Signed-off-by: Pavel Tikhomirov 

Feature: sched: emulate virtual cpus for Containers
---
 kernel/sched/cpuacct.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index 37c102354522..55dfd37567d8 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -362,6 +362,9 @@ struct cgroup_subsys cpuacct_cgrp_subsys = {
.css_free   = cpuacct_css_free,
.legacy_cftypes = files,
.early_init = true,
+
+   .implicit_on_dfl = true,
+   .threaded   = true,
 };
 
 extern struct task_group *css_tg(struct cgroup_subsys_state *css);
-- 
2.43.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH VZ9 2/2] mm: migrate page private for high-order folios in swap cache correctly

2024-03-06 Thread Pavel Tikhomirov
We have a check in do_swap_page that page from lookup_swap_cache should
have private field equal to the swapcache index we searched it at
(page_private(page) != entry.val). So folio_migrate_mapping should
preserve page private for each page of a huge folio to satisfy this
check else we get infinite loop in:

  +-> mmap_read_lock
  +-> __get_user_pages_locked
+-> for-loop # taken once
  +-> __get_user_pages
+-> retry-loop # constantly spinning
  +-> faultin_page # return 0 to trigger retry
+-> handle_mm_fault
  +-> __handle_mm_fault
+-> handle_pte_fault
  +-> do_swap_page
+-> lookup_swap_cache # returns non-NULL
+-> if (swapcache)
  +-> if (!folio_test_swapcache || page_private(page) != 
entry.val)
+-> goto out_page
  +-> return 0

And this loop is under mmap_lock so we have secondary problem of
multiple processes hanging on attempt to take mmap_lock.

Note: We don't need this after we stop using private in this ms patch:
cfeed8ffe55b ("mm/swap: stop using page->private on tail pages for THP_SWAP")

https://virtuozzo.atlassian.net/browse/PSBM-153264
Signed-off-by: Pavel Tikhomirov 
---
 mm/migrate.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/mm/migrate.c b/mm/migrate.c
index d950f42c0708..357a0a4c59e6 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -424,6 +424,10 @@ int folio_migrate_mapping(struct address_space *mapping,
if (folio_test_swapcache(folio)) {
folio_set_swapcache(newfolio);
newfolio->private = folio_get_private(folio);
+
+   /* Swap cache needs private for each page of huge page 
*/
+   for (i = 1; i < nr; i++)
+   set_page_private(folio_page(newfolio, i), 
folio_page(folio, i)->private);
}
entries = nr;
} else {
-- 
2.43.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH VZ9 1/2] mm: migrate high-order folios in swap cache correctly

2024-03-06 Thread Pavel Tikhomirov
From: Charan Teja Kalla 

Large folios occupy N consecutive entries in the swap cache instead of
using multi-index entries like the page cache.  However, if a large folio
is re-added to the LRU list, it can be migrated.  The migration code was
not aware of the difference between the swap cache and the page cache and
assumed that a single xas_store() would be sufficient.

This leaves potentially many stale pointers to the now-migrated folio in
the swap cache, which can lead to almost arbitrary data corruption in the
future.  This can also manifest as infinite loops with the RCU read lock
held.

[wi...@infradead.org: modifications to the changelog & tweaked the fix]
Fixes: 3417013e0d18 ("mm/migrate: Add folio_migrate_mapping()")
Link: https://lkml.kernel.org/r/20231214045841.961776-1-wi...@infradead.org
Signed-off-by: Charan Teja Kalla 
Signed-off-by: Matthew Wilcox (Oracle) 
Reported-by: Charan Teja Kalla 
Closes: 
https://lkml.kernel.org/r/1700569840-17327-1-git-send-email-quic_chara...@quicinc.com
Cc: David Hildenbrand 
Cc: Johannes Weiner 
Cc: Kirill A. Shutemov 
Cc: Naoya Horiguchi 
Cc: Shakeel Butt 
Cc: 
Signed-off-by: Andrew Morton 

We have a check in do_swap_page that page from lookup_swap_cache should
have PG_swapcache bit set, but these leftover stale pointers may be
reused by new folio without PG_swapcache bit, and that leads to infinite
loop in:

  +-> mmap_read_lock
+-> __get_user_pages_locked
  +-> for-loop # taken once
+-> __get_user_pages
  +-> retry-loop # constantly spinning
+-> faultin_page # return 0 to trigger retry
  +-> handle_mm_fault
+-> __handle_mm_fault
  +-> handle_pte_fault
+-> do_swap_page
  +-> lookup_swap_cache # returns non-NULL
  +-> if (swapcache)
+-> if (!folio_test_swapcache || page_private(page) != 
entry.val)
  +-> goto out_page
+-> return 0

(cherry picked from commit fc346d0a70a13d52fe1c4bc49516d83a42cd7c4c)
https://virtuozzo.atlassian.net/browse/PSBM-153264
Signed-off-by: Pavel Tikhomirov 
---
 mm/migrate.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index d36d945cf716..d950f42c0708 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -387,6 +387,7 @@ int folio_migrate_mapping(struct address_space *mapping,
int dirty;
int expected_count = folio_expected_refs(mapping, folio) + extra_count;
long nr = folio_nr_pages(folio);
+   long entries, i;
 
if (!mapping) {
/* Anonymous page without mapping */
@@ -424,8 +425,10 @@ int folio_migrate_mapping(struct address_space *mapping,
folio_set_swapcache(newfolio);
newfolio->private = folio_get_private(folio);
}
+   entries = nr;
} else {
VM_BUG_ON_FOLIO(folio_test_swapcache(folio), folio);
+   entries = 1;
}
 
/* Move dirty while page refs frozen and newpage not yet exposed */
@@ -435,7 +438,11 @@ int folio_migrate_mapping(struct address_space *mapping,
folio_set_dirty(newfolio);
}
 
-   xas_store(, newfolio);
+   /* Swap cache still stores N entries instead of a high-order entry */
+   for (i = 0; i < entries; i++) {
+   xas_store(, newfolio);
+   xas_next();
+   }
 
/*
 * Drop cache reference from old page by unfreezing
-- 
2.43.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH VZ9 0/2] mm: fix livelock caused by thp pages migration

2024-03-06 Thread Pavel Tikhomirov
https://virtuozzo.atlassian.net/browse/PSBM-153264

Charan Teja Kalla (1):
  mm: migrate high-order folios in swap cache correctly

Pavel Tikhomirov (1):
  mm: migrate page private for high-order folios in swap cache correctly

 mm/migrate.c | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

-- 
2.43.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH VZ9] mm/vmscan: add rcu_read_lock to replace released shrinker_rwsem

2024-02-05 Thread Pavel Tikhomirov
After commit [1] we release shrinker_rwsem for nfs while processing
do_shrink_slab, we need this to mitigate blocked shrinker_rwsem due to
a hang in nfs shrinker.

After that we lack shrinker_rwsem and rcu_read_lock in these stacks:

  +-< rcu_dereference_protected(ockdep_is_held(_rwsem))
+-< shrinker_info_protected
  +-< xchg_nr_deferred_memcg
+-< xchg_nr_deferred
  +-< do_shrink_slab
  +-< add_nr_deferred_memcg
+-< add_nr_deferred
  +-< do_shrink_slab

As these stacks only use info for read, we can switch to rcu_read_lock,
also need to switch rcu_dereference_protected -> rcu_dereference_check.

https://virtuozzo.atlassian.net/browse/PSBM-153973
Fixes: c0efc56a8f844 ("mm: fix hanging shrinker management on long 
do_shrink_slab") [1]
Signed-off-by: Pavel Tikhomirov 
---
note: In vz7 we don't need it.
---
 mm/vmscan.c | 31 +++
 1 file changed, 27 insertions(+), 4 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 847fc2354c3dc..1f2eacbad84d4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -218,6 +218,13 @@ static struct shrinker_info 
*shrinker_info_protected(struct mem_cgroup *memcg,
 lockdep_is_held(_rwsem));
 }
 
+static struct shrinker_info *shrinker_info_check(struct mem_cgroup *memcg,
+int nid)
+{
+   return rcu_dereference_check(memcg->nodeinfo[nid]->shrinker_info,
+lockdep_is_held(_rwsem));
+}
+
 static int expand_one_shrinker_info(struct mem_cgroup *memcg,
int map_size, int defer_size,
int old_map_size, int old_defer_size)
@@ -411,18 +418,34 @@ static long xchg_nr_deferred_memcg(int nid, struct 
shrinker *shrinker,
   struct mem_cgroup *memcg)
 {
struct shrinker_info *info;
+   long ret;
 
-   info = shrinker_info_protected(memcg, nid);
-   return atomic_long_xchg(>nr_deferred[shrinker->id], 0);
+   /*
+* Need rcu lock here in case we've released shrinker_rwsem to prevent
+* hang on nfs before calling do_shrink_slab().
+*/
+   rcu_read_lock();
+   info = shrinker_info_check(memcg, nid);
+   ret = atomic_long_xchg(>nr_deferred[shrinker->id], 0);
+   rcu_read_unlock();
+   return ret;
 }
 
 static long add_nr_deferred_memcg(long nr, int nid, struct shrinker *shrinker,
  struct mem_cgroup *memcg)
 {
struct shrinker_info *info;
+   long ret;
 
-   info = shrinker_info_protected(memcg, nid);
-   return atomic_long_add_return(nr, >nr_deferred[shrinker->id]);
+   /*
+* Need rcu lock here in case we've released shrinker_rwsem to prevent
+* hang on nfs before calling do_shrink_slab().
+*/
+   rcu_read_lock();
+   info = shrinker_info_check(memcg, nid);
+   ret = atomic_long_add_return(nr, >nr_deferred[shrinker->id]);
+   rcu_read_unlock();
+   return ret;
 }
 
 void reparent_shrinker_deferred(struct mem_cgroup *memcg)
-- 
2.43.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH VZ9 2/2] mm/swap: convert add_to_swap_cache() to take a folio

2024-01-31 Thread Pavel Tikhomirov
From: "Matthew Wilcox (Oracle)" 

With all callers using folios, we can convert add_to_swap_cache() to take
a folio and use it throughout.

Link: https://lkml.kernel.org/r/20220902194653.1739778-13-wi...@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) 
Signed-off-by: Andrew Morton 

(cherry picked from commit a4c366f01f10073e0220656561b875627ff7cd90)
https://virtuozzo.atlassian.net/browse/PSBM-153264
Signed-off-by: Pavel Tikhomirov 
---
 mm/shmem.c  |  2 +-
 mm/swap.h   |  4 ++--
 mm/swap_state.c | 34 +-
 3 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 50ccad78b315f..fbfecb73e740b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1435,7 +1435,7 @@ static int shmem_writepage(struct page *page, struct 
writeback_control *wbc)
if (list_empty(>swaplist))
list_add(>swaplist, _swaplist);
 
-   if (add_to_swap_cache(>page, swap,
+   if (add_to_swap_cache(folio, swap,
__GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN,
NULL) == 0) {
spin_lock_irq(>lock);
diff --git a/mm/swap.h b/mm/swap.h
index f23941d6cc699..4c2b7f16783e4 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -34,7 +34,7 @@ extern struct address_space *swapper_spaces[];
 void show_swap_cache_info(void);
 bool add_to_swap(struct folio *folio);
 void *get_shadow_from_swap_cache(swp_entry_t entry);
-int add_to_swap_cache(struct page *page, swp_entry_t entry,
+int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
  gfp_t gfp, void **shadowp);
 void __delete_from_swap_cache(struct folio *folio,
  swp_entry_t entry, void *shadow);
@@ -126,7 +126,7 @@ static inline void *get_shadow_from_swap_cache(swp_entry_t 
entry)
return NULL;
 }
 
-static inline int add_to_swap_cache(struct page *page, swp_entry_t entry,
+static inline int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
gfp_t gfp_mask, void **shadowp)
 {
return -1;
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 3a2323aba6f9b..9aed413b61e60 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -85,21 +85,21 @@ void *get_shadow_from_swap_cache(swp_entry_t entry)
  * add_to_swap_cache resembles filemap_add_folio on swapper_space,
  * but sets SwapCache flag and private instead of mapping and index.
  */
-int add_to_swap_cache(struct page *page, swp_entry_t entry,
+int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
gfp_t gfp, void **shadowp)
 {
struct address_space *address_space = swap_address_space(entry);
pgoff_t idx = swp_offset(entry);
-   XA_STATE_ORDER(xas, _space->i_pages, idx, compound_order(page));
-   unsigned long i, nr = thp_nr_pages(page);
+   XA_STATE_ORDER(xas, _space->i_pages, idx, folio_order(folio));
+   unsigned long i, nr = folio_nr_pages(folio);
void *old;
 
-   VM_BUG_ON_PAGE(!PageLocked(page), page);
-   VM_BUG_ON_PAGE(PageSwapCache(page), page);
-   VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
+   VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
+   VM_BUG_ON_FOLIO(folio_test_swapcache(folio), folio);
+   VM_BUG_ON_FOLIO(!folio_test_swapbacked(folio), folio);
 
-   page_ref_add(page, nr);
-   SetPageSwapCache(page);
+   folio_ref_add(folio, nr);
+   folio_set_swapcache(folio);
 
do {
xas_lock_irq();
@@ -107,19 +107,19 @@ int add_to_swap_cache(struct page *page, swp_entry_t 
entry,
if (xas_error())
goto unlock;
for (i = 0; i < nr; i++) {
-   VM_BUG_ON_PAGE(xas.xa_index != idx + i, page);
+   VM_BUG_ON_FOLIO(xas.xa_index != idx + i, folio);
old = xas_load();
if (xa_is_value(old)) {
if (shadowp)
*shadowp = old;
}
-   set_page_private(page + i, entry.val + i);
-   xas_store(, page);
+   set_page_private(folio_page(folio, i), entry.val + i);
+   xas_store(, folio);
xas_next();
}
address_space->nrpages += nr;
-   __mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, nr);
-   __mod_lruvec_page_state(page, NR_SWAPCACHE, nr);
+   __node_stat_mod_folio(folio, NR_FILE_PAGES, nr);
+   __lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr);
 unlock:
xas_unlock_irq();
} while (xas_nomem(, gfp));
@@ -127,8 +127,8 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry,
if (!xas_error())
return 0;
 
-   ClearPageSwapCache(page);
-   page_ref_sub(page, nr);
+

[Devel] [PATCH VZ9 1/2] mm/swap: convert __read_swap_cache_async() to use a folio

2024-01-31 Thread Pavel Tikhomirov
From: "Matthew Wilcox (Oracle)" 

Remove a few hidden (and one visible) calls to compound_head().

Link: https://lkml.kernel.org/r/20220902194653.1739778-12-wi...@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) 
Signed-off-by: Andrew Morton 

(cherry picked from commit a0d3374b070776e985bbd7b165b178fa688bf37a)
Change: Also update vz specific hunk SetPageActive->folio_set_active.

https://virtuozzo.atlassian.net/browse/PSBM-153264
Signed-off-by: Pavel Tikhomirov 
---
 mm/swap_state.c | 39 ---
 1 file changed, 20 insertions(+), 19 deletions(-)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index b16dc348bbd26..3a2323aba6f9b 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -423,7 +423,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, 
gfp_t gfp_mask,
bool *new_page_allocated, bool activate)
 {
struct swap_info_struct *si;
-   struct page *page;
+   struct folio *folio;
void *shadow = NULL;
 
*new_page_allocated = false;
@@ -438,11 +438,11 @@ struct page *__read_swap_cache_async(swp_entry_t entry, 
gfp_t gfp_mask,
si = get_swap_device(entry);
if (!si)
return NULL;
-   page = find_get_page(swap_address_space(entry),
-swp_offset(entry));
+   folio = filemap_get_folio(swap_address_space(entry),
+   swp_offset(entry));
put_swap_device(si);
-   if (page)
-   return page;
+   if (folio)
+   return folio_file_page(folio, swp_offset(entry));
 
/*
 * Just skip read ahead for unused swap slot.
@@ -460,8 +460,8 @@ struct page *__read_swap_cache_async(swp_entry_t entry, 
gfp_t gfp_mask,
 * before marking swap_map SWAP_HAS_CACHE, when -EEXIST will
 * cause any racers to loop around until we add it to cache.
 */
-   page = alloc_page_vma(gfp_mask, vma, addr);
-   if (!page)
+   folio = vma_alloc_folio(gfp_mask, 0, vma, addr, false);
+   if (!folio)
return NULL;
 
/*
@@ -471,7 +471,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, 
gfp_t gfp_mask,
if (!err)
break;
 
-   put_page(page);
+   folio_put(folio);
if (err != -EEXIST)
return NULL;
 
@@ -489,32 +489,33 @@ struct page *__read_swap_cache_async(swp_entry_t entry, 
gfp_t gfp_mask,
 * The swap entry is ours to swap in. Prepare the new page.
 */
 
-   __SetPageLocked(page);
-   __SetPageSwapBacked(page);
+   __folio_set_locked(folio);
+   __folio_set_swapbacked(folio);
 
-   if (mem_cgroup_swapin_charge_page(page, NULL, gfp_mask, entry))
+   if (mem_cgroup_swapin_charge_page(>page, NULL, gfp_mask, entry))
goto fail_unlock;
 
/* May fail (-ENOMEM) if XArray node allocation failed. */
-   if (add_to_swap_cache(page, entry, gfp_mask & GFP_RECLAIM_MASK, 
))
+   if (add_to_swap_cache(>page, entry, gfp_mask & GFP_RECLAIM_MASK, 
))
goto fail_unlock;
 
mem_cgroup_swapin_uncharge_swap(entry);
 
if (shadow)
-   workingset_refault(page_folio(page), shadow);
+   workingset_refault(folio, shadow);
 
/* Caller will initiate read into locked page */
if (activate)
-   SetPageActive(page);
-   lru_cache_add(page);
+   folio_set_active(folio);
+   /* Caller will initiate read into locked folio */
+   folio_add_lru(folio);
*new_page_allocated = true;
-   return page;
+   return >page;
 
 fail_unlock:
-   put_swap_page(page, entry);
-   unlock_page(page);
-   put_page(page);
+   put_swap_page(>page, entry);
+   folio_unlock(folio);
+   folio_put(folio);
return NULL;
 }
 
-- 
2.43.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH VZ9 0/2] mm/swap: fix swapcache page/folio inconsistency

2024-01-31 Thread Pavel Tikhomirov
This ports the switching from page to folio in add_to_swap_cache().

We saw multiple different hangs on mmap_lock, where the task holding the
lock was livelocked spinning in this stack:

  +-> __x64_sys_ioctl
+-> kvm_vcpu_ioctl
  +-> kvm_arch_vcpu_ioctl_run
+-> vcpu_run
  +-> vcpu_enter_guest
+-> kvm_mmu_page_fault
  +-> kvm_tdp_page_fault
+-> kvm_faultin_pfn
  +-> __kvm_faultin_pfn
+-> hva_to_pfn
  +-> get_user_pages_unlocked

  +-> get_user_pages_unlocked
+-> mmap_read_lock # 1
+-> __get_user_pages_locked # 2
  +-> for-loop # taken once
+-> __get_user_pages
  +-> retry-loop # constantly spinning
+-> faultin_page # return 0 to trigger retry
  +-> handle_mm_fault
+-> __handle_mm_fault
  +-> handle_pte_fault
+-> do_swap_page
  +-> lookup_swap_cache # returns non-NULL
  +-> if (swapcache)
+-> if (!folio_test_swapcache || page_private(page) != 
entry.val)
  +-> goto out_page
+-> return 0

That can be due to an inconsistency in swapcache flag setting/reading,
one can see that PageSwapCache reads the flag from folio, but
SetPageSwapCache/ClearPageSwapCache instead affect the flag from page.

After applying those patches SetPageSwapCache/ClearPageSwapCache become
unused, thus all paths seek this flag from folio now. With it I don't
see any hangs on mmap_lock anymore (on the same test setup).

https://virtuozzo.atlassian.net/browse/PSBM-153264
Signed-off-by: Pavel Tikhomirov 

Matthew Wilcox (Oracle) (2):
  mm/swap: convert __read_swap_cache_async() to use a folio
  mm/swap: convert add_to_swap_cache() to take a folio

 mm/shmem.c  |  2 +-
 mm/swap.h   |  4 +--
 mm/swap_state.c | 71 +
 3 files changed, 39 insertions(+), 38 deletions(-)

-- 
2.43.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH VZ9 4/4] netfilter: bridge: replace physindev with physinif in nf_bridge_info

2024-01-22 Thread Pavel Tikhomirov
An skb can be added to a neigh->arp_queue while waiting for an arp
reply. Where original skb's skb->dev can be different to neigh's
neigh->dev. For instance in case of bridging dnated skb from one veth to
another, the skb would be added to a neigh->arp_queue of the bridge.

As skb->dev can be reset back to nf_bridge->physindev and used, and as
there is no explicit mechanism that prevents this physindev from been
freed under us (for instance neigh_flush_dev doesn't cleanup skbs from
different device's neigh queue) we can crash on e.g. this stack:

arp_process
  neigh_update
skb = __skb_dequeue(>arp_queue)
  neigh_resolve_output(..., skb)
...
  br_nf_dev_xmit
br_nf_pre_routing_finish_bridge_slow
  skb->dev = nf_bridge->physindev
  br_handle_frame_finish

Let's use plain ifindex instead of net_device link. To peek into the
original net_device we will use dev_get_by_index_rcu(). Thus either we
get device and are safe to use it or we don't get it and drop skb.

Fixes: c4e70a87d975 ("netfilter: bridge: rename br_netfilter.c to 
br_netfilter_hooks.c")
Suggested-by: Florian Westphal 
Signed-off-by: Pavel Tikhomirov 
Signed-off-by: Pablo Neira Ayuso 

(cherry picked from commit 9874808878d9eed407e3977fd11fee49de1e1d86)
https://virtuozzo.atlassian.net/browse/PSBM-153269
Signed-off-by: Pavel Tikhomirov 
---
 include/linux/netfilter_bridge.h|  4 +--
 include/linux/skbuff.h  |  2 +-
 net/bridge/br_netfilter_hooks.c | 42 +++--
 net/bridge/br_netfilter_ipv6.c  | 14 +++---
 net/ipv4/netfilter/nf_reject_ipv4.c |  9 ---
 net/ipv6/netfilter/nf_reject_ipv6.c | 11 +---
 6 files changed, 61 insertions(+), 21 deletions(-)

diff --git a/include/linux/netfilter_bridge.h b/include/linux/netfilter_bridge.h
index e927b9a15a556..743475ca7e9d5 100644
--- a/include/linux/netfilter_bridge.h
+++ b/include/linux/netfilter_bridge.h
@@ -42,7 +42,7 @@ static inline int nf_bridge_get_physinif(const struct sk_buff 
*skb)
if (!nf_bridge)
return 0;
 
-   return nf_bridge->physindev ? nf_bridge->physindev->ifindex : 0;
+   return nf_bridge->physinif;
 }
 
 static inline int nf_bridge_get_physoutif(const struct sk_buff *skb)
@@ -60,7 +60,7 @@ nf_bridge_get_physindev(const struct sk_buff *skb, struct net 
*net)
 {
const struct nf_bridge_info *nf_bridge = nf_bridge_info_get(skb);
 
-   return nf_bridge ? nf_bridge->physindev : NULL;
+   return nf_bridge ? dev_get_by_index_rcu(net, nf_bridge->physinif) : 
NULL;
 }
 
 static inline struct net_device *
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index a6d26a08a2557..4714ce0afffb3 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -263,7 +263,7 @@ struct nf_bridge_info {
u8  bridged_dnat:1;
u8  sabotage_in_done:1;
__u16   frag_max_size;
-   struct net_device   *physindev;
+   int physinif;
 
/* always valid & non-NULL from FORWARD on, for physdev match */
struct net_device   *physoutdev;
diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c
index 4bc6761517bbc..98598672f9382 100644
--- a/net/bridge/br_netfilter_hooks.c
+++ b/net/bridge/br_netfilter_hooks.c
@@ -278,8 +278,17 @@ int br_nf_pre_routing_finish_bridge(struct net *net, 
struct sock *sk, struct sk_
int ret;
 
if ((neigh->nud_state & NUD_CONNECTED) && neigh->hh.hh_len) {
+   struct net_device *br_indev;
+
+   br_indev = nf_bridge_get_physindev(skb, net);
+   if (!br_indev) {
+   neigh_release(neigh);
+   goto free_skb;
+   }
+
neigh_hh_bridge(>hh, skb);
-   skb->dev = nf_bridge->physindev;
+   skb->dev = br_indev;
+
ret = br_handle_frame_finish(net, sk, skb);
} else {
/* the neighbour function below overwrites the complete
@@ -351,12 +360,18 @@ br_nf_ipv4_daddr_was_changed(const struct sk_buff *skb,
  */
 static int br_nf_pre_routing_finish(struct net *net, struct sock *sk, struct 
sk_buff *skb)
 {
-   struct net_device *dev = skb->dev;
+   struct net_device *dev = skb->dev, *br_indev;
struct iphdr *iph = ip_hdr(skb);
struct nf_bridge_info *nf_bridge = nf_bridge_info_get(skb);
struct rtable *rt;
int err;
 
+   br_indev = nf_bridge_get_physindev(skb, net);
+   if (!br_indev) {
+   kfree_skb(skb);
+   return 0;
+   }
+
nf_bridge->frag_max_size = IPCB(skb)->frag_max_size;
 
if (nf_bridge->pkt_otherhost)

[Devel] [PATCH VZ9 3/4] netfilter: propagate net to nf_bridge_get_physindev

2024-01-22 Thread Pavel Tikhomirov
This is a preparation patch for replacing physindev with physinif on
nf_bridge_info structure. We will use dev_get_by_index_rcu to resolve
device, when needed, and it requires net to be available.

Signed-off-by: Pavel Tikhomirov 
Reviewed-by: Simon Horman 
Signed-off-by: Pablo Neira Ayuso 

(cherry picked from commit a54e72197037d2c9bfcd70dddaac8c8ccb5b41ba)
https://virtuozzo.atlassian.net/browse/PSBM-153269
Signed-off-by: Pavel Tikhomirov 
---
 include/linux/netfilter_bridge.h   |  2 +-
 net/ipv4/netfilter/nf_reject_ipv4.c|  2 +-
 net/ipv6/netfilter/nf_reject_ipv6.c|  2 +-
 net/netfilter/ipset/ip_set_hash_netiface.c |  8 
 net/netfilter/nf_log_syslog.c  | 13 +++--
 net/netfilter/nf_queue.c   |  2 +-
 net/netfilter/xt_physdev.c |  2 +-
 7 files changed, 16 insertions(+), 15 deletions(-)

diff --git a/include/linux/netfilter_bridge.h b/include/linux/netfilter_bridge.h
index f980edfdd2783..e927b9a15a556 100644
--- a/include/linux/netfilter_bridge.h
+++ b/include/linux/netfilter_bridge.h
@@ -56,7 +56,7 @@ static inline int nf_bridge_get_physoutif(const struct 
sk_buff *skb)
 }
 
 static inline struct net_device *
-nf_bridge_get_physindev(const struct sk_buff *skb)
+nf_bridge_get_physindev(const struct sk_buff *skb, struct net *net)
 {
const struct nf_bridge_info *nf_bridge = nf_bridge_info_get(skb);
 
diff --git a/net/ipv4/netfilter/nf_reject_ipv4.c 
b/net/ipv4/netfilter/nf_reject_ipv4.c
index f2edb40c0db00..63c91aa2375fb 100644
--- a/net/ipv4/netfilter/nf_reject_ipv4.c
+++ b/net/ipv4/netfilter/nf_reject_ipv4.c
@@ -286,7 +286,7 @@ void nf_send_reset(struct net *net, struct sock *sk, struct 
sk_buff *oldskb,
 * build the eth header using the original destination's MAC as the
 * source, and send the RST packet directly.
 */
-   br_indev = nf_bridge_get_physindev(oldskb);
+   br_indev = nf_bridge_get_physindev(oldskb, net);
if (br_indev) {
struct ethhdr *oeth = eth_hdr(oldskb);
 
diff --git a/net/ipv6/netfilter/nf_reject_ipv6.c 
b/net/ipv6/netfilter/nf_reject_ipv6.c
index dffedcded..b138ee4ebc6a2 100644
--- a/net/ipv6/netfilter/nf_reject_ipv6.c
+++ b/net/ipv6/netfilter/nf_reject_ipv6.c
@@ -353,7 +353,7 @@ void nf_send_reset6(struct net *net, struct sock *sk, 
struct sk_buff *oldskb,
 * build the eth header using the original destination's MAC as the
 * source, and send the RST packet directly.
 */
-   br_indev = nf_bridge_get_physindev(oldskb);
+   br_indev = nf_bridge_get_physindev(oldskb, net);
if (br_indev) {
struct ethhdr *oeth = eth_hdr(oldskb);
 
diff --git a/net/netfilter/ipset/ip_set_hash_netiface.c 
b/net/netfilter/ipset/ip_set_hash_netiface.c
index 0310732862362..bf1a3851ba5a7 100644
--- a/net/netfilter/ipset/ip_set_hash_netiface.c
+++ b/net/netfilter/ipset/ip_set_hash_netiface.c
@@ -138,9 +138,9 @@ hash_netiface4_data_next(struct hash_netiface4_elem *next,
 #include "ip_set_hash_gen.h"
 
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
-static const char *get_physindev_name(const struct sk_buff *skb)
+static const char *get_physindev_name(const struct sk_buff *skb, struct net 
*net)
 {
-   struct net_device *dev = nf_bridge_get_physindev(skb);
+   struct net_device *dev = nf_bridge_get_physindev(skb, net);
 
return dev ? dev->name : NULL;
 }
@@ -177,7 +177,7 @@ hash_netiface4_kadt(struct ip_set *set, const struct 
sk_buff *skb,
 
if (opt->cmdflags & IPSET_FLAG_PHYSDEV) {
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
-   const char *eiface = SRCDIR ? get_physindev_name(skb) :
+   const char *eiface = SRCDIR ? get_physindev_name(skb, 
xt_net(par)) :
  get_physoutdev_name(skb);
 
if (!eiface)
@@ -395,7 +395,7 @@ hash_netiface6_kadt(struct ip_set *set, const struct 
sk_buff *skb,
 
if (opt->cmdflags & IPSET_FLAG_PHYSDEV) {
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
-   const char *eiface = SRCDIR ? get_physindev_name(skb) :
+   const char *eiface = SRCDIR ? get_physindev_name(skb, 
xt_net(par)) :
  get_physoutdev_name(skb);
 
if (!eiface)
diff --git a/net/netfilter/nf_log_syslog.c b/net/netfilter/nf_log_syslog.c
index b7e4e6791038e..baf609ff5fbda 100644
--- a/net/netfilter/nf_log_syslog.c
+++ b/net/netfilter/nf_log_syslog.c
@@ -117,7 +117,8 @@ nf_log_dump_packet_common(struct nf_log_buf *m, u8 pf,
  unsigned int hooknum, const struct sk_buff *skb,
  const struct net_device *in,
  const struct net_device *out,
- const struct nf_loginfo *loginfo, const char *prefix)
+ const struct nf_loginfo *loginfo, const char *prefix,
+ struct net *net)
 {

[Devel] [PATCH VZ9 0/4] netlink: bridge: fix nf_bridge->physindev use after free

2024-01-22 Thread Pavel Tikhomirov
This is a backport of mainstream version of the fix to vz9:
https://lore.kernel.org/netdev/2024050645.85637-1-ptikhomi...@virtuozzo.com/

https://virtuozzo.atlassian.net/browse/PSBM-153269

Pavel Tikhomirov (4):
  netfilter: nfnetlink_log: use proper helper for fetching physinif
  netfilter: nf_queue: remove excess nf_bridge variable
  netfilter: propagate net to nf_bridge_get_physindev
  netfilter: bridge: replace physindev with physinif in nf_bridge_info

 include/linux/netfilter_bridge.h   |  6 ++--
 include/linux/skbuff.h |  2 +-
 net/bridge/br_netfilter_hooks.c| 42 +-
 net/bridge/br_netfilter_ipv6.c | 14 +---
 net/ipv4/netfilter/nf_reject_ipv4.c|  9 +++--
 net/ipv6/netfilter/nf_reject_ipv6.c| 11 --
 net/netfilter/ipset/ip_set_hash_netiface.c |  8 ++---
 net/netfilter/nf_log_syslog.c  | 13 +++
 net/netfilter/nf_queue.c   |  6 ++--
 net/netfilter/nfnetlink_log.c  |  8 ++---
 net/netfilter/xt_physdev.c |  2 +-
 11 files changed, 80 insertions(+), 41 deletions(-)

-- 
2.43.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH VZ9 1/4] netfilter: nfnetlink_log: use proper helper for fetching physinif

2024-01-22 Thread Pavel Tikhomirov
We don't use physindev in __build_packet_message except for getting
physinif from it. So let's switch to nf_bridge_get_physinif to get what
we want directly.

Signed-off-by: Pavel Tikhomirov 
Reviewed-by: Simon Horman 
Signed-off-by: Pablo Neira Ayuso 

(cherry picked from commit c3f9fd54cd87233f53bdf0e191a86b3a5e960e02)
https://virtuozzo.atlassian.net/browse/PSBM-153269
Signed-off-by: Pavel Tikhomirov 
---
 net/netfilter/nfnetlink_log.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/netfilter/nfnetlink_log.c b/net/netfilter/nfnetlink_log.c
index d97eb280cb2e8..8116beccdc1c4 100644
--- a/net/netfilter/nfnetlink_log.c
+++ b/net/netfilter/nfnetlink_log.c
@@ -499,7 +499,7 @@ __build_packet_message(struct nfnl_log_net *log,
 
htonl(br_port_get_rcu(indev)->br->dev->ifindex)))
goto nla_put_failure;
} else {
-   struct net_device *physindev;
+   int physinif;
 
/* Case 2: indev is bridge group, we need to look for
 * physical device (when called from ipv4) */
@@ -507,10 +507,10 @@ __build_packet_message(struct nfnl_log_net *log,
 htonl(indev->ifindex)))
goto nla_put_failure;
 
-   physindev = nf_bridge_get_physindev(skb);
-   if (physindev &&
+   physinif = nf_bridge_get_physinif(skb);
+   if (physinif &&
nla_put_be32(inst->skb, NFULA_IFINDEX_PHYSINDEV,
-htonl(physindev->ifindex)))
+htonl(physinif)))
goto nla_put_failure;
}
 #endif
-- 
2.43.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH VZ9 2/4] netfilter: nf_queue: remove excess nf_bridge variable

2024-01-22 Thread Pavel Tikhomirov
We don't really need nf_bridge variable here. And nf_bridge_info_exists
is better replacement for nf_bridge_info_get in case we are only
checking for existence.

Signed-off-by: Pavel Tikhomirov 
Reviewed-by: Simon Horman 
Signed-off-by: Pablo Neira Ayuso 

(cherry picked from commit aeaa44075f8e49e2e0ad4507d925e690b7950145)
https://virtuozzo.atlassian.net/browse/PSBM-153269
Signed-off-by: Pavel Tikhomirov 
---
 net/netfilter/nf_queue.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/net/netfilter/nf_queue.c b/net/netfilter/nf_queue.c
index d7542c5c4ae61..0077354d3258c 100644
--- a/net/netfilter/nf_queue.c
+++ b/net/netfilter/nf_queue.c
@@ -82,10 +82,8 @@ static void __nf_queue_entry_init_physdevs(struct 
nf_queue_entry *entry)
 {
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
const struct sk_buff *skb = entry->skb;
-   struct nf_bridge_info *nf_bridge;
 
-   nf_bridge = nf_bridge_info_get(skb);
-   if (nf_bridge) {
+   if (nf_bridge_info_exists(skb)) {
entry->physin = nf_bridge_get_physindev(skb);
entry->physout = nf_bridge_get_physoutdev(skb);
} else {
-- 
2.43.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH VZ9] drivers/vhost: fix missing rcu_read_lock in vhost_work_queue

2024-01-21 Thread Pavel Tikhomirov
In this stack:

  +-> vhost_vsock_dev_ioctl
+-> vhost_vsock_start
  +-> vhost_work_queue
+-> xas_find
  +-> xas_load
+-> xas_start
  +-> xa_head
+-> rcu_dereference_check

We require either rcu_read_lock or xa_lock but have none. Let's fix it
by calling a xa_find, which is a wraper for xas_find having proper rcu
and also xas_retry logic.

https://virtuozzo.atlassian.net/browse/PSBM-153264
Fixes: 5271bf51f1b83 ("ms/vhost: replace single worker pointer with xarray")
Signed-off-by: Pavel Tikhomirov 

Feature: vhost-blk: in-kernel accelerator for virtio-blk guests
---
 drivers/vhost/vhost.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index c32557e279dfb..2f45c8d2b6fd6 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -257,10 +257,10 @@ static bool vhost_worker_queue(struct vhost_worker 
*worker,
 
 bool vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
 {
-   XA_STATE(xas, >worker_xa, 0);
struct vhost_worker *worker;
+   unsigned long i;
 
-   worker = xas_find(, UINT_MAX);
+   worker = xa_find(>worker_xa, , ULONG_MAX, XA_PRESENT);
if (!worker)
return false;
 
-- 
2.43.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH vz7 v3 1/2] net, neigh: Fix null-ptr-deref in neigh_table_clear()

2024-01-16 Thread Pavel Tikhomirov

(for both patches)
Reviewed-by: Pavel Tikhomirov 

On 17/01/2024 03:31, Alexander Atanasov wrote:

From: Chen Zhongjin 

When IPv6 module gets initialized but hits an error in the middle,
kenel panic with:

KASAN: null-ptr-deref in range [0x0598-0x059f]
CPU: 1 PID: 361 Comm: insmod
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
RIP: 0010:__neigh_ifdown.isra.0+0x24b/0x370
RSP: 0018:888012677908 EFLAGS: 0202
...
Call Trace:
  
  neigh_table_clear+0x94/0x2d0
  ndisc_cleanup+0x27/0x40 [ipv6]
  inet6_init+0x21c/0x2cb [ipv6]
  do_one_initcall+0xd3/0x4d0
  do_init_module+0x1ae/0x670
...
Kernel panic - not syncing: Fatal exception

When ipv6 initialization fails, it will try to cleanup and calls:

neigh_table_clear()
   neigh_ifdown(tbl, NULL)
 pneigh_queue_purge(>proxy_queue, dev_net(dev == NULL))
 # dev_net(NULL) triggers null-ptr-deref.

Fix it by passing NULL to pneigh_queue_purge() in neigh_ifdown() if dev
is NULL, to make kernel not panic immediately.

Fixes: 66ba215cb513 ("neigh: fix possible DoS due to net iface start/stop loop")
Signed-off-by: Chen Zhongjin 
Reviewed-by: Eric Dumazet 
Reviewed-by: Denis V. Lunev 
Link: https://lore.kernel.org/r/20221101121552.21890-1-chenzhong...@huawei.com
Signed-off-by: Jakub Kicinski 
(cherry picked from commit f8017317cb0b279b8ab98b0f3901a2e0ac880dad)
https://virtuozzo.atlassian.net/browse/PSBM-153018
https://bugs.openvz.org/browse/OVZ-7410
Signed-off-by: Alexander Atanasov 
---
  net/core/neighbour.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)


v1->v2: fixed wrong merge
v2->v3: backported one more patch - which is not in the vz7 tree but is in 
mainstream,
so it looked like two patches are made into one. Both touch the same code to 
increase
the confusion.


diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index d5b28c158144..ae6490f8d3ae 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -321,7 +321,7 @@ int neigh_ifdown(struct neigh_table *tbl, struct net_device 
*dev)
neigh_flush_dev(tbl, dev);
pneigh_ifdown(tbl, dev);
write_unlock_bh(>lock);
-   pneigh_queue_purge(>proxy_queue, dev_net(dev));
+   pneigh_queue_purge(>proxy_queue, dev ? dev_net(dev) : NULL);
if (skb_queue_empty(>proxy_queue))
del_timer_sync(>proxy_timer);
return 0;


--
Best regards, Tikhomirov Pavel
Senior Software Developer, Virtuozzo.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH VZ9 2/3] net: zerocopy over unix sockets

2024-01-16 Thread Pavel Tikhomirov



On 16/01/2024 19:26, Alexey Kuznetsov wrote:

Hello!

On Tue, Jan 16, 2024 at 1:13 PM Pavel Tikhomirov
 wrote:

--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1405,6 +1405,9 @@ int sk_setsockopt(struct sock *sk, int level, int optname,
 (sk->sk_type == SOCK_DGRAM &&
  sk->sk_protocol == IPPROTO_UDP)))
   ret = -EOPNOTSUPP;
+ } else if (sk->sk_family == PF_UNIX) {
+ if (sk->sk_type == SOCK_DGRAM)
+ ret = -EOPNOTSUPP;


Likely we don't want to enable zerocopy for SOCK_SEQPACKET. Maybe change
to: "if (sk->sk_type != SOCK_STREAM)"


Indeed. Thanks.



+ if (unlikely(flags & MSG_ERRQUEUE))
+ return unix_recv_error(sk, msg, size);


In case of MSG_PEEK it might be not what user wants to receive zero-copy
notification error message.


MSG_ERRQUEUE means user requested error messages, them and only them.
If MSG_PEEK (or another flag) is also set it means either peeking from
error queue or this flag
is ignored (likeMSG_DONTWAIT) or rejected, whatever is appropriate. At
least it is how I originally designed
it. But in case it was messed up with time, I have to recheck current semantics
in tcp and align to it, whatever it is.


Ah sorry I was confused, you are right of course, when we ask for errors 
it is perfectly OK to receive an error.


Though yes, there would be no way to peek on error queue, so CRIU will 
not be able to handle such error messages, but that is different story.






+ int err = skb_orphan_frags_rx(skb, GFP_KERNEL);
+


I prefer not to put function call on the same line with variable


Ok-ok. If my old pants are old-fashioned granpa's style, we replace them.

But why that creepy checkpatch.pl does not complain? Is it broken and
already not haute couture? :-)


I don't think that it is a strict rule, there are many places where one 
can see such things in kernel code (e.g. osf_statfs, bpf_copy_from_user, 
update_parent_effective_cpumask or nsfs_init_fs_context). So I don't insist.


--
Best regards, Tikhomirov Pavel
Senior Software Developer, Virtuozzo.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH vz7 v2] net: neigh: decrement the family specific qlen

2024-01-15 Thread Pavel Tikhomirov



On 16/01/2024 15:10, Alexander Atanasov wrote:

On 16.01.24 5:12, Pavel Tikhomirov wrote:
I'd prefer to have two separate clean patches ported instead of 
porting a merge of one into another (without a change explaining 
comment).


commit 8207f253a097fe15c93d85ac15ebb73c5e39e1e1
Author: Thomas Zeitlhofer 
Date:   Tue Nov 15 23:09:41 2022 +0100

@@ -409,7 +427,8 @@ static int __neigh_ifdown(struct neigh_table *tbl, 
struct net_device *dev,

     write_lock_bh(>lock);
     neigh_flush_dev(tbl, dev, skip_perm);
     pneigh_ifdown_and_unlock(tbl, dev);
-   pneigh_queue_purge(>proxy_queue, dev ? dev_net(dev) : NULL);
+   pneigh_queue_purge(>proxy_queue, dev ? dev_net(dev) : NULL,
+  tbl->family);
     if (skb_queue_empty_lockless(>proxy_queue))
     del_timer_sync(>proxy_timer);
     return 0;


The hook about device is in the commit i am backporting to vz7.
i made a mistake during the merge so it was not in the v1 patch.

The other commit you talk about is:
commit f8017317cb0b279b8ab98b0f3901a2e0ac880dad
Author: Chen Zhongjin 
Date:   Tue Nov 1 20:15:52 2022 +0800


Which brings the fix about NULL dev - predates the one i am backporting.

Both logs are from mainstream tree.




You ported 8207f253 and merged f8017317 into it, I don't like it. I 
would've ported both patches sequentially instead, that's all I try to 
say here.


When you port something cleanly, git-range-diff will be clean:

[snorch@turmoil vzkernel-vz7]$ git range-diff 
8207f253a097fe15c93d85ac15ebb73c5e39e1e1^- HEAD^- | grep "++\|--\|-+\|+-"

[snorch@turmoil vzkernel-vz7]$

in your case it is:

[snorch@turmoil vzkernel-vz7]$ git range-diff 
8207f253a097fe15c93d85ac15ebb73c5e39e1e1^- aatanasov-neigh-qlen-v2^- | 
grep "++\|--\|-+\|+-"

--  pneigh_queue_purge(>proxy_queue, dev ? dev_net(dev) : NULL);
+-  pneigh_queue_purge(>proxy_queue, dev_net(dev));







On 15/01/2024 22:56, Alexander Atanasov wrote:

From: Thomas Zeitlhofer 

Commit 0ff4eb3d5ebb ("neighbour: make proxy_queue.qlen limit
per-device") introduced the length counter qlen in struct neigh_parms.
There are separate neigh_parms instances for IPv4/ARP and IPv6/ND, and
while the family specific qlen is incremented in pneigh_enqueue(), the
mentioned commit decrements always the IPv4/ARP specific qlen,
regardless of the currently processed family, in pneigh_queue_purge()
and neigh_proxy_process().

As a result, with IPv6/ND, the family specific qlen is only incremented
(and never decremented) until it exceeds PROXY_QLEN, and then, according
to the check in pneigh_enqueue(), neighbor solicitations are not
answered anymore. As an example, this is noted when using the
subnet-router anycast address to access a Linux router. After a certain
amount of time (in the observed case, qlen exceeded PROXY_QLEN after two
days), the Linux router stops answering neighbor solicitations for its
subnet-router anycast address and effectively becomes unreachable.

Another result with IPv6/ND is that the IPv4/ARP specific qlen is
decremented more often than incremented. This leads to negative qlen
values, as a signed integer has been used for the length counter qlen,
and potentially to an integer overflow.

Fix this by introducing the helper function neigh_parms_qlen_dec(),
which decrements the family specific qlen. Thereby, make use of the
existing helper function neigh_get_dev_parms_rcu(), whose definition
therefore needs to be placed earlier in neighbour.c. Take the family
member from struct neigh_table to determine the currently processed
family and appropriately call neigh_parms_qlen_dec() from
pneigh_queue_purge() and neigh_proxy_process().

Additionally, use an unsigned integer for the length counter qlen.

Fixes: 0ff4eb3d5ebb ("neighbour: make proxy_queue.qlen limit 
per-device")

Signed-off-by: Thomas Zeitlhofer 
Signed-off-by: David S. Miller 
(cherry picked from commit 8207f253a097fe15c93d85ac15ebb73c5e39e1e1)
https://virtuozzo.atlassian.net/browse/PSBM-153018
https://bugs.openvz.org/browse/OVZ-7410
Signed-off-by: Alexander Atanasov 
---
  include/net/neighbour.h |  2 +-
  net/core/neighbour.c    | 58 +
  2 files changed, 31 insertions(+), 29 deletions(-)

diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index 2326044d5294..c63016cc4b27 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -79,7 +79,7 @@ struct neigh_parms {
  struct rcu_head rcu_head;
  int    reachable_time;
-    int    qlen;
+    u32    qlen;
  int    data[NEIGH_VAR_DATA_MAX];
  DECLARE_BITMAP(data_state, NEIGH_VAR_DATA_MAX);
  };
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index d5b28c158144..fa9da27c0676 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -223,7 +223,31 @@ static int neigh_del_timer(struct neighbour *n)
  return 0;
  }
-static void pneigh_queue_purge(struct 

Re: [Devel] [PATCH VZ9 2/3] net: zerocopy over unix sockets

2024-01-15 Thread Pavel Tikhomirov




On 09/01/2024 21:38, Alexey Kuznetsov wrote:

Observation is that af_unix sockets today became slower
and eat a lot of more cpu than 100G ethernet. So, implement
MSG_ZEROCOPY over af_unix sockets to be able to talk to
local services without collapse of performance.

Unexpectedly, this makes sense! F.e. zerocopy cannot be done
in TCP over loopback, because skbs when passing over loopback
change ownership. But unix sockets traditionally implemented
in the way that skbs queued on destination sockets still preserve
ownership of source socket, so that completion notifications
can be delivered back to sender without problems.

Code looks neat and clean and might be submitted to mainstream
one day. A little bit of dirt is that it shares IP_RECVERR
control message, but rather this dirt is not mine, but
by original author, who used protocol level message for generic
socket level notifications without any good reasons, creating
a mess for users.

Signed-off-by: Alexey Kuznetsov 
---
  net/core/datagram.c |   2 +-
  net/core/sock.c |   3 ++
  net/unix/af_unix.c  | 108 +---
  3 files changed, 98 insertions(+), 15 deletions(-)

diff --git a/net/core/datagram.c b/net/core/datagram.c
index 8eb5999..980248b 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -640,7 +640,7 @@ int __zerocopy_sg_from_iter(struct sock *sk, struct sk_buff 
*skb,
skb->data_len += copied;
skb->len += copied;
skb->truesize += truesize;
-   if (sk && sk->sk_type == SOCK_STREAM) {
+   if (sk && sk->sk_type == SOCK_STREAM && sk->sk_family != 
PF_UNIX) {
sk_wmem_queued_add(sk, truesize);
sk_mem_charge(sk, truesize);
} else {
diff --git a/net/core/sock.c b/net/core/sock.c
index 5d7e16a..fe8e102 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1405,6 +1405,9 @@ int sk_setsockopt(struct sock *sk, int level, int optname,
  (sk->sk_type == SOCK_DGRAM &&
   sk->sk_protocol == IPPROTO_UDP)))
ret = -EOPNOTSUPP;
+   } else if (sk->sk_family == PF_UNIX) {
+   if (sk->sk_type == SOCK_DGRAM)
+   ret = -EOPNOTSUPP;


Likely we don't want to enable zerocopy for SOCK_SEQPACKET. Maybe change 
to: "if (sk->sk_type != SOCK_STREAM)"



} else if (sk->sk_family != PF_RDS) {
ret = -EOPNOTSUPP;
}
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 36c3cb9..e32fe47 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -115,6 +115,7 @@
  #include 
  #include 
  #include 
+#include 
  
  #include "scm.h"
  
@@ -2054,6 +2055,7 @@ static int unix_stream_sendmsg(struct socket *sock, struct msghdr *msg,

  {
struct sock *sk = sock->sk;
struct sock *other = NULL;
+   struct ubuf_info *uarg = NULL;
int err, size;
struct sk_buff *skb;
int sent = 0;
@@ -2083,22 +2085,37 @@ static int unix_stream_sendmsg(struct socket *sock, 
struct msghdr *msg,
if (sk->sk_shutdown & SEND_SHUTDOWN)
goto pipe_err;
  
+	if ((msg->msg_flags & MSG_ZEROCOPY) && len && sock_flag(sk, SOCK_ZEROCOPY)) {

+   uarg = msg_zerocopy_alloc(sk, len);
+   if (!uarg) {
+   err = -ENOBUFS;
+   goto out_err;
+   }
+   }
+
while (sent < len) {
size = len - sent;
  
-		/* Keep two messages in the pipe so it schedules better */

-   size = min_t(int, size, (sk->sk_sndbuf >> 1) - 64);
+   if (!uarg) {
+   /* Keep two messages in the pipe so it schedules better 
*/
+   size = min_t(int, size, (sk->sk_sndbuf >> 1) - 64);
+
+   /* allow fallback to order-0 allocations */
+   size = min_t(int, size, SKB_MAX_HEAD(0) + 
UNIX_SKB_FRAGS_SZ);
  
-		/* allow fallback to order-0 allocations */

-   size = min_t(int, size, SKB_MAX_HEAD(0) + UNIX_SKB_FRAGS_SZ);
+   data_len = max_t(int, 0, size - SKB_MAX_HEAD(0));
  
-		data_len = max_t(int, 0, size - SKB_MAX_HEAD(0));

+   data_len = min_t(size_t, size, PAGE_ALIGN(data_len));
  
-		data_len = min_t(size_t, size, PAGE_ALIGN(data_len));

+   skb = sock_alloc_send_pskb(sk, size - data_len, 
data_len,
+  msg->msg_flags & MSG_DONTWAIT, 
,
+  
get_order(UNIX_SKB_FRAGS_SZ));
+   } else {
+   size = min_t(int, size, sk->sk_sndbuf);
+   skb = sock_alloc_send_pskb(sk, 0, 0,
+  msg->msg_flags & MSG_DONTWAIT, 
, 0);
+   }
  
-		skb = 

Re: [Devel] [PATCH vz7 v2] net: neigh: decrement the family specific qlen

2024-01-15 Thread Pavel Tikhomirov
I'd prefer to have two separate clean patches ported instead of porting 
a merge of one into another (without a change explaining comment).


On 15/01/2024 22:56, Alexander Atanasov wrote:

From: Thomas Zeitlhofer 

Commit 0ff4eb3d5ebb ("neighbour: make proxy_queue.qlen limit
per-device") introduced the length counter qlen in struct neigh_parms.
There are separate neigh_parms instances for IPv4/ARP and IPv6/ND, and
while the family specific qlen is incremented in pneigh_enqueue(), the
mentioned commit decrements always the IPv4/ARP specific qlen,
regardless of the currently processed family, in pneigh_queue_purge()
and neigh_proxy_process().

As a result, with IPv6/ND, the family specific qlen is only incremented
(and never decremented) until it exceeds PROXY_QLEN, and then, according
to the check in pneigh_enqueue(), neighbor solicitations are not
answered anymore. As an example, this is noted when using the
subnet-router anycast address to access a Linux router. After a certain
amount of time (in the observed case, qlen exceeded PROXY_QLEN after two
days), the Linux router stops answering neighbor solicitations for its
subnet-router anycast address and effectively becomes unreachable.

Another result with IPv6/ND is that the IPv4/ARP specific qlen is
decremented more often than incremented. This leads to negative qlen
values, as a signed integer has been used for the length counter qlen,
and potentially to an integer overflow.

Fix this by introducing the helper function neigh_parms_qlen_dec(),
which decrements the family specific qlen. Thereby, make use of the
existing helper function neigh_get_dev_parms_rcu(), whose definition
therefore needs to be placed earlier in neighbour.c. Take the family
member from struct neigh_table to determine the currently processed
family and appropriately call neigh_parms_qlen_dec() from
pneigh_queue_purge() and neigh_proxy_process().

Additionally, use an unsigned integer for the length counter qlen.

Fixes: 0ff4eb3d5ebb ("neighbour: make proxy_queue.qlen limit per-device")
Signed-off-by: Thomas Zeitlhofer 
Signed-off-by: David S. Miller 
(cherry picked from commit 8207f253a097fe15c93d85ac15ebb73c5e39e1e1)
https://virtuozzo.atlassian.net/browse/PSBM-153018
https://bugs.openvz.org/browse/OVZ-7410
Signed-off-by: Alexander Atanasov 
---
  include/net/neighbour.h |  2 +-
  net/core/neighbour.c| 58 +
  2 files changed, 31 insertions(+), 29 deletions(-)

diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index 2326044d5294..c63016cc4b27 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -79,7 +79,7 @@ struct neigh_parms {
struct rcu_head rcu_head;
  
  	int	reachable_time;

-   int qlen;
+   u32 qlen;
int data[NEIGH_VAR_DATA_MAX];
DECLARE_BITMAP(data_state, NEIGH_VAR_DATA_MAX);
  };
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index d5b28c158144..fa9da27c0676 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -223,7 +223,31 @@ static int neigh_del_timer(struct neighbour *n)
return 0;
  }
  
-static void pneigh_queue_purge(struct sk_buff_head *list, struct net *net)

+static struct neigh_parms *neigh_get_dev_parms_rcu(struct net_device *dev,
+  int family)
+{
+   switch (family) {
+   case AF_INET:
+   return __in_dev_arp_parms_get_rcu(dev);
+   case AF_INET6:
+   return __in6_dev_nd_parms_get_rcu(dev);
+   }
+   return NULL;
+}
+
+static void neigh_parms_qlen_dec(struct net_device *dev, int family)
+{
+   struct neigh_parms *p;
+
+   rcu_read_lock();
+   p = neigh_get_dev_parms_rcu(dev, family);
+   if (p)
+   p->qlen--;
+   rcu_read_unlock();
+}
+
+static void pneigh_queue_purge(struct sk_buff_head *list, struct net *net,
+  int family)
  {
struct sk_buff_head tmp;
unsigned long flags;
@@ -237,13 +261,7 @@ static void pneigh_queue_purge(struct sk_buff_head *list, 
struct net *net)
struct net_device *dev = skb->dev;
  
  		if (net == NULL || net_eq(dev_net(dev), net)) {

-   struct in_device *in_dev;
-
-   rcu_read_lock();
-   in_dev = __in_dev_get_rcu(dev);
-   if (in_dev)
-   in_dev->arp_parms->qlen--;
-   rcu_read_unlock();
+   neigh_parms_qlen_dec(dev, family);
__skb_unlink(skb, list);
__skb_queue_tail(, skb);
}
@@ -321,7 +339,8 @@ int neigh_ifdown(struct neigh_table *tbl, struct net_device 
*dev)
neigh_flush_dev(tbl, dev);
pneigh_ifdown(tbl, dev);
write_unlock_bh(>lock);
-   pneigh_queue_purge(>proxy_queue, dev_net(dev));
+   pneigh_queue_purge(>proxy_queue, dev ? dev_net(dev) : NULL,
+ 

Re: [Devel] [PATCH vz7] net: neigh: decrement the family specific qlen

2024-01-15 Thread Pavel Tikhomirov
Do we also need f8017317cb0b2 ("net, neigh: Fix null-ptr-deref in 
neigh_table_clear()") while on it?


Looks good.

Reviewed-by: Pavel Tikhomirov 

On 15/01/2024 16:42, Alexander Atanasov wrote:

From: Thomas Zeitlhofer 

Commit 0ff4eb3d5ebb ("neighbour: make proxy_queue.qlen limit
per-device") introduced the length counter qlen in struct neigh_parms.
There are separate neigh_parms instances for IPv4/ARP and IPv6/ND, and
while the family specific qlen is incremented in pneigh_enqueue(), the
mentioned commit decrements always the IPv4/ARP specific qlen,
regardless of the currently processed family, in pneigh_queue_purge()
and neigh_proxy_process().

As a result, with IPv6/ND, the family specific qlen is only incremented
(and never decremented) until it exceeds PROXY_QLEN, and then, according
to the check in pneigh_enqueue(), neighbor solicitations are not
answered anymore. As an example, this is noted when using the
subnet-router anycast address to access a Linux router. After a certain
amount of time (in the observed case, qlen exceeded PROXY_QLEN after two
days), the Linux router stops answering neighbor solicitations for its
subnet-router anycast address and effectively becomes unreachable.

Another result with IPv6/ND is that the IPv4/ARP specific qlen is
decremented more often than incremented. This leads to negative qlen
values, as a signed integer has been used for the length counter qlen,
and potentially to an integer overflow.

Fix this by introducing the helper function neigh_parms_qlen_dec(),
which decrements the family specific qlen. Thereby, make use of the
existing helper function neigh_get_dev_parms_rcu(), whose definition
therefore needs to be placed earlier in neighbour.c. Take the family
member from struct neigh_table to determine the currently processed
family and appropriately call neigh_parms_qlen_dec() from
pneigh_queue_purge() and neigh_proxy_process().

Additionally, use an unsigned integer for the length counter qlen.

Fixes: 0ff4eb3d5ebb ("neighbour: make proxy_queue.qlen limit per-device")
Signed-off-by: Thomas Zeitlhofer 
Signed-off-by: David S. Miller 
(cherry picked from commit 8207f253a097fe15c93d85ac15ebb73c5e39e1e1)
https://virtuozzo.atlassian.net/browse/PSBM-153018
https://bugs.openvz.org/browse/OVZ-7410
Signed-off-by: Alexander Atanasov 
---
  include/net/neighbour.h |  2 +-
  net/core/neighbour.c| 57 +
  2 files changed, 30 insertions(+), 29 deletions(-)

diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index 2326044d5294..c63016cc4b27 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -79,7 +79,7 @@ struct neigh_parms {
struct rcu_head rcu_head;
  
  	int	reachable_time;

-   int qlen;
+   u32 qlen;
int data[NEIGH_VAR_DATA_MAX];
DECLARE_BITMAP(data_state, NEIGH_VAR_DATA_MAX);
  };
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index d5b28c158144..caed48f39896 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -223,7 +223,31 @@ static int neigh_del_timer(struct neighbour *n)
return 0;
  }
  
-static void pneigh_queue_purge(struct sk_buff_head *list, struct net *net)

+static struct neigh_parms *neigh_get_dev_parms_rcu(struct net_device *dev,
+  int family)
+{
+   switch (family) {
+   case AF_INET:
+   return __in_dev_arp_parms_get_rcu(dev);
+   case AF_INET6:
+   return __in6_dev_nd_parms_get_rcu(dev);
+   }
+   return NULL;
+}
+
+static void neigh_parms_qlen_dec(struct net_device *dev, int family)
+{
+   struct neigh_parms *p;
+
+   rcu_read_lock();
+   p = neigh_get_dev_parms_rcu(dev, family);
+   if (p)
+   p->qlen--;
+   rcu_read_unlock();
+}
+
+static void pneigh_queue_purge(struct sk_buff_head *list, struct net *net,
+  int family)
  {
struct sk_buff_head tmp;
unsigned long flags;
@@ -237,13 +261,7 @@ static void pneigh_queue_purge(struct sk_buff_head *list, 
struct net *net)
struct net_device *dev = skb->dev;
  
  		if (net == NULL || net_eq(dev_net(dev), net)) {

-   struct in_device *in_dev;
-
-   rcu_read_lock();
-   in_dev = __in_dev_get_rcu(dev);
-   if (in_dev)
-   in_dev->arp_parms->qlen--;
-   rcu_read_unlock();
+   neigh_parms_qlen_dec(dev, family);
__skb_unlink(skb, list);
__skb_queue_tail(, skb);
}
@@ -321,7 +339,7 @@ int neigh_ifdown(struct neigh_table *tbl, struct net_device 
*dev)
neigh_flush_dev(tbl, dev);
pneigh_ifdown(tbl, dev);
write_unlock_bh(>lock);
-   pneigh_queue_purge(>proxy_queue, dev_net(dev));
+   pneigh_que

[Devel] [PATCH v3 VZ7] neighbour: purge nf_bridged skb from foreign device neigh

2024-01-07 Thread Pavel Tikhomirov
An skb can be added to a neigh->arp_queue while waiting for an arp
reply. Where original skb's skb->dev can be different to neigh's
neigh->dev. For instance in case of bridging dnated skb from one veth to
another, the skb would be added to a neigh->arp_queue of the bridge.

There is no explicit mechanism that prevents the original skb->dev link
of such skb from being freed under us. For instance neigh_flush_dev does
not cleanup skbs from different device's neigh queue. But that original
link can be used and lead to crash on e.g. this stack:

arp_process
  neigh_update
skb = __skb_dequeue(>arp_queue)
  neigh_resolve_output(..., skb)
...
  br_nf_dev_xmit
br_nf_pre_routing_finish_bridge_slow
  skb->dev = nf_bridge->physindev
  br_handle_frame_finish

So let's improve neigh_flush_dev to also purge skbs when device
equal to their skb->nf_bridge->physindev gets destroyed.

https://virtuozzo.atlassian.net/browse/PSBM-151735
Signed-off-by: Pavel Tikhomirov 
---
v2: take neigh->lock for queue modification
v3: put skb_peak under lock
---
 net/core/neighbour.c | 27 +++
 1 file changed, 27 insertions(+)

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 7f197a63c780f..4f0ee6c04f1c2 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -43,6 +43,10 @@
 #include 
 #include 
 
+#include 
+#include 
+#include 
+
 #define DEBUG
 #define NEIGH_DEBUG 1
 #define neigh_dbg(level, fmt, ...) \
@@ -257,6 +261,28 @@ static void pneigh_queue_purge(struct sk_buff_head *list, 
struct net *net)
}
 }
 
+static void neigh_purge_nf_bridge_dev(struct neighbour *neigh, struct 
net_device *dev)
+{
+   struct sk_buff_head *list = >arp_queue;
+   struct nf_bridge_info *nf_bridge;
+   struct sk_buff *skb, *next;
+
+   write_lock(>lock);
+   skb = skb_peek(list);
+   while (skb) {
+   nf_bridge = nf_bridge_info_get(skb);
+
+   next = skb_peek_next(skb, list);
+   if (nf_bridge && nf_bridge->physindev == dev) {
+   __skb_unlink(skb, list);
+   neigh->arp_queue_len_bytes -= skb->truesize;
+   kfree_skb(skb);
+   }
+   skb = next;
+   }
+   write_unlock(>lock);
+}
+
 static void neigh_flush_dev(struct neigh_table *tbl, struct net_device *dev)
 {
int i;
@@ -272,6 +298,7 @@ static void neigh_flush_dev(struct neigh_table *tbl, struct 
net_device *dev)
while ((n = rcu_dereference_protected(*np,
lockdep_is_held(>lock))) != NULL) {
if (dev && n->dev != dev) {
+   neigh_purge_nf_bridge_dev(n, dev);
np = >next;
continue;
}
-- 
2.43.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH v2 VZ7] neighbour: purge nf_bridged skb from foreign device neigh

2024-01-07 Thread Pavel Tikhomirov
An skb can be added to a neigh->arp_queue while waiting for an arp
reply. Where original skb's skb->dev can be different to neigh's
neigh->dev. For instance in case of bridging dnated skb from one veth to
another, the skb would be added to a neigh->arp_queue of the bridge.

There is no explicit mechanism that prevents the original skb->dev link
of such skb from being freed under us. For instance neigh_flush_dev does
not cleanup skbs from different device's neigh queue. But that original
link can be used and lead to crash on e.g. this stack:

arp_process
  neigh_update
skb = __skb_dequeue(>arp_queue)
  neigh_resolve_output(..., skb)
...
  br_nf_dev_xmit
br_nf_pre_routing_finish_bridge_slow
  skb->dev = nf_bridge->physindev
  br_handle_frame_finish

So let's improve neigh_flush_dev to also purge skbs when device
equal to their skb->nf_bridge->physindev gets destroyed.

https://virtuozzo.atlassian.net/browse/PSBM-151735
Signed-off-by: Pavel Tikhomirov 
---
v2: take neigh->lock for queue modification
---
 net/core/neighbour.c | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 7f197a63c780f..2aa1c23a7f44d 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -43,6 +43,10 @@
 #include 
 #include 
 
+#include 
+#include 
+#include 
+
 #define DEBUG
 #define NEIGH_DEBUG 1
 #define neigh_dbg(level, fmt, ...) \
@@ -257,6 +261,27 @@ static void pneigh_queue_purge(struct sk_buff_head *list, 
struct net *net)
}
 }
 
+static void neigh_purge_nf_bridge_dev(struct neighbour *neigh, struct 
net_device *dev)
+{
+   struct sk_buff_head *list = >arp_queue;
+   struct sk_buff *skb = skb_peek(list), *next;
+   struct nf_bridge_info *nf_bridge;
+
+   write_lock(>lock);
+   while (skb) {
+   nf_bridge = nf_bridge_info_get(skb);
+
+   next = skb_peek_next(skb, list);
+   if (nf_bridge && nf_bridge->physindev == dev) {
+   __skb_unlink(skb, list);
+   neigh->arp_queue_len_bytes -= skb->truesize;
+   kfree_skb(skb);
+   }
+   skb = next;
+   }
+   write_unlock(>lock);
+}
+
 static void neigh_flush_dev(struct neigh_table *tbl, struct net_device *dev)
 {
int i;
@@ -272,6 +297,7 @@ static void neigh_flush_dev(struct neigh_table *tbl, struct 
net_device *dev)
while ((n = rcu_dereference_protected(*np,
lockdep_is_held(>lock))) != NULL) {
if (dev && n->dev != dev) {
+   neigh_purge_nf_bridge_dev(n, dev);
np = >next;
continue;
}
-- 
2.43.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH7] neighbour: purge nf_bridged skb from foreign device neigh

2024-01-04 Thread Pavel Tikhomirov
An skb can be added to a neigh->arp_queue while waiting for an arp
reply. Where original skb's skb->dev can be different to neigh's
neigh->dev. For instance in case of bridging dnated skb from one veth to
another, the skb would be added to a neigh->arp_queue of the bridge.

There is no explicit mechanism that prevents the original skb->dev link
of such skb from being freed under us. For instance neigh_flush_dev does
not cleanup skbs from different device's neigh queue. But that original
link can be used and lead to crash on e.g. this stack:

arp_process
  neigh_update
skb = __skb_dequeue(>arp_queue)
  neigh_resolve_output(..., skb)
...
  br_nf_dev_xmit
br_nf_pre_routing_finish_bridge_slow
  skb->dev = nf_bridge->physindev
  br_handle_frame_finish

So let's improve neigh_flush_dev to also purge skbs when device
equal to their skb->nf_bridge->physindev gets destroyed.

https://virtuozzo.atlassian.net/browse/PSBM-151735
Signed-off-by: Pavel Tikhomirov 
---
 net/core/neighbour.c | 24 
 1 file changed, 24 insertions(+)

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 7f197a63c780f..e2e35d060ee0a 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -43,6 +43,10 @@
 #include 
 #include 
 
+#include 
+#include 
+#include 
+
 #define DEBUG
 #define NEIGH_DEBUG 1
 #define neigh_dbg(level, fmt, ...) \
@@ -257,6 +261,25 @@ static void pneigh_queue_purge(struct sk_buff_head *list, 
struct net *net)
}
 }
 
+static void neigh_purge_nf_bridge_dev(struct neighbour *neigh, struct 
net_device *dev)
+{
+   struct sk_buff_head *list = >arp_queue;
+   struct sk_buff *skb = skb_peek(list), *next;
+   struct nf_bridge_info *nf_bridge;
+
+   while (skb) {
+   nf_bridge = nf_bridge_info_get(skb);
+
+   next = skb_peek_next(skb, list);
+   if (nf_bridge && nf_bridge->physindev == dev) {
+   __skb_unlink(skb, list);
+   neigh->arp_queue_len_bytes -= skb->truesize;
+   kfree_skb(skb);
+   }
+   skb = next;
+   }
+}
+
 static void neigh_flush_dev(struct neigh_table *tbl, struct net_device *dev)
 {
int i;
@@ -272,6 +295,7 @@ static void neigh_flush_dev(struct neigh_table *tbl, struct 
net_device *dev)
while ((n = rcu_dereference_protected(*np,
lockdep_is_held(>lock))) != NULL) {
if (dev && n->dev != dev) {
+   neigh_purge_nf_bridge_dev(n, dev);
np = >next;
continue;
}
-- 
2.43.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH] mm: Drop swap_cache_info reporting in vzstat

2024-01-01 Thread Pavel Tikhomirov

Looks good.

Reviewed-by: Pavel Tikhomirov 

While on it, can we also remove swap_cache_info_struct in 
include/linux/vzstat.h as it looks unused?


On 29/12/2023 01:15, Konstantin Khorenko wrote:

Mainstream has dropped swap_cache_info statistics:
  442701e7058b ("mm/swap: remove swap_cache_info statistics")

So we are dropping reporting it via /proc/vz/stats interface.

We could leave the format of /proc/vz/stats file the same (it is an
interface after all, should be stable), but as in vz9 we'll have so many
changes, vzstat utility is also should be rewritten, so it's a good time
to drop the junk code rathen than preserve the interface file format.

Of course we inc the format version to 2.7.

This patch:
  * fully reverts 44a915de19aa ("mm: Export swap_cache_info struct and
variable")
  * removes couple of hunks in 5952b9c2f339 ("vzstat: Add vzstat module and
kstat interfaces") - related to swap_cache_info

https://virtuozzo.atlassian.net/browse/PSBM-152466

Signed-off-by: Konstantin Khorenko 

Feature: ve: extra statistics (mm, latency)
---
  include/linux/swap.h   |  9 -
  include/linux/vzstat.h |  7 ---
  kernel/ve/vzstat.c | 16 +---
  mm/swap_state.c|  3 ---
  4 files changed, 1 insertion(+), 34 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 9348574b5bb7..1af5b711bc00 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -715,14 +715,5 @@ static inline bool mem_cgroup_swap_full(struct page *page)
  }
  #endif
  
-struct swap_cache_info {

-   unsigned long add_total;
-   unsigned long del_total;
-   unsigned long find_success;
-   unsigned long find_total;
-};
-
-extern struct swap_cache_info swap_cache_info;
-
  #endif /* __KERNEL__*/
  #endif /* _LINUX_SWAP_H */
diff --git a/include/linux/vzstat.h b/include/linux/vzstat.h
index 505487117ca4..4a5738bd036f 100644
--- a/include/linux/vzstat.h
+++ b/include/linux/vzstat.h
@@ -14,13 +14,6 @@
  #include 
  #include 
  
-struct swap_cache_info_struct {

-   unsigned long add_total;
-   unsigned long del_total;
-   unsigned long find_success;
-   unsigned long find_total;
-};
-
  struct kstat_zone_avg {
unsigned long   free_pages_avg[3],
nr_active_avg[3],
diff --git a/kernel/ve/vzstat.c b/kernel/ve/vzstat.c
index 7ffd5ec7f2a3..73063a4189d7 100644
--- a/kernel/ve/vzstat.c
+++ b/kernel/ve/vzstat.c
@@ -27,7 +27,6 @@
  #include 
  #include 
  #include 
-#include 
  
  #include 
  
@@ -385,18 +384,6 @@ static void kernel_text_csum_seq_show(struct seq_file *m, void *v)

seq_printf(m, "kernel_text_csum_broken: %d\n", 0);
  }
  
-static void swap_cache_seq_show(struct seq_file *m, void *v)

-{
-   struct swap_cache_info *swpcache;
-
-   swpcache = _cache_info;
-   seq_printf(m, "Swap cache: add %lu, del %lu, find %lu/%lu\n",
-   swpcache->add_total,
-   swpcache->del_total,
-   swpcache->find_success,
-   swpcache->find_total);
-}
-
  /*
   * Declare special structure to store summarized statistics. The 'struct zone'
   * is not used because of it's tremendous size.
@@ -668,14 +655,13 @@ static int stats_seq_show(struct seq_file *m, void *v)
  {
if (!v)
return 0;
-   seq_puts(m, "Version: 2.6\n");
+   seq_puts(m, "Version: 2.7\n");
cycles_per_jiffy_show(m, v);
jiffies_per_second_show(m, v);
seq_puts(m, "\nLoad info:\n");
task_counts_seq_show(m, v);
seq_puts(m, "\nMemory info:\n");
kernel_text_csum_seq_show(m, v);
-   swap_cache_seq_show(m, v);
mem_free_areas_show(m, v);
mem_avg_show(m, v);
mem_fails_show(m, v);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 034dd5d0c6d9..b16dc348bbd2 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -59,9 +59,6 @@ static bool enable_vma_readahead __read_mostly = true;
  #define GET_SWAP_RA_VAL(vma)  \
(atomic_long_read(&(vma)->swap_readahead_info) ? : 4)
  
-struct swap_cache_info swap_cache_info;

-EXPORT_SYMBOL_GPL(swap_cache_info);
-
  static atomic_t swapin_readahead_hits = ATOMIC_INIT(4);
  
  void show_swap_cache_info(void)


--
Best regards, Tikhomirov Pavel
Senior Software Developer, Virtuozzo.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH7] br_netfilter: add debug to see what happens in br_nf_pre_routing_finish

2023-12-22 Thread Pavel Tikhomirov
We see a packet with skb->dev already freed, which should have passed
this place in br_nf_pre_routing_finish as it has nf_bridge->neigh_header
set to src mac which can be set only later in this stack in
br_nf_pre_routing_finish_bridge. So I try to print all information I can
imagine to help understanding this problem.

We see this skb taken from neigh->arp_queue and then crash on it, and
here:

  +-> br_nf_pre_routing
+-> br_nf_pre_routing_finish
  +-> skb->dev = nf_bridge->physindev #1
  +-> br_nf_pre_routing_finish_bridge #2
+-> nf_bridge->bridged_dnat = 1 #1
+-> neighbour->output #2
  +-> neigh_resolve_output
+-> neigh_event_send
  +-> __neigh_event_send
+-> __skb_queue_tail(>arp_queue, skb)

skb can acutually get to arp_queue. This is the second confirmation that
something wrong likely happened on this stack.

https://virtuozzo.atlassian.net/browse/PSBM-151735
Signed-off-by: Pavel Tikhomirov 
---
 net/bridge/br_netfilter_hooks.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c
index eb8ad039c343f..6258fe55a04eb 100644
--- a/net/bridge/br_netfilter_hooks.c
+++ b/net/bridge/br_netfilter_hooks.c
@@ -386,6 +386,17 @@ static int br_nf_pre_routing_finish(struct sock *sk, 
struct sk_buff *skb)
} else {
if (skb_dst(skb)->dev == dev) {
 bridged_dnat:
+   /*
+* Debug for PSBM-151735, I'm really curious 
how we can get here.
+*/
+   WARN_ONCE(1, "br_nf_pre_routing_finish: 
saddr=%pI4 daddr=%pI4 nf_bridge_daddr=%pI4 "
+"dev=%s skb->dev=%s 
nf_bridge->indev=%s "
+"skb->dst->dev=%s "
+"err=%d indev_forward=%d\n",
+ >saddr, >daddr, 
_bridge->ipv4_daddr,
+ dev->name, skb->dev->name, 
nf_bridge->physindev ? nf_bridge->physindev->name : "",
+ skb_dst(skb) && skb_dst(skb)->dev ? 
skb_dst(skb)->dev->name : "",
+ err, 
IN_DEV_FORWARD(__in_dev_get_rcu(dev)));
skb->dev = nf_bridge->physindev;
nf_bridge_update_protocol(skb);
nf_bridge_push_encap_header(skb);
-- 
2.43.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH vz9] sched: Do not set LBF_NEED_BREAK flag if scanned all the tasks

2023-12-17 Thread Pavel Tikhomirov



On 16/12/2023 01:21, Konstantin Khorenko wrote:

On 15.12.2023 12:19, Pavel Tikhomirov wrote:



On 14/12/2023 06:01, Konstantin Khorenko wrote:

After ms commit b0defa7ae03e ("sched/fair: Make sure to try to detach at
least one movable task") detach_tasks() does not stop on the condition
(env->loop > env->loop_max) in case no movable task found.

Instead of that (if there are no movable tasks in the rq) exits always
happen on the loop_break check - thus with LBF_NEED_BREAK flag set.

It's not a problem for mainstream because load_balance() proceeds with
balancing in case LBF_NEED_BREAK is set only in case (env.loop <
busiest->nr_running), but in Virtuozzo kernel with CFS_CPULIMIT feature
right before that we try to move a whole task group (object of the
CFS_CPULIMIT feature) and resets env.loop to zero,
so we get a livelock here.

Resetting env.loop makes sense in case we have successfully moved some
tasks (in the scope of the task group), but if we failed to move any
task, no progress is expected during further balancing attempts.

Ways to fix it:
   1. In load_balance() restore old env.loop in case no tasks were moved
  by move_task_groups()
   2. Add a check in detach_tasks() to exit without LBF_NEED_BREAK 
flag in

  case we have scanned all tasks and have not found movable tasks.

Current patch implements the second way.

Caused by ms commit: b0defa7ae03e ("sched/fair: Make sure to try to
detach at least one movable task")

Fixes ms commit: bca010328248 ("sched: Port CONFIG_CFS_CPULIMIT 
feature")


this is not "ms" commit





Reviewed-by: Pavel Tikhomirov 


Signed-off-by: Konstantin Khorenko 

Feature: sched: ability to limit number of CPUs available to a CT
---
   kernel/sched/fair.c | 6 ++
   1 file changed, 6 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9367d16a8d85..e068bb90f197 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8771,6 +8771,12 @@ static int detach_tasks(struct lb_env *env)
   if (env->loop > env->loop_max &&
   !(env->flags & LBF_ALL_PINNED))
   break;
+    /*
+ * Quit if we have scanned all tasks even in case we haven't
+ * found any movable task.
+ */
+    if (env->loop > env->src_rq->nr_running)
+    break;


The env->loop is clearly designed in load_balance function to be reused
on next iteration of "goto more_balance". The fact that we clear it to
zero in our code is really strange... Maybe we can have separate
env.vzloop variable for our part of the code (or we can properly
integrate with env.loop without zeroing it?)?


Well, in fact load_balance() contains 2 other places where env.loop is
reset to 0 as well and the execution is sent to "more_balance" after that.


Agree.



In general i understand the idea of resetting env.loop in our case as well:
we have moved the task group (our feature) and after that we would like to
balance the load again - so resetting loop counter looks logic.
The other question is that we reset env.loop even in case we have not 
really
moved any task group and thus restarting the balancing process is 
incorrect.


i thought about checking the result of move_task_groups() and if it's zero,
just restore the env.loop counter to the value before moving the task 
group attempt.
May be it is worth to do, but i just decided to resurrect the previous 
behavior -

thus do not set LBF_NEED_BREAK flag if we reach the end of the run queue.



Do we also need same crazy LBF_ALL_PINNED change to move_task_groups's
"if (env->loop > env->loop_max) break;" check? Similar to what
b0defa7ae03ec ("sched/fair: Make sure to try to detach at least one
movable task") does in detach_tasks to allow it to handle the case then
lru tail is pinned?


And again, i thought about it.
move_task_groups() increases env.loop only in case it founds a task to 
move,
so in case some tasks are pinned, env.loop is not increased and we will 
go further.


Agree.



So in move_task_groups() we exit either
1. we scanned all groups (and did not find enough tasks to move to fix 
the imbalance)

2. we scanned some groups and found/moved tasks > env.loop_max
3. we scanned some groups, found/moved some tasks and their number was 
enough to fix the imbalance


Seems reasonable. Thanks for explanation!




   /* take a breather every nr_migrate tasks */
   if (env->loop > env->loop_break) {




--
Best regards, Tikhomirov Pavel
Senior Software Developer, Virtuozzo.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH vz9] sched: Do not set LBF_NEED_BREAK flag if scanned all the tasks

2023-12-15 Thread Pavel Tikhomirov




On 14/12/2023 06:01, Konstantin Khorenko wrote:

After ms commit b0defa7ae03e ("sched/fair: Make sure to try to detach at
least one movable task") detach_tasks() does not stop on the condition
(env->loop > env->loop_max) in case no movable task found.

Instead of that (if there are no movable tasks in the rq) exits always
happen on the loop_break check - thus with LBF_NEED_BREAK flag set.

It's not a problem for mainstream because load_balance() proceeds with
balancing in case LBF_NEED_BREAK is set only in case (env.loop <
busiest->nr_running), but in Virtuozzo kernel with CFS_CPULIMIT feature
right before that we try to move a whole task group (object of the
CFS_CPULIMIT feature) and resets env.loop to zero,
so we get a livelock here.

Resetting env.loop makes sense in case we have successfully moved some
tasks (in the scope of the task group), but if we failed to move any
task, no progress is expected during further balancing attempts.

Ways to fix it:
  1. In load_balance() restore old env.loop in case no tasks were moved
 by move_task_groups()
  2. Add a check in detach_tasks() to exit without LBF_NEED_BREAK flag in
 case we have scanned all tasks and have not found movable tasks.

Current patch implements the second way.

Caused by ms commit: b0defa7ae03e ("sched/fair: Make sure to try to
detach at least one movable task")

Fixes ms commit: bca010328248 ("sched: Port CONFIG_CFS_CPULIMIT feature")


this is not "ms" commit



Signed-off-by: Konstantin Khorenko 

Feature: sched: ability to limit number of CPUs available to a CT
---
  kernel/sched/fair.c | 6 ++
  1 file changed, 6 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9367d16a8d85..e068bb90f197 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8771,6 +8771,12 @@ static int detach_tasks(struct lb_env *env)
if (env->loop > env->loop_max &&
!(env->flags & LBF_ALL_PINNED))
break;
+   /*
+* Quit if we have scanned all tasks even in case we haven't
+* found any movable task.
+*/
+   if (env->loop > env->src_rq->nr_running)
+   break;


The env->loop is clearly designed in load_balance function to be reused 
on next iteration of "goto more_balance". The fact that we clear it to 
zero in our code is really strange... Maybe we can have separate 
env.vzloop variable for our part of the code (or we can properly 
integrate with env.loop without zeroing it?)?


Do we also need same crazy LBF_ALL_PINNED change to move_task_groups's 
"if (env->loop > env->loop_max) break;" check? Similar to what 
b0defa7ae03ec ("sched/fair: Make sure to try to detach at least one 
movable task") does in detach_tasks to allow it to handle the case then 
lru tail is pinned?


  
  		/* take a breather every nr_migrate tasks */

if (env->loop > env->loop_break) {


--
Best regards, Tikhomirov Pavel
Senior Software Developer, Virtuozzo.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH7] x86/kprobes: Fix kprobes instruction boudary check with CONFIG_RETHUNK

2023-12-07 Thread Pavel Tikhomirov
From: "Masami Hiramatsu (Google)" 

Since the CONFIG_RETHUNK and CONFIG_SLS will use INT3 for stopping
speculative execution after RET instruction, kprobes always failes to
check the probed instruction boundary by decoding the function body if
the probed address is after such sequence. (Note that some conditional
code blocks will be placed after function return, if compiler decides
it is not on the hot path.)

This is because kprobes expects kgdb puts the INT3 as a software
breakpoint and it will replace the original instruction.
But these INT3 are not such purpose, it doesn't need to recover the
original instruction.

To avoid this issue, kprobes checks whether the INT3 is owned by
kgdb or not, and if so, stop decoding and make it fail. The other
INT3 will come from CONFIG_RETHUNK/CONFIG_SLS and those can be
treated as a one-byte instruction.

Fixes: e463a09af2f0 ("x86: Add straight-line-speculation mitigation")
Suggested-by: Peter Zijlstra 
Signed-off-by: Masami Hiramatsu (Google) 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: sta...@vger.kernel.org
Link: 
https://lore.kernel.org/r/167146051026.1374301.392728975473572291.stgit@devnote3

Changes:
- s/INT3_INSN_OPCODE/BREAKPOINT_INSTRUCTION/
- include kgdb_has_hit_break implementation

If function return (int3 instruction) got compiled in assembly code into
the middle of the function, then we failed to insert kprobe at any point
after that. To be able to insert kprobes everythere in the function code
we need this patch.

(cherry picked from commit 1993bf97992df2d560287f3c4120eda57426843d)
Signed-off-by: Pavel Tikhomirov 
---
 arch/x86/kernel/kprobes/core.c | 10 +++---
 include/linux/kgdb.h   |  1 +
 kernel/debug/debug_core.c  | 12 
 3 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index 2be0dde130f14..9dc7f48e84ea9 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -48,6 +48,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -300,12 +301,15 @@ static int __kprobes can_probe(unsigned long paddr)
kernel_insn_init(, (void *)__addr, MAX_INSN_SIZE);
insn_get_length();
 
+#ifdef CONFIG_KGDB
/*
-* Another debugging subsystem might insert this breakpoint.
-* In that case, we can't recover it.
+* If there is a dynamically installed kgdb sw breakpoint,
+* this function should not be probed.
 */
-   if (insn.opcode.bytes[0] == BREAKPOINT_INSTRUCTION)
+   if (insn.opcode.bytes[0] == BREAKPOINT_INSTRUCTION &&
+   kgdb_has_hit_break(addr))
return 0;
+#endif
addr += insn.length;
}
 
diff --git a/include/linux/kgdb.h b/include/linux/kgdb.h
index 6b06d378f3dfe..47a8fd917b3b5 100644
--- a/include/linux/kgdb.h
+++ b/include/linux/kgdb.h
@@ -305,6 +305,7 @@ extern int kgdb_hex2mem(char *buf, char *mem, int count);
 
 extern int kgdb_isremovedbreak(unsigned long addr);
 extern void kgdb_schedule_breakpoint(void);
+extern int kgdb_has_hit_break(unsigned long addr);
 
 extern int
 kgdb_handle_exception(int ex_vector, int signo, int err_code,
diff --git a/kernel/debug/debug_core.c b/kernel/debug/debug_core.c
index 8865caec45fbc..e5872e38fae51 100644
--- a/kernel/debug/debug_core.c
+++ b/kernel/debug/debug_core.c
@@ -356,6 +356,18 @@ int kgdb_isremovedbreak(unsigned long addr)
return 0;
 }
 
+int kgdb_has_hit_break(unsigned long addr)
+{
+   int i;
+
+   for (i = 0; i < KGDB_MAX_BREAKPOINTS; i++) {
+   if (kgdb_break[i].state == BP_ACTIVE &&
+   kgdb_break[i].bpt_addr == addr)
+   return 1;
+   }
+   return 0;
+}
+
 int dbg_remove_all_break(void)
 {
int error;
-- 
2.43.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH7] mm/memcontrol: prohibit writing to memory.numa_migrate from container

2023-12-05 Thread Pavel Tikhomirov
We might want to put containers on designated numa nodes for optimal
perfomance, it will be all ruinied if container could force its memory
pages to move to any node it wants.

This memory.numa_migrate file was originaly made for vcmmd which works
from ve0, so we should be fine with this additional restriction.

Fixes: dfc0b63bfd50c ("mm: memcontrol: add memory.numa_migrate file")
https://virtuozzo.atlassian.net/browse/PSBM-152372
Signed-off-by: Pavel Tikhomirov 

Feature: mm: interface to migrate memory between NUMA nodes upon userspace 
request
---
 mm/memcontrol.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a2fe48d93d02a..5d65b523a0ec8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5812,8 +5812,8 @@ static int memcg_numa_migrate_pages(struct mem_cgroup 
*memcg,
  *
  * The call may be interrupted by a signal, in which case -EINTR is returned.
  */
-static int memcg_numa_migrate_write(struct cgroup *cont,
-   struct cftype *cft, const char *buf)
+static int __memcg_numa_migrate_write(struct cgroup *cont, struct cftype *cft,
+ const char *buf)
 {
struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
NODEMASK_ALLOC(nodemask_t, target_nodes, GFP_KERNEL);
@@ -5851,6 +5851,14 @@ static int memcg_numa_migrate_write(struct cgroup *cont,
return ret;
 }
 
+static int memcg_numa_migrate_write(struct cgroup *cont, struct cftype *cft,
+   const char *buf)
+   if (!ve_is_super(get_exec_env()))
+   return -EPERM;
+
+   return __memcg_numa_migrate_write(cont, cft, buf);
+}
+
 #endif /* CONFIG_NUMA */
 
 static inline void mem_cgroup_lru_names_not_uptodate(void)
-- 
2.43.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH9] mm/memcontrol: prohibit writing to memory.numa_migrate from container

2023-12-05 Thread Pavel Tikhomirov
We might want to put containers on designated numa nodes for optimal
perfomance, it will be all ruinied if container could force its memory
pages to move to any node it wants.

This memory.numa_migrate file was originaly made for vcmmd which works
from ve0, so we should be fine with this additional restriction.

Fixes: dfc0b63bfd50c ("mm: memcontrol: add memory.numa_migrate file")
https://virtuozzo.atlassian.net/browse/PSBM-152372
Signed-off-by: Pavel Tikhomirov 

Feature: mm: interface to migrate memory between NUMA nodes upon userspace 
request
---
 mm/memcontrol.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fd706aa1bbbd1..b7c6ee09ab9fc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4700,8 +4700,8 @@ static int memcg_numa_migrate_pages(struct mem_cgroup 
*memcg,
  *
  * The call may be interrupted by a signal, in which case -EINTR is returned.
  */
-static ssize_t memcg_numa_migrate_write(struct kernfs_open_file *of, char *buf,
-   size_t nbytes, loff_t off)
+static ssize_t __memcg_numa_migrate_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
 {
struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
NODEMASK_ALLOC(nodemask_t, target_nodes, GFP_KERNEL);
@@ -4739,6 +4739,15 @@ static ssize_t memcg_numa_migrate_write(struct 
kernfs_open_file *of, char *buf,
return ret ?: nbytes;
 }
 
+static ssize_t memcg_numa_migrate_write(struct kernfs_open_file *of, char *buf,
+   size_t nbytes, loff_t off)
+{
+   if (!ve_is_super(get_exec_env()))
+   return -EPERM;
+
+   return __memcg_numa_migrate_write(of, buf, nbytes, off);
+}
+
 #endif /* CONFIG_NUMA */
 
 static const unsigned int memcg1_stats[] = {
-- 
2.43.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH7 1/3] netfilter: nf_tables: use call_rcu in netlink dumps

2023-11-01 Thread Pavel Tikhomirov
From: Florian Westphal 

We can make all dumps and lookups lockless.

Dumps currently only hold the nfnl mutex on the dump request itself.
Dumps can span multiple syscalls, dump continuation doesn't acquire the
nfnl mutex anywhere, i.e. the dump callbacks in nf_tables already use
rcu and never rely on nfnl mutex being held.

So, just switch all dumpers to rcu.

This requires taking a module reference before dropping the rcu lock
so rmmod is blocked, we also need to hold module reference over
the entire dump operation sequence. netlink already supports this
via the .module member in the netlink_dump_control struct.

For the non-dump case (i.e. lookup of a specific tables, chains, etc),
we need to swtich to _rcu list iteration primitive and make sure we
use GFP_ATOMIC.

This patch also adds the new nft_netlink_dump_start_rcu() helper that
takes care of the get_ref, drop-rcu-lock,start dump,
get-rcu-lock,put-ref sequence.

The helper will be reused for all dumps.

Rationale in all dump requests is:

 - use the nft_netlink_dump_start_rcu helper added in first patch
 - use GFP_ATOMIC and rcu list iteration
 - switch to .call_rcu

... thus making all dumps in nf_tables not depend on the
nfnl mutex anymore.

In the nf_tables_getgen: This callback just fetches the current base
sequence, there is no need to serialize this with nfnl nft mutex.

Signed-off-by: Florian Westphal 
Signed-off-by: Pablo Neira Ayuso 

(cherry picked from commit d9adf22a2918832ac94335fddf1950b12543d0b5)
Changes:
- s/>nft.tables/>tables/
- drop most of "elements" related hunks
- drop "obj" related hunks
- drop "flowtable" related hunks
- employ rcu in nf_tables_afinfo_lookup when needed

https://virtuozzo.atlassian.net/browse/PSBM-150147
Signed-off-by: Pavel Tikhomirov 
---
 net/netfilter/nf_tables_api.c | 135 ++
 1 file changed, 87 insertions(+), 48 deletions(-)

diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 8beed7bd2970..2a3680da9cd2 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -63,15 +63,17 @@ static struct nft_af_info *nft_afinfo_lookup(struct net 
*net, int family)
 {
struct nft_af_info *afi;
 
-   list_for_each_entry(afi, >nft.af_info, list) {
+   /* called either with rcu_read_lock or nfnl_lock held */
+   list_for_each_entry_rcu(afi, >nft.af_info, list) {
if (afi->family == family)
return afi;
}
+
return NULL;
 }
 
 static struct nft_af_info *
-nf_tables_afinfo_lookup(struct net *net, int family, bool autoload)
+nf_tables_afinfo_lookup(struct net *net, int family, bool autoload, bool rcu)
 {
struct nft_af_info *afi;
 
@@ -80,9 +82,17 @@ nf_tables_afinfo_lookup(struct net *net, int family, bool 
autoload)
return afi;
 #ifdef CONFIG_MODULES
if (autoload) {
-   nfnl_unlock(NFNL_SUBSYS_NFTABLES);
+   if (rcu) {
+   rcu_read_unlock();
+   } else {
+   nfnl_unlock(NFNL_SUBSYS_NFTABLES);
+   }
request_module("nft-afinfo-%u", family);
-   nfnl_lock(NFNL_SUBSYS_NFTABLES);
+   if (rcu) {
+   rcu_read_lock();
+   } else {
+   nfnl_lock(NFNL_SUBSYS_NFTABLES);
+   }
afi = nft_afinfo_lookup(net, family);
if (afi != NULL)
return ERR_PTR(-EAGAIN);
@@ -361,7 +371,7 @@ static struct nft_table *nft_table_lookup(const struct 
nft_af_info *afi,
 {
struct nft_table *table;
 
-   list_for_each_entry(table, >tables, list) {
+   list_for_each_entry_rcu(table, >tables, list) {
if (!nla_strcmp(nla, table->name) &&
nft_active_genmask(table, genmask))
return table;
@@ -540,6 +550,24 @@ static int nf_tables_dump_tables(struct sk_buff *skb,
return skb->len;
 }
 
+static int nft_netlink_dump_start_rcu(struct sock *nlsk, struct sk_buff *skb,
+ const struct nlmsghdr *nlh,
+ struct netlink_dump_control *c)
+{
+   int err;
+
+   if (!try_module_get(THIS_MODULE))
+   return -EINVAL;
+
+   rcu_read_unlock();
+   err = netlink_dump_start(nlsk, skb, nlh, c);
+   rcu_read_lock();
+   module_put(THIS_MODULE);
+
+   return err;
+}
+
+/* called with rcu_read_lock held */
 static int nf_tables_gettable(struct sock *nlsk, struct sk_buff *skb,
  const struct nlmsghdr *nlh,
  const struct nlattr * const nla[])
@@ -556,11 +584,13 @@ static int nf_tables_gettable(struct sock *nlsk, struct 
sk_buff *skb,
if (nlh->nlmsg_flags & NLM_F_DUMP) {
struct netlink_dump_control c = {

[Devel] [PATCH RH7 2/3] netfilter: nf_tables: fix oops during rule dump

2023-11-01 Thread Pavel Tikhomirov
From: Florian Westphal 

We can oops in nf_tables_fill_rule_info().

Its not possible to fetch previous element in rcu-protected lists
when deletions are not prevented somehow: list_del_rcu poisons
the ->prev pointer value.

Before rcu-conversion this was safe as dump operations did hold
nfnetlink mutex.

Pass previous rule as argument, obtained by keeping a pointer to
the previous rule during traversal.

Fixes: d9adf22a291883 ("netfilter: nf_tables: use call_rcu in netlink dumps")
Signed-off-by: Florian Westphal 
Signed-off-by: Pablo Neira Ayuso 

(cherry picked from commit 2c82c7e724ff51cab78e1afd5c2aaa31994fe41e)
Changes:
- move hunks from __nf_tables_dump_rules to nf_tables_dump_rules

https://virtuozzo.atlassian.net/browse/PSBM-150147
Signed-off-by: Pavel Tikhomirov 
---
 net/netfilter/nf_tables_api.c | 19 ++-
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 2a3680da9cd2..21b7b0f81f8f 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -1890,13 +1890,13 @@ static int nf_tables_fill_rule_info(struct sk_buff 
*skb, struct net *net,
u32 flags, int family,
const struct nft_table *table,
const struct nft_chain *chain,
-   const struct nft_rule *rule)
+   const struct nft_rule *rule,
+   const struct nft_rule *prule)
 {
struct nlmsghdr *nlh;
struct nfgenmsg *nfmsg;
const struct nft_expr *expr, *next;
struct nlattr *list;
-   const struct nft_rule *prule;
int type = event | NFNL_SUBSYS_NFTABLES << 8;
 
nlh = nlmsg_put(skb, portid, seq, type, sizeof(struct nfgenmsg),
@@ -1917,8 +1917,7 @@ static int nf_tables_fill_rule_info(struct sk_buff *skb, 
struct net *net,
 NFTA_RULE_PAD))
goto nla_put_failure;
 
-   if ((event != NFT_MSG_DELRULE) && (rule->list.prev != >rules)) {
-   prule = list_entry(rule->list.prev, struct nft_rule, list);
+   if (event != NFT_MSG_DELRULE && prule) {
if (nla_put_be64(skb, NFTA_RULE_POSITION,
 cpu_to_be64(prule->handle),
 NFTA_RULE_PAD))
@@ -1967,7 +1966,7 @@ static int nf_tables_rule_notify(const struct nft_ctx 
*ctx,
 
err = nf_tables_fill_rule_info(skb, ctx->net, ctx->portid, ctx->seq,
   event, 0, ctx->afi->family, ctx->table,
-  ctx->chain, rule);
+  ctx->chain, rule, NULL);
if (err < 0) {
kfree_skb(skb);
goto err;
@@ -1996,7 +1995,7 @@ static int nf_tables_dump_rules(struct sk_buff *skb,
const struct nft_af_info *afi;
const struct nft_table *table;
const struct nft_chain *chain;
-   const struct nft_rule *rule;
+   const struct nft_rule *rule, *prule = NULL;
unsigned int idx = 0, s_idx = cb->args[0];
struct net *net = sock_net(skb->sk);
int family = nfmsg->nfgen_family;
@@ -2020,7 +2019,7 @@ static int nf_tables_dump_rules(struct sk_buff *skb,
 
list_for_each_entry_rcu(rule, >rules, 
list) {
if (!nft_is_active(net, rule))
-   goto cont;
+   goto cont_skip;
if (idx < s_idx)
goto cont;
if (idx > s_idx)
@@ -2030,11 +2029,13 @@ static int nf_tables_dump_rules(struct sk_buff *skb,
  
cb->nlh->nlmsg_seq,
  
NFT_MSG_NEWRULE,
  
NLM_F_MULTI | NLM_F_APPEND,
- 
afi->family, table, chain, rule) < 0)
+ 
afi->family, table, chain, rule, prule) < 0)
goto done;
 
nl_dump_check_consistent(cb, 
nlmsg_hdr(skb));
 cont:
+   prule = rule;
+cont_skip:
idx++;
}
}
@@ -2134,7 +2135,7 @@ static int nf_tables_getrule(struct sock *nlsk, struct 
sk_buff *skb,
 
err = nf_tables_fill_rule_inf

[Devel] [PATCH RH7 3/3] netfilter: nf_tables: use list_entry_rcu in nft_do_chain

2023-11-01 Thread Pavel Tikhomirov
We already use list_for_each_entry_continue_rcu two lines below, it is
thus logical to also use list_entry_rcu there.

https://virtuozzo.atlassian.net/browse/PSBM-150147
Signed-off-by: Pavel Tikhomirov 
---
 net/netfilter/nf_tables_core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/nf_tables_core.c b/net/netfilter/nf_tables_core.c
index 81ccbca32fa8..2fc814445af1 100644
--- a/net/netfilter/nf_tables_core.c
+++ b/net/netfilter/nf_tables_core.c
@@ -141,7 +141,7 @@ nft_do_chain(struct nft_pktinfo *pkt, const struct 
nf_hook_ops *ops)
nft_trace_init(, pkt, , basechain);
 do_chain:
rulenum = 0;
-   rule = list_entry(>rules, struct nft_rule, list);
+   rule = list_entry_rcu(>rules, struct nft_rule, list);
 next_rule:
regs.verdict.code = NFT_CONTINUE;
list_for_each_entry_continue_rcu(rule, >rules, list) {
-- 
2.41.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH7 0/3] netfilter: nf_tables: switch read path to rcu

2023-11-01 Thread Pavel Tikhomirov
We have a customer claiming that iptables-nft takes too long to list
rules from container on big systems. So we remove global nfnl_lock from
read code paths and replace it with rcu to improve perfomane for that
case.

https://virtuozzo.atlassian.net/browse/PSBM-150147
Signed-off-by: Pavel Tikhomirov 

Florian Westphal (2):
  netfilter: nf_tables: use call_rcu in netlink dumps
  netfilter: nf_tables: fix oops during rule dump

Pavel Tikhomirov (1):
  netfilter: nf_tables: use list_entry_rcu in nft_do_chain

 net/netfilter/nf_tables_api.c  | 154 +
 net/netfilter/nf_tables_core.c |   2 +-
 2 files changed, 98 insertions(+), 58 deletions(-)

-- 
2.41.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH9 v2 3/5] mm/memcontrol: fix mem_cgroup_enough_memory on cgroup-v2

2023-09-05 Thread Pavel Tikhomirov
In cgroup-v2 memsw is not used anymore, instead we have swap and memory
separated from one another. So let's replace memsw to mem+swap.

https://jira.vzint.dev/browse/PSBM-149975
Signed-off-by: Pavel Tikhomirov 
---
v2: should use mem+swap instead of just mem here
---
 mm/memcontrol.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3a2099920756..ac327776144d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4283,7 +4283,12 @@ static int mem_cgroup_enough_memory(struct mem_cgroup 
*memcg, long pages)
long free;
 
/* unused memory */
-   free = memcg->memsw.max - page_counter_read(>memsw);
+   if (memcg->css.cgroup->root == _dfl_root) {
+   free = memcg->memory.max - page_counter_read(>memory);
+   free += memcg->swap.max - page_counter_read(>swap);
+   } else {
+   free = memcg->memsw.max - page_counter_read(>memsw);
+   }
 
/* reclaimable slabs */
free += memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B) >> PAGE_SHIFT;
-- 
2.41.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH9 v2 2/5] mm/memcontrol: fix available memory calculation in mem_cgroup_enough_memory

2023-09-05 Thread Pavel Tikhomirov
The idea of mem_cgroup_enough_memory check is to return ENOMEM on
allocations when container memory cgroup is close to OOM.

In vz7 we have:
  free = memcg->memsw.limit - page_counter_read(>memsw);
In vz9:
  free = memcg->memsw.max - page_counter_read(>memory);

The error here is memsw -> memory change in rebase to vz8, after it we
do not care about swap usage at all. This leads to allowing allocations
even then all memory and swap are exhausted for container which is not
what we want.

https://jira.vzint.dev/browse/PSBM-149975
Fixes: 9417436f33d1 ("ve/mm: add heuristic check for memory overcommit")
Signed-off-by: Pavel Tikhomirov 
---
 mm/memcontrol.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9287eed944d5..3a2099920756 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4283,7 +4283,7 @@ static int mem_cgroup_enough_memory(struct mem_cgroup 
*memcg, long pages)
long free;
 
/* unused memory */
-   free = memcg->memsw.max - page_counter_read(>memory);
+   free = memcg->memsw.max - page_counter_read(>memsw);
 
/* reclaimable slabs */
free += memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B) >> PAGE_SHIFT;
-- 
2.41.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH9 v2 5/5] cgroup: show and allow mounting of cgroup-v2 in containers

2023-09-05 Thread Pavel Tikhomirov
The change in is_virtualized_cgroot makes cgroup-v2 "virtualized" only
when it has enabled subsystems in it. And "virtualized" means that it
will be:
a) shown in different cgroup related proc files in container,
b) allowed to be mounted in container and
c) required to have separate root directory for container

Note: we don't expect cgroup-v2 root to change enabled controllers, it
either has no controllers or has at least one all the way.
Note: cgroup-v2 has one hierarchy for all controllers, so skipping misc
and debug for vzctl is not required anymore.

Add FS_VIRTUALIZED to cgroup2 to allow mounting in container.

https://jira.vzint.dev/browse/PSBM-149975
Signed-off-by: Pavel Tikhomirov 
---
 kernel/cgroup/cgroup.c | 16 +++-
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 945130057a30..d902e675598d 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2060,18 +2060,16 @@ struct ve_struct *get_curr_ve(void)
  */
 static inline bool is_virtualized_cgroot(struct cgroup_root *cgroot)
 {
-   /* Cgroup v2 */
-   if (cgroot == _dfl_root)
-   return false;
-
+   if (cgroot != _dfl_root) {
 #if IS_ENABLED(CONFIG_CGROUP_DEBUG)
-   if (cgroot->subsys_mask & (1 << debug_cgrp_id))
-   return false;
+   if (cgroot->subsys_mask & (1 << debug_cgrp_id))
+   return false;
 #endif
 #if IS_ENABLED(CONFIG_CGROUP_MISC)
-   if (cgroot->subsys_mask & (1 << misc_cgrp_id))
-   return false;
+   if (cgroot->subsys_mask & (1 << misc_cgrp_id))
+   return false;
 #endif
+   }
 
if (cgroot->subsys_mask)
return true;
@@ -2628,7 +2626,7 @@ static struct file_system_type cgroup2_fs_type = {
.init_fs_context= cgroup_init_fs_context,
.parameters = cgroup2_fs_parameters,
.kill_sb= cgroup_kill_sb,
-   .fs_flags   = FS_USERNS_MOUNT,
+   .fs_flags   = FS_USERNS_MOUNT | FS_VIRTUALIZED,
 };
 
 #ifdef CONFIG_CPUSETS
-- 
2.41.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH9 v2 1/5] cgroup: fix uninitiallized use of ctx->root

2023-09-05 Thread Pavel Tikhomirov
In vfs_get_tree we explicitly check that fc->root is NULL before calling
->get_tree helper. So when mounting cgroup2 filesystem in the begining
of cgroup_get_tree the fc->root is uninitializled. We were lucky that
ve_hide_cgroups never dereferenced it on this code path, as mounting
cgroup2 from container was prohibited and from host ve_hide_cgroups does
not dereference root.

But if we will allow mounting cgroup2 filesystem in container, this use
of ctx->root in cgroup_get_tree will leed to crash, let's fix it.

https://jira.vzint.dev/browse/PSBM-149975
Fixes: e8e4834b833c ("ve/cgroup: hide non-virtualized cgroups in container")
Signed-off-by: Pavel Tikhomirov 
---
 kernel/cgroup/cgroup.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 3f685035076a..b0cf5cf66d20 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2543,7 +2543,7 @@ static int cgroup_get_tree(struct fs_context *fc)
struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
int ret;
 
-   if (ve_hide_cgroups(ctx->root))
+   if (ve_hide_cgroups(_dfl_root))
return -EPERM;
 
cgrp_dfl_visible = true;
-- 
2.41.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH9 v2 4/5] cgroup: allow to write to cgroup.subtree_control in ct root cgroup

2023-09-05 Thread Pavel Tikhomirov
This way systemd in container on cgroup-v2 can enable the controllers it
requires.

https://jira.vzint.dev/browse/PSBM-149975
Signed-off-by: Pavel Tikhomirov 
---
 kernel/cgroup/cgroup.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index b0cf5cf66d20..945130057a30 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5482,7 +5482,7 @@ static struct cftype cgroup_base_files[] = {
},
{
.name = "cgroup.subtree_control",
-   .flags = CFTYPE_NS_DELEGATABLE,
+   .flags = CFTYPE_NS_DELEGATABLE | CFTYPE_VE_WRITABLE,
.seq_show = cgroup_subtree_control_show,
.write = cgroup_subtree_control_write,
},
-- 
2.41.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH9 v2 0/5] enable cgroup-v2 in containers

2023-09-05 Thread Pavel Tikhomirov
To test it need to boot with systemd.unified_cgroup_hierarchy=1 option.

https://jira.vzint.dev/browse/PSBM-149975
Signed-off-by: Pavel Tikhomirov 

Pavel Tikhomirov (5):
  cgroup: fix uninitiallized use of ctx->root
  mm/memcontrol: fix available memory calculation in
mem_cgroup_enough_memory
  mm/memcontrol: fix mem_cgroup_enough_memory on cgroup-v2
  cgroup: allow to write to cgroup.subtree_control in ct root cgroup
  cgroup: show and allow mounting of cgroup-v2 in containers

 kernel/cgroup/cgroup.c | 20 +---
 mm/memcontrol.c|  7 ++-
 2 files changed, 15 insertions(+), 12 deletions(-)

-- 
2.41.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH vz7] cgroup_freezer: fix memory allocation context when checking freeze timeouts

2023-09-03 Thread Pavel Tikhomirov

Thanks for fixing this!

On 01.09.2023 20:44, Alexander Atanasov wrote:

check_freezer_timeout(...) is called under spin_lock_irq(...) from
update_if_frozen(...) the issue is that it does allocations under GFP_KERNEL
context - fix that by using GFP_ATOMIC.

Fixes: aab9c2fafb6b ("cgroup_freezer: print information about unfreezable 
process")
https://jira.vzint.dev/browse/PSBM-150293


Reviewed-by: Pavel Tikhomirov 


Signed-off-by: Alexander Atanasov 
---
  kernel/cgroup_freezer.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index d4747ff98090..e5da9a63813d 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -269,7 +269,7 @@ static void check_freezer_timeout(struct cgroup *cgroup,
if (!__ratelimit(_timeout_rs))
return;
  
-	freezer_cg_name = kmalloc(PATH_MAX, GFP_KERNEL);

+   freezer_cg_name = kmalloc(PATH_MAX, GFP_ATOMIC);
if (!freezer_cg_name)
return;
  
@@ -283,7 +283,7 @@ static void check_freezer_timeout(struct cgroup *cgroup,

   freezer_cg_name, __freeze_timeout/HZ, tgid, task->comm);
  
  	entries = kmalloc(MAX_STACK_TRACE_DEPTH * sizeof(*entries),

- GFP_KERNEL);
+ GFP_ATOMIC);
if (!entries)
goto free_cg_name;
  


--
Best regards, Tikhomirov Pavel
Senior Software Developer, Virtuozzo.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH v2 rh7] vznetstat: fix statistics output when getting stats for same device in parallel

2023-08-29 Thread Pavel Tikhomirov




On 29.08.2023 18:01, Konstantin Khorenko wrote:

Before this patch .get_stats() for venetdev was not defended from
racing, Documentation/networking/netdevices.txt states ndo_get_stats()
is to be syncronized by dev_base_lock rwlock, but we do READ stats, so
can go in parallel.

At the same time the algorithm of gathering per-cpu stats into a single
value used per-device structure, so in case .get_stats() is called in
parallel, stats could be corrupted.

Example, 2 cpu node:

cpu A   cpu B
.get_stats()
memset(dev->...->stats, 0)
stats.rx_bytes += cpuA.rx_bytes
.get_stats()
memset(dev->...->stats, 0)
stats.rx_bytes += cpuB.rx_bytes
return stats
stats.rx_bytes += cpuA.rx_bytes
stats.rx_bytes += cpuB.rx_bytes
return stats

=> process running on cpu A will print rx_bytes only taken from cpu B
(so lower than expected), and the process running on cpu B will print
too high numbers because cpu B values will be added twice.

Let's make venetdev provide ndo_get_stats64() callback which gets a
storage as an argument, thus does not suffer from a race of filling
a per-netdevice struct in parallel.

https://jira.vzint.dev/browse/PSBM-150027



Reviewed-by: Pavel Tikhomirov 


Signed-off-by: Konstantin Khorenko 
---
v2:
   * get rid of extra function venet_stats_one()
   * get rid of extra struct venet_device_stats and filling it

  drivers/net/venetdev.c | 20 
  include/linux/venet.h  |  1 -
  2 files changed, 8 insertions(+), 13 deletions(-)

diff --git a/drivers/net/venetdev.c b/drivers/net/venetdev.c
index 3ca4f8a49a5c..9785b9ba7247 100644
--- a/drivers/net/venetdev.c
+++ b/drivers/net/venetdev.c
@@ -568,25 +568,21 @@ static int venet_xmit(struct sk_buff *skb, struct 
net_device *dev)
return 0;
  }
  
-static struct net_device_stats *get_stats(struct net_device *dev)

+static void venet_get_stats64(struct net_device *dev,
+ struct rtnl_link_stats64 *total)
  {
int i;
-   struct venet_stats *stats;
  
-	stats = (struct venet_stats *)dev->ml_priv;

-   memset(>stats, 0, sizeof(struct net_device_stats));
for_each_possible_cpu(i) {
struct net_device_stats *dev_stats;
  
  		dev_stats = venet_stats(dev, i);

-   stats->stats.rx_bytes   += dev_stats->rx_bytes;
-   stats->stats.tx_bytes   += dev_stats->tx_bytes;
-   stats->stats.rx_packets += dev_stats->rx_packets;
-   stats->stats.tx_packets += dev_stats->tx_packets;
-   stats->stats.tx_dropped += dev_stats->tx_dropped;
+   total->rx_bytes += dev_stats->rx_bytes;
+   total->tx_bytes += dev_stats->tx_bytes;
+   total->rx_packets += dev_stats->rx_packets;
+   total->tx_packets += dev_stats->tx_packets;
+   total->tx_dropped += dev_stats->tx_dropped;
}
-
-   return >stats;
  }
  
  /* Initialize the rest of the LOOPBACK device. */

@@ -712,7 +708,7 @@ static const struct ethtool_ops venet_ethtool_ops = {
  
  static const struct net_device_ops venet_netdev_ops = {

.ndo_start_xmit = venet_xmit,
-   .ndo_get_stats = get_stats,
+   .ndo_get_stats64 = venet_get_stats64,
.ndo_open = venet_open,
.ndo_stop = venet_close,
.ndo_init = venet_init_dev,
diff --git a/include/linux/venet.h b/include/linux/venet.h
index 51e0abeb03d7..f0a00832f6de 100644
--- a/include/linux/venet.h
+++ b/include/linux/venet.h
@@ -21,7 +21,6 @@
  struct ve_struct;
  struct venet_stat;
  struct venet_stats {
-   struct net_device_stats stats;
struct net_device_stats *real_stats;
  };
  


--
Best regards, Tikhomirov Pavel
Senior Software Developer, Virtuozzo.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7] vznetstat: fix statistics output when getting stats for same device in parallel

2023-08-28 Thread Pavel Tikhomirov




On 28.08.2023 23:54, Konstantin Khorenko wrote:

Before this patch .get_stats() for venetdev was not defended from
racing, Documentation/networking/netdevices.txt states ndo_get_stats()
is to be syncronized by dev_base_lock rwlock, but we do READ stats, so
can go in parallel.

At the same time the algorithm of gathering per-cpu stats into a single
value used per-device structure, so in case .get_stats() is called in
parallel, stats could be corrupted.

Example, 2 cpu node:

cpu A   cpu B
.get_stats()
memset(dev->...->stats, 0)
stats.rx_bytes += cpuA.rx_bytes
.get_stats()
memset(dev->...->stats, 0)
stats.rx_bytes += cpuB.rx_bytes
return stats
stats.rx_bytes += cpuA.rx_bytes
stats.rx_bytes += cpuB.rx_bytes
return stats

=> process running on cpu A will print rx_bytes only taken from cpu B
(so lower than expected), and the process running on cpu B will print
too high numbers because cpu B values will be added twice.

Let's rework the code to sumup per-cpu values to a local structure
(on-stack).
Extra 5 unsigned long variables seem to be fine to be put on stack.

https://jira.vzint.dev/browse/PSBM-150027

Signed-off-by: Konstantin Khorenko 
---
  drivers/net/venetdev.c | 38 --
  include/linux/venet.h  | 10 +-
  2 files changed, 33 insertions(+), 15 deletions(-)

diff --git a/drivers/net/venetdev.c b/drivers/net/venetdev.c
index 3ca4f8a49a5c..3ad67ee2f2b0 100644
--- a/drivers/net/venetdev.c
+++ b/drivers/net/venetdev.c
@@ -568,25 +568,35 @@ static int venet_xmit(struct sk_buff *skb, struct 
net_device *dev)
return 0;
  }
  
-static struct net_device_stats *get_stats(struct net_device *dev)

+static void venet_stats_one(struct venet_device_stats *result, struct 
net_device *dev)
  {
-   int i;
-   struct venet_stats *stats;
+   int cpu;
+
+   memset(result, 0, sizeof(struct venet_device_stats));
  
-	stats = (struct venet_stats *)dev->ml_priv;

-   memset(>stats, 0, sizeof(struct net_device_stats));
-   for_each_possible_cpu(i) {
+   for_each_possible_cpu(cpu) {
struct net_device_stats *dev_stats;
  
-		dev_stats = venet_stats(dev, i);

-   stats->stats.rx_bytes   += dev_stats->rx_bytes;
-   stats->stats.tx_bytes   += dev_stats->tx_bytes;
-   stats->stats.rx_packets += dev_stats->rx_packets;
-   stats->stats.tx_packets += dev_stats->tx_packets;
-   stats->stats.tx_dropped += dev_stats->tx_dropped;
+   dev_stats = venet_stats(dev, cpu);
+   result->rx_bytes += dev_stats->rx_bytes;
+   result->tx_bytes += dev_stats->tx_bytes;
+   result->rx_packets += dev_stats->rx_packets;
+   result->tx_packets += dev_stats->tx_packets;
+   result->tx_dropped += dev_stats->tx_dropped;
}
+}
  
-	return >stats;

+static void venet_get_stats64(struct net_device *dev,
+ struct rtnl_link_stats64 *total)
+{
+   struct venet_device_stats one;
+
+   venet_stats_one(, dev);


Having both venet_get_stats64 and venet_stats_one and adding struct 
venet_device_stats seems excess to me. Why can't we just fill 
rtnl_link_stats64 directly?



+   total->rx_bytes = one.rx_bytes;
+   total->tx_bytes = one.tx_bytes;
+   total->rx_packets = one.rx_packets;
+   total->tx_packets = one.tx_packets;
+   total->tx_dropped = one.tx_dropped;
  }
  
  /* Initialize the rest of the LOOPBACK device. */

@@ -712,7 +722,7 @@ static const struct ethtool_ops venet_ethtool_ops = {
  
  static const struct net_device_ops venet_netdev_ops = {

.ndo_start_xmit = venet_xmit,
-   .ndo_get_stats = get_stats,
+   .ndo_get_stats64 = venet_get_stats64,
.ndo_open = venet_open,
.ndo_stop = venet_close,
.ndo_init = venet_init_dev,
diff --git a/include/linux/venet.h b/include/linux/venet.h
index 51e0abeb03d7..b54b0342a38a 100644
--- a/include/linux/venet.h
+++ b/include/linux/venet.h
@@ -21,10 +21,18 @@
  struct ve_struct;
  struct venet_stat;
  struct venet_stats {
-   struct net_device_stats stats;
struct net_device_stats *real_stats;
  };
  
+struct venet_device_stats {

+   unsigned long   rx_packets;
+   unsigned long   tx_packets;
+   unsigned long   rx_bytes;
+   unsigned long   tx_bytes;
+
+   unsigned long   tx_dropped;
+};
+
  struct ip_entry_struct
  {
struct ve_addr_struct   addr;


--
Best regards, Tikhomirov Pavel
Senior Software Developer, Virtuozzo.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH9 4/4] cgroup: show and allow mounting of cgroup-v2 in containers

2023-08-28 Thread Pavel Tikhomirov
The change in is_virtualized_cgroot makes cgroup-v2 "virtualized" only
when it has enabled subsystems in it. And "virtualized" means that it
will be:
a) shown in different cgroup related proc files in container,
b) allowed to be mounted in container and
c) required to have separate root directory for container

Note: we don't expect cgroup-v2 root to change enabled controllers, it
either has no controllers or has at least one all the way.
Note: cgroup-v2 has one hierarchy for all controllers, so skipping misc
and debug for vzctl is not required anymore.

Add FS_VIRTUALIZED to cgroup2 to allow mounting in container.

https://jira.vzint.dev/browse/PSBM-149975
Signed-off-by: Pavel Tikhomirov 
---
 kernel/cgroup/cgroup.c | 16 +++-
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 945130057a30..d902e675598d 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2060,18 +2060,16 @@ struct ve_struct *get_curr_ve(void)
  */
 static inline bool is_virtualized_cgroot(struct cgroup_root *cgroot)
 {
-   /* Cgroup v2 */
-   if (cgroot == _dfl_root)
-   return false;
-
+   if (cgroot != _dfl_root) {
 #if IS_ENABLED(CONFIG_CGROUP_DEBUG)
-   if (cgroot->subsys_mask & (1 << debug_cgrp_id))
-   return false;
+   if (cgroot->subsys_mask & (1 << debug_cgrp_id))
+   return false;
 #endif
 #if IS_ENABLED(CONFIG_CGROUP_MISC)
-   if (cgroot->subsys_mask & (1 << misc_cgrp_id))
-   return false;
+   if (cgroot->subsys_mask & (1 << misc_cgrp_id))
+   return false;
 #endif
+   }
 
if (cgroot->subsys_mask)
return true;
@@ -2628,7 +2626,7 @@ static struct file_system_type cgroup2_fs_type = {
.init_fs_context= cgroup_init_fs_context,
.parameters = cgroup2_fs_parameters,
.kill_sb= cgroup_kill_sb,
-   .fs_flags   = FS_USERNS_MOUNT,
+   .fs_flags   = FS_USERNS_MOUNT | FS_VIRTUALIZED,
 };
 
 #ifdef CONFIG_CPUSETS
-- 
2.41.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH9 3/4] cgroup: allow to write to cgroup.subtree_control in ct root cgroup

2023-08-28 Thread Pavel Tikhomirov
This way systemd in container on cgroup-v2 can enable the controllers it
requires.

https://jira.vzint.dev/browse/PSBM-149975
Signed-off-by: Pavel Tikhomirov 
---
 kernel/cgroup/cgroup.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index b0cf5cf66d20..945130057a30 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5482,7 +5482,7 @@ static struct cftype cgroup_base_files[] = {
},
{
.name = "cgroup.subtree_control",
-   .flags = CFTYPE_NS_DELEGATABLE,
+   .flags = CFTYPE_NS_DELEGATABLE | CFTYPE_VE_WRITABLE,
.seq_show = cgroup_subtree_control_show,
.write = cgroup_subtree_control_write,
},
-- 
2.41.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH9 2/4] mm/memcontrol: fix mem_cgroup_enough_memory on cgroup-v2

2023-08-28 Thread Pavel Tikhomirov
In cgroup-v2 memsw is not used anymore, instead we have swap and memory
separated from one another. So memory limit should be read from
memory.max not from memsw.max.

https://jira.vzint.dev/browse/PSBM-149975
Fixes: 9417436f33d1 ("ve/mm: add heuristic check for memory overcommit")
Signed-off-by: Pavel Tikhomirov 
---
 mm/memcontrol.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9287eed944d5..cd11065cbe0d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4283,7 +4283,10 @@ static int mem_cgroup_enough_memory(struct mem_cgroup 
*memcg, long pages)
long free;
 
/* unused memory */
-   free = memcg->memsw.max - page_counter_read(>memory);
+   if (memcg->css.cgroup->root == _dfl_root)
+   free = memcg->memory.max - page_counter_read(>memory);
+   else
+   free = memcg->memsw.max - page_counter_read(>memory);
 
/* reclaimable slabs */
free += memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B) >> PAGE_SHIFT;
-- 
2.41.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH9 1/4] cgroup: fix uninitiallized use of ctx->root

2023-08-28 Thread Pavel Tikhomirov
In vfs_get_tree we explicitly check that fc->root is NULL before calling
->get_tree helper. So when mounting cgroup2 filesystem in the begining
of cgroup_get_tree the fc->root is uninitializled. We were lucky that
ve_hide_cgroups never dereferenced it on this code path, as mounting
cgroup2 from container was prohibited and from host ve_hide_cgroups does
not dereference root.

But if we will allow mounting cgroup2 filesystem in container, this use
of ctx->root in cgroup_get_tree will leed to crash, let's fix it.

https://jira.vzint.dev/browse/PSBM-149975
Fixes: e8e4834b833c ("ve/cgroup: hide non-virtualized cgroups in container")
Signed-off-by: Pavel Tikhomirov 
---
 kernel/cgroup/cgroup.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 3f685035076a..b0cf5cf66d20 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2543,7 +2543,7 @@ static int cgroup_get_tree(struct fs_context *fc)
struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
int ret;
 
-   if (ve_hide_cgroups(ctx->root))
+   if (ve_hide_cgroups(_dfl_root))
return -EPERM;
 
cgrp_dfl_visible = true;
-- 
2.41.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH9 0/4] enable cgroup-v2 in containers

2023-08-28 Thread Pavel Tikhomirov
To test it need to boot with systemd.unified_cgroup_hierarchy=1 option.

https://jira.vzint.dev/browse/PSBM-149975
Signed-off-by: Pavel Tikhomirov 

Pavel Tikhomirov (4):
  cgroup: fix uninitiallized use of ctx->root
  mm/memcontrol: fix mem_cgroup_enough_memory on cgroup-v2
  cgroup: allow to write to cgroup.subtree_control in ct root cgroup
  cgroup: show and allow mounting of cgroup-v2 in containers

 kernel/cgroup/cgroup.c | 20 +---
 mm/memcontrol.c|  5 -
 2 files changed, 13 insertions(+), 12 deletions(-)

-- 
2.41.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH RH7 v2 0/2] fixes for memory cgroup id refcounting

2023-07-10 Thread Pavel Tikhomirov




On 11.07.2023 12:32, Pavel Tikhomirov wrote:

Complete memcg id refcount design:

- on cgroup online get css refcount and set memcg id refctount to 1
- on cgroup offline put memcg id refcount
- on swapout:
   - if cgroup memcg id refcount is 0 (not online), use parent cgroup
   - if cgroup memcg id refcount is >0, get it
   (previously in rh7 we got css refcount instead)
- on swapin put memcg id refcount
   (previously in rh7 we put css refcount instead)
- on charge move from one to another cgroup put memcg id refctount on
"from" cgroup and get it on "to" cgroup
(previously in rh7 we put css refcount on "from" and get on "to")
- on last memcg id refctount put also put css refcount

Charge move is a bit tricky as we can move swap -> non-swap, thus we get
"to" memcg id refcount in mem_cgroup_move_charge_pte_range, and put
"from" memcg id refcount later in __mem_cgroup_clear_mc.


The part about "swap -> non-swap" is wrong, rewrite of this paragraph:

Charge move is a bit tricky: we get "to" memcg id refcount in
mem_cgroup_move_charge_pte_range, and put "from" memcg id refcount later
in __mem_cgroup_clear_mc.



So, in the first patch of this series we remove leftover memcg id
refcount put on memcg free path, and in the second patch we remove
leftover "to" css refcount get in task move code for moved swap.

https://jira.vzint.dev/browse/PSBM-148818
Signed-off-by: Pavel Tikhomirov 

Pavel Tikhomirov (2):
   mm/memcontrol: remove excess put of memory cgroup id refcount on css
 free
   mm/memcontrol: remove excess css_get on task migration with swap

  mm/memcontrol.c | 14 --
  1 file changed, 14 deletions(-)



--
Best regards, Tikhomirov Pavel
Senior Software Developer, Virtuozzo.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH RH7] mm/memcontrol: remove excess put of memory cgroup id refcount on css free

2023-07-10 Thread Pavel Tikhomirov

please ignore, see v2

On 11.07.2023 11:47, Pavel Tikhomirov wrote:

The removed hunk comes from rhel partial porting of [1], there they
added mem_cgroup_id_put on memcg free instead of offline. But they also
didn't do refcounting in mem_cgroup_id_put, only removed id from idr.

Now we do actual refcounting in mem_cgroup_id_put, and atomic_set(1) and
css_get in mem_cgroup_css_online are paired with mem_cgroup_id_put in
mem_cgroup_css_offline. The call in free path is excess and we should
remove it.

This is not a critical excess refcount put, as it happened on late
cgroup freeing stage, which can only happen after this id refcount
reaches zero, which leads to css_put, which leads to freeing. So nobody
cares about id refcount at this point.

https://jira.vzint.dev/browse/PSBM-148818

Fixes: 9dcef96ce3e2 ("ms/mm: memcontrol: fix cgroup creation failure after many 
small jobs") [1]
Signed-off-by: Pavel Tikhomirov 
---
  mm/memcontrol.c | 2 --
  1 file changed, 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3a5e89b913bd..c7c15c6dd6fb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6818,8 +6818,6 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
  
  	mem_cgroup_remove_from_trees(memcg);
  
-	mem_cgroup_id_put(memcg);

-
for_each_node(node)
free_mem_cgroup_per_zone_info(memcg, node);
  


--
Best regards, Tikhomirov Pavel
Senior Software Developer, Virtuozzo.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH7 v2 0/2] fixes for memory cgroup id refcounting

2023-07-10 Thread Pavel Tikhomirov
Complete memcg id refcount design:

- on cgroup online get css refcount and set memcg id refctount to 1
- on cgroup offline put memcg id refcount
- on swapout:
  - if cgroup memcg id refcount is 0 (not online), use parent cgroup
  - if cgroup memcg id refcount is >0, get it
  (previously in rh7 we got css refcount instead)
- on swapin put memcg id refcount
  (previously in rh7 we put css refcount instead)
- on charge move from one to another cgroup put memcg id refctount on
"from" cgroup and get it on "to" cgroup
(previously in rh7 we put css refcount on "from" and get on "to")
- on last memcg id refctount put also put css refcount

Charge move is a bit tricky as we can move swap -> non-swap, thus we get
"to" memcg id refcount in mem_cgroup_move_charge_pte_range, and put
"from" memcg id refcount later in __mem_cgroup_clear_mc.

So, in the first patch of this series we remove leftover memcg id
refcount put on memcg free path, and in the second patch we remove
leftover "to" css refcount get in task move code for moved swap.

https://jira.vzint.dev/browse/PSBM-148818
Signed-off-by: Pavel Tikhomirov 

Pavel Tikhomirov (2):
  mm/memcontrol: remove excess put of memory cgroup id refcount on css
free
  mm/memcontrol: remove excess css_get on task migration with swap

 mm/memcontrol.c | 14 --
 1 file changed, 14 deletions(-)

-- 
2.41.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH7 v2 1/2] mm/memcontrol: remove excess put of memory cgroup id refcount on css free

2023-07-10 Thread Pavel Tikhomirov
The removed hunk comes from rhel partial porting of [1], there they
added mem_cgroup_id_put on memcg free instead of offline. But they also
didn't do refcounting in mem_cgroup_id_put, only removed id from idr.

Now we do actual refcounting in mem_cgroup_id_put, and atomic_set(1) and
css_get in mem_cgroup_css_online are paired with mem_cgroup_id_put in
mem_cgroup_css_offline. The call in free path is excess and we should
remove it.

This is not a critical excess refcount put, as it happened on late
cgroup freeing stage, which can only happen after this id refcount
reaches zero, which leads to css_put, which leads to freeing. So nobody
cares about id refcount at this point.

https://jira.vzint.dev/browse/PSBM-148818

Fixes: 9dcef96ce3e2 ("ms/mm: memcontrol: fix cgroup creation failure after many 
small jobs") [1]
Signed-off-by: Pavel Tikhomirov 
---
 mm/memcontrol.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3a5e89b913bd..c7c15c6dd6fb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6818,8 +6818,6 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
 
mem_cgroup_remove_from_trees(memcg);
 
-   mem_cgroup_id_put(memcg);
-
for_each_node(node)
free_mem_cgroup_per_zone_info(memcg, node);
 
-- 
2.41.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH7 v2 2/2] mm/memcontrol: remove excess css_get on task migration with swap

2023-07-10 Thread Pavel Tikhomirov
Before [1] we took css refcount for each swap page, after that we take
memory cgroup id refcount instead, that's why we don't need to get css
refcount on destination cgroup when moving swap page to it.

https://jira.vzint.dev/browse/PSBM-148818

Fixes: 9dcef96ce3e2 ("ms/mm: memcontrol: fix cgroup creation failure after many 
small jobs") [1]
Fixes: 11870ae0f7b6 ("ms/mm: memcontrol: fix memcg id ref counter on swap 
charge move")
Signed-off-by: Pavel Tikhomirov 
---
 mm/memcontrol.c | 12 
 1 file changed, 12 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c7c15c6dd6fb..a2fe48d93d02 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4356,18 +4356,6 @@ static int mem_cgroup_move_swap_account(swp_entry_t 
entry,
if (swap_cgroup_cmpxchg(entry, old_id, new_id) == old_id) {
mem_cgroup_swap_statistics(from, false);
mem_cgroup_swap_statistics(to, true);
-   /*
-* This function is only called from task migration context now.
-* It postpones page_counter and refcount handling till the end
-* of task migration(mem_cgroup_clear_mc()) for performance
-* improvement. But we cannot postpone css_get(to)  because if
-* the process that has been moved to @to does swap-in, the
-* refcount of @to might be decreased to 0.
-*
-* We are in attach() phase, so the cgroup is guaranteed to be
-* alive, so we can just call css_get().
-*/
-   css_get(>css);
return 0;
}
return -EINVAL;
-- 
2.41.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH7] mm/memcontrol: remove excess put of memory cgroup id refcount on css free

2023-07-10 Thread Pavel Tikhomirov
The removed hunk comes from rhel partial porting of [1], there they
added mem_cgroup_id_put on memcg free instead of offline. But they also
didn't do refcounting in mem_cgroup_id_put, only removed id from idr.

Now we do actual refcounting in mem_cgroup_id_put, and atomic_set(1) and
css_get in mem_cgroup_css_online are paired with mem_cgroup_id_put in
mem_cgroup_css_offline. The call in free path is excess and we should
remove it.

This is not a critical excess refcount put, as it happened on late
cgroup freeing stage, which can only happen after this id refcount
reaches zero, which leads to css_put, which leads to freeing. So nobody
cares about id refcount at this point.

https://jira.vzint.dev/browse/PSBM-148818

Fixes: 9dcef96ce3e2 ("ms/mm: memcontrol: fix cgroup creation failure after many 
small jobs") [1]
Signed-off-by: Pavel Tikhomirov 
---
 mm/memcontrol.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3a5e89b913bd..c7c15c6dd6fb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6818,8 +6818,6 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
 
mem_cgroup_remove_from_trees(memcg);
 
-   mem_cgroup_id_put(memcg);
-
for_each_node(node)
free_mem_cgroup_per_zone_info(memcg, node);
 
-- 
2.41.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH7 v2] mm/memcontrol: do not put css when putting page to swap or moving swap page

2023-07-07 Thread Pavel Tikhomirov
When commit [1] was ported we should not have removed css_get in
mem_cgroup_swapout, as when we put page from memory to swap we need to
have refcount on css in case we will want to put it back. In vz7 we
don't have commit [2] which in mainstream preceeds [1], that's why in
mainstream in [1] they removed css_get as we don't need to take second
reference for the same page. So fix for [1] is bringing back css_get.

When commit [3] was ported we should have removed this css_get as now we
take mem_cgroup_id_get instead. But we should not have added css_put
there as it would be an excess put (remove get + put is equivalent of
double put). In mainstream [3] is preceeded by [2], so they need to put
css reference of page put to swap and replace it with id reference. So
fix for [1] + [2] is removal of css_put.

When commit [4] was ported in __mem_cgroup_clear_mc we should have
removed css_put-s as now after [3] we don't have css reference for
swapped pages, and there is nothing to put when moving to other cgroup.

And now I understand why they do css_put(from)->css_put(to) replacement
in original mainstream patch, it's because they also don't have refcount
from page which is already in swap after [3], but obviousely when they
charge pages to destination cgroup they have reference from them to css
and as they are swap they need to put this reference.

https://jira.vzint.dev/browse/PSBM-148702

Fixes: d0f735e5df66 ("ms/mm: memcontrol: uncharge pages on swapout") [1]
e8ea14cc6ead ("mm: memcontrol: take a css reference for each charged page") [2]
Fixes: 9dcef96ce3e2 ("ms/mm: memcontrol: fix cgroup creation failure after many 
small jobs") [3]
Fixes: 11870ae0f7b6 ("ms/mm: memcontrol: fix memcg id ref counter on swap 
charge move") [4]
Signed-off-by: Pavel Tikhomirov 
---
 mm/memcontrol.c | 7 ---
 1 file changed, 7 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b2911329852f..3a5e89b913bd 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -7413,7 +7413,6 @@ static void __mem_cgroup_clear_mc(void)
 {
struct mem_cgroup *from = mc.from;
struct mem_cgroup *to = mc.to;
-   int i;
 
/* we must uncharge all the leftover precharges from mc.to */
if (mc.precharge) {
@@ -7436,9 +7435,6 @@ static void __mem_cgroup_clear_mc(void)
 
mem_cgroup_id_put_many(mc.from, mc.moved_swap);
 
-   for (i = 0; i < mc.moved_swap; i++)
-   css_put(>css);
-
if (!mem_cgroup_is_root(mc.to)) {
/*
 * we charged both to->memory and to->memsw, so we
@@ -7837,9 +7833,6 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t 
entry)
 
mem_cgroup_charge_statistics(memcg, page, -1);
memcg_check_events(memcg, page);
-
-   if (!mem_cgroup_is_root(memcg))
-   css_put(>css);
 }
 
 /**
-- 
2.41.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH7] mm/memcontrol: do not put css when putting page to swap or moving swap page

2023-07-07 Thread Pavel Tikhomirov
When commit [1] was ported we should not have removed css_get in
mem_cgroup_swapout, as when we put page from memory to swap we need to
have refcount on css in case we will want to put it back. In vz7 we
don't have commit [2] which in mainstream preceeds [1], that's why in
mainstream in [1] they removed css_get as we don't need to take second
reference for the same page. So fix for [1] is bringing back css_get.

When commit [3] was ported we should have removed this css_get as now we
take mem_cgroup_id_get instead. But we should not have added css_put
there as it would be an excess put (remove get + put is equivalent of
double put). In mainstream [3] is preceeded by [2], so they need to put
css reference of page put to swap and replace it with id reference. So
fix for [1] + [2] is removal of css_put.

When commit [4] was ported in __mem_cgroup_clear_mc we should have
removed css_put-s as now after [3] we don't have css reference for
swapped pages, and there is nothing to put when moving to other cgroup.

And now I understand why they do css_put(from)->css_put(to) replacement
in original mainstream patch, it's because they also don't have refcount
from page which is already in swap after [3], but obviousely when they
charge pages to destination cgroup they have reference from them to css
and as they are swap they need to put this reference.

https://jira.vzint.dev/browse/PSBM-148702

Fixes: d0f735e5df66 ("ms/mm: memcontrol: uncharge pages on swapout") [1]
e8ea14cc6ead ("mm: memcontrol: take a css reference for each charged page") [2]
Fixes: 9dcef96ce3e2 ("ms/mm: memcontrol: fix cgroup creation failure after many 
small jobs") [3]
Fixes: 11870ae0f7b6 ("ms/mm: memcontrol: fix memcg id ref counter on swap 
charge move") [4]
Signed-off-by: Pavel Tikhomirov 
---
 mm/memcontrol.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b2911329852f..8f74de3ebace 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -7436,9 +7436,6 @@ static void __mem_cgroup_clear_mc(void)
 
mem_cgroup_id_put_many(mc.from, mc.moved_swap);
 
-   for (i = 0; i < mc.moved_swap; i++)
-   css_put(>css);
-
if (!mem_cgroup_is_root(mc.to)) {
/*
 * we charged both to->memory and to->memsw, so we
@@ -7837,9 +7834,6 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t 
entry)
 
mem_cgroup_charge_statistics(memcg, page, -1);
memcg_check_events(memcg, page);
-
-   if (!mem_cgroup_is_root(memcg))
-   css_put(>css);
 }
 
 /**
-- 
2.41.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH7 9/9] mm: memcontrol: avoid unused function warning

2023-07-05 Thread Pavel Tikhomirov
From: Arnd Bergmann 

A bugfix in v4.8-rc2 introduced a harmless warning when
CONFIG_MEMCG_SWAP is disabled but CONFIG_MEMCG is enabled:

  mm/memcontrol.c:4085:27: error: 'mem_cgroup_id_get_online' defined but not 
used [-Werror=unused-function]
   static struct mem_cgroup *mem_cgroup_id_get_online(struct mem_cgroup *memcg)

This moves the function inside of the #ifdef block that hides the
calling function, to avoid the warning.

Fixes: 1f47b61fb407 ("mm: memcontrol: fix swap counter leak on swapout from 
offline cgroup")
Link: http://lkml.kernel.org/r/20160824113733.2776701-1-a...@arndb.de
Signed-off-by: Arnd Bergmann 
Acked-by: Michal Hocko 
Acked-by: Vladimir Davydov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

https://jira.vzint.dev/browse/PSBM-147036

(cherry picked from commit 358c07fcc3b60ab08d77f1684de8bd81bcf49a1a)
Signed-off-by: Pavel Tikhomirov 
---
 mm/memcontrol.c | 36 ++--
 1 file changed, 18 insertions(+), 18 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 098c18f4981e..5ea9dfcae678 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6688,24 +6688,6 @@ static void mem_cgroup_id_get_many(struct mem_cgroup 
*memcg, unsigned int n)
atomic_add(n, >id.ref);
 }
 
-static struct mem_cgroup *mem_cgroup_id_get_online(struct mem_cgroup *memcg)
-{
-   while (!atomic_inc_not_zero(>id.ref)) {
-   /*
-* The root cgroup cannot be destroyed, so it's refcount must
-* always be >= 1.
-*/
-   if (WARN_ON_ONCE(memcg == root_mem_cgroup)) {
-   VM_BUG_ON(1);
-   break;
-   }
-   memcg = parent_mem_cgroup(memcg);
-   if (!memcg)
-   memcg = root_mem_cgroup;
-   }
-   return memcg;
-}
-
 static void mem_cgroup_id_put_many(struct mem_cgroup *memcg, unsigned int n)
 {
VM_BUG_ON(atomic_read(>id.ref) < n);
@@ -7785,6 +7767,24 @@ static void __init enable_swap_cgroup(void)
 #endif
 
 #ifdef CONFIG_MEMCG_SWAP
+static struct mem_cgroup *mem_cgroup_id_get_online(struct mem_cgroup *memcg)
+{
+   while (!atomic_inc_not_zero(>id.ref)) {
+   /*
+* The root cgroup cannot be destroyed, so it's refcount must
+* always be >= 1.
+*/
+   if (WARN_ON_ONCE(memcg == root_mem_cgroup)) {
+   VM_BUG_ON(1);
+   break;
+   }
+   memcg = parent_mem_cgroup(memcg);
+   if (!memcg)
+   memcg = root_mem_cgroup;
+   }
+   return memcg;
+}
+
 /**
  * mem_cgroup_swapout - transfer a memsw charge to swap
  * @page: page whose memsw charge to transfer
-- 
2.40.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH7 6/9] memcg: remove memcg_cgroup::id from IDR on mem_cgroup_css_alloc() failure

2023-07-05 Thread Pavel Tikhomirov
From: Kirill Tkhai 

In case of memcg_online_kmem() failure, memcg_cgroup::id remains hashed
in mem_cgroup_idr even after memcg memory is freed.  This leads to leak
of ID in mem_cgroup_idr.

This patch adds removal into mem_cgroup_css_alloc(), which fixes the
problem.  For better readability, it adds a generic helper which is used
in mem_cgroup_alloc() and mem_cgroup_id_put_many() as well.

Link: 
http://lkml.kernel.org/r/152354470916.22460.14397070748001974638.stgit@localhost.localdomain
Fixes 73f576c04b94 ("mm: memcontrol: fix cgroup creation failure after many 
small jobs")
Signed-off-by: Kirill Tkhai 
Acked-by: Johannes Weiner 
Acked-by: Vladimir Davydov 
Cc: Michal Hocko 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

Changes:
- skip hunk in mem_cgroup_css_alloc as there is no such error path yet,
patch is not strictly required but better have it as a small cleanup

https://jira.vzint.dev/browse/PSBM-147036

(cherry picked from commit 7e97de0b033bcac4fa9a35cef72e0c06e6a22c67)
Signed-off-by: Pavel Tikhomirov 
---
 mm/memcontrol.c | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8c14585f5e86..af99281292e5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6674,6 +6674,14 @@ unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
return memcg->id.id;
 }
 
+static void mem_cgroup_id_remove(struct mem_cgroup *memcg)
+{
+   if (memcg->id.id > 0) {
+   idr_remove(_cgroup_idr, memcg->id.id);
+   memcg->id.id = 0;
+   }
+}
+
 static void mem_cgroup_id_get_many(struct mem_cgroup *memcg, unsigned int n)
 {
atomic_add(n, >id.ref);
@@ -6700,8 +6708,7 @@ static struct mem_cgroup *mem_cgroup_id_get_online(struct 
mem_cgroup *memcg)
 static void mem_cgroup_id_put_many(struct mem_cgroup *memcg, unsigned int n)
 {
if (atomic_sub_and_test(n, >id.ref)) {
-   idr_remove(_cgroup_idr, memcg->id.id);
-   memcg->id.id = 0;
+   mem_cgroup_id_remove(memcg);
 
/* Memcg ID pins CSS */
css_put(>css);
@@ -6811,8 +6818,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
for_each_node(node)
free_mem_cgroup_per_zone_info(memcg, node);
 
-   if (memcg->id.id > 0)
-   idr_remove(_cgroup_idr, memcg->id.id);
+   mem_cgroup_id_remove(memcg);
 fail:
kfree(memcg);
return NULL;
-- 
2.40.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH7 4/9] mm: memcontrol: fix memcg id ref counter on swap charge move

2023-07-05 Thread Pavel Tikhomirov
From: Vladimir Davydov 

Since commit 73f576c04b94 ("mm: memcontrol: fix cgroup creation failure
after many small jobs") swap entries do not pin memcg->css.refcnt
directly.  Instead, they pin memcg->id.ref.  So we should adjust the
reference counters accordingly when moving swap charges between cgroups.

Fixes: 73f576c04b941 ("mm: memcontrol: fix cgroup creation failure after many 
small jobs")
Link: 
http://lkml.kernel.org/r/9ce297c64954a42dc90b543bc76106c4a94f07e8.1470219853.git.vdavy...@virtuozzo.com
Signed-off-by: Vladimir Davydov 
Acked-by: Michal Hocko 
Acked-by: Johannes Weiner 
Cc: [3.19+]
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

Changes:
original patch also did:
-   css_put_many(>css, mc.moved_swap);
+   css_put_many(>css, mc.moved_swap);
but this change seems strange and unrelated to me, let's skip it.

https://jira.vzint.dev/browse/PSBM-147036

(cherry picked from commit 615d66c37c755c49ce022c9e5ac0875d27d2603d)
Signed-off-by: Pavel Tikhomirov 
---
 mm/memcontrol.c | 24 +++-
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a9a1fb354d33..8b78c7c8d3e3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6674,9 +6674,9 @@ unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
return memcg->id.id;
 }
 
-static void mem_cgroup_id_get(struct mem_cgroup *memcg)
+static void mem_cgroup_id_get_many(struct mem_cgroup *memcg, unsigned int n)
 {
-   atomic_inc(>id.ref);
+   atomic_add(n, >id.ref);
 }
 
 static struct mem_cgroup *mem_cgroup_id_get_online(struct mem_cgroup *memcg)
@@ -6697,9 +6697,9 @@ static struct mem_cgroup *mem_cgroup_id_get_online(struct 
mem_cgroup *memcg)
return memcg;
 }
 
-static void mem_cgroup_id_put(struct mem_cgroup *memcg)
+static void mem_cgroup_id_put_many(struct mem_cgroup *memcg, unsigned int n)
 {
-   if (atomic_dec_and_test(>id.ref)) {
+   if (atomic_sub_and_test(n, >id.ref)) {
idr_remove(_cgroup_idr, memcg->id.id);
memcg->id.id = 0;
 
@@ -6708,6 +6708,16 @@ static void mem_cgroup_id_put(struct mem_cgroup *memcg)
}
 }
 
+static inline void mem_cgroup_id_get(struct mem_cgroup *memcg)
+{
+   mem_cgroup_id_get_many(memcg, 1);
+}
+
+static inline void mem_cgroup_id_put(struct mem_cgroup *memcg)
+{
+   mem_cgroup_id_put_many(memcg, 1);
+}
+
 /**
  * mem_cgroup_from_id - look up a memcg from a memcg id
  * @id: the memcg id to look up
@@ -7442,6 +7452,8 @@ static void __mem_cgroup_clear_mc(void)
if (!mem_cgroup_is_root(mc.from))
page_counter_uncharge(>memsw, mc.moved_swap);
 
+   mem_cgroup_id_put_many(mc.from, mc.moved_swap);
+
for (i = 0; i < mc.moved_swap; i++)
css_put(>css);
 
@@ -7452,7 +7464,9 @@ static void __mem_cgroup_clear_mc(void)
 */
page_counter_uncharge(>memory, mc.moved_swap);
}
-   /* we've already done css_get(mc.to) */
+
+   mem_cgroup_id_get_many(mc.to, mc.moved_swap);
+
mc.moved_swap = 0;
}
if (do_swap_account) {
-- 
2.40.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH7 8/9] mm/memcontrol.c: fix a -Wunused-function warning

2023-07-05 Thread Pavel Tikhomirov
From: Qian Cai 

mem_cgroup_id_get() was introduced in commit 73f576c04b94 ("mm:memcontrol:
fix cgroup creation failure after many small jobs").

Later, it no longer has any user since the commits,

1f47b61fb407 ("mm: memcontrol: fix swap counter leak on swapout from offline 
cgroup")
58fa2a5512d9 ("mm: memcontrol: add sanity checks for memcg->id.ref on get/put")

so safe to remove it.

Link: http://lkml.kernel.org/r/1568648453-5482-1-git-send-email-...@lca.pw
Signed-off-by: Qian Cai 
Acked-by: Michal Hocko 
Cc: Johannes Weiner 
Cc: Vladimir Davydov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

https://jira.vzint.dev/browse/PSBM-147036

(cherry picked from commit 4d0e3230a56a75bf26d74b3f4b6cb05a0e9fae60)
Signed-off-by: Pavel Tikhomirov 
---
 mm/memcontrol.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 42ca14e637ef..098c18f4981e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6717,11 +6717,6 @@ static void mem_cgroup_id_put_many(struct mem_cgroup 
*memcg, unsigned int n)
}
 }
 
-static inline void mem_cgroup_id_get(struct mem_cgroup *memcg)
-{
-   mem_cgroup_id_get_many(memcg, 1);
-}
-
 static inline void mem_cgroup_id_put(struct mem_cgroup *memcg)
 {
mem_cgroup_id_put_many(memcg, 1);
-- 
2.40.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH7 2/9] mm: memcontrol: fix cgroup creation failure after many small jobs

2023-07-05 Thread Pavel Tikhomirov
From: Johannes Weiner 

The memory controller has quite a bit of state that usually outlives the
cgroup and pins its CSS until said state disappears.  At the same time
it imposes a 16-bit limit on the CSS ID space to economically store IDs
in the wild.  Consequently, when we use cgroups to contain frequent but
small and short-lived jobs that leave behind some page cache, we quickly
run into the 64k limitations of outstanding CSSs.  Creating a new cgroup
fails with -ENOSPC while there are only a few, or even no user-visible
cgroups in existence.

Although pinning CSSs past cgroup removal is common, there are only two
instances that actually need an ID after a cgroup is deleted: cache
shadow entries and swapout records.

Cache shadow entries reference the ID weakly and can deal with the CSS
having disappeared when it's looked up later.  They pose no hurdle.

Swap-out records do need to pin the css to hierarchically attribute
swapins after the cgroup has been deleted; though the only pages that
remain swapped out after offlining are tmpfs/shmem pages.  And those
references are under the user's control, so they are manageable.

This patch introduces a private 16-bit memcg ID and switches swap and
cache shadow entries over to using that.  This ID can then be recycled
after offlining when the CSS remains pinned only by objects that don't
specifically need it.

This script demonstrates the problem by faulting one cache page in a new
cgroup and deleting it again:

  set -e
  mkdir -p pages
  for x in `seq 128000`; do
[ $((x % 1000)) -eq 0 ] && echo $x
mkdir /cgroup/foo
echo $$ >/cgroup/foo/cgroup.procs
echo trex >pages/$x
echo $$ >/cgroup/cgroup.procs
rmdir /cgroup/foo
  done

When run on an unpatched kernel, we eventually run out of possible IDs
even though there are no visible cgroups:

  [root@ham ~]# ./cssidstress.sh
  [...]
  65000
  mkdir: cannot create directory '/cgroup/foo': No space left on device

After this patch, the IDs get released upon cgroup destruction and the
cache and css objects get released once memory reclaim kicks in.

[han...@cmpxchg.org: init the IDR]
  Link: http://lkml.kernel.org/r/20160621154601.ga22...@cmpxchg.org
Fixes: b2052564e66d ("mm: memcontrol: continue cache reclaim from offlined 
groups")
Link: http://lkml.kernel.org/r/20160617162516.gd19...@cmpxchg.org
Signed-off-by: Johannes Weiner 
Reported-by: John Garcia 
Reviewed-by: Vladimir Davydov 
Acked-by: Tejun Heo 
Cc: Nikolay Borisov 
Cc: [3.19+]
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

Changes:
- rhel almost ported it, so some things are already there
- remove synchronize_rcu() from rhel version
- skip mem_cgroup_try_charge_swap hunk
- in original patch we didn't have if (!cont->parent) return 0; thing in
mem_cgroup_css_online, we should handle memcg->id.id before this not to
break refcounting on cgroups without parent (e.g. root_mem_cgroup)

https://jira.vzint.dev/browse/PSBM-147036

(cherry picked from commit 73f576c04b9410ed19660f74f97521bee6e1c546)
Signed-off-by: Pavel Tikhomirov 
---
 mm/memcontrol.c | 62 +
 1 file changed, 42 insertions(+), 20 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a334e9a1a311..6356b6532163 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -176,6 +176,11 @@ enum mem_cgroup_events_target {
 #define SOFTLIMIT_EVENTS_TARGET 1024
 #define NUMAINFO_EVENTS_TARGET 1024
 
+struct mem_cgroup_id {
+   int id;
+   atomic_t ref;
+};
+
 static void mem_cgroup_id_put(struct mem_cgroup *memcg);
 
 struct mem_cgroup_stat_cpu {
@@ -302,7 +307,7 @@ struct mem_cgroup {
struct cgroup_subsys_state css;
 
/* Private memcg ID. Used to ID objects that outlive the cgroup */
-   unsigned short id;
+   struct mem_cgroup_id id;
 
/*
 * the counter to account for memory usage
@@ -6665,14 +6670,24 @@ unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
 {
if (mem_cgroup_disabled())
return 0;
-   return memcg->id;
+
+   return memcg->id.id;
+}
+
+static void mem_cgroup_id_get(struct mem_cgroup *memcg)
+{
+   atomic_inc(>id.ref);
 }
 
 static void mem_cgroup_id_put(struct mem_cgroup *memcg)
 {
-   idr_remove(_cgroup_idr, memcg->id);
-   memcg->id = 0;
-   synchronize_rcu();
+   if (atomic_dec_and_test(>id.ref)) {
+   idr_remove(_cgroup_idr, memcg->id.id);
+   memcg->id.id = 0;
+
+   /* Memcg ID pins CSS */
+   css_put(>css);
+   }
 }
 
 /**
@@ -6726,7 +6741,6 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 {
struct mem_cgroup *memcg;
size_t size;
-   int id;
int i, ret;
int node;
 
@@ -6737,14 +6751,12 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
if (!memcg)
return NULL;
 
-   id = idr_alloc(_cgroup_i

[Devel] [PATCH RH7 3/9] mm: memcontrol: fix swap counter leak on swapout from offline cgroup

2023-07-05 Thread Pavel Tikhomirov
From: Vladimir Davydov 

An offline memory cgroup might have anonymous memory or shmem left
charged to it and no swap.  Since only swap entries pin the id of an
offline cgroup, such a cgroup will have no id and so an attempt to
swapout its anon/shmem will not store memory cgroup info in the swap
cgroup map.  As a result, memcg->swap or memcg->memsw will never get
uncharged from it and any of its ascendants.

Fix this by always charging swapout to the first ancestor cgroup that
hasn't released its id yet.

[han...@cmpxchg.org: add comment to mem_cgroup_swapout]
[vdavy...@virtuozzo.com: use WARN_ON_ONCE() in mem_cgroup_id_get_online()]
  Link: http://lkml.kernel.org/r/20160803123445.GJ13263@esperanza
Fixes: 73f576c04b941 ("mm: memcontrol: fix cgroup creation failure after many 
small jobs")
Link: 
http://lkml.kernel.org/r/5336daa5c9a32e776067773d9da655d2dc126491.1470219853.git.vdavy...@virtuozzo.com
Signed-off-by: Vladimir Davydov 
Acked-by: Johannes Weiner 
Acked-by: Michal Hocko 
Cc: [3.19+]
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

https://jira.vzint.dev/browse/PSBM-147036

(cherry picked from commit 1f47b61fb4077936465dcde872a4e5cc4fe708da)
Signed-off-by: Pavel Tikhomirov 
---
 mm/memcontrol.c | 39 ++-
 1 file changed, 34 insertions(+), 5 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6356b6532163..a9a1fb354d33 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6679,6 +6679,24 @@ static void mem_cgroup_id_get(struct mem_cgroup *memcg)
atomic_inc(>id.ref);
 }
 
+static struct mem_cgroup *mem_cgroup_id_get_online(struct mem_cgroup *memcg)
+{
+   while (!atomic_inc_not_zero(>id.ref)) {
+   /*
+* The root cgroup cannot be destroyed, so it's refcount must
+* always be >= 1.
+*/
+   if (WARN_ON_ONCE(memcg == root_mem_cgroup)) {
+   VM_BUG_ON(1);
+   break;
+   }
+   memcg = parent_mem_cgroup(memcg);
+   if (!memcg)
+   memcg = root_mem_cgroup;
+   }
+   return memcg;
+}
+
 static void mem_cgroup_id_put(struct mem_cgroup *memcg)
 {
if (atomic_dec_and_test(>id.ref)) {
@@ -7759,7 +,7 @@ static void __init enable_swap_cgroup(void)
  */
 void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 {
-   struct mem_cgroup *memcg;
+   struct mem_cgroup *memcg, *swap_memcg;
struct page_cgroup *pc;
unsigned short oldid;
 
@@ -7778,17 +7796,28 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t 
entry)
VM_BUG_ON_PAGE(!(pc->flags & PCG_MEMSW), page);
memcg = pc->mem_cgroup;
 
-   mem_cgroup_id_get(memcg);
-   oldid = swap_cgroup_record(entry, mem_cgroup_id(memcg));
+   /*
+* In case the memcg owning these pages has been offlined and doesn't
+* have an ID allocated to it anymore, charge the closest online
+* ancestor for the swap instead and transfer the memory+swap charge.
+*/
+   swap_memcg = mem_cgroup_id_get_online(memcg);
+   oldid = swap_cgroup_record(entry, mem_cgroup_id(swap_memcg));
VM_BUG_ON_PAGE(oldid, page);
-   mem_cgroup_swap_statistics(memcg, true);
-   this_cpu_inc(memcg->stat->events[MEM_CGROUP_EVENTS_PSWPOUT]);
+   mem_cgroup_swap_statistics(swap_memcg, true);
+   this_cpu_inc(swap_memcg->stat->events[MEM_CGROUP_EVENTS_PSWPOUT]);
 
pc->flags = 0;
 
if (!mem_cgroup_is_root(memcg))
page_counter_uncharge(>memory, 1);
 
+   if (memcg != swap_memcg) {
+   if (!mem_cgroup_is_root(swap_memcg))
+   page_counter_charge(_memcg->memsw, 1);
+   page_counter_uncharge(>memsw, 1);
+   }
+
/* XXX: caller holds IRQ-safe mapping->tree_lock */
VM_BUG_ON(!irqs_disabled());
 
-- 
2.40.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH7 7/9] mm: memcontrol: add sanity checks for memcg->id.ref on get/put

2023-07-05 Thread Pavel Tikhomirov
From: Vladimir Davydov 

Link: 
http://lkml.kernel.org/r/1c5ddb1c171dbdfc3262252769d6138a29b35b70.1470219853.git.vdavy...@virtuozzo.com
Signed-off-by: Vladimir Davydov 
Acked-by: Johannes Weiner 
Acked-by: Michal Hocko 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

https://jira.vzint.dev/browse/PSBM-147036

(cherry picked from commit 58fa2a5512d9f224775fb01433f195e639953c5f)
Signed-off-by: Pavel Tikhomirov 
---
 mm/memcontrol.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index af99281292e5..42ca14e637ef 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6684,6 +6684,7 @@ static void mem_cgroup_id_remove(struct mem_cgroup *memcg)
 
 static void mem_cgroup_id_get_many(struct mem_cgroup *memcg, unsigned int n)
 {
+   VM_BUG_ON(atomic_read(>id.ref) <= 0);
atomic_add(n, >id.ref);
 }
 
@@ -6707,6 +6708,7 @@ static struct mem_cgroup *mem_cgroup_id_get_online(struct 
mem_cgroup *memcg)
 
 static void mem_cgroup_id_put_many(struct mem_cgroup *memcg, unsigned int n)
 {
+   VM_BUG_ON(atomic_read(>id.ref) < n);
if (atomic_sub_and_test(n, >id.ref)) {
mem_cgroup_id_remove(memcg);
 
@@ -6965,7 +6967,7 @@ mem_cgroup_css_online(struct cgroup *cont)
memcg = mem_cgroup_from_cont(cont);
 
/* Online state pins memcg ID, memcg ID pins CSS */
-   mem_cgroup_id_get(memcg);
+   atomic_set(>id.ref, 1);
css_get(>css);
 
if (!cont->parent) {
-- 
2.40.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH7 5/9] mm/memcg: fix refcount error while moving and swapping

2023-07-05 Thread Pavel Tikhomirov
From: Hugh Dickins 

It was hard to keep a test running, moving tasks between memcgs with
move_charge_at_immigrate, while swapping: mem_cgroup_id_get_many()'s
refcount is discovered to be 0 (supposedly impossible), so it is then
forced to REFCOUNT_SATURATED, and after thousands of warnings in quick
succession, the test is at last put out of misery by being OOM killed.

This is because of the way moved_swap accounting was saved up until the
task move gets completed in __mem_cgroup_clear_mc(), deferred from when
mem_cgroup_move_swap_account() actually exchanged old and new ids.
Concurrent activity can free up swap quicker than the task is scanned,
bringing id refcount down 0 (which should only be possible when
offlining).

Just skip that optimization: do that part of the accounting immediately.

Fixes: 615d66c37c75 ("mm: memcontrol: fix memcg id ref counter on swap charge 
move")
Signed-off-by: Hugh Dickins 
Signed-off-by: Andrew Morton 
Reviewed-by: Alex Shi 
Cc: Johannes Weiner 
Cc: Alex Shi 
Cc: Shakeel Butt 
Cc: Michal Hocko 
Cc: 
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2007071431050.4726@eggly.anvils
Signed-off-by: Linus Torvalds 

https://jira.vzint.dev/browse/PSBM-147036

(cherry picked from commit 8d22a9351035ef2ff12ef163a1091b8b8cf1e49c)
Signed-off-by: Pavel Tikhomirov 
---
 mm/memcontrol.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8b78c7c8d3e3..8c14585f5e86 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -7465,7 +7465,6 @@ static void __mem_cgroup_clear_mc(void)
page_counter_uncharge(>memory, mc.moved_swap);
}
 
-   mem_cgroup_id_get_many(mc.to, mc.moved_swap);
 
mc.moved_swap = 0;
}
@@ -7625,7 +7624,8 @@ put:  /* get_mctgt_type() gets the 
page */
ent = target.ent;
if (!mem_cgroup_move_swap_account(ent, mc.from, mc.to)) 
{
mc.precharge--;
-   /* we fixup refcnts and charges later. */
+   mem_cgroup_id_get_many(mc.to, 1);
+   /* we fixup other refcnts and charges later. */
mc.moved_swap++;
}
break;
-- 
2.40.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH7 0/9] memcg: release id when offlinging cgroup

2023-07-05 Thread Pavel Tikhomirov
We see that container user can deplete memory cgroup ids on the system
(64k) and prevent further memory cgroup creation. In crash collected by
our customer in such a situation we see that mem_cgroup_idr is full of
cgroups from one container with same exact path (cgroup of docker
service), cgroups are not released because they have kmem charges, this
kmem charge is for a tmpfs dentry allocated from this cgroup. (And on
vz7 kernel it seems that such a dentry is only released after umounting
tmpfs or removing the corresponding file from tmpfs.)

So there is a valid way to hold kmem cgroup for a long time. Similar
thing was mentioned in mainstream with page cache holding kmem cgroup
for a long time. And they proposed a way to deal with it - just release
cgroup id early so that one can allocate new cgroups immediately.

Reproduce:
https://git.vzint.dev/users/ptikhomirov/repos/helpers/browse/memcg-related/test-mycg-tmpfs.sh

After this fix the number of memory cgroups in /proc/cgroups can now
show > 64k as we allow to leave memory cgroups hanging while releasing
their ids.

Note: Maybe it's a bad idea to allow container to eat kernel
memory with such a hanging cgroups, but yet I don't have better ideas.

https://jira.vzint.dev/browse/PSBM-147473
https://jira.vzint.dev/browse/PSBM-147036
Signed-off-by: Pavel Tikhomirov 

Arnd Bergmann (1):
  mm: memcontrol: avoid unused function warning

Hugh Dickins (1):
  mm/memcg: fix refcount error while moving and swapping

Johannes Weiner (2):
  mm: memcontrol: uncharge pages on swapout
  mm: memcontrol: fix cgroup creation failure after many small jobs

Kirill Tkhai (1):
  memcg: remove memcg_cgroup::id from IDR on mem_cgroup_css_alloc()
failure

Qian Cai (1):
  mm/memcontrol.c: fix a -Wunused-function warning

Vladimir Davydov (3):
  mm: memcontrol: fix swap counter leak on swapout from offline cgroup
  mm: memcontrol: fix memcg id ref counter on swap charge move
  mm: memcontrol: add sanity checks for memcg->id.ref on get/put

 mm/memcontrol.c | 134 ++--
 1 file changed, 106 insertions(+), 28 deletions(-)

-- 
2.40.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH7 1/9] mm: memcontrol: uncharge pages on swapout

2023-07-05 Thread Pavel Tikhomirov
From: Johannes Weiner 

This series gets rid of the remaining page_cgroup flags, thus cutting the
memcg per-page overhead down to one pointer.

This patch (of 4):

mem_cgroup_swapout() is called with exclusive access to the page at the
end of the page's lifetime.  Instead of clearing the PCG_MEMSW flag and
deferring the uncharge, just do it right away.  This allows follow-up
patches to simplify the uncharge code.

Signed-off-by: Johannes Weiner 
Cc: Hugh Dickins 
Acked-by: Michal Hocko 
Acked-by: Vladimir Davydov 
Reviewed-by: KAMEZAWA Hiroyuki 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

https://jira.vzint.dev/browse/PSBM-147036

(cherry picked from commit 7bdd143c37e591c254d0991ac398a53f3f9ef1af)
Signed-off-by: Pavel Tikhomirov 
---
 mm/memcontrol.c | 20 +++-
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f8a530e29b7a..a334e9a1a311 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -7741,6 +7741,7 @@ static void __init enable_swap_cgroup(void)
  */
 void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 {
+   struct mem_cgroup *memcg;
struct page_cgroup *pc;
unsigned short oldid;
 
@@ -7757,14 +7758,23 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t 
entry)
return;
 
VM_BUG_ON_PAGE(!(pc->flags & PCG_MEMSW), page);
+   memcg = pc->mem_cgroup;
 
-   oldid = swap_cgroup_record(entry, mem_cgroup_id(pc->mem_cgroup));
+   oldid = swap_cgroup_record(entry, mem_cgroup_id(memcg));
VM_BUG_ON_PAGE(oldid, page);
+   mem_cgroup_swap_statistics(memcg, true);
+   this_cpu_inc(memcg->stat->events[MEM_CGROUP_EVENTS_PSWPOUT]);
+
+   pc->flags = 0;
 
-   pc->flags &= ~PCG_MEMSW;
-   css_get(>mem_cgroup->css);
-   mem_cgroup_swap_statistics(pc->mem_cgroup, true);
-   this_cpu_inc(pc->mem_cgroup->stat->events[MEM_CGROUP_EVENTS_PSWPOUT]);
+   if (!mem_cgroup_is_root(memcg))
+   page_counter_uncharge(>memory, 1);
+
+   /* XXX: caller holds IRQ-safe mapping->tree_lock */
+   VM_BUG_ON(!irqs_disabled());
+
+   mem_cgroup_charge_statistics(memcg, page, -1);
+   memcg_check_events(memcg, page);
 }
 
 /**
-- 
2.40.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH vz7 v3] ploop: properly restore old delta kobject on replace error

2023-06-14 Thread Pavel Tikhomirov

On 15.06.2023 04:23, Alexander Atanasov wrote:

Current code removes the old_delta kobject before the new delta
is completely ready to be used, in case there is an error while
reading BAT, the "too short BAT" error we have seen, the new delta
and its objects are destroyed properly but the original delta kobject
is not restored, so pdelta sysfs dir stay empty and userspace tools
get confused since they consult this files before performing operations
on the ploop device.

To fix this instead of deleting the delta's kobject early move it at
the end and on error restore back original kobject.

Extract code to rename kobject into a function and cleanup its users.
Since kobject_add can fail, get an extra reference on error so later
kobject_del/kobject_put would not destroy it unexpectedly. Object
should be intact except that it would not be visible in sysfs.

Fixes:
5f3ee110e6f4 (ploop: Repopulate holes_bitmap on changing delta, 2019-04-17)

https://jira.vzint.dev/browse/PSBM-146797


Reviewed-by: Pavel Tikhomirov 


Signed-off-by: Alexander Atanasov 
---
  drivers/block/ploop/dev.c | 54 ---
  1 file changed, 28 insertions(+), 26 deletions(-)

v1->v2: Addressing review comments. Removed the rename to minimize
possibility of errors. added getting extra refs after failed add.

v2->v3: dropped the idea of extra refs since they are not necessary.
double kobject_put is not an issue  since parent is cleared and
kset objects are not used.

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 6eb22168b5fe..0d6272b39863 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -3594,27 +3594,24 @@ static int ploop_replace_delta(struct ploop_device * 
plo, unsigned long arg)
if (err)
goto out_destroy;
  
-	kobject_del(_delta->kobj);

-
-   err = KOBJECT_ADD(>kobj, kobject_get(>kobj),
- "%d", delta->level);
-   /* _put below is a pair for _get for OLD delta */
-   kobject_put(>kobj);
-
-   if (err < 0) {
-   kobject_put(>kobj);
-   goto out_close;
-   }
-
ploop_quiesce(plo);
if (delta->ops->replace_delta) {
err = delta->ops->replace_delta(delta);
if (err) {
ploop_relax(plo);
-   goto out_kobj_del;
+   goto out_close;
}
}
  
+	/* Remove old delta kobj to avoid name collision with the new one */

+   kobject_del(_delta->kobj);
+   err = KOBJECT_ADD(>kobj, kobject_get(>kobj),
+ "%d", delta->level);
+   if (err < 0) {
+   /* _put for failed _ADD */
+   kobject_put(>kobj);
+   goto out_kobj_restore;
+   }
ploop_map_destroy(>map);
list_replace_init(_delta->list, >list);
ploop_delta_list_changed(plo);
@@ -3632,11 +3629,14 @@ static int ploop_replace_delta(struct ploop_device * 
plo, unsigned long arg)
old_delta->ops->stop(old_delta);
old_delta->ops->destroy(old_delta);
kobject_put(_delta->kobj);
-   return 0;
-
-out_kobj_del:
-   kobject_del(>kobj);
kobject_put(>kobj);
+   return 0;
+out_kobj_restore:
+   /* we haven't dropped our plo->kobj ref just add back */
+   err = KOBJECT_ADD(_delta->kobj, >kobj, "%d", old_delta->level);
+   if (err < 0)
+   /* Nothing we can do unfortunately */
+   PL_ERR(plo, "Failed to restore old delta kobject\n");
  out_close:
delta->ops->stop(delta);
  out_destroy:
@@ -3965,6 +3965,16 @@ static void renumber_deltas(struct ploop_device * plo)
}
  }
  
+/*

+ * Have to implement own version of kobject_rename since it is GPL only symbol
+ */
+static int ploop_rename_delta(struct ploop_delta *delta, int level, char *pref)
+{
+   kobject_del(>kobj);
+   return KOBJECT_ADD(>kobj, >plo->kobj,
+ "%s%d", pref ? : "", level);
+}
+
  static void rename_deltas(struct ploop_device * plo, int level)
  {
struct ploop_delta * delta;
@@ -3974,15 +3984,7 @@ static void rename_deltas(struct ploop_device * plo, int 
level)
  
  		if (delta->level < level)

continue;
-#if 0
-   /* Oops, kobject_rename() is not exported! */
-   sprintf(nname, "%d", delta->level);
-   err = kobject_rename(>kobj, nname);
-#else
-   kobject_del(>kobj);
-   err = KOBJECT_ADD(>kobj, >kobj,
- "%d", delta->level);
-#endif
+   err = ploop_rename_delta(delta, delta->level, NULL);
if (err)
PL_WARN(plo, "rename_deltas:

Re: [Devel] [PATCH vz7 v2] ploop: properly restore old delta kobject on replace error

2023-06-09 Thread Pavel Tikhomirov




On 09.06.2023 00:19, Alexander Atanasov wrote:

Current code removes the old_delta kobject before the new delta
is ready to be used in case there is an error while reading BAT,
the "too short BAT" error we have seen.


The above sentence seems incomplete, as in an opposite case where there 
was no error, old_delta kobject would have been removed before new delta 
is ready anyway =)



The new delta and its objects are destroyed properly but the original
delta kobject is never added back, so pdelta sysfs dir stay empty and
userspace tools get confused since they consult this files before
performing operations on the ploop device. At later point when ploop
device is destroyed a second kobject_del will be called and it can lead
to a crash or corrupted memory.

To fix this instead of deleting the delta early move it at the end,
on error restore back original.

Extract code to rename kobject into a function and cleanup its users.
Since kobject_add can fail, get an extra reference on error so later
kobject_del/kobject_put would not destroy it unexpectedly. Object
should be intact except that it would not be visible in sysfs.

Fixes:
5f3ee110e6f4 (ploop: Repopulate holes_bitmap on changing delta, 2019-04-17)

https://jira.vzint.dev/browse/PSBM-146797
Signed-off-by: Alexander Atanasov 
---
  drivers/block/ploop/dev.c | 63 +++
  1 file changed, 37 insertions(+), 26 deletions(-)

v1->v2: Addressing review comments. Removed the rename to minimize
possibility of errors. added getting extra refs after failed add.

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 6eb22168b5fe..2833865b087f 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -3557,6 +3557,16 @@ static int ploop_add_delta(struct ploop_device * plo, 
unsigned long arg)
return err;
  }
  
+/*

+ * Have to implement own version of kobject_rename since it is GPL only symbol
+ */
+static int ploop_rename_delta(struct ploop_delta *delta, int level, char *pref)
+{
+   kobject_del(>kobj);
+   return KOBJECT_ADD(>kobj, >plo->kobj,
+ "%s%d", pref ? : "", level);
+}
+


It would probably be cleaner to put this function just before the 
function using it (rename_deltas).



  static int ploop_replace_delta(struct ploop_device * plo, unsigned long arg)
  {
int err;
@@ -3594,27 +3604,25 @@ static int ploop_replace_delta(struct ploop_device * 
plo, unsigned long arg)
if (err)
goto out_destroy;
  
-	kobject_del(_delta->kobj);

-
-   err = KOBJECT_ADD(>kobj, kobject_get(>kobj),
- "%d", delta->level);
-   /* _put below is a pair for _get for OLD delta */
-   kobject_put(>kobj);
-
-   if (err < 0) {
-   kobject_put(>kobj);
-   goto out_close;
-   }
  


^ double newline after code removal


ploop_quiesce(plo);
if (delta->ops->replace_delta) {
err = delta->ops->replace_delta(delta);
if (err) {
ploop_relax(plo);
-   goto out_kobj_del;
+   goto out_close;
}
}
  
+	/* Remove old delta kobj to avoid name collision with the new one */

+   kobject_del(_delta->kobj);
+   err = KOBJECT_ADD(>kobj, kobject_get(>kobj),
+ "%d", delta->level);
+   if (err < 0) {
+   /* _put for failed _ADD */
+   kobject_put(>kobj);
+   goto out_kobj_restore;
+   }
ploop_map_destroy(>map);
list_replace_init(_delta->list, >list);
ploop_delta_list_changed(plo);
@@ -3632,11 +3640,19 @@ static int ploop_replace_delta(struct ploop_device * 
plo, unsigned long arg)
old_delta->ops->stop(old_delta);
old_delta->ops->destroy(old_delta);
kobject_put(_delta->kobj);
-   return 0;
-
-out_kobj_del:
-   kobject_del(>kobj);
kobject_put(>kobj);
+   return 0;
+out_kobj_restore:
+   /* we haven't dropped our plo->kobj ref just add back */
+   err = KOBJECT_ADD(_delta->kobj, >kobj, "%d", old_delta->level);
+   if (err < 0) {
+   PL_ERR(plo, "Failed to restore old delta kobject\n");
+   /*
+* Get an extra ref to parent object as kobject_add does, so it
+* later kobject_del doesn't destory it
+*/
+   kobject_get(>kobj);


Later kobject_del would not see parent object as first kobject_del did 
"kobj->parent = NULL;", so in this aspect you should not be getting 
reference here.


note: Problem I was talking in my previous message was about kobj->kset put.


+   }
  out_close:
delta->ops->stop(delta);
  out_destroy:
@@ -3974,17 +3990,12 @@ static void rename_deltas(struct ploop_device * plo, 
int level)
  
  		if (delta->level < level)

continue;
-#if 0
-   /* Oops, kobject_rename() is not 

Re: [Devel] [PATCH vz7 v2] ploop: increase logging on errors when opening new deltas

2023-06-08 Thread Pavel Tikhomirov

On 09.06.2023 00:16, Alexander Atanasov wrote:

Ocassionally we got EBUSY but it is a bit over used,
so it is not clear what it means.

Add more logging to catch the source of the error.

https://jira.vzint.dev/browse/PSBM-146836


Reviewed-by: Pavel Tikhomirov 


Signed-off-by: Alexander Atanasov 
---
  drivers/block/ploop/dev.c |  9 -
  drivers/block/ploop/fmt_ploop1.c  |  4 +++-
  drivers/block/ploop/io_kaio.c | 13 ++---
  drivers/block/ploop/io_kaio_map.c |  7 +--
  4 files changed, 26 insertions(+), 7 deletions(-)

v1->v2: addressing review comments - print the error value and fix coding style

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 2833865b087f..8e238eafd9f2 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -3582,8 +3582,11 @@ static int ploop_replace_delta(struct ploop_device * 
plo, unsigned long arg)
   sizeof(struct ploop_ctl_chunk)))
return -EFAULT;
  
-	if (plo->maintenance_type != PLOOP_MNTN_OFF)

+   if (plo->maintenance_type != PLOOP_MNTN_OFF) {
+   if (printk_ratelimit())
+   PL_WARN(plo, "Attempt to replace while in maintenance 
mode\n");
return -EBUSY;
+   }
  
  	old_delta = find_delta(plo, ctl.pctl_level);

if (old_delta == NULL)
@@ -3596,6 +3599,10 @@ static int ploop_replace_delta(struct ploop_device * 
plo, unsigned long arg)
if (IS_ERR(delta))
return PTR_ERR(delta);
  
+	WARN_ONCE(delta->ops != old_delta->ops,

+ "New delta uses different io %p vs %p\n",
+ delta->ops, old_delta->ops);
+
err = delta->ops->compose(delta, 1, );
if (err)
goto out_destroy;
diff --git a/drivers/block/ploop/fmt_ploop1.c b/drivers/block/ploop/fmt_ploop1.c
index e59a9eb50ac2..a0369db35c83 100644
--- a/drivers/block/ploop/fmt_ploop1.c
+++ b/drivers/block/ploop/fmt_ploop1.c
@@ -314,8 +314,10 @@ ploop1_open(struct ploop_delta * delta)
if (!(delta->flags & PLOOP_FMT_RDONLY)) {
pvd_header_set_disk_in_use(vh);
err = delta->io.ops->sync_write(>io, ph->dyn_page, 4096, 
0, 0);
-   if (err)
+   if (err) {
+   PL_ERR(delta->plo, "failed to update in use err=%d\n", 
err);
goto out_err;
+   }
}
  
  	delta->io.alloc_head = ph->alloc_head;

diff --git a/drivers/block/ploop/io_kaio.c b/drivers/block/ploop/io_kaio.c
index 35c0fad43baf..098f6c7b5f2d 100644
--- a/drivers/block/ploop/io_kaio.c
+++ b/drivers/block/ploop/io_kaio.c
@@ -966,17 +966,23 @@ static int kaio_open(struct ploop_io * io)
io->files.bdev = io->files.inode->i_sb->s_bdev;
  
  	err = io->ops->sync(io);

-   if (err)
+   if (err) {
+   PL_WARN(delta->plo, "open failed to sync err=%d\n", err);
return err;
+   }
  
  	mutex_lock(>files.inode->i_mutex);

err = kaio_invalidate_cache(io);
-   if (!err)
+   if (err)
+   PL_WARN(delta->plo, "invaldiate_cache failed err=%d\n", err);
+   else
err = ploop_kaio_open(file, delta->flags & PLOOP_FMT_RDONLY);
mutex_unlock(>files.inode->i_mutex);
  
-	if (err)

+   if (err) {
+   PL_WARN(delta->plo, "open failed err=%d\n", err);
return err;
+   }
  
  	io->files.em_tree = _em_tree;
  
@@ -988,6 +994,7 @@ static int kaio_open(struct ploop_io * io)

err = PTR_ERR(io->fsync_thread);
io->fsync_thread = NULL;
ploop_kaio_close(io->files.mapping, 0);
+   PL_WARN(delta->plo, "fsync thread start failed=%d\n", 
err);
return err;
}
  
diff --git a/drivers/block/ploop/io_kaio_map.c b/drivers/block/ploop/io_kaio_map.c

index d4ff39d95e74..59e35c562ef9 100644
--- a/drivers/block/ploop/io_kaio_map.c
+++ b/drivers/block/ploop/io_kaio_map.c
@@ -35,10 +35,13 @@ int ploop_kaio_open(struct file * file, int rdonly)
else
m->readers++;
} else {
-   if (m->readers)
+   if (m->readers) {
+   pr_warn("File is already active:%d\n",
+   m->readers);
err = -EBUSY;
-   else
+   } else {
m->readers = -1;
+   }
}
goto kaio_open_done;
  

Re: [Devel] [PATCH vz7] ploop: properly restore old delta kobject on replace error

2023-06-06 Thread Pavel Tikhomirov




On 22.05.2023 23:19, Alexander Atanasov wrote:

Current code removes the old_delta kobject before the new delta
is ready to be used in case there is an error while reading BAT,
the "too short BAT" error we have seen.
The new delta and its objects are destroyed properly but the original
delta kobject is never added back, so pdelta sysfs dir stay empty and
userspace tools get confused since they consult this files before
performing operations on the ploop device. At later point when ploop
device is destroyed a second kobject_del will be called and it can lead
to a crash or corrupted memory.

To fix this instead of deleting the delta early rename it with prefix
and move the deletion at the end, while on error rename back to original
name. This way if something goes wrong it will be visible in sysfs as
old_N, where N is the delta level.

Extract code to rename kobject into a function and cleanup its users.

Fixes:
commit 5f3ee110e6f4 ("ploop: Repopulate holes_bitmap on changing delta")

https://jira.vzint.dev/browse/PSBM-146797
Signed-off-by: Alexander Atanasov 
---
  drivers/block/ploop/dev.c | 47 ++-
  1 file changed, 26 insertions(+), 21 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 6eb22168b5fe..0ed63f9da1cb 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -3557,6 +3557,16 @@ static int ploop_add_delta(struct ploop_device * plo, 
unsigned long arg)
return err;
  }
  
+/*

+ * Have to implement own version of kobject_rename since it is GPL only symbol
+ */
+static int ploop_rename_delta(struct ploop_delta *delta, int level, char *pref)
+{
+   kobject_del(>kobj);
+   return KOBJECT_ADD(>kobj, >plo->kobj,
+ "%s%d", pref ? : "", level);
+}
+
  static int ploop_replace_delta(struct ploop_device * plo, unsigned long arg)
  {
int err;
@@ -3594,15 +3604,9 @@ static int ploop_replace_delta(struct ploop_device * 
plo, unsigned long arg)
if (err)
goto out_destroy;
  
-	kobject_del(_delta->kobj);

-
-   err = KOBJECT_ADD(>kobj, kobject_get(>kobj),
- "%d", delta->level);
-   /* _put below is a pair for _get for OLD delta */
-   kobject_put(>kobj);
-
+   err = ploop_rename_delta(old_delta, old_delta->level, "old_");
if (err < 0) {
-   kobject_put(>kobj);
+   PL_WARN(plo, "Failed to rename old delta kobj\n");
goto out_close;


I don't like this error path, if ploop_rename_delta failed it means that 
we did kobject_del(_delta->kobj); but didn't finish corresponding 
KOBJECT_ADD, and in out_close, we would do ploop_rename_delta again 
where we would do second kobject_del(_delta->jobj) and corresponding 
KOBJECT_ADD. This would do double 
kobject_del->kobj_kset_leave->kset_put, not sure if it would break 
something or not though.


I mean, maybe it's better to preserve original out_close, and add a new 
label out_restore_rename.



}
  
@@ -3611,10 +3615,17 @@ static int ploop_replace_delta(struct ploop_device * plo, unsigned long arg)

err = delta->ops->replace_delta(delta);
if (err) {
ploop_relax(plo);
-   goto out_kobj_del;
+   goto out_close;
}
}
  
+	err = KOBJECT_ADD(>kobj, kobject_get(>kobj),

+ "%d", delta->level);
+   if (err < 0) {
+   /* _put for failed _ADD */
+   kobject_put(>kobj);
+   goto out_close;
+   }
ploop_map_destroy(>map);
list_replace_init(_delta->list, >list);
ploop_delta_list_changed(plo);
@@ -3631,13 +3642,15 @@ static int ploop_replace_delta(struct ploop_device * 
plo, unsigned long arg)
  
  	old_delta->ops->stop(old_delta);

old_delta->ops->destroy(old_delta);
+   kobject_del(_delta->kobj);
kobject_put(_delta->kobj);
+   kobject_put(>kobj);
return 0;
  
-out_kobj_del:

-   kobject_del(>kobj);
-   kobject_put(>kobj);
  out_close:
+   err = ploop_rename_delta(old_delta, old_delta->level, NULL);
+   if (err < 0)
+   PL_ERR(plo, "Failed to restore old delta kobj name\n");
delta->ops->stop(delta);
  out_destroy:
delta->ops->destroy(delta);
@@ -3974,15 +3987,7 @@ static void rename_deltas(struct ploop_device * plo, int 
level)
  
  		if (delta->level < level)

continue;
-#if 0
-   /* Oops, kobject_rename() is not exported! */
-   sprintf(nname, "%d", delta->level);
-   err = kobject_rename(>kobj, nname);
-#else
-   kobject_del(>kobj);
-   err = KOBJECT_ADD(>kobj, >kobj,
- "%d", delta->level);
-#endif
+   err = ploop_rename_delta(delta, delta->level, NULL);
if (err)
PL_WARN(plo, 

Re: [Devel] [PATCH vz7] ploop: increase logging on errors when opening new deltas

2023-06-06 Thread Pavel Tikhomirov

Looks good, see minor nits in inline comments.

On 22.05.2023 23:39, Alexander Atanasov wrote:

Ocassionally we got EBUSY but it is a bit over used,
so it is not clear what it means.

Add more logging to catch the source of the error.

https://jira.vzint.dev/browse/PSBM-146836
Signed-off-by: Alexander Atanasov 
---
  drivers/block/ploop/dev.c |  9 -
  drivers/block/ploop/fmt_ploop1.c  |  4 +++-
  drivers/block/ploop/io_kaio.c | 13 ++---
  drivers/block/ploop/io_kaio_map.c |  6 --
  4 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 6eb22168b5fe..75e427927713 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -3572,8 +3572,11 @@ static int ploop_replace_delta(struct ploop_device * 
plo, unsigned long arg)
   sizeof(struct ploop_ctl_chunk)))
return -EFAULT;
  
-	if (plo->maintenance_type != PLOOP_MNTN_OFF)

+   if (plo->maintenance_type != PLOOP_MNTN_OFF) {
+   if (printk_ratelimit())
+   PL_WARN(plo, "Attempt to replace while in maintenance 
mode\n");
return -EBUSY;
+   }
  
  	old_delta = find_delta(plo, ctl.pctl_level);

if (old_delta == NULL)
@@ -3586,6 +3589,10 @@ static int ploop_replace_delta(struct ploop_device * 
plo, unsigned long arg)
if (IS_ERR(delta))
return PTR_ERR(delta);
  
+	WARN_ONCE(delta->ops != old_delta->ops,

+ "New delta uses different io %p vs %p\n",
+ delta->ops, old_delta->ops);
+
err = delta->ops->compose(delta, 1, );
if (err)
goto out_destroy;
diff --git a/drivers/block/ploop/fmt_ploop1.c b/drivers/block/ploop/fmt_ploop1.c
index e59a9eb50ac2..a89804561e57 100644
--- a/drivers/block/ploop/fmt_ploop1.c
+++ b/drivers/block/ploop/fmt_ploop1.c
@@ -314,8 +314,10 @@ ploop1_open(struct ploop_delta * delta)
if (!(delta->flags & PLOOP_FMT_RDONLY)) {
pvd_header_set_disk_in_use(vh);
err = delta->io.ops->sync_write(>io, ph->dyn_page, 4096, 
0, 0);
-   if (err)
+   if (err) {
+   PL_ERR(delta->plo, "write failed updating in use\n");


Don't we need err printed here? (like in other places below where we 
have err printed)



goto out_err;
+   }
}
  
  	delta->io.alloc_head = ph->alloc_head;

diff --git a/drivers/block/ploop/io_kaio.c b/drivers/block/ploop/io_kaio.c
index ab93d2c70bc5..b7258252deb2 100644
--- a/drivers/block/ploop/io_kaio.c
+++ b/drivers/block/ploop/io_kaio.c
@@ -997,17 +997,23 @@ static int kaio_open(struct ploop_io * io)
io->files.bdev = io->files.inode->i_sb->s_bdev;
  
  	err = io->ops->sync(io);

-   if (err)
+   if (err) {
+   PL_WARN(delta->plo, "open failed to sync err=%d\n", err);
return err;
+   }
  
  	mutex_lock(>files.inode->i_mutex);

err = kaio_invalidate_cache(io);
-   if (!err)
+   if (err)
+   PL_WARN(delta->plo, "invaldiate_cache failed err=%d\n", err);
+   else
err = ploop_kaio_open(file, delta->flags & PLOOP_FMT_RDONLY);
mutex_unlock(>files.inode->i_mutex);
  
-	if (err)

+   if (err) {
+   PL_WARN(delta->plo, "open failed err=%d\n", err);
return err;
+   }
  
  	io->files.em_tree = _em_tree;
  
@@ -1019,6 +1025,7 @@ static int kaio_open(struct ploop_io * io)

err = PTR_ERR(io->fsync_thread);
io->fsync_thread = NULL;
ploop_kaio_close(io->files.mapping, 0);
+   PL_WARN(delta->plo, "fsync thread start failed=%d\n", 
err);
return err;
}
  
diff --git a/drivers/block/ploop/io_kaio_map.c b/drivers/block/ploop/io_kaio_map.c

index d4ff39d95e74..5a3d11e0f4a9 100644
--- a/drivers/block/ploop/io_kaio_map.c
+++ b/drivers/block/ploop/io_kaio_map.c
@@ -35,9 +35,11 @@ int ploop_kaio_open(struct file * file, int rdonly)
else
m->readers++;
} else {
-   if (m->readers)
+   if (m->readers) {
+   pr_warn("File is already active:%d\n",
+   m->readers);
err = -EBUSY;
-   else
+   } else
m->readers = -1;


I'd rather you follow the codding style here and add {} to both branches.

https://www.kernel.org/doc/html/v4.10/process/coding-style.html#placing-braces-and-spaces


}
goto kaio_open_done;


--
Best regards, Tikhomirov Pavel
Senior Software Developer, 

[Devel] [PATCH RH7] ve/cgroups: fix subgroups_limit error path handling

2023-05-18 Thread Pavel Tikhomirov
We do ida_simple_get on root->cgroup_ida, just before checking
subgroups_limit, and in case subgroups_limit is reached we don't do
corresponding ida_simple_remove to free id. Let's fix it by jumping
to proper goto label err_free_id.

This may or may not be related to [1], found while investigating it.

https://jira.vzint.dev/browse/PSBM-147036 [1]
Fixes: 92faf0fad3e3 ("ve/cgroups: Introduce subgroups_limit control")
Signed-off-by: Pavel Tikhomirov 
---
 kernel/cgroup.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index f2952d7c18dc..3f8c49b9ebe0 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4888,7 +4888,7 @@ static long cgroup_create(struct cgroup *parent, struct 
dentry *dentry,
if (ve_root && ve_root->subgroups_limit > 0 &&
subgroups_count(ve_root) >= ve_root->subgroups_limit) {
err = -EACCES;
-   goto err_free_name;
+   goto err_free_id;
}
 
/*
-- 
2.39.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH RH9 0/4] net/openvswitch: pull ms API

2023-04-26 Thread Pavel Tikhomirov

Ported cleanly. Would be nice to have jira issue tracking this.

Reviewed-by: Pavel Tikhomirov 

On 25.04.2023 18:11, Andrey Zhadchenko wrote:

Revert our patch for openvswitch, apply the ones that got accepted
into mainstream:
https://lore.kernel.org/all/20220825020450.664147-1-andrey.zhadche...@virtuozzo.com/

Andrey Zhadchenko (4):
   Revert "net: openvswitch: add capability to specify ifindex of new
 links"
   openvswitch: fix memory leak at failed datapath creation
   openvswitch: allow specifying ifindex of new interfaces
   openvswitch: add OVS_DP_ATTR_PER_CPU_PIDS to get requests

  include/uapi/linux/openvswitch.h |  3 +++
  net/openvswitch/datapath.c   | 29 
  net/openvswitch/vport-internal_dev.c |  2 +-
  net/openvswitch/vport.h  |  4 ++--
  4 files changed, 23 insertions(+), 15 deletions(-)



--
Best regards, Tikhomirov Pavel
Senior Software Developer, Virtuozzo.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH RH9 2/4] openvswitch: fix memory leak at failed datapath creation

2023-04-26 Thread Pavel Tikhomirov




On 25.04.2023 18:11, Andrey Zhadchenko wrote:

ovs_dp_cmd_new()->ovs_dp_change()->ovs_dp_set_upcall_portids()
allocates array via kmalloc.
If for some reason new_vport() fails during ovs_dp_cmd_new()
dp->upcall_portids must be freed.
Add missing kfree.

Kmemleak example:
unreferenced object 0x88800c382500 (size 64):
   comm "dump_state", pid 323, jiffies 4294955418 (age 104.347s)
   hex dump (first 32 bytes):
 5e c2 79 e4 1f 7a 38 c7 09 21 38 0c 80 88 ff ff  ^.y..z8..!8.
 03 00 00 00 0a 00 00 00 14 00 00 00 28 00 00 00  (...
   backtrace:
 [<71bebc9f>] ovs_dp_set_upcall_portids+0x38/0xa0
 [<0187d8bd>] ovs_dp_change+0x63/0xe0
 [<2397e446>] ovs_dp_cmd_new+0x1f0/0x380
 [] genl_family_rcv_msg_doit+0xea/0x150
 [<8f583bc4>] genl_rcv_msg+0xdc/0x1e0
 [] netlink_rcv_skb+0x50/0x100
 [<4959cece>] genl_rcv+0x24/0x40
 [<4699ac7f>] netlink_unicast+0x23e/0x360
 [] netlink_sendmsg+0x24e/0x4b0
 [<6f4aa380>] sock_sendmsg+0x62/0x70
 [] sys_sendmsg+0x230/0x270
 [<12dacf7d>] ___sys_sendmsg+0x88/0xd0
 [<11776020>] __sys_sendmsg+0x59/0xa0
 [<2e8f2dc1>] do_syscall_64+0x3b/0x90
 [<3243e7cb>] entry_SYSCALL_64_after_hwframe+0x63/0xcd

Fixes: b83d23a2a38b ("openvswitch: Introduce per-cpu upcall dispatch")
Acked-by: Aaron Conole 
Signed-off-by: Andrey Zhadchenko 
Link: 
https://lore.kernel.org/r/20220825020326.664073-1-andrey.zhadche...@virtuozzo.com
Signed-off-by: Jakub Kicinski 

(cherry picked from ms commit a87406f4adee9c53b311d8a1ba2849c69e29a6d0)


Do we have jira issue for it?


Signed-off-by: Andrey Zhadchenko 
---
  net/openvswitch/datapath.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 7e8a39a35627..6c9d153afbee 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -1802,7 +1802,7 @@ static int ovs_dp_cmd_new(struct sk_buff *skb, struct 
genl_info *info)
ovs_dp_reset_user_features(skb, info);
}
  
-		goto err_unlock_and_destroy_meters;

+   goto err_destroy_portids;
}
  
  	err = ovs_dp_cmd_fill_info(dp, reply, info->snd_portid,

@@ -1817,6 +1817,8 @@ static int ovs_dp_cmd_new(struct sk_buff *skb, struct 
genl_info *info)
ovs_notify(_datapath_genl_family, reply, info);
return 0;
  
+err_destroy_portids:

+   kfree(rcu_dereference_raw(dp->upcall_portids));
  err_unlock_and_destroy_meters:
ovs_unlock();
ovs_meters_exit(dp);


--
Best regards, Tikhomirov Pavel
Senior Software Developer, Virtuozzo.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH RH9 1/4] Revert "net: openvswitch: add capability to specify ifindex of new links"

2023-04-26 Thread Pavel Tikhomirov
Just a note: old criu will get EOPNOTSUPP on restore in create_one_vport 
on new kernel.


Probably it's OK and we can afford it, as we only released beta version 
yet. And probably nobody will just update kernel and reboot without 
updating criu.


On 25.04.2023 18:11, Andrey Zhadchenko wrote:

This reverts commit 757ebade1eec8c6a3d1a150c8bd6f564c939c058.
We should use the version upstream accepted

https://jira.vzint.dev/browse/PSBM-105844
Signed-off-by: Andrey Zhadchenko 
---
  net/openvswitch/datapath.c   | 16 ++--
  net/openvswitch/vport-internal_dev.c |  1 -
  net/openvswitch/vport.h  |  2 --
  3 files changed, 2 insertions(+), 17 deletions(-)

diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 8033c97a8d65..7e8a39a35627 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -1739,7 +1739,6 @@ static int ovs_dp_cmd_new(struct sk_buff *skb, struct 
genl_info *info)
struct vport *vport;
struct ovs_net *ovs_net;
int err;
-   struct ovs_header *ovs_header = info->userhdr;
  
  	err = -EINVAL;

if (!a[OVS_DP_ATTR_NAME] || !a[OVS_DP_ATTR_UPCALL_PID])
@@ -1780,7 +1779,6 @@ static int ovs_dp_cmd_new(struct sk_buff *skb, struct 
genl_info *info)
parms.dp = dp;
parms.port_no = OVSP_LOCAL;
parms.upcall_portids = a[OVS_DP_ATTR_UPCALL_PID];
-   parms.desired_ifindex = ovs_header->dp_ifindex;
  
  	/* So far only local changes have been made, now need the lock. */

ovs_lock();
@@ -2201,10 +2199,7 @@ static int ovs_vport_cmd_new(struct sk_buff *skb, struct 
genl_info *info)
if (!a[OVS_VPORT_ATTR_NAME] || !a[OVS_VPORT_ATTR_TYPE] ||
!a[OVS_VPORT_ATTR_UPCALL_PID])
return -EINVAL;
-
-   parms.type = nla_get_u32(a[OVS_VPORT_ATTR_TYPE]);
-
-   if (a[OVS_VPORT_ATTR_IFINDEX] && parms.type != OVS_VPORT_TYPE_INTERNAL)
+   if (a[OVS_VPORT_ATTR_IFINDEX])
return -EOPNOTSUPP;
  
  	port_no = a[OVS_VPORT_ATTR_PORT_NO]

@@ -2241,19 +2236,12 @@ static int ovs_vport_cmd_new(struct sk_buff *skb, 
struct genl_info *info)
}
  
  	parms.name = nla_data(a[OVS_VPORT_ATTR_NAME]);

+   parms.type = nla_get_u32(a[OVS_VPORT_ATTR_TYPE]);
parms.options = a[OVS_VPORT_ATTR_OPTIONS];
parms.dp = dp;
parms.port_no = port_no;
parms.upcall_portids = a[OVS_VPORT_ATTR_UPCALL_PID];
  
-	if (parms.type == OVS_VPORT_TYPE_INTERNAL) {

-   if (a[OVS_VPORT_ATTR_IFINDEX])
-   parms.desired_ifindex =
-   nla_get_u32(a[OVS_VPORT_ATTR_IFINDEX]);
-   else
-   parms.desired_ifindex = 0;
-   }
-
vport = new_vport();
err = PTR_ERR(vport);
if (IS_ERR(vport)) {
diff --git a/net/openvswitch/vport-internal_dev.c 
b/net/openvswitch/vport-internal_dev.c
index 1c25158fbdf2..1e5468137c88 100644
--- a/net/openvswitch/vport-internal_dev.c
+++ b/net/openvswitch/vport-internal_dev.c
@@ -157,7 +157,6 @@ static struct vport *internal_dev_create(const struct 
vport_parms *parms)
if (vport->port_no == OVSP_LOCAL)
vport->dev->features |= NETIF_F_NETNS_LOCAL;
  
-	dev->ifindex = parms->desired_ifindex;

rtnl_lock();
err = register_netdevice(vport->dev);
if (err)
diff --git a/net/openvswitch/vport.h b/net/openvswitch/vport.h
index 24e1cba2f1ac..9de5030d9801 100644
--- a/net/openvswitch/vport.h
+++ b/net/openvswitch/vport.h
@@ -98,8 +98,6 @@ struct vport_parms {
enum ovs_vport_type type;
struct nlattr *options;
  
-	int desired_ifindex;

-
/* For ovs_vport_alloc(). */
struct datapath *dp;
u16 port_no;


--
Best regards, Tikhomirov Pavel
Senior Software Developer, Virtuozzo.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH7 00/14] port move_mount_set_group and mount_setattr

2023-04-13 Thread Pavel Tikhomirov
We need this as in Virtuozzo criu after rebase to mainstream criu in u20
we will switch to this new API for sharing group setting accross mounts.

https://jira.sw.ru/browse/PSBM-144416
Signed-off-by: Pavel Tikhomirov 

Aleksa Sarai (1):
  lib: introduce copy_struct_from_user() helper

Christian Brauner (6):
  namespace: take lock_mount_hash() directly when changing flags
  mount: make {lock,unlock}_mount_hash() static
  namespace: only take read lock in do_reconfigure_mnt()
  fs: split out functions to hold writers
  fs: add mount_setattr()
  fs: drop peer group ids under namespace lock

David Howells (3):
  VFS: Differentiate mount flags (MS_*) from internal superblock flags
  vfs: Undo an overly zealous MS_RDONLY -> SB_RDONLY conversion
  vfs: Separate changing mount flags full remount

Markus Trippelsdorf (1):
  VFS: Handle lazytime in do_mount()

Mimi Zohar (1):
  vfs: fix mounting a filesystem with i_version

Pavel Tikhomirov (2):
  mount: rename do_set_group to do_set_group_old
  move_mount: allow to add a mount into an existing group

 Documentation/filesystems/porting |   2 +-
 arch/x86/syscalls/syscall_32.tbl  |   1 +
 arch/x86/syscalls/syscall_64.tbl  |   1 +
 fs/mount.h|  10 -
 fs/namespace.c| 646 +-
 fs/super.c|  68 ++--
 include/linux/fs.h|  45 ++-
 include/linux/mount.h |   2 +-
 include/linux/uaccess.h   |  70 
 include/linux/vz_bitops.h |  11 +
 include/uapi/linux/fs.h   |   3 +-
 init/do_mounts.c  |   2 +-
 lib/strnlen_user.c|   8 +-
 lib/usercopy.c|  76 
 14 files changed, 780 insertions(+), 165 deletions(-)
 create mode 100644 include/linux/vz_bitops.h

-- 
2.39.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH7 03/14] lib: introduce copy_struct_from_user() helper

2023-04-13 Thread Pavel Tikhomirov
From: Aleksa Sarai 

A common pattern for syscall extensions is increasing the size of a
struct passed from userspace, such that the zero-value of the new fields
result in the old kernel behaviour (allowing for a mix of userspace and
kernel vintages to operate on one another in most cases).

While this interface exists for communication in both directions, only
one interface is straightforward to have reasonable semantics for
(userspace passing a struct to the kernel). For kernel returns to
userspace, what the correct semantics are (whether there should be an
error if userspace is unaware of a new extension) is very
syscall-dependent and thus probably cannot be unified between syscalls
(a good example of this problem is [1]).

Previously there was no common lib/ function that implemented
the necessary extension-checking semantics (and different syscalls
implemented them slightly differently or incompletely[2]). Future
patches replace common uses of this pattern to make use of
copy_struct_from_user().

Some in-kernel selftests that insure that the handling of alignment and
various byte patterns are all handled identically to memchr_inv() usage.

[1]: commit 1251201c0d34 ("sched/core: Fix uclamp ABI bug, clean up and
 robustify sched_read_attr() ABI logic and code")

[2]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do
 similar checks to copy_struct_from_user() while rt_sigprocmask(2)
 always rejects differently-sized struct arguments.

Suggested-by: Rasmus Villemoes 
Signed-off-by: Aleksa Sarai 
Reviewed-by: Kees Cook 
Reviewed-by: Christian Brauner 
Link: https://lore.kernel.org/r/20191001011055.19283-2-cyp...@cyphar.com
Signed-off-by: Christian Brauner 

- Need this for mount_setxattr syscall.
- Dropped lib/test_user_copy.c hunks when rebasing.
- Move aligned_byte_mask to vz_bitops.h else something breaks on boot
after moving it to bitops.h.
- Add unsafe_get_user, user_access_begin and user_access_end helpers.

https://jira.vzint.dev/browse/PSBM-144416
(cherry picked from commit f5a1a536fa14895ccff4e94e6a5af90901ce86aa)
Signed-off-by: Pavel Tikhomirov 
---
 include/linux/uaccess.h   | 70 
 include/linux/vz_bitops.h | 11 ++
 lib/strnlen_user.c|  8 +
 lib/usercopy.c| 76 +++
 4 files changed, 158 insertions(+), 7 deletions(-)
 create mode 100644 include/linux/vz_bitops.h

diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index b28cc68c1896..9f8da4aa31a1 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -112,6 +112,76 @@ static inline unsigned long __copy_from_user_nocache(void 
*to,
ret;\
})
 
+extern __must_check int check_zeroed_user(const void __user *from, size_t 
size);
+
+/**
+ * copy_struct_from_user: copy a struct from userspace
+ * @dst:   Destination address, in kernel space. This buffer must be @ksize
+ * bytes long.
+ * @ksize: Size of @dst struct.
+ * @src:   Source address, in userspace.
+ * @usize: (Alleged) size of @src struct.
+ *
+ * Copies a struct from userspace to kernel space, in a way that guarantees
+ * backwards-compatibility for struct syscall arguments (as long as future
+ * struct extensions are made such that all new fields are *appended* to the
+ * old struct, and zeroed-out new fields have the same meaning as the old
+ * struct).
+ *
+ * @ksize is just sizeof(*dst), and @usize should've been passed by userspace.
+ * The recommended usage is something like the following:
+ *
+ *   SYSCALL_DEFINE2(foobar, const struct foo __user *, uarg, size_t, usize)
+ *   {
+ *  int err;
+ *  struct foo karg = {};
+ *
+ *  if (usize > PAGE_SIZE)
+ *return -E2BIG;
+ *  if (usize < FOO_SIZE_VER0)
+ *return -EINVAL;
+ *
+ *  err = copy_struct_from_user(, sizeof(karg), uarg, usize);
+ *  if (err)
+ *return err;
+ *
+ *  // ...
+ *   }
+ *
+ * There are three cases to consider:
+ *  * If @usize == @ksize, then it's copied verbatim.
+ *  * If @usize < @ksize, then the userspace has passed an old struct to a
+ *newer kernel. The rest of the trailing bytes in @dst (@ksize - @usize)
+ *are to be zero-filled.
+ *  * If @usize > @ksize, then the userspace has passed a new struct to an
+ *older kernel. The trailing bytes unknown to the kernel (@usize - @ksize)
+ *are checked to ensure they are zeroed, otherwise -E2BIG is returned.
+ *
+ * Returns (in all cases, some data may have been copied):
+ *  * -E2BIG:  (@usize > @ksize) and there are non-zero trailing bytes in @src.
+ *  * -EFAULT: access to userspace failed.
+ */
+static __always_inline __must_check int
+copy_struct_from_user(void *dst, size_t ksize, const void __user *src,
+ size_t usize)
+{
+   size_t size = min(ksize, usize);
+   size_t rest = max(ksize, usize) - size;
+
+   /*

[Devel] [PATCH RH7 13/14] fs: add mount_setattr()

2023-04-13 Thread Pavel Tikhomirov
From: Christian Brauner 

This implements the missing mount_setattr() syscall. While the new mount
api allows to change the properties of a superblock there is currently
no way to change the properties of a mount or a mount tree using file
descriptors which the new mount api is based on. In addition the old
mount api has the restriction that mount options cannot be applied
recursively. This hasn't changed since changing mount options on a
per-mount basis was implemented in [1] and has been a frequent request
not just for convenience but also for security reasons. The legacy
mount syscall is unable to accommodate this behavior without introducing
a whole new set of flags because MS_REC | MS_REMOUNT | MS_BIND |
MS_RDONLY | MS_NOEXEC | [...] only apply the mount option to the topmost
mount. Changing MS_REC to apply to the whole mount tree would mean
introducing a significant uapi change and would likely cause significant
regressions.

The new mount_setattr() syscall allows to recursively clear and set
mount options in one shot. Multiple calls to change mount options
requesting the same changes are idempotent:

int mount_setattr(int dfd, const char *path, unsigned flags,
  struct mount_attr *uattr, size_t usize);

Flags to modify path resolution behavior are specified in the @flags
argument. Currently, AT_EMPTY_PATH, AT_RECURSIVE, AT_SYMLINK_NOFOLLOW,
and AT_NO_AUTOMOUNT are supported. If useful, additional lookup flags to
restrict path resolution as introduced with openat2() might be supported
in the future.

The mount_setattr() syscall can be expected to grow over time and is
designed with extensibility in mind. It follows the extensible syscall
pattern we have used with other syscalls such as openat2(), clone3(),
sched_{set,get}attr(), and others.
The set of mount options is passed in the uapi struct mount_attr which
currently has the following layout:

struct mount_attr {
__u64 attr_set;
__u64 attr_clr;
__u64 propagation;
__u64 userns_fd;
};

The @attr_set and @attr_clr members are used to clear and set mount
options. This way a user can e.g. request that a set of flags is to be
raised such as turning mounts readonly by raising MOUNT_ATTR_RDONLY in
@attr_set while at the same time requesting that another set of flags is
to be lowered such as removing noexec from a mount tree by specifying
MOUNT_ATTR_NOEXEC in @attr_clr.

Note, since the MOUNT_ATTR_ values are an enum starting from 0,
not a bitmap, users wanting to transition to a different atime setting
cannot simply specify the atime setting in @attr_set, but must also
specify MOUNT_ATTR__ATIME in the @attr_clr field. So we ensure that
MOUNT_ATTR__ATIME can't be partially set in @attr_clr and that @attr_set
can't have any atime bits set if MOUNT_ATTR__ATIME isn't set in
@attr_clr.

The @propagation field lets callers specify the propagation type of a
mount tree. Propagation is a single property that has four different
settings and as such is not really a flag argument but an enum.
Specifically, it would be unclear what setting and clearing propagation
settings in combination would amount to. The legacy mount() syscall thus
forbids the combination of multiple propagation settings too. The goal
is to keep the semantics of mount propagation somewhat simple as they
are overly complex as it is.

The @userns_fd field lets user specify a user namespace whose idmapping
becomes the idmapping of the mount. This is implemented and explained in
detail in the next patch.

[1]: commit 2e4b7fcd9260 ("[PATCH] r/o bind mounts: honor mount writer counts 
at remount")

Link: 
https://lore.kernel.org/r/20210121131959.646623-35-christian.brau...@ubuntu.com
Cc: David Howells 
Cc: Aleksa Sarai 
Cc: Al Viro 
Cc: linux-fsde...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Reviewed-by: Christoph Hellwig 
Signed-off-by: Christian Brauner 

Changes: port syscall for x86 only, drop uapi hunks and ignore
MNT_NOSYMFOLLOW as it is not yet supported
(cherry picked from commit 2a1867219c7b27f928e2545782b86daaf9ad50bd)
https://jira.sw.ru/browse/PSBM-144416
Signed-off-by: Pavel Tikhomirov 
---
 arch/x86/syscalls/syscall_32.tbl |   1 +
 arch/x86/syscalls/syscall_64.tbl |   1 +
 fs/namespace.c   | 312 +++
 3 files changed, 314 insertions(+)

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index 978f07cb0ea1..7978137648ed 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -373,6 +373,7 @@
 
 428i386open_tree   sys_open_tree
 429i386move_mount  sys_move_mount
+442i386mount_setattr   sys_mount_setattr
 
 510i386getluid sys_getluid
 511i386setluid sys_setluid
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 3c86abac9a50..7175e94309fd 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ 

[Devel] [PATCH RH7 05/14] vfs: fix mounting a filesystem with i_version

2023-04-13 Thread Pavel Tikhomirov
From: Mimi Zohar 

The mount i_version flag is not enabled in the new sb_flags.  This patch
adds the missing SB_I_VERSION flag.

Fixes: e462ec5 "VFS: Differentiate mount flags (MS_*) from internal
   superblock flags"
Signed-off-by: Mimi Zohar 
Signed-off-by: Al Viro 

(cherry picked from commit 917086ff231f614e6705927d8fe3eb6aa74b21bf)
https://jira.sw.ru/browse/PSBM-144416
Signed-off-by: Pavel Tikhomirov 
---
 fs/namespace.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 1a62ab03692e..b60e2eab0d85 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3586,7 +3586,8 @@ long do_mount(const char *dev_name, const char __user 
*dir_name,
SB_MANDLOCK |
SB_DIRSYNC |
SB_SILENT |
-   SB_POSIXACL);
+   SB_POSIXACL |
+   SB_I_VERSION);
 
if (flags & MS_REMOUNT)
retval = do_remount(, flags, sb_flags, mnt_flags,
-- 
2.39.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH7 08/14] vfs: Separate changing mount flags full remount

2023-04-13 Thread Pavel Tikhomirov
From: David Howells 

Separate just the changing of mount flags (MS_REMOUNT|MS_BIND) from full
remount because the mount data will get parsed with the new fs_context
stuff prior to doing a remount - and this causes the syscall to fail under
some circumstances.

To quote Eric's explanation:

  [...] mount(..., MS_REMOUNT|MS_BIND, ...) now validates the mount options
  string, which breaks systemd unit files with ProtectControlGroups=yes
  (e.g.  systemd-networkd.service) when systemd does the following to
  change a cgroup (v1) mount to read-only:

mount(NULL, "/run/systemd/unit-root/sys/fs/cgroup/systemd", NULL,
  MS_RDONLY|MS_NOSUID|MS_NODEV|MS_NOEXEC|MS_REMOUNT|MS_BIND, NULL)

  ... when the kernel has CONFIG_CGROUPS=y but no cgroup subsystems
  enabled, since in that case the error "cgroup1: Need name or subsystem
  set" is hit when the mount options string is empty.

  Probably it doesn't make sense to validate the mount options string at
  all in the MS_REMOUNT|MS_BIND case, though maybe you had something else
  in mind.

This is also worthwhile doing because we will need to add a mount_setattr()
syscall to take over the remount-bind function.

Reported-by: Eric Biggers 
Signed-off-by: David Howells 
Signed-off-by: Al Viro 
Reviewed-by: David Howells 

(cherry picked from commit 43f5e655eff7e124d4e484515689cba374ab698e)
https://jira.sw.ru/browse/PSBM-144416
Signed-off-by: Pavel Tikhomirov 
---
 fs/namespace.c| 146 ++
 include/linux/mount.h |   2 +-
 2 files changed, 93 insertions(+), 55 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 10e294cb8cda..1b7b71c46a22 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -285,13 +285,9 @@ static struct mount *alloc_vfsmnt(const char *name)
  * mnt_want/drop_write() will _keep_ the filesystem
  * r/w.
  */
-int __mnt_is_readonly(struct vfsmount *mnt)
+bool __mnt_is_readonly(struct vfsmount *mnt)
 {
-   if (mnt->mnt_flags & MNT_READONLY)
-   return 1;
-   if (mnt->mnt_sb->s_flags & MS_RDONLY)
-   return 1;
-   return 0;
+   return (mnt->mnt_flags & MNT_READONLY) || (mnt->mnt_sb->s_flags & 
MS_RDONLY);
 }
 EXPORT_SYMBOL_GPL(__mnt_is_readonly);
 
@@ -606,11 +602,12 @@ static int mnt_make_readonly(struct mount *mnt)
return ret;
 }
 
-static void __mnt_unmake_readonly(struct mount *mnt)
+static int __mnt_unmake_readonly(struct mount *mnt)
 {
lock_mount_hash();
mnt->mnt.mnt_flags &= ~MNT_READONLY;
unlock_mount_hash();
+   return 0;
 }
 
 int sb_prepare_remount_readonly(struct super_block *sb)
@@ -2657,21 +2654,91 @@ SYSCALL_DEFINE3(open_tree, int, dfd, const char __user 
*, filename, unsigned, fl
return fd;
 }
 
-static int change_mount_flags(struct vfsmount *mnt, int ms_flags)
+/*
+ * Don't allow locked mount flags to be cleared.
+ *
+ * No locks need to be held here while testing the various MNT_LOCK
+ * flags because those flags can never be cleared once they are set.
+ */
+static bool can_change_locked_flags(struct mount *mnt, unsigned int mnt_flags)
+{
+   unsigned int fl = mnt->mnt.mnt_flags;
+
+   if ((fl & MNT_LOCK_READONLY) &&
+   !(mnt_flags & MNT_READONLY))
+   return false;
+
+   if ((fl & MNT_LOCK_NODEV) &&
+   !(mnt_flags & MNT_NODEV))
+   return false;
+
+   if ((fl & MNT_LOCK_NOSUID) &&
+   !(mnt_flags & MNT_NOSUID))
+   return false;
+
+   if ((fl & MNT_LOCK_NOEXEC) &&
+   !(mnt_flags & MNT_NOEXEC))
+   return false;
+
+   if ((fl & MNT_LOCK_ATIME) &&
+   ((fl & MNT_ATIME_MASK) != (mnt_flags & MNT_ATIME_MASK)))
+   return false;
+
+   return true;
+}
+
+static int change_mount_ro_state(struct mount *mnt, unsigned int mnt_flags)
 {
-   int error = 0;
-   int readonly_request = 0;
+   bool readonly_request = (mnt_flags & MNT_READONLY);
 
-   if (ms_flags & MS_RDONLY)
-   readonly_request = 1;
-   if (readonly_request == __mnt_is_readonly(mnt))
+   if (readonly_request == __mnt_is_readonly(>mnt))
return 0;
 
if (readonly_request)
-   error = mnt_make_readonly(real_mount(mnt));
-   else
-   __mnt_unmake_readonly(real_mount(mnt));
-   return error;
+   return mnt_make_readonly(mnt);
+
+   return __mnt_unmake_readonly(mnt);
+}
+
+/*
+ * Update the user-settable attributes on a mount.  The caller must hold
+ * sb->s_umount for writing.
+ */
+static void set_mount_attributes(struct mount *mnt, unsigned int mnt_flags)
+{
+   lock_mount_hash();
+   mnt_flags |= mnt->mnt.mnt_flags & ~MNT_USER_SETTABLE_MASK;
+   mnt->mnt.mnt_flags = mnt_flags;
+   touch_mnt_namespace(mnt->mnt_ns);

[Devel] [PATCH RH7 04/14] VFS: Differentiate mount flags (MS_*) from internal superblock flags

2023-04-13 Thread Pavel Tikhomirov
From: David Howells 

Differentiate the MS_* flags passed to mount(2) from the internal flags set
in the super_block's s_flags.  s_flags are now called SB_*, with the names
and the values for the moment mirroring the MS_* flags that they're
equivalent to.

In this patch, just the headers are altered and some kernel code where
blind automated conversion isn't necessarily correct.

Note that this shows up some interesting issues:

 (1) Some MS_* flags get translated to MNT_* flags (such as MS_NODEV ->
 MNT_NODEV) without passing this on to the filesystem, but some
 filesystems set such flags anyway.

 (2) The ->remount_fs() methods of some filesystems adjust the *flags
 argument by setting MS_* flags in it, such as MS_NOATIME - but these
 flags are then scrubbed by do_remount_sb() (only the occupants of
 MS_RMT_MASK are permitted: MS_RDONLY, MS_SYNCHRONOUS, MS_MANDLOCK,
 MS_I_VERSION and MS_LAZYTIME)

I'm not sure what's the best way to solve all these cases.

Suggested-by: Al Viro 
Signed-off-by: David Howells 

Change: Get rid of our cmd variable in do_mount
(cherry picked from commit e462ec50cb5fad19f6003a3d8087f4a0945dd2b1)
https://jira.sw.ru/browse/PSBM-144416
Signed-off-by: Pavel Tikhomirov 
---
 Documentation/filesystems/porting |  2 +-
 fs/namespace.c| 71 ---
 fs/super.c| 68 ++---
 include/linux/fs.h| 45 
 init/do_mounts.c  |  2 +-
 5 files changed, 108 insertions(+), 80 deletions(-)

diff --git a/Documentation/filesystems/porting 
b/Documentation/filesystems/porting
index fbfc7338f2bd..680e3401f7d8 100644
--- a/Documentation/filesystems/porting
+++ b/Documentation/filesystems/porting
@@ -228,7 +228,7 @@ anything from oops to silent memory corruption.
 ---
 [mandatory]
 
-   FS_NOMOUNT is gone.  If you use it - just set MS_NOUSER in flags
+   FS_NOMOUNT is gone.  If you use it - just set SB_NOUSER in flags
 (see rootfs for one kind of solution and bdev/socket/pipe for another).
 
 ---
diff --git a/fs/namespace.c b/fs/namespace.c
index d10138869c91..1a62ab03692e 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1065,7 +1065,7 @@ vfs_kern_mount(struct file_system_type *type, int flags, 
const char *name, void
if (!mnt)
return ERR_PTR(-ENOMEM);
 
-   if (flags & MS_KERNMOUNT)
+   if (flags & SB_KERNMOUNT)
mnt->mnt.mnt_flags = MNT_INTERNAL;
 
root = mount_fs(type, flags, name, data);
@@ -1121,7 +1121,7 @@ vfs_submount(const struct dentry *mountpoint, struct 
file_system_type *type,
return ERR_PTR(-EPERM);
 #endif
 
-   return vfs_kern_mount(type, MS_SUBMOUNT, name, data);
+   return vfs_kern_mount(type, SB_SUBMOUNT, name, data);
 }
 EXPORT_SYMBOL_GPL(vfs_submount);
 
@@ -1828,7 +1828,7 @@ static int do_umount(struct mount *mnt, int flags)
return -EPERM;
down_write(>s_umount);
if (!(sb->s_flags & MS_RDONLY))
-   retval = do_remount_sb(sb, MS_RDONLY, NULL, 0);
+   retval = do_remount_sb(sb, SB_RDONLY, NULL, 0);
up_write(>s_umount);
return retval;
}
@@ -2419,7 +2419,7 @@ static void unlock_mount(struct mountpoint *where)
 
 static int graft_tree(struct mount *mnt, struct mount *p, struct mountpoint 
*mp)
 {
-   if (mnt->mnt.mnt_sb->s_flags & MS_NOUSER)
+   if (mnt->mnt.mnt_sb->s_flags & SB_NOUSER)
return -EINVAL;
 
if (S_ISDIR(mp->m_dentry->d_inode->i_mode) !=
@@ -2433,9 +2433,9 @@ static int graft_tree(struct mount *mnt, struct mount *p, 
struct mountpoint *mp)
  * Sanity check the flags to change_mnt_propagation.
  */
 
-static int flags_to_propagation_type(int flags)
+static int flags_to_propagation_type(int ms_flags)
 {
-   int type = flags & ~(MS_REC | MS_SILENT);
+   int type = ms_flags & ~(MS_REC | MS_SILENT);
 
/* Fail if any non-propagation flags are set */
if (type & ~(MS_SHARED | MS_PRIVATE | MS_SLAVE | MS_UNBINDABLE))
@@ -2449,18 +2449,18 @@ static int flags_to_propagation_type(int flags)
 /*
  * recursively change the type of the mountpoint.
  */
-static int do_change_type(struct path *path, int flag)
+static int do_change_type(struct path *path, int ms_flags)
 {
struct mount *m;
struct mount *mnt = real_mount(path->mnt);
-   int recurse = flag & MS_REC;
+   int recurse = ms_flags & MS_REC;
int type;
int err = 0;
 
if (path->dentry != path->mnt->mnt_root)
return -EINVAL;
 
-   type = flags_to_propagation_type(flag);
+   type = flags_to_propagation_type(ms_flags);
if (!type)
return -EINVAL;
 
@@ -2867,8 +2867,8 @@ static int do_check_and_remount_sb(struct s

[Devel] [PATCH RH7 10/14] mount: make {lock, unlock}_mount_hash() static

2023-04-13 Thread Pavel Tikhomirov
From: Christian Brauner 

The lock_mount_hash() and unlock_mount_hash() helpers are never called
outside a single file. Remove them from the header and make them static
to reflect this fact. There's no need to have them callable from other
places right now, as Christoph observed.

Link: 
https://lore.kernel.org/r/20210121131959.646623-31-christian.brau...@ubuntu.com
Cc: David Howells 
Cc: Al Viro 
Cc: linux-fsde...@vger.kernel.org
Suggested-by: Christoph Hellwig 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Christian Brauner 

(cherry picked from commit d033cb6784c4f3a19a593cfe11f850e476197388)
https://jira.sw.ru/browse/PSBM-144416
Signed-off-by: Pavel Tikhomirov 
---
 fs/mount.h | 10 --
 fs/namespace.c | 10 ++
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/mount.h b/fs/mount.h
index b07b3934b746..f968bb548685 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -132,16 +132,6 @@ static inline void get_mnt_ns(struct mnt_namespace *ns)
 
 extern seqlock_t mount_lock;
 
-static inline void lock_mount_hash(void)
-{
-   write_seqlock(_lock);
-}
-
-static inline void unlock_mount_hash(void)
-{
-   write_sequnlock(_lock);
-}
-
 struct proc_mounts {
struct seq_file m;
struct mnt_namespace *ns;
diff --git a/fs/namespace.c b/fs/namespace.c
index 52a879628aba..9e58a2ff16ea 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -84,6 +84,16 @@ EXPORT_SYMBOL_GPL(fs_kobj);
  */
 __cacheline_aligned_in_smp DEFINE_SEQLOCK(mount_lock);
 
+static inline void lock_mount_hash(void)
+{
+   write_seqlock(_lock);
+}
+
+static inline void unlock_mount_hash(void)
+{
+   write_sequnlock(_lock);
+}
+
 static inline struct hlist_head *m_hash(struct vfsmount *mnt, struct dentry 
*dentry)
 {
unsigned long tmp = ((unsigned long)mnt / L1_CACHE_BYTES);
-- 
2.39.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH7 09/14] namespace: take lock_mount_hash() directly when changing flags

2023-04-13 Thread Pavel Tikhomirov
From: Christian Brauner 

Changing mount options always ends up taking lock_mount_hash() but when
MNT_READONLY is requested and neither the mount nor the superblock are
MNT_READONLY we end up taking the lock, dropping it, and retaking it to
change the other mount attributes. Instead, let's acquire the lock once
when changing the mount attributes. This simplifies the locking in these
codepath, makes them easier to reason about and avoids having to
reacquire the lock right after dropping it.

Link: 
https://lore.kernel.org/r/20210121131959.646623-30-christian.brau...@ubuntu.com
Cc: David Howells 
Cc: Al Viro 
Cc: linux-fsde...@vger.kernel.org
Reviewed-by: Christoph Hellwig 
Signed-off-by: Christian Brauner 

(cherry picked from commit 68847c941700475575ced191108971d26e82ae29)
https://jira.sw.ru/browse/PSBM-144416
Signed-off-by: Pavel Tikhomirov 
---
 fs/namespace.c | 22 --
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 1b7b71c46a22..52a879628aba 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -564,7 +564,6 @@ static int mnt_make_readonly(struct mount *mnt)
 {
int ret = 0;
 
-   lock_mount_hash();
mnt->mnt.mnt_flags |= MNT_WRITE_HOLD;
/*
 * After storing MNT_WRITE_HOLD, we'll read the counters. This store
@@ -598,18 +597,9 @@ static int mnt_make_readonly(struct mount *mnt)
 */
smp_wmb();
mnt->mnt.mnt_flags &= ~MNT_WRITE_HOLD;
-   unlock_mount_hash();
return ret;
 }
 
-static int __mnt_unmake_readonly(struct mount *mnt)
-{
-   lock_mount_hash();
-   mnt->mnt.mnt_flags &= ~MNT_READONLY;
-   unlock_mount_hash();
-   return 0;
-}
-
 int sb_prepare_remount_readonly(struct super_block *sb)
 {
struct mount *mnt;
@@ -2697,7 +2687,8 @@ static int change_mount_ro_state(struct mount *mnt, 
unsigned int mnt_flags)
if (readonly_request)
return mnt_make_readonly(mnt);
 
-   return __mnt_unmake_readonly(mnt);
+   mnt->mnt.mnt_flags &= ~MNT_READONLY;
+   return 0;
 }
 
 /*
@@ -2706,11 +2697,9 @@ static int change_mount_ro_state(struct mount *mnt, 
unsigned int mnt_flags)
  */
 static void set_mount_attributes(struct mount *mnt, unsigned int mnt_flags)
 {
-   lock_mount_hash();
mnt_flags |= mnt->mnt.mnt_flags & ~MNT_USER_SETTABLE_MASK;
mnt->mnt.mnt_flags = mnt_flags;
touch_mnt_namespace(mnt->mnt_ns);
-   unlock_mount_hash();
 }
 
 /*
@@ -2734,9 +2723,11 @@ static int do_reconfigure_mnt(struct path *path, 
unsigned int mnt_flags)
return -EPERM;
 
down_write(>s_umount);
+   lock_mount_hash();
ret = change_mount_ro_state(mnt, mnt_flags);
if (ret == 0)
set_mount_attributes(mnt, mnt_flags);
+   unlock_mount_hash();
up_write(>s_umount);
return ret;
 }
@@ -2958,8 +2949,11 @@ static int do_remount(struct path *path, int ms_flags, 
int sb_flags,
err = -EPERM;
if (ns_capable(sb->s_user_ns, CAP_SYS_ADMIN)) {
err = do_check_and_remount_sb(sb, sb_flags, data);
-   if (!err)
+   if (!err) {
+   lock_mount_hash();
set_mount_attributes(mnt, mnt_flags);
+   unlock_mount_hash();
+   }
}
up_write(>s_umount);
return err;
-- 
2.39.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH7 07/14] vfs: Undo an overly zealous MS_RDONLY -> SB_RDONLY conversion

2023-04-13 Thread Pavel Tikhomirov
From: David Howells 

In do_mount() when the MS_* flags are being converted to MNT_* flags,
MS_RDONLY got accidentally convered to SB_RDONLY.

Undo this change.

Fixes: e462ec50cb5f ("VFS: Differentiate mount flags (MS_*) from internal 
superblock flags")
Signed-off-by: David Howells 
Signed-off-by: Linus Torvalds 

(cherry picked from commit a9e5b73288cf1595ac2e05cf1acd1924ceea05fa)
https://jira.sw.ru/browse/PSBM-144416
Signed-off-by: Pavel Tikhomirov 
---
 fs/namespace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 33ccc11cf327..10e294cb8cda 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3570,7 +3570,7 @@ long do_mount(const char *dev_name, const char __user 
*dir_name,
mnt_flags |= MNT_NODIRATIME;
if (flags & MS_STRICTATIME)
mnt_flags &= ~(MNT_RELATIME | MNT_NOATIME);
-   if (flags & SB_RDONLY)
+   if (flags & MS_RDONLY)
mnt_flags |= MNT_READONLY;
 
/* The default atime for remount is preservation */
-- 
2.39.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH7 01/14] mount: rename do_set_group to do_set_group_old

2023-04-13 Thread Pavel Tikhomirov
We have a VZ-only feature to copy mount sharing between mounts via mount
syscall, to be able to handle mount restore in CRIU u15-u19 efficiently.

In mainstream there is now similar feature through move_mount syscall.

To support both old criu and new criu, which uses ms API, at the same
time let's fix name collision and leave both variants for now, several
updates later we can drop old mount syscall based API.

https://jira.sw.ru/browse/PSBM-144416
Signed-off-by: Pavel Tikhomirov 
---
 fs/namespace.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index fcfe15ed28f2..94f1e308b354 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2939,7 +2939,7 @@ static inline int tree_contains_unbindable(struct mount 
*mnt)
return 0;
 }
 
-static int do_set_group(struct path *path, const char *sibling_name)
+static int do_set_group_old(struct path *path, const char *sibling_name)
 {
struct ve_struct *ve = get_exec_env();
struct mount *sibling, *mnt;
@@ -3525,7 +3525,7 @@ long do_mount(const char *dev_name, const char __user 
*dir_name,
else if (cmd & MS_MOVE)
retval = do_move_mount_old(, dev_name);
else if (cmd & MS_SET_GROUP)
-   retval = do_set_group(, dev_name);
+   retval = do_set_group_old(, dev_name);
else
retval = do_new_mount(, type_page, flags, mnt_flags,
  dev_name, data_page);
-- 
2.39.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH7 12/14] fs: split out functions to hold writers

2023-04-13 Thread Pavel Tikhomirov
From: Christian Brauner 

When a mount is marked read-only we set MNT_WRITE_HOLD on it if there
aren't currently any active writers. Split this logic out into simple
helpers that we can use in follow-up patches.

Link: 
https://lore.kernel.org/r/20210121131959.646623-33-christian.brau...@ubuntu.com
Cc: David Howells 
Cc: Al Viro 
Cc: linux-fsde...@vger.kernel.org
Suggested-by: Christoph Hellwig 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Christian Brauner 

(cherry picked from commit fbdc2f6c40f6528fa0db79c73e844451234f3e26)
https://jira.sw.ru/browse/PSBM-144416
Signed-off-by: Pavel Tikhomirov 
---
 fs/namespace.c | 24 ++--
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index c03e21575d08..a40a217f9871 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -570,10 +570,8 @@ void mnt_drop_write_file(struct file *file)
 }
 EXPORT_SYMBOL(mnt_drop_write_file);
 
-static int mnt_make_readonly(struct mount *mnt)
+static inline int mnt_hold_writers(struct mount *mnt)
 {
-   int ret = 0;
-
mnt->mnt.mnt_flags |= MNT_WRITE_HOLD;
/*
 * After storing MNT_WRITE_HOLD, we'll read the counters. This store
@@ -598,15 +596,29 @@ static int mnt_make_readonly(struct mount *mnt)
 * we're counting up here.
 */
if (mnt_get_writers(mnt) > 0)
-   ret = -EBUSY;
-   else
-   mnt->mnt.mnt_flags |= MNT_READONLY;
+   return -EBUSY;
+
+   return 0;
+}
+
+static inline void mnt_unhold_writers(struct mount *mnt)
+{
/*
 * MNT_READONLY must become visible before ~MNT_WRITE_HOLD, so writers
 * that become unheld will see MNT_READONLY.
 */
smp_wmb();
mnt->mnt.mnt_flags &= ~MNT_WRITE_HOLD;
+}
+
+static int mnt_make_readonly(struct mount *mnt)
+{
+   int ret;
+
+   ret = mnt_hold_writers(mnt);
+   if (!ret)
+   mnt->mnt.mnt_flags |= MNT_READONLY;
+   mnt_unhold_writers(mnt);
return ret;
 }
 
-- 
2.39.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


  1   2   3   4   5   6   7   8   9   >