[Devel] [PATCH VZ7] netfilter: nft_dynset: do not reject set updates with NFT_SET_EVAL
From: Pablo Neira Ayuso NFT_SET_EVAL is signalling the kernel that this sets can be updated from the evaluation path, even if there are no expressions attached to the element. Otherwise, set updates with no expressions fail. Update description to describe the right semantics. Fixes: 22fe54d5fefc ("netfilter: nf_tables: add support for dynamic set updates") Signed-off-by: Pablo Neira Ayuso Note: This fixes the problem of adding dynamic set modification rules: nft add table inet test-table nft add set inet test-table test-set { typeof ip saddr \; flags timeout \; size 1 \; } nft add chain inet test-table test-chain { type filter hook input priority filter \; policy accept \; } nft add rule inet test-table test-chain add @test-set { ip saddr timeout 2m } Without this patch the last command fails. Pinging the above setup will automaticly create set elements, e.g.: nft list ruleset table inet test-table { set test-set { typeof ip saddr size 1 flags timeout elements = { 172.29.1.154 expires 1m38s945ms, 172.29.158.29 expires 1m45s348ms, 192.168.16.220 expires 1m56s984ms } } chain test-chain { type filter hook input priority filter; policy accept; add @test-set { ip saddr timeout 2m } } } https://virtuozzo.atlassian.net/browse/PSBM-156601 (cherry picked from commit 215a31f19dedd4e92a67cf5a9717ee898d012b3a) Signed-off-by: Pavel Tikhomirov --- include/uapi/linux/netfilter/nf_tables.h | 2 +- net/netfilter/nft_dynset.c | 3 +-- 2 files changed, 2 insertions(+), 3 deletions(-) diff --git a/include/uapi/linux/netfilter/nf_tables.h b/include/uapi/linux/netfilter/nf_tables.h index 98ad9c3d98f4..4ba705db4ae5 100644 --- a/include/uapi/linux/netfilter/nf_tables.h +++ b/include/uapi/linux/netfilter/nf_tables.h @@ -243,7 +243,7 @@ enum nft_rule_compat_attributes { * @NFT_SET_INTERVAL: set contains intervals * @NFT_SET_MAP: set is used as a dictionary * @NFT_SET_TIMEOUT: set uses timeouts - * @NFT_SET_EVAL: set contains expressions for evaluation + * @NFT_SET_EVAL: set can be updated from the evaluation path */ enum nft_set_flags { NFT_SET_ANONYMOUS = 0x1, diff --git a/net/netfilter/nft_dynset.c b/net/netfilter/nft_dynset.c index abf5b1cfdb3a..a32b39ad5d8d 100644 --- a/net/netfilter/nft_dynset.c +++ b/net/netfilter/nft_dynset.c @@ -178,8 +178,7 @@ static int nft_dynset_init(const struct nft_ctx *ctx, priv->expr = nft_expr_init(ctx, tb[NFTA_DYNSET_EXPR]); if (IS_ERR(priv->expr)) return PTR_ERR(priv->expr); - } else if (set->flags & NFT_SET_EVAL) - return -EINVAL; + } nft_set_ext_prepare(>tmpl); nft_set_ext_add_length(>tmpl, NFT_SET_EXT_KEY, set->klen); -- 2.44.0 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH VZ9] mm/memcontrol: Add page cache limit to cgroup-v2
On 22/03/2024 12:58, Pavel Tikhomirov wrote: The interface is slightly reworked to be more v2 like: - rename memory.cache.limit/usage_in_bytes -> memory.cache.max/current - show "max" when uninitialized and allow to write it - memcg_max_mutex with page_counter_set_max replaced with simple xchg - we set limit first before looping and then try to enforce it if needed, no more enforce before setting logic - retry reclaim couple of times if it fails to enforce the limit and then just give up (memory_max_write triggers oom in this case, but we probably do not want to trigger oom due to cache limit) https://virtuozzo.atlassian.net/browse/PSBM-154207 Signed-off-by: Pavel Tikhomirov Feature: mm: Memory cgroup page cache limit --- mm/memcontrol.c | 75 + 1 file changed, 75 insertions(+) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 9000dc00ed03..28a39b50cbe1 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -8810,3 +8810,78 @@ static int __init mem_cgroup_swap_init(void) subsys_initcall(mem_cgroup_swap_init); #endif /* CONFIG_SWAP */ + +static int cache_max_show(struct seq_file *m, void *v) +{ + return seq_puts_memcg_tunable(m, + READ_ONCE(mem_cgroup_from_seq(m)->cache.max)); +} + +static ssize_t cache_max_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + unsigned int nr_reclaims = MAX_RECLAIM_RETRIES; + unsigned long max; + int err; + + buf = strstrip(buf); + err = page_counter_memparse(buf, "max", ); + if (err) + return err; + + xchg(>cache.max, max); For history: Here is my understanding (from discussion with Johannes Weiner ) of why we need xchg() in memory_max_write: In page_counter_try_charge we do: a) increase usage -- implicit full memory barrier --- b) check usage is within limit, else c) revert usage In page_counter_set_max we do: A) check usage is within new limit, then -- implicit full memory barrier --- B) set new limit -- implicit full memory barrier --- C) check usage didn't grow under us, else D) revert old limit If at (b) we don't see concurrent limit change then at (C) we would see usage grow and vice versa. And after switch to memory_max_write in cgroup-v2 code path (C) and (D) were replaced with: C') check usage is within new limit, else D') reclaim until is, or OOM if can't reach So basically we need xchg() to sync with page_counter_try_charge. For memcg->cache we don't have uses of page_counter_try_charge, so it should be safe to omit xchg() here (same for oom.guarantee). But we also already have such unjustified xchg(), which is not strictly required, for swap.max in mainstream code, probably just copy-pasted from memory_max_write. And as it is not a fast-path we can just leave xchg() here and there unless we would really need to remove it for some reason. + + for (;;) { + unsigned long nr_cache = page_counter_read(>cache); + + if (nr_cache <= max) + break; + + if (signal_pending(current)) + break; + + if (!nr_reclaims) + break; + + if (!try_to_free_mem_cgroup_pages(memcg, nr_cache - max, + GFP_KERNEL, false)) + nr_reclaims--; + } + + memcg_wb_domain_size_changed(memcg); + return nbytes; +} + +static u64 cache_current_read(struct cgroup_subsys_state *css, + struct cftype *cft) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(css); + + return (u64)page_counter_read(>cache) * PAGE_SIZE; +} + +static struct cftype cache_files[] = { + { + .name = "cache.max", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = cache_max_show, + .write = cache_max_write, + }, + { + .name = "cache.current", + .flags = CFTYPE_NOT_ON_ROOT, + .read_u64 = cache_current_read, + }, + { } /* terminate */ +}; + +static int __init mem_cgroup_cache_init(void) +{ + if (mem_cgroup_disabled()) + return 0; + + WARN_ON(cgroup_add_dfl_cftypes(_cgrp_subsys, cache_files)); + return 0; +} +subsys_initcall(mem_cgroup_cache_init); -- Best regards, Tikhomirov Pavel Senior Software Developer, Virtuozzo. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH VZ9 v2 1/3] memcg: fix oom_guarantee to be considered in global oom and in berserker
After rebase to VZ8 the code which sets memcg->overdraft was moved to mem_cgroup_scan_tasks() which is completely wrong (for instance in VZ7 we had this code in oom_unlock(), so everywhere in oom we always had ->overdraft information properly set). Now we don't have proper refresh of ->overdraft information in two cases: in global oom and in berserker. Let's fix this by spliting the refresh code to separate function refresh_mem_cgroup_overdraft() and call it where it is really needed (where later in stack oom_badness uses the refreshed ->overdraft). Fixes: c31dabeaf42d ("memcg: add oom_guarantee") Signed-off-by: Pavel Tikhomirov Feature: mm/oom: OOM guarantee feature --- include/linux/memcontrol.h | 6 ++ mm/memcontrol.c| 30 +++--- mm/oom_kill.c | 4 3 files changed, 29 insertions(+), 11 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 4719eb84894f..9c2b8774639e 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -22,6 +22,7 @@ #include #include #include +#include struct mem_cgroup; struct obj_cgroup; @@ -1004,6 +1005,7 @@ void mem_cgroup_print_oom_context(struct mem_cgroup *memcg, void mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg); unsigned long mem_cgroup_overdraft(struct mem_cgroup *memcg); +void refresh_mem_cgroup_overdraft(struct oom_control *oc); static inline void mem_cgroup_enter_user_fault(void) { @@ -1529,6 +1531,10 @@ static inline unsigned long mem_cgroup_overdraft(struct mem_cgroup *memcg) return 0; } +static inline void refresh_mem_cgroup_overdraft(struct oom_control *oc) +{ +} + static inline void mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index a0a7c3b4ed35..8ca27af8d902 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1308,17 +1308,6 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg, for_each_mem_cgroup_tree(iter, memcg) { struct css_task_iter it; struct task_struct *task; - struct mem_cgroup *parent; - - /* -* Update overdraft of each cgroup under us. This -* information will be used in oom_badness. -*/ - iter->overdraft = mem_cgroup_overdraft(iter); - parent = parent_mem_cgroup(iter); - if (parent && iter != memcg) - iter->overdraft = max(iter->overdraft, - parent->overdraft); css_task_iter_start(>css, CSS_TASK_ITER_PROCS, ); while (!ret && (task = css_task_iter_next())) @@ -1518,6 +1507,25 @@ unsigned long mem_cgroup_overdraft(struct mem_cgroup *memcg) return usage > guarantee ? (usage - guarantee) : 0; } +void refresh_mem_cgroup_overdraft(struct oom_control *oc) +{ + struct mem_cgroup *iter; + + for_each_mem_cgroup_tree(iter, oc->memcg) { + struct mem_cgroup *parent; + + /* +* Update overdraft of each cgroup under us. This +* information will be used in oom_badness. +*/ + iter->overdraft = mem_cgroup_overdraft(iter); + parent = parent_mem_cgroup(iter); + if (parent && iter != oc->memcg) + iter->overdraft = max(iter->overdraft, + parent->overdraft); + } +} + bool mem_cgroup_dcache_is_low(struct mem_cgroup *memcg, int vfs_cache_min_ratio) { unsigned long anon, file, dcache; diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 1d44024d6c6f..d27e04295e7f 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -378,6 +378,8 @@ static void select_bad_process(struct oom_control *oc) { oc->chosen_points = LONG_MIN; + refresh_mem_cgroup_overdraft(oc); + if (is_memcg_oom(oc)) mem_cgroup_scan_tasks(oc->memcg, oom_evaluate_task, oc); else { @@ -1073,6 +1075,8 @@ static void oom_berserker(struct oom_control *oc) if (rage < 0) return; + refresh_mem_cgroup_overdraft(oc); + /* * So, we are in rage. Kill (1 << rage) youngest tasks that are * as bad as the victim. -- 2.43.0 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH VZ9 v2 2/3] memcg: fix oom_guarantee overdraft calculation in cgroup-v2
The memsw counter in cgroup-v2 is reused by swap counter and does not include memory usage, we need to replace it with memory + swap. Fixes: c31dabeaf42d ("memcg: add oom_guarantee") Signed-off-by: Pavel Tikhomirov Feature: mm/oom: OOM guarantee feature --- mm/memcontrol.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 8ca27af8d902..fcc4b24ba330 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1503,7 +1503,10 @@ unsigned long mem_cgroup_overdraft(struct mem_cgroup *memcg) return 0; guarantee = READ_ONCE(memcg->oom_guarantee); - usage = page_counter_read(>memsw); + if (cgroup_subsys_on_dfl(memory_cgrp_subsys)) + usage = page_counter_read(>memory) + page_counter_read(>swap); + else + usage = page_counter_read(>memsw); return usage > guarantee ? (usage - guarantee) : 0; } -- 2.43.0 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH VZ9 v2 0/3] cgroup-v2: enable memory.oom_guarantee vz-specific cgroup property
https://virtuozzo.atlassian.net/browse/PSBM-154224 Pavel Tikhomirov (3): memcg: fix oom_guarantee to be considered in global oom and in berserker memcg: fix oom_guarantee overdraft calculation in cgroup-v2 mm/memcontrol: make memory.oom_guarantee file also available in cgroup-v2 include/linux/memcontrol.h | 8 - mm/memcontrol.c| 65 +++--- mm/oom_kill.c | 4 +++ 3 files changed, 64 insertions(+), 13 deletions(-) -- 2.43.0 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH VZ9 3/3] mm/memcontrol: make memory.oom_guarantee file also available in cgroup-v2
The interface is slightly reworked to be more v2 like: - switch from "-1" to "max" for max value following cgroup-v2 style While on it also: - make oom_guarantee (unsigned long) similar to regular cgroup limits, as we never set more than PAGE_COUNTER_MAX there anyway - set oom_guarantee with xchg(), this ensures that update is atomic https://virtuozzo.atlassian.net/browse/PSBM-154224 Signed-off-by: Pavel Tikhomirov Feature: mm/oom: OOM guarantee feature --- v2: - correct commit message part about "max" - make oom_guarantee unsigned long - update oom_guarantee with xchg() --- include/linux/memcontrol.h | 2 +- mm/memcontrol.c| 30 ++ 2 files changed, 31 insertions(+), 1 deletion(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 9c2b8774639e..2aa330cb81b4 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -263,7 +263,7 @@ struct mem_cgroup { struct vmpressure vmpressure; unsigned long overdraft; - unsigned long long oom_guarantee; + unsigned long oom_guarantee; /* * Should the OOM killer kill all belonging tasks, had it kill one? diff --git a/mm/memcontrol.c b/mm/memcontrol.c index fcc4b24ba330..1722a49855ac 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5239,6 +5239,14 @@ static u64 mem_cgroup_oom_guarantee_read(struct cgroup_subsys_state *css, return mem_cgroup_from_css(css)->oom_guarantee << PAGE_SHIFT; } +static int seq_puts_memcg_tunable(struct seq_file *m, unsigned long value); + +static int memory_oom_guarantee_show(struct seq_file *m, void *v) +{ + return seq_puts_memcg_tunable(m, + READ_ONCE(mem_cgroup_from_seq(m)->oom_guarantee)); +} + static ssize_t mem_cgroup_oom_guarantee_write(struct kernfs_open_file *kops, char *buffer, size_t nbytes, loff_t off) { @@ -5255,6 +5263,22 @@ static ssize_t mem_cgroup_oom_guarantee_write(struct kernfs_open_file *kops, return nbytes; } +static ssize_t memory_oom_guarantee_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + unsigned long oom_guarantee; + int err; + + buf = strstrip(buf); + err = page_counter_memparse(buf, "max", _guarantee); + if (err) + return err; + + xchg(>oom_guarantee, oom_guarantee); + return nbytes; +} + #ifdef CONFIG_CLEANCACHE static u64 mem_cgroup_disable_cleancache_read(struct cgroup_subsys_state *css, struct cftype *cft) @@ -7629,6 +7653,12 @@ static struct cftype memory_files[] = { .flags = CFTYPE_NS_DELEGATABLE, .write = memory_reclaim, }, + { + .name = "oom_guarantee", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = memory_oom_guarantee_show, + .write = memory_oom_guarantee_write, + }, { } /* terminate */ }; -- 2.43.0 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH VZ9 3/3] mm/memcontrol: make memory.oom_guarantee file also available in cgroup-v2
On 22/03/2024 13:21, Pavel Tikhomirov wrote: The interface is slightly reworked to be more v2 like: - show "max" when uninitialized and allow to write it Slightly bad wording, it is 0 when uninitialized. Should be: - show "max" when has max value and allow to write it https://virtuozzo.atlassian.net/browse/PSBM-154224 Signed-off-by: Pavel Tikhomirov Feature: mm/oom: OOM guarantee feature --- mm/memcontrol.c | 30 ++ 1 file changed, 30 insertions(+) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index fcc4b24ba330..f709ed6be891 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5239,6 +5239,14 @@ static u64 mem_cgroup_oom_guarantee_read(struct cgroup_subsys_state *css, return mem_cgroup_from_css(css)->oom_guarantee << PAGE_SHIFT; } +static int seq_puts_memcg_tunable(struct seq_file *m, unsigned long value); + +static int memory_oom_guarantee_show(struct seq_file *m, void *v) +{ + return seq_puts_memcg_tunable(m, + READ_ONCE(mem_cgroup_from_seq(m)->oom_guarantee)); +} + static ssize_t mem_cgroup_oom_guarantee_write(struct kernfs_open_file *kops, char *buffer, size_t nbytes, loff_t off) { @@ -5255,6 +5263,22 @@ static ssize_t mem_cgroup_oom_guarantee_write(struct kernfs_open_file *kops, return nbytes; } +static ssize_t memory_oom_guarantee_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + unsigned long oom_guarantee; + int err; + + buf = strstrip(buf); + err = page_counter_memparse(buf, "max", _guarantee); + if (err) + return err; + + memcg->oom_guarantee = oom_guarantee; + return nbytes; +} + #ifdef CONFIG_CLEANCACHE static u64 mem_cgroup_disable_cleancache_read(struct cgroup_subsys_state *css, struct cftype *cft) @@ -7629,6 +7653,12 @@ static struct cftype memory_files[] = { .flags = CFTYPE_NS_DELEGATABLE, .write = memory_reclaim, }, + { + .name = "oom_guarantee", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = memory_oom_guarantee_show, + .write = memory_oom_guarantee_write, + }, { } /* terminate */ }; -- Best regards, Tikhomirov Pavel Senior Software Developer, Virtuozzo. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH VZ9 2/3] memcg: fix oom_guarantee overdraft calculation in cgroup-v2
The memsw counter in cgroup-v2 is reused by swap counter and does not include memory usage, we need to replace it with memory + swap. https://virtuozzo.atlassian.net/browse/PSBM-154224 Signed-off-by: Pavel Tikhomirov Feature: mm/oom: OOM guarantee feature --- mm/memcontrol.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 8ca27af8d902..fcc4b24ba330 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1503,7 +1503,10 @@ unsigned long mem_cgroup_overdraft(struct mem_cgroup *memcg) return 0; guarantee = READ_ONCE(memcg->oom_guarantee); - usage = page_counter_read(>memsw); + if (cgroup_subsys_on_dfl(memory_cgrp_subsys)) + usage = page_counter_read(>memory) + page_counter_read(>swap); + else + usage = page_counter_read(>memsw); return usage > guarantee ? (usage - guarantee) : 0; } -- 2.43.0 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH VZ9 0/2] cgroup-v2: enable memory.oom_guarantee vz-specific cgroup property
https://virtuozzo.atlassian.net/browse/PSBM-153799 Pavel Tikhomirov (3): memcg: fix oom_guarantee to be considered in global oom and in berserker memcg: fix oom_guarantee overdraft calculation in cgroup-v2 mm/memcontrol: make memory.oom_guarantee file also available in cgroup-v2 include/linux/memcontrol.h | 6 mm/memcontrol.c| 65 +++--- mm/oom_kill.c | 4 +++ 3 files changed, 63 insertions(+), 12 deletions(-) -- 2.43.0 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH VZ9 3/3] mm/memcontrol: make memory.oom_guarantee file also available in cgroup-v2
The interface is slightly reworked to be more v2 like: - show "max" when uninitialized and allow to write it https://virtuozzo.atlassian.net/browse/PSBM-154224 Signed-off-by: Pavel Tikhomirov Feature: mm/oom: OOM guarantee feature --- mm/memcontrol.c | 30 ++ 1 file changed, 30 insertions(+) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index fcc4b24ba330..f709ed6be891 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5239,6 +5239,14 @@ static u64 mem_cgroup_oom_guarantee_read(struct cgroup_subsys_state *css, return mem_cgroup_from_css(css)->oom_guarantee << PAGE_SHIFT; } +static int seq_puts_memcg_tunable(struct seq_file *m, unsigned long value); + +static int memory_oom_guarantee_show(struct seq_file *m, void *v) +{ + return seq_puts_memcg_tunable(m, + READ_ONCE(mem_cgroup_from_seq(m)->oom_guarantee)); +} + static ssize_t mem_cgroup_oom_guarantee_write(struct kernfs_open_file *kops, char *buffer, size_t nbytes, loff_t off) { @@ -5255,6 +5263,22 @@ static ssize_t mem_cgroup_oom_guarantee_write(struct kernfs_open_file *kops, return nbytes; } +static ssize_t memory_oom_guarantee_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + unsigned long oom_guarantee; + int err; + + buf = strstrip(buf); + err = page_counter_memparse(buf, "max", _guarantee); + if (err) + return err; + + memcg->oom_guarantee = oom_guarantee; + return nbytes; +} + #ifdef CONFIG_CLEANCACHE static u64 mem_cgroup_disable_cleancache_read(struct cgroup_subsys_state *css, struct cftype *cft) @@ -7629,6 +7653,12 @@ static struct cftype memory_files[] = { .flags = CFTYPE_NS_DELEGATABLE, .write = memory_reclaim, }, + { + .name = "oom_guarantee", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = memory_oom_guarantee_show, + .write = memory_oom_guarantee_write, + }, { } /* terminate */ }; -- 2.43.0 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH VZ9 1/3] memcg: fix oom_guarantee to be considered in global oom and in berserker
After rebase to VZ8 the code which sets memcg->overdraft was moved to mem_cgroup_scan_tasks() which is completely wrong (for instance in VZ7 we had this code in oom_unlock(), so everywhere in oom we always had ->overdraft information properly set). Now we don't have proper refresh of ->overdraft information in two cases: in global oom and in berserker. Let's fix this by spliting the refresh code to separate function refresh_mem_cgroup_overdraft() and call it where it is really needed (where later in stack oom_badness uses the refreshed ->overdraft). https://virtuozzo.atlassian.net/browse/PSBM-154224 Fixes: c31dabeaf42d ("memcg: add oom_guarantee") Signed-off-by: Pavel Tikhomirov Feature: mm/oom: OOM guarantee feature --- include/linux/memcontrol.h | 6 ++ mm/memcontrol.c| 30 +++--- mm/oom_kill.c | 4 3 files changed, 29 insertions(+), 11 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 4719eb84894f..9c2b8774639e 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -22,6 +22,7 @@ #include #include #include +#include struct mem_cgroup; struct obj_cgroup; @@ -1004,6 +1005,7 @@ void mem_cgroup_print_oom_context(struct mem_cgroup *memcg, void mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg); unsigned long mem_cgroup_overdraft(struct mem_cgroup *memcg); +void refresh_mem_cgroup_overdraft(struct oom_control *oc); static inline void mem_cgroup_enter_user_fault(void) { @@ -1529,6 +1531,10 @@ static inline unsigned long mem_cgroup_overdraft(struct mem_cgroup *memcg) return 0; } +static inline void refresh_mem_cgroup_overdraft(struct oom_control *oc) +{ +} + static inline void mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index a0a7c3b4ed35..8ca27af8d902 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1308,17 +1308,6 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg, for_each_mem_cgroup_tree(iter, memcg) { struct css_task_iter it; struct task_struct *task; - struct mem_cgroup *parent; - - /* -* Update overdraft of each cgroup under us. This -* information will be used in oom_badness. -*/ - iter->overdraft = mem_cgroup_overdraft(iter); - parent = parent_mem_cgroup(iter); - if (parent && iter != memcg) - iter->overdraft = max(iter->overdraft, - parent->overdraft); css_task_iter_start(>css, CSS_TASK_ITER_PROCS, ); while (!ret && (task = css_task_iter_next())) @@ -1518,6 +1507,25 @@ unsigned long mem_cgroup_overdraft(struct mem_cgroup *memcg) return usage > guarantee ? (usage - guarantee) : 0; } +void refresh_mem_cgroup_overdraft(struct oom_control *oc) +{ + struct mem_cgroup *iter; + + for_each_mem_cgroup_tree(iter, oc->memcg) { + struct mem_cgroup *parent; + + /* +* Update overdraft of each cgroup under us. This +* information will be used in oom_badness. +*/ + iter->overdraft = mem_cgroup_overdraft(iter); + parent = parent_mem_cgroup(iter); + if (parent && iter != oc->memcg) + iter->overdraft = max(iter->overdraft, + parent->overdraft); + } +} + bool mem_cgroup_dcache_is_low(struct mem_cgroup *memcg, int vfs_cache_min_ratio) { unsigned long anon, file, dcache; diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 1d44024d6c6f..d27e04295e7f 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -378,6 +378,8 @@ static void select_bad_process(struct oom_control *oc) { oc->chosen_points = LONG_MIN; + refresh_mem_cgroup_overdraft(oc); + if (is_memcg_oom(oc)) mem_cgroup_scan_tasks(oc->memcg, oom_evaluate_task, oc); else { @@ -1073,6 +1075,8 @@ static void oom_berserker(struct oom_control *oc) if (rage < 0) return; + refresh_mem_cgroup_overdraft(oc); + /* * So, we are in rage. Kill (1 << rage) youngest tasks that are * as bad as the victim. -- 2.43.0 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH VZ9] mm/memcontrol: Add page cache limit to cgroup-v2
The interface is slightly reworked to be more v2 like: - rename memory.cache.limit/usage_in_bytes -> memory.cache.max/current - show "max" when uninitialized and allow to write it - memcg_max_mutex with page_counter_set_max replaced with simple xchg - we set limit first before looping and then try to enforce it if needed, no more enforce before setting logic - retry reclaim couple of times if it fails to enforce the limit and then just give up (memory_max_write triggers oom in this case, but we probably do not want to trigger oom due to cache limit) https://virtuozzo.atlassian.net/browse/PSBM-154207 Signed-off-by: Pavel Tikhomirov Feature: mm: Memory cgroup page cache limit --- mm/memcontrol.c | 75 + 1 file changed, 75 insertions(+) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 9000dc00ed03..28a39b50cbe1 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -8810,3 +8810,78 @@ static int __init mem_cgroup_swap_init(void) subsys_initcall(mem_cgroup_swap_init); #endif /* CONFIG_SWAP */ + +static int cache_max_show(struct seq_file *m, void *v) +{ + return seq_puts_memcg_tunable(m, + READ_ONCE(mem_cgroup_from_seq(m)->cache.max)); +} + +static ssize_t cache_max_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + unsigned int nr_reclaims = MAX_RECLAIM_RETRIES; + unsigned long max; + int err; + + buf = strstrip(buf); + err = page_counter_memparse(buf, "max", ); + if (err) + return err; + + xchg(>cache.max, max); + + for (;;) { + unsigned long nr_cache = page_counter_read(>cache); + + if (nr_cache <= max) + break; + + if (signal_pending(current)) + break; + + if (!nr_reclaims) + break; + + if (!try_to_free_mem_cgroup_pages(memcg, nr_cache - max, + GFP_KERNEL, false)) + nr_reclaims--; + } + + memcg_wb_domain_size_changed(memcg); + return nbytes; +} + +static u64 cache_current_read(struct cgroup_subsys_state *css, + struct cftype *cft) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(css); + + return (u64)page_counter_read(>cache) * PAGE_SIZE; +} + +static struct cftype cache_files[] = { + { + .name = "cache.max", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = cache_max_show, + .write = cache_max_write, + }, + { + .name = "cache.current", + .flags = CFTYPE_NOT_ON_ROOT, + .read_u64 = cache_current_read, + }, + { } /* terminate */ +}; + +static int __init mem_cgroup_cache_init(void) +{ + if (mem_cgroup_disabled()) + return 0; + + WARN_ON(cgroup_add_dfl_cftypes(_cgrp_subsys, cache_files)); + return 0; +} +subsys_initcall(mem_cgroup_cache_init); -- 2.43.0 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH VZ9 2/2] sched/stat: make cpu.proc.stat file also available in cgroup-v2
We use the fact that cpuacct controller is implicitly enabled in cgroupv-2 by previous patch. So we can just do everyting the same as we do it on cgroup-v1 where we have both cpu and cpuacct controllers mounted together. Note: Stats in cpu.proc.stat give per-cpu per-cgroup usages which are used in vcmmd to calculate per-numa cpu loads for it's balancing algorithm. This used both for Containers and VMs. https://virtuozzo.atlassian.net/browse/PSBM-154206 Signed-off-by: Pavel Tikhomirov Feature: sched: emulate virtual cpus for Containers --- kernel/sched/core.c | 4 1 file changed, 4 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 98e9f78e19cf..2235f1565078 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -11835,6 +11835,10 @@ static struct cftype cpu_files[] = { .write = cpu_uclamp_max_write, }, #endif + { + .name = "proc.stat", + .seq_show = cpu_cgroup_proc_stat_show, + }, { } /* terminate */ }; -- 2.43.0 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH VZ9 0/2] cgroup-v2: enable cpu.proc.stat vz-specific cgroup property
https://virtuozzo.atlassian.net/browse/PSBM-154206 Signed-off-by: Pavel Tikhomirov Pavel Tikhomirov (2): cpuacct: implicitly enable cpuacct controller in cgroup-v2 sched/stat: make cpu.proc.stat file also available in cgroup-v2 kernel/sched/core.c| 4 kernel/sched/cpuacct.c | 3 +++ 2 files changed, 7 insertions(+) -- 2.43.0 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH VZ9 1/2] cpuacct: implicitly enable cpuacct controller in cgroup-v2
This way the cpuacct controller is transparent to userspace but can be used to track cgroup cpu usages per-cpu, in next patch we will show those usages in cpu.proc.stat file. Notes: - In case cpuacct is mounted in cgroup-v1 we won't have it in cgroup-v2. - We also make this cgroup threaded to be able to make it implicit on default hierarchy. There should not be any problem as each thread accounts separately anyway. https://virtuozzo.atlassian.net/browse/PSBM-154206 Signed-off-by: Pavel Tikhomirov Feature: sched: emulate virtual cpus for Containers --- kernel/sched/cpuacct.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c index 37c102354522..55dfd37567d8 100644 --- a/kernel/sched/cpuacct.c +++ b/kernel/sched/cpuacct.c @@ -362,6 +362,9 @@ struct cgroup_subsys cpuacct_cgrp_subsys = { .css_free = cpuacct_css_free, .legacy_cftypes = files, .early_init = true, + + .implicit_on_dfl = true, + .threaded = true, }; extern struct task_group *css_tg(struct cgroup_subsys_state *css); -- 2.43.0 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH VZ9 2/2] mm: migrate page private for high-order folios in swap cache correctly
We have a check in do_swap_page that page from lookup_swap_cache should have private field equal to the swapcache index we searched it at (page_private(page) != entry.val). So folio_migrate_mapping should preserve page private for each page of a huge folio to satisfy this check else we get infinite loop in: +-> mmap_read_lock +-> __get_user_pages_locked +-> for-loop # taken once +-> __get_user_pages +-> retry-loop # constantly spinning +-> faultin_page # return 0 to trigger retry +-> handle_mm_fault +-> __handle_mm_fault +-> handle_pte_fault +-> do_swap_page +-> lookup_swap_cache # returns non-NULL +-> if (swapcache) +-> if (!folio_test_swapcache || page_private(page) != entry.val) +-> goto out_page +-> return 0 And this loop is under mmap_lock so we have secondary problem of multiple processes hanging on attempt to take mmap_lock. Note: We don't need this after we stop using private in this ms patch: cfeed8ffe55b ("mm/swap: stop using page->private on tail pages for THP_SWAP") https://virtuozzo.atlassian.net/browse/PSBM-153264 Signed-off-by: Pavel Tikhomirov --- mm/migrate.c | 4 1 file changed, 4 insertions(+) diff --git a/mm/migrate.c b/mm/migrate.c index d950f42c0708..357a0a4c59e6 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -424,6 +424,10 @@ int folio_migrate_mapping(struct address_space *mapping, if (folio_test_swapcache(folio)) { folio_set_swapcache(newfolio); newfolio->private = folio_get_private(folio); + + /* Swap cache needs private for each page of huge page */ + for (i = 1; i < nr; i++) + set_page_private(folio_page(newfolio, i), folio_page(folio, i)->private); } entries = nr; } else { -- 2.43.0 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH VZ9 1/2] mm: migrate high-order folios in swap cache correctly
From: Charan Teja Kalla Large folios occupy N consecutive entries in the swap cache instead of using multi-index entries like the page cache. However, if a large folio is re-added to the LRU list, it can be migrated. The migration code was not aware of the difference between the swap cache and the page cache and assumed that a single xas_store() would be sufficient. This leaves potentially many stale pointers to the now-migrated folio in the swap cache, which can lead to almost arbitrary data corruption in the future. This can also manifest as infinite loops with the RCU read lock held. [wi...@infradead.org: modifications to the changelog & tweaked the fix] Fixes: 3417013e0d18 ("mm/migrate: Add folio_migrate_mapping()") Link: https://lkml.kernel.org/r/20231214045841.961776-1-wi...@infradead.org Signed-off-by: Charan Teja Kalla Signed-off-by: Matthew Wilcox (Oracle) Reported-by: Charan Teja Kalla Closes: https://lkml.kernel.org/r/1700569840-17327-1-git-send-email-quic_chara...@quicinc.com Cc: David Hildenbrand Cc: Johannes Weiner Cc: Kirill A. Shutemov Cc: Naoya Horiguchi Cc: Shakeel Butt Cc: Signed-off-by: Andrew Morton We have a check in do_swap_page that page from lookup_swap_cache should have PG_swapcache bit set, but these leftover stale pointers may be reused by new folio without PG_swapcache bit, and that leads to infinite loop in: +-> mmap_read_lock +-> __get_user_pages_locked +-> for-loop # taken once +-> __get_user_pages +-> retry-loop # constantly spinning +-> faultin_page # return 0 to trigger retry +-> handle_mm_fault +-> __handle_mm_fault +-> handle_pte_fault +-> do_swap_page +-> lookup_swap_cache # returns non-NULL +-> if (swapcache) +-> if (!folio_test_swapcache || page_private(page) != entry.val) +-> goto out_page +-> return 0 (cherry picked from commit fc346d0a70a13d52fe1c4bc49516d83a42cd7c4c) https://virtuozzo.atlassian.net/browse/PSBM-153264 Signed-off-by: Pavel Tikhomirov --- mm/migrate.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/mm/migrate.c b/mm/migrate.c index d36d945cf716..d950f42c0708 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -387,6 +387,7 @@ int folio_migrate_mapping(struct address_space *mapping, int dirty; int expected_count = folio_expected_refs(mapping, folio) + extra_count; long nr = folio_nr_pages(folio); + long entries, i; if (!mapping) { /* Anonymous page without mapping */ @@ -424,8 +425,10 @@ int folio_migrate_mapping(struct address_space *mapping, folio_set_swapcache(newfolio); newfolio->private = folio_get_private(folio); } + entries = nr; } else { VM_BUG_ON_FOLIO(folio_test_swapcache(folio), folio); + entries = 1; } /* Move dirty while page refs frozen and newpage not yet exposed */ @@ -435,7 +438,11 @@ int folio_migrate_mapping(struct address_space *mapping, folio_set_dirty(newfolio); } - xas_store(, newfolio); + /* Swap cache still stores N entries instead of a high-order entry */ + for (i = 0; i < entries; i++) { + xas_store(, newfolio); + xas_next(); + } /* * Drop cache reference from old page by unfreezing -- 2.43.0 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH VZ9 0/2] mm: fix livelock caused by thp pages migration
https://virtuozzo.atlassian.net/browse/PSBM-153264 Charan Teja Kalla (1): mm: migrate high-order folios in swap cache correctly Pavel Tikhomirov (1): mm: migrate page private for high-order folios in swap cache correctly mm/migrate.c | 13 - 1 file changed, 12 insertions(+), 1 deletion(-) -- 2.43.0 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH VZ9] mm/vmscan: add rcu_read_lock to replace released shrinker_rwsem
After commit [1] we release shrinker_rwsem for nfs while processing do_shrink_slab, we need this to mitigate blocked shrinker_rwsem due to a hang in nfs shrinker. After that we lack shrinker_rwsem and rcu_read_lock in these stacks: +-< rcu_dereference_protected(ockdep_is_held(_rwsem)) +-< shrinker_info_protected +-< xchg_nr_deferred_memcg +-< xchg_nr_deferred +-< do_shrink_slab +-< add_nr_deferred_memcg +-< add_nr_deferred +-< do_shrink_slab As these stacks only use info for read, we can switch to rcu_read_lock, also need to switch rcu_dereference_protected -> rcu_dereference_check. https://virtuozzo.atlassian.net/browse/PSBM-153973 Fixes: c0efc56a8f844 ("mm: fix hanging shrinker management on long do_shrink_slab") [1] Signed-off-by: Pavel Tikhomirov --- note: In vz7 we don't need it. --- mm/vmscan.c | 31 +++ 1 file changed, 27 insertions(+), 4 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 847fc2354c3dc..1f2eacbad84d4 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -218,6 +218,13 @@ static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg, lockdep_is_held(_rwsem)); } +static struct shrinker_info *shrinker_info_check(struct mem_cgroup *memcg, +int nid) +{ + return rcu_dereference_check(memcg->nodeinfo[nid]->shrinker_info, +lockdep_is_held(_rwsem)); +} + static int expand_one_shrinker_info(struct mem_cgroup *memcg, int map_size, int defer_size, int old_map_size, int old_defer_size) @@ -411,18 +418,34 @@ static long xchg_nr_deferred_memcg(int nid, struct shrinker *shrinker, struct mem_cgroup *memcg) { struct shrinker_info *info; + long ret; - info = shrinker_info_protected(memcg, nid); - return atomic_long_xchg(>nr_deferred[shrinker->id], 0); + /* +* Need rcu lock here in case we've released shrinker_rwsem to prevent +* hang on nfs before calling do_shrink_slab(). +*/ + rcu_read_lock(); + info = shrinker_info_check(memcg, nid); + ret = atomic_long_xchg(>nr_deferred[shrinker->id], 0); + rcu_read_unlock(); + return ret; } static long add_nr_deferred_memcg(long nr, int nid, struct shrinker *shrinker, struct mem_cgroup *memcg) { struct shrinker_info *info; + long ret; - info = shrinker_info_protected(memcg, nid); - return atomic_long_add_return(nr, >nr_deferred[shrinker->id]); + /* +* Need rcu lock here in case we've released shrinker_rwsem to prevent +* hang on nfs before calling do_shrink_slab(). +*/ + rcu_read_lock(); + info = shrinker_info_check(memcg, nid); + ret = atomic_long_add_return(nr, >nr_deferred[shrinker->id]); + rcu_read_unlock(); + return ret; } void reparent_shrinker_deferred(struct mem_cgroup *memcg) -- 2.43.0 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH VZ9 2/2] mm/swap: convert add_to_swap_cache() to take a folio
From: "Matthew Wilcox (Oracle)" With all callers using folios, we can convert add_to_swap_cache() to take a folio and use it throughout. Link: https://lkml.kernel.org/r/20220902194653.1739778-13-wi...@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton (cherry picked from commit a4c366f01f10073e0220656561b875627ff7cd90) https://virtuozzo.atlassian.net/browse/PSBM-153264 Signed-off-by: Pavel Tikhomirov --- mm/shmem.c | 2 +- mm/swap.h | 4 ++-- mm/swap_state.c | 34 +- 3 files changed, 20 insertions(+), 20 deletions(-) diff --git a/mm/shmem.c b/mm/shmem.c index 50ccad78b315f..fbfecb73e740b 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1435,7 +1435,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc) if (list_empty(>swaplist)) list_add(>swaplist, _swaplist); - if (add_to_swap_cache(>page, swap, + if (add_to_swap_cache(folio, swap, __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN, NULL) == 0) { spin_lock_irq(>lock); diff --git a/mm/swap.h b/mm/swap.h index f23941d6cc699..4c2b7f16783e4 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -34,7 +34,7 @@ extern struct address_space *swapper_spaces[]; void show_swap_cache_info(void); bool add_to_swap(struct folio *folio); void *get_shadow_from_swap_cache(swp_entry_t entry); -int add_to_swap_cache(struct page *page, swp_entry_t entry, +int add_to_swap_cache(struct folio *folio, swp_entry_t entry, gfp_t gfp, void **shadowp); void __delete_from_swap_cache(struct folio *folio, swp_entry_t entry, void *shadow); @@ -126,7 +126,7 @@ static inline void *get_shadow_from_swap_cache(swp_entry_t entry) return NULL; } -static inline int add_to_swap_cache(struct page *page, swp_entry_t entry, +static inline int add_to_swap_cache(struct folio *folio, swp_entry_t entry, gfp_t gfp_mask, void **shadowp) { return -1; diff --git a/mm/swap_state.c b/mm/swap_state.c index 3a2323aba6f9b..9aed413b61e60 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -85,21 +85,21 @@ void *get_shadow_from_swap_cache(swp_entry_t entry) * add_to_swap_cache resembles filemap_add_folio on swapper_space, * but sets SwapCache flag and private instead of mapping and index. */ -int add_to_swap_cache(struct page *page, swp_entry_t entry, +int add_to_swap_cache(struct folio *folio, swp_entry_t entry, gfp_t gfp, void **shadowp) { struct address_space *address_space = swap_address_space(entry); pgoff_t idx = swp_offset(entry); - XA_STATE_ORDER(xas, _space->i_pages, idx, compound_order(page)); - unsigned long i, nr = thp_nr_pages(page); + XA_STATE_ORDER(xas, _space->i_pages, idx, folio_order(folio)); + unsigned long i, nr = folio_nr_pages(folio); void *old; - VM_BUG_ON_PAGE(!PageLocked(page), page); - VM_BUG_ON_PAGE(PageSwapCache(page), page); - VM_BUG_ON_PAGE(!PageSwapBacked(page), page); + VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); + VM_BUG_ON_FOLIO(folio_test_swapcache(folio), folio); + VM_BUG_ON_FOLIO(!folio_test_swapbacked(folio), folio); - page_ref_add(page, nr); - SetPageSwapCache(page); + folio_ref_add(folio, nr); + folio_set_swapcache(folio); do { xas_lock_irq(); @@ -107,19 +107,19 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, if (xas_error()) goto unlock; for (i = 0; i < nr; i++) { - VM_BUG_ON_PAGE(xas.xa_index != idx + i, page); + VM_BUG_ON_FOLIO(xas.xa_index != idx + i, folio); old = xas_load(); if (xa_is_value(old)) { if (shadowp) *shadowp = old; } - set_page_private(page + i, entry.val + i); - xas_store(, page); + set_page_private(folio_page(folio, i), entry.val + i); + xas_store(, folio); xas_next(); } address_space->nrpages += nr; - __mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, nr); - __mod_lruvec_page_state(page, NR_SWAPCACHE, nr); + __node_stat_mod_folio(folio, NR_FILE_PAGES, nr); + __lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr); unlock: xas_unlock_irq(); } while (xas_nomem(, gfp)); @@ -127,8 +127,8 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, if (!xas_error()) return 0; - ClearPageSwapCache(page); - page_ref_sub(page, nr); +
[Devel] [PATCH VZ9 1/2] mm/swap: convert __read_swap_cache_async() to use a folio
From: "Matthew Wilcox (Oracle)" Remove a few hidden (and one visible) calls to compound_head(). Link: https://lkml.kernel.org/r/20220902194653.1739778-12-wi...@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton (cherry picked from commit a0d3374b070776e985bbd7b165b178fa688bf37a) Change: Also update vz specific hunk SetPageActive->folio_set_active. https://virtuozzo.atlassian.net/browse/PSBM-153264 Signed-off-by: Pavel Tikhomirov --- mm/swap_state.c | 39 --- 1 file changed, 20 insertions(+), 19 deletions(-) diff --git a/mm/swap_state.c b/mm/swap_state.c index b16dc348bbd26..3a2323aba6f9b 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -423,7 +423,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, bool *new_page_allocated, bool activate) { struct swap_info_struct *si; - struct page *page; + struct folio *folio; void *shadow = NULL; *new_page_allocated = false; @@ -438,11 +438,11 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, si = get_swap_device(entry); if (!si) return NULL; - page = find_get_page(swap_address_space(entry), -swp_offset(entry)); + folio = filemap_get_folio(swap_address_space(entry), + swp_offset(entry)); put_swap_device(si); - if (page) - return page; + if (folio) + return folio_file_page(folio, swp_offset(entry)); /* * Just skip read ahead for unused swap slot. @@ -460,8 +460,8 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, * before marking swap_map SWAP_HAS_CACHE, when -EEXIST will * cause any racers to loop around until we add it to cache. */ - page = alloc_page_vma(gfp_mask, vma, addr); - if (!page) + folio = vma_alloc_folio(gfp_mask, 0, vma, addr, false); + if (!folio) return NULL; /* @@ -471,7 +471,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, if (!err) break; - put_page(page); + folio_put(folio); if (err != -EEXIST) return NULL; @@ -489,32 +489,33 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, * The swap entry is ours to swap in. Prepare the new page. */ - __SetPageLocked(page); - __SetPageSwapBacked(page); + __folio_set_locked(folio); + __folio_set_swapbacked(folio); - if (mem_cgroup_swapin_charge_page(page, NULL, gfp_mask, entry)) + if (mem_cgroup_swapin_charge_page(>page, NULL, gfp_mask, entry)) goto fail_unlock; /* May fail (-ENOMEM) if XArray node allocation failed. */ - if (add_to_swap_cache(page, entry, gfp_mask & GFP_RECLAIM_MASK, )) + if (add_to_swap_cache(>page, entry, gfp_mask & GFP_RECLAIM_MASK, )) goto fail_unlock; mem_cgroup_swapin_uncharge_swap(entry); if (shadow) - workingset_refault(page_folio(page), shadow); + workingset_refault(folio, shadow); /* Caller will initiate read into locked page */ if (activate) - SetPageActive(page); - lru_cache_add(page); + folio_set_active(folio); + /* Caller will initiate read into locked folio */ + folio_add_lru(folio); *new_page_allocated = true; - return page; + return >page; fail_unlock: - put_swap_page(page, entry); - unlock_page(page); - put_page(page); + put_swap_page(>page, entry); + folio_unlock(folio); + folio_put(folio); return NULL; } -- 2.43.0 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH VZ9 0/2] mm/swap: fix swapcache page/folio inconsistency
This ports the switching from page to folio in add_to_swap_cache(). We saw multiple different hangs on mmap_lock, where the task holding the lock was livelocked spinning in this stack: +-> __x64_sys_ioctl +-> kvm_vcpu_ioctl +-> kvm_arch_vcpu_ioctl_run +-> vcpu_run +-> vcpu_enter_guest +-> kvm_mmu_page_fault +-> kvm_tdp_page_fault +-> kvm_faultin_pfn +-> __kvm_faultin_pfn +-> hva_to_pfn +-> get_user_pages_unlocked +-> get_user_pages_unlocked +-> mmap_read_lock # 1 +-> __get_user_pages_locked # 2 +-> for-loop # taken once +-> __get_user_pages +-> retry-loop # constantly spinning +-> faultin_page # return 0 to trigger retry +-> handle_mm_fault +-> __handle_mm_fault +-> handle_pte_fault +-> do_swap_page +-> lookup_swap_cache # returns non-NULL +-> if (swapcache) +-> if (!folio_test_swapcache || page_private(page) != entry.val) +-> goto out_page +-> return 0 That can be due to an inconsistency in swapcache flag setting/reading, one can see that PageSwapCache reads the flag from folio, but SetPageSwapCache/ClearPageSwapCache instead affect the flag from page. After applying those patches SetPageSwapCache/ClearPageSwapCache become unused, thus all paths seek this flag from folio now. With it I don't see any hangs on mmap_lock anymore (on the same test setup). https://virtuozzo.atlassian.net/browse/PSBM-153264 Signed-off-by: Pavel Tikhomirov Matthew Wilcox (Oracle) (2): mm/swap: convert __read_swap_cache_async() to use a folio mm/swap: convert add_to_swap_cache() to take a folio mm/shmem.c | 2 +- mm/swap.h | 4 +-- mm/swap_state.c | 71 + 3 files changed, 39 insertions(+), 38 deletions(-) -- 2.43.0 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH VZ9 4/4] netfilter: bridge: replace physindev with physinif in nf_bridge_info
An skb can be added to a neigh->arp_queue while waiting for an arp reply. Where original skb's skb->dev can be different to neigh's neigh->dev. For instance in case of bridging dnated skb from one veth to another, the skb would be added to a neigh->arp_queue of the bridge. As skb->dev can be reset back to nf_bridge->physindev and used, and as there is no explicit mechanism that prevents this physindev from been freed under us (for instance neigh_flush_dev doesn't cleanup skbs from different device's neigh queue) we can crash on e.g. this stack: arp_process neigh_update skb = __skb_dequeue(>arp_queue) neigh_resolve_output(..., skb) ... br_nf_dev_xmit br_nf_pre_routing_finish_bridge_slow skb->dev = nf_bridge->physindev br_handle_frame_finish Let's use plain ifindex instead of net_device link. To peek into the original net_device we will use dev_get_by_index_rcu(). Thus either we get device and are safe to use it or we don't get it and drop skb. Fixes: c4e70a87d975 ("netfilter: bridge: rename br_netfilter.c to br_netfilter_hooks.c") Suggested-by: Florian Westphal Signed-off-by: Pavel Tikhomirov Signed-off-by: Pablo Neira Ayuso (cherry picked from commit 9874808878d9eed407e3977fd11fee49de1e1d86) https://virtuozzo.atlassian.net/browse/PSBM-153269 Signed-off-by: Pavel Tikhomirov --- include/linux/netfilter_bridge.h| 4 +-- include/linux/skbuff.h | 2 +- net/bridge/br_netfilter_hooks.c | 42 +++-- net/bridge/br_netfilter_ipv6.c | 14 +++--- net/ipv4/netfilter/nf_reject_ipv4.c | 9 --- net/ipv6/netfilter/nf_reject_ipv6.c | 11 +--- 6 files changed, 61 insertions(+), 21 deletions(-) diff --git a/include/linux/netfilter_bridge.h b/include/linux/netfilter_bridge.h index e927b9a15a556..743475ca7e9d5 100644 --- a/include/linux/netfilter_bridge.h +++ b/include/linux/netfilter_bridge.h @@ -42,7 +42,7 @@ static inline int nf_bridge_get_physinif(const struct sk_buff *skb) if (!nf_bridge) return 0; - return nf_bridge->physindev ? nf_bridge->physindev->ifindex : 0; + return nf_bridge->physinif; } static inline int nf_bridge_get_physoutif(const struct sk_buff *skb) @@ -60,7 +60,7 @@ nf_bridge_get_physindev(const struct sk_buff *skb, struct net *net) { const struct nf_bridge_info *nf_bridge = nf_bridge_info_get(skb); - return nf_bridge ? nf_bridge->physindev : NULL; + return nf_bridge ? dev_get_by_index_rcu(net, nf_bridge->physinif) : NULL; } static inline struct net_device * diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index a6d26a08a2557..4714ce0afffb3 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -263,7 +263,7 @@ struct nf_bridge_info { u8 bridged_dnat:1; u8 sabotage_in_done:1; __u16 frag_max_size; - struct net_device *physindev; + int physinif; /* always valid & non-NULL from FORWARD on, for physdev match */ struct net_device *physoutdev; diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c index 4bc6761517bbc..98598672f9382 100644 --- a/net/bridge/br_netfilter_hooks.c +++ b/net/bridge/br_netfilter_hooks.c @@ -278,8 +278,17 @@ int br_nf_pre_routing_finish_bridge(struct net *net, struct sock *sk, struct sk_ int ret; if ((neigh->nud_state & NUD_CONNECTED) && neigh->hh.hh_len) { + struct net_device *br_indev; + + br_indev = nf_bridge_get_physindev(skb, net); + if (!br_indev) { + neigh_release(neigh); + goto free_skb; + } + neigh_hh_bridge(>hh, skb); - skb->dev = nf_bridge->physindev; + skb->dev = br_indev; + ret = br_handle_frame_finish(net, sk, skb); } else { /* the neighbour function below overwrites the complete @@ -351,12 +360,18 @@ br_nf_ipv4_daddr_was_changed(const struct sk_buff *skb, */ static int br_nf_pre_routing_finish(struct net *net, struct sock *sk, struct sk_buff *skb) { - struct net_device *dev = skb->dev; + struct net_device *dev = skb->dev, *br_indev; struct iphdr *iph = ip_hdr(skb); struct nf_bridge_info *nf_bridge = nf_bridge_info_get(skb); struct rtable *rt; int err; + br_indev = nf_bridge_get_physindev(skb, net); + if (!br_indev) { + kfree_skb(skb); + return 0; + } + nf_bridge->frag_max_size = IPCB(skb)->frag_max_size; if (nf_bridge->pkt_otherhost)
[Devel] [PATCH VZ9 3/4] netfilter: propagate net to nf_bridge_get_physindev
This is a preparation patch for replacing physindev with physinif on nf_bridge_info structure. We will use dev_get_by_index_rcu to resolve device, when needed, and it requires net to be available. Signed-off-by: Pavel Tikhomirov Reviewed-by: Simon Horman Signed-off-by: Pablo Neira Ayuso (cherry picked from commit a54e72197037d2c9bfcd70dddaac8c8ccb5b41ba) https://virtuozzo.atlassian.net/browse/PSBM-153269 Signed-off-by: Pavel Tikhomirov --- include/linux/netfilter_bridge.h | 2 +- net/ipv4/netfilter/nf_reject_ipv4.c| 2 +- net/ipv6/netfilter/nf_reject_ipv6.c| 2 +- net/netfilter/ipset/ip_set_hash_netiface.c | 8 net/netfilter/nf_log_syslog.c | 13 +++-- net/netfilter/nf_queue.c | 2 +- net/netfilter/xt_physdev.c | 2 +- 7 files changed, 16 insertions(+), 15 deletions(-) diff --git a/include/linux/netfilter_bridge.h b/include/linux/netfilter_bridge.h index f980edfdd2783..e927b9a15a556 100644 --- a/include/linux/netfilter_bridge.h +++ b/include/linux/netfilter_bridge.h @@ -56,7 +56,7 @@ static inline int nf_bridge_get_physoutif(const struct sk_buff *skb) } static inline struct net_device * -nf_bridge_get_physindev(const struct sk_buff *skb) +nf_bridge_get_physindev(const struct sk_buff *skb, struct net *net) { const struct nf_bridge_info *nf_bridge = nf_bridge_info_get(skb); diff --git a/net/ipv4/netfilter/nf_reject_ipv4.c b/net/ipv4/netfilter/nf_reject_ipv4.c index f2edb40c0db00..63c91aa2375fb 100644 --- a/net/ipv4/netfilter/nf_reject_ipv4.c +++ b/net/ipv4/netfilter/nf_reject_ipv4.c @@ -286,7 +286,7 @@ void nf_send_reset(struct net *net, struct sock *sk, struct sk_buff *oldskb, * build the eth header using the original destination's MAC as the * source, and send the RST packet directly. */ - br_indev = nf_bridge_get_physindev(oldskb); + br_indev = nf_bridge_get_physindev(oldskb, net); if (br_indev) { struct ethhdr *oeth = eth_hdr(oldskb); diff --git a/net/ipv6/netfilter/nf_reject_ipv6.c b/net/ipv6/netfilter/nf_reject_ipv6.c index dffedcded..b138ee4ebc6a2 100644 --- a/net/ipv6/netfilter/nf_reject_ipv6.c +++ b/net/ipv6/netfilter/nf_reject_ipv6.c @@ -353,7 +353,7 @@ void nf_send_reset6(struct net *net, struct sock *sk, struct sk_buff *oldskb, * build the eth header using the original destination's MAC as the * source, and send the RST packet directly. */ - br_indev = nf_bridge_get_physindev(oldskb); + br_indev = nf_bridge_get_physindev(oldskb, net); if (br_indev) { struct ethhdr *oeth = eth_hdr(oldskb); diff --git a/net/netfilter/ipset/ip_set_hash_netiface.c b/net/netfilter/ipset/ip_set_hash_netiface.c index 0310732862362..bf1a3851ba5a7 100644 --- a/net/netfilter/ipset/ip_set_hash_netiface.c +++ b/net/netfilter/ipset/ip_set_hash_netiface.c @@ -138,9 +138,9 @@ hash_netiface4_data_next(struct hash_netiface4_elem *next, #include "ip_set_hash_gen.h" #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER) -static const char *get_physindev_name(const struct sk_buff *skb) +static const char *get_physindev_name(const struct sk_buff *skb, struct net *net) { - struct net_device *dev = nf_bridge_get_physindev(skb); + struct net_device *dev = nf_bridge_get_physindev(skb, net); return dev ? dev->name : NULL; } @@ -177,7 +177,7 @@ hash_netiface4_kadt(struct ip_set *set, const struct sk_buff *skb, if (opt->cmdflags & IPSET_FLAG_PHYSDEV) { #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER) - const char *eiface = SRCDIR ? get_physindev_name(skb) : + const char *eiface = SRCDIR ? get_physindev_name(skb, xt_net(par)) : get_physoutdev_name(skb); if (!eiface) @@ -395,7 +395,7 @@ hash_netiface6_kadt(struct ip_set *set, const struct sk_buff *skb, if (opt->cmdflags & IPSET_FLAG_PHYSDEV) { #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER) - const char *eiface = SRCDIR ? get_physindev_name(skb) : + const char *eiface = SRCDIR ? get_physindev_name(skb, xt_net(par)) : get_physoutdev_name(skb); if (!eiface) diff --git a/net/netfilter/nf_log_syslog.c b/net/netfilter/nf_log_syslog.c index b7e4e6791038e..baf609ff5fbda 100644 --- a/net/netfilter/nf_log_syslog.c +++ b/net/netfilter/nf_log_syslog.c @@ -117,7 +117,8 @@ nf_log_dump_packet_common(struct nf_log_buf *m, u8 pf, unsigned int hooknum, const struct sk_buff *skb, const struct net_device *in, const struct net_device *out, - const struct nf_loginfo *loginfo, const char *prefix) + const struct nf_loginfo *loginfo, const char *prefix, + struct net *net) {
[Devel] [PATCH VZ9 0/4] netlink: bridge: fix nf_bridge->physindev use after free
This is a backport of mainstream version of the fix to vz9: https://lore.kernel.org/netdev/2024050645.85637-1-ptikhomi...@virtuozzo.com/ https://virtuozzo.atlassian.net/browse/PSBM-153269 Pavel Tikhomirov (4): netfilter: nfnetlink_log: use proper helper for fetching physinif netfilter: nf_queue: remove excess nf_bridge variable netfilter: propagate net to nf_bridge_get_physindev netfilter: bridge: replace physindev with physinif in nf_bridge_info include/linux/netfilter_bridge.h | 6 ++-- include/linux/skbuff.h | 2 +- net/bridge/br_netfilter_hooks.c| 42 +- net/bridge/br_netfilter_ipv6.c | 14 +--- net/ipv4/netfilter/nf_reject_ipv4.c| 9 +++-- net/ipv6/netfilter/nf_reject_ipv6.c| 11 -- net/netfilter/ipset/ip_set_hash_netiface.c | 8 ++--- net/netfilter/nf_log_syslog.c | 13 +++ net/netfilter/nf_queue.c | 6 ++-- net/netfilter/nfnetlink_log.c | 8 ++--- net/netfilter/xt_physdev.c | 2 +- 11 files changed, 80 insertions(+), 41 deletions(-) -- 2.43.0 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH VZ9 1/4] netfilter: nfnetlink_log: use proper helper for fetching physinif
We don't use physindev in __build_packet_message except for getting physinif from it. So let's switch to nf_bridge_get_physinif to get what we want directly. Signed-off-by: Pavel Tikhomirov Reviewed-by: Simon Horman Signed-off-by: Pablo Neira Ayuso (cherry picked from commit c3f9fd54cd87233f53bdf0e191a86b3a5e960e02) https://virtuozzo.atlassian.net/browse/PSBM-153269 Signed-off-by: Pavel Tikhomirov --- net/netfilter/nfnetlink_log.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/net/netfilter/nfnetlink_log.c b/net/netfilter/nfnetlink_log.c index d97eb280cb2e8..8116beccdc1c4 100644 --- a/net/netfilter/nfnetlink_log.c +++ b/net/netfilter/nfnetlink_log.c @@ -499,7 +499,7 @@ __build_packet_message(struct nfnl_log_net *log, htonl(br_port_get_rcu(indev)->br->dev->ifindex))) goto nla_put_failure; } else { - struct net_device *physindev; + int physinif; /* Case 2: indev is bridge group, we need to look for * physical device (when called from ipv4) */ @@ -507,10 +507,10 @@ __build_packet_message(struct nfnl_log_net *log, htonl(indev->ifindex))) goto nla_put_failure; - physindev = nf_bridge_get_physindev(skb); - if (physindev && + physinif = nf_bridge_get_physinif(skb); + if (physinif && nla_put_be32(inst->skb, NFULA_IFINDEX_PHYSINDEV, -htonl(physindev->ifindex))) +htonl(physinif))) goto nla_put_failure; } #endif -- 2.43.0 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH VZ9 2/4] netfilter: nf_queue: remove excess nf_bridge variable
We don't really need nf_bridge variable here. And nf_bridge_info_exists is better replacement for nf_bridge_info_get in case we are only checking for existence. Signed-off-by: Pavel Tikhomirov Reviewed-by: Simon Horman Signed-off-by: Pablo Neira Ayuso (cherry picked from commit aeaa44075f8e49e2e0ad4507d925e690b7950145) https://virtuozzo.atlassian.net/browse/PSBM-153269 Signed-off-by: Pavel Tikhomirov --- net/netfilter/nf_queue.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/net/netfilter/nf_queue.c b/net/netfilter/nf_queue.c index d7542c5c4ae61..0077354d3258c 100644 --- a/net/netfilter/nf_queue.c +++ b/net/netfilter/nf_queue.c @@ -82,10 +82,8 @@ static void __nf_queue_entry_init_physdevs(struct nf_queue_entry *entry) { #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER) const struct sk_buff *skb = entry->skb; - struct nf_bridge_info *nf_bridge; - nf_bridge = nf_bridge_info_get(skb); - if (nf_bridge) { + if (nf_bridge_info_exists(skb)) { entry->physin = nf_bridge_get_physindev(skb); entry->physout = nf_bridge_get_physoutdev(skb); } else { -- 2.43.0 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH VZ9] drivers/vhost: fix missing rcu_read_lock in vhost_work_queue
In this stack: +-> vhost_vsock_dev_ioctl +-> vhost_vsock_start +-> vhost_work_queue +-> xas_find +-> xas_load +-> xas_start +-> xa_head +-> rcu_dereference_check We require either rcu_read_lock or xa_lock but have none. Let's fix it by calling a xa_find, which is a wraper for xas_find having proper rcu and also xas_retry logic. https://virtuozzo.atlassian.net/browse/PSBM-153264 Fixes: 5271bf51f1b83 ("ms/vhost: replace single worker pointer with xarray") Signed-off-by: Pavel Tikhomirov Feature: vhost-blk: in-kernel accelerator for virtio-blk guests --- drivers/vhost/vhost.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index c32557e279dfb..2f45c8d2b6fd6 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -257,10 +257,10 @@ static bool vhost_worker_queue(struct vhost_worker *worker, bool vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work) { - XA_STATE(xas, >worker_xa, 0); struct vhost_worker *worker; + unsigned long i; - worker = xas_find(, UINT_MAX); + worker = xa_find(>worker_xa, , ULONG_MAX, XA_PRESENT); if (!worker) return false; -- 2.43.0 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH vz7 v3 1/2] net, neigh: Fix null-ptr-deref in neigh_table_clear()
(for both patches) Reviewed-by: Pavel Tikhomirov On 17/01/2024 03:31, Alexander Atanasov wrote: From: Chen Zhongjin When IPv6 module gets initialized but hits an error in the middle, kenel panic with: KASAN: null-ptr-deref in range [0x0598-0x059f] CPU: 1 PID: 361 Comm: insmod Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) RIP: 0010:__neigh_ifdown.isra.0+0x24b/0x370 RSP: 0018:888012677908 EFLAGS: 0202 ... Call Trace: neigh_table_clear+0x94/0x2d0 ndisc_cleanup+0x27/0x40 [ipv6] inet6_init+0x21c/0x2cb [ipv6] do_one_initcall+0xd3/0x4d0 do_init_module+0x1ae/0x670 ... Kernel panic - not syncing: Fatal exception When ipv6 initialization fails, it will try to cleanup and calls: neigh_table_clear() neigh_ifdown(tbl, NULL) pneigh_queue_purge(>proxy_queue, dev_net(dev == NULL)) # dev_net(NULL) triggers null-ptr-deref. Fix it by passing NULL to pneigh_queue_purge() in neigh_ifdown() if dev is NULL, to make kernel not panic immediately. Fixes: 66ba215cb513 ("neigh: fix possible DoS due to net iface start/stop loop") Signed-off-by: Chen Zhongjin Reviewed-by: Eric Dumazet Reviewed-by: Denis V. Lunev Link: https://lore.kernel.org/r/20221101121552.21890-1-chenzhong...@huawei.com Signed-off-by: Jakub Kicinski (cherry picked from commit f8017317cb0b279b8ab98b0f3901a2e0ac880dad) https://virtuozzo.atlassian.net/browse/PSBM-153018 https://bugs.openvz.org/browse/OVZ-7410 Signed-off-by: Alexander Atanasov --- net/core/neighbour.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) v1->v2: fixed wrong merge v2->v3: backported one more patch - which is not in the vz7 tree but is in mainstream, so it looked like two patches are made into one. Both touch the same code to increase the confusion. diff --git a/net/core/neighbour.c b/net/core/neighbour.c index d5b28c158144..ae6490f8d3ae 100644 --- a/net/core/neighbour.c +++ b/net/core/neighbour.c @@ -321,7 +321,7 @@ int neigh_ifdown(struct neigh_table *tbl, struct net_device *dev) neigh_flush_dev(tbl, dev); pneigh_ifdown(tbl, dev); write_unlock_bh(>lock); - pneigh_queue_purge(>proxy_queue, dev_net(dev)); + pneigh_queue_purge(>proxy_queue, dev ? dev_net(dev) : NULL); if (skb_queue_empty(>proxy_queue)) del_timer_sync(>proxy_timer); return 0; -- Best regards, Tikhomirov Pavel Senior Software Developer, Virtuozzo. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH VZ9 2/3] net: zerocopy over unix sockets
On 16/01/2024 19:26, Alexey Kuznetsov wrote: Hello! On Tue, Jan 16, 2024 at 1:13 PM Pavel Tikhomirov wrote: --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1405,6 +1405,9 @@ int sk_setsockopt(struct sock *sk, int level, int optname, (sk->sk_type == SOCK_DGRAM && sk->sk_protocol == IPPROTO_UDP))) ret = -EOPNOTSUPP; + } else if (sk->sk_family == PF_UNIX) { + if (sk->sk_type == SOCK_DGRAM) + ret = -EOPNOTSUPP; Likely we don't want to enable zerocopy for SOCK_SEQPACKET. Maybe change to: "if (sk->sk_type != SOCK_STREAM)" Indeed. Thanks. + if (unlikely(flags & MSG_ERRQUEUE)) + return unix_recv_error(sk, msg, size); In case of MSG_PEEK it might be not what user wants to receive zero-copy notification error message. MSG_ERRQUEUE means user requested error messages, them and only them. If MSG_PEEK (or another flag) is also set it means either peeking from error queue or this flag is ignored (likeMSG_DONTWAIT) or rejected, whatever is appropriate. At least it is how I originally designed it. But in case it was messed up with time, I have to recheck current semantics in tcp and align to it, whatever it is. Ah sorry I was confused, you are right of course, when we ask for errors it is perfectly OK to receive an error. Though yes, there would be no way to peek on error queue, so CRIU will not be able to handle such error messages, but that is different story. + int err = skb_orphan_frags_rx(skb, GFP_KERNEL); + I prefer not to put function call on the same line with variable Ok-ok. If my old pants are old-fashioned granpa's style, we replace them. But why that creepy checkpatch.pl does not complain? Is it broken and already not haute couture? :-) I don't think that it is a strict rule, there are many places where one can see such things in kernel code (e.g. osf_statfs, bpf_copy_from_user, update_parent_effective_cpumask or nsfs_init_fs_context). So I don't insist. -- Best regards, Tikhomirov Pavel Senior Software Developer, Virtuozzo. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH vz7 v2] net: neigh: decrement the family specific qlen
On 16/01/2024 15:10, Alexander Atanasov wrote: On 16.01.24 5:12, Pavel Tikhomirov wrote: I'd prefer to have two separate clean patches ported instead of porting a merge of one into another (without a change explaining comment). commit 8207f253a097fe15c93d85ac15ebb73c5e39e1e1 Author: Thomas Zeitlhofer Date:  Tue Nov 15 23:09:41 2022 +0100 @@ -409,7 +427,8 @@ static int __neigh_ifdown(struct neigh_table *tbl, struct net_device *dev,    write_lock_bh(>lock);    neigh_flush_dev(tbl, dev, skip_perm);    pneigh_ifdown_and_unlock(tbl, dev); -  pneigh_queue_purge(>proxy_queue, dev ? dev_net(dev) : NULL); +  pneigh_queue_purge(>proxy_queue, dev ? dev_net(dev) : NULL, + tbl->family);    if (skb_queue_empty_lockless(>proxy_queue))    del_timer_sync(>proxy_timer);    return 0; The hook about device is in the commit i am backporting to vz7. i made a mistake during the merge so it was not in the v1 patch. The other commit you talk about is: commit f8017317cb0b279b8ab98b0f3901a2e0ac880dad Author: Chen Zhongjin Date:  Tue Nov 1 20:15:52 2022 +0800 Which brings the fix about NULL dev - predates the one i am backporting. Both logs are from mainstream tree. You ported 8207f253 and merged f8017317 into it, I don't like it. I would've ported both patches sequentially instead, that's all I try to say here. When you port something cleanly, git-range-diff will be clean: [snorch@turmoil vzkernel-vz7]$ git range-diff 8207f253a097fe15c93d85ac15ebb73c5e39e1e1^- HEAD^- | grep "++\|--\|-+\|+-" [snorch@turmoil vzkernel-vz7]$ in your case it is: [snorch@turmoil vzkernel-vz7]$ git range-diff 8207f253a097fe15c93d85ac15ebb73c5e39e1e1^- aatanasov-neigh-qlen-v2^- | grep "++\|--\|-+\|+-" -- pneigh_queue_purge(>proxy_queue, dev ? dev_net(dev) : NULL); +- pneigh_queue_purge(>proxy_queue, dev_net(dev)); On 15/01/2024 22:56, Alexander Atanasov wrote: From: Thomas Zeitlhofer Commit 0ff4eb3d5ebb ("neighbour: make proxy_queue.qlen limit per-device") introduced the length counter qlen in struct neigh_parms. There are separate neigh_parms instances for IPv4/ARP and IPv6/ND, and while the family specific qlen is incremented in pneigh_enqueue(), the mentioned commit decrements always the IPv4/ARP specific qlen, regardless of the currently processed family, in pneigh_queue_purge() and neigh_proxy_process(). As a result, with IPv6/ND, the family specific qlen is only incremented (and never decremented) until it exceeds PROXY_QLEN, and then, according to the check in pneigh_enqueue(), neighbor solicitations are not answered anymore. As an example, this is noted when using the subnet-router anycast address to access a Linux router. After a certain amount of time (in the observed case, qlen exceeded PROXY_QLEN after two days), the Linux router stops answering neighbor solicitations for its subnet-router anycast address and effectively becomes unreachable. Another result with IPv6/ND is that the IPv4/ARP specific qlen is decremented more often than incremented. This leads to negative qlen values, as a signed integer has been used for the length counter qlen, and potentially to an integer overflow. Fix this by introducing the helper function neigh_parms_qlen_dec(), which decrements the family specific qlen. Thereby, make use of the existing helper function neigh_get_dev_parms_rcu(), whose definition therefore needs to be placed earlier in neighbour.c. Take the family member from struct neigh_table to determine the currently processed family and appropriately call neigh_parms_qlen_dec() from pneigh_queue_purge() and neigh_proxy_process(). Additionally, use an unsigned integer for the length counter qlen. Fixes: 0ff4eb3d5ebb ("neighbour: make proxy_queue.qlen limit per-device") Signed-off-by: Thomas Zeitlhofer Signed-off-by: David S. Miller (cherry picked from commit 8207f253a097fe15c93d85ac15ebb73c5e39e1e1) https://virtuozzo.atlassian.net/browse/PSBM-153018 https://bugs.openvz.org/browse/OVZ-7410 Signed-off-by: Alexander Atanasov ---  include/net/neighbour.h | 2 +-  net/core/neighbour.c   | 58 +  2 files changed, 31 insertions(+), 29 deletions(-) diff --git a/include/net/neighbour.h b/include/net/neighbour.h index 2326044d5294..c63016cc4b27 100644 --- a/include/net/neighbour.h +++ b/include/net/neighbour.h @@ -79,7 +79,7 @@ struct neigh_parms {  struct rcu_head rcu_head;  int   reachable_time; -   int   qlen; +   u32   qlen;  int   data[NEIGH_VAR_DATA_MAX];  DECLARE_BITMAP(data_state, NEIGH_VAR_DATA_MAX);  }; diff --git a/net/core/neighbour.c b/net/core/neighbour.c index d5b28c158144..fa9da27c0676 100644 --- a/net/core/neighbour.c +++ b/net/core/neighbour.c @@ -223,7 +223,31 @@ static int neigh_del_timer(struct neighbour *n)  return 0;  } -static void pneigh_queue_purge(struct
Re: [Devel] [PATCH VZ9 2/3] net: zerocopy over unix sockets
On 09/01/2024 21:38, Alexey Kuznetsov wrote: Observation is that af_unix sockets today became slower and eat a lot of more cpu than 100G ethernet. So, implement MSG_ZEROCOPY over af_unix sockets to be able to talk to local services without collapse of performance. Unexpectedly, this makes sense! F.e. zerocopy cannot be done in TCP over loopback, because skbs when passing over loopback change ownership. But unix sockets traditionally implemented in the way that skbs queued on destination sockets still preserve ownership of source socket, so that completion notifications can be delivered back to sender without problems. Code looks neat and clean and might be submitted to mainstream one day. A little bit of dirt is that it shares IP_RECVERR control message, but rather this dirt is not mine, but by original author, who used protocol level message for generic socket level notifications without any good reasons, creating a mess for users. Signed-off-by: Alexey Kuznetsov --- net/core/datagram.c | 2 +- net/core/sock.c | 3 ++ net/unix/af_unix.c | 108 +--- 3 files changed, 98 insertions(+), 15 deletions(-) diff --git a/net/core/datagram.c b/net/core/datagram.c index 8eb5999..980248b 100644 --- a/net/core/datagram.c +++ b/net/core/datagram.c @@ -640,7 +640,7 @@ int __zerocopy_sg_from_iter(struct sock *sk, struct sk_buff *skb, skb->data_len += copied; skb->len += copied; skb->truesize += truesize; - if (sk && sk->sk_type == SOCK_STREAM) { + if (sk && sk->sk_type == SOCK_STREAM && sk->sk_family != PF_UNIX) { sk_wmem_queued_add(sk, truesize); sk_mem_charge(sk, truesize); } else { diff --git a/net/core/sock.c b/net/core/sock.c index 5d7e16a..fe8e102 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1405,6 +1405,9 @@ int sk_setsockopt(struct sock *sk, int level, int optname, (sk->sk_type == SOCK_DGRAM && sk->sk_protocol == IPPROTO_UDP))) ret = -EOPNOTSUPP; + } else if (sk->sk_family == PF_UNIX) { + if (sk->sk_type == SOCK_DGRAM) + ret = -EOPNOTSUPP; Likely we don't want to enable zerocopy for SOCK_SEQPACKET. Maybe change to: "if (sk->sk_type != SOCK_STREAM)" } else if (sk->sk_family != PF_RDS) { ret = -EOPNOTSUPP; } diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 36c3cb9..e32fe47 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -115,6 +115,7 @@ #include #include #include +#include #include "scm.h" @@ -2054,6 +2055,7 @@ static int unix_stream_sendmsg(struct socket *sock, struct msghdr *msg, { struct sock *sk = sock->sk; struct sock *other = NULL; + struct ubuf_info *uarg = NULL; int err, size; struct sk_buff *skb; int sent = 0; @@ -2083,22 +2085,37 @@ static int unix_stream_sendmsg(struct socket *sock, struct msghdr *msg, if (sk->sk_shutdown & SEND_SHUTDOWN) goto pipe_err; + if ((msg->msg_flags & MSG_ZEROCOPY) && len && sock_flag(sk, SOCK_ZEROCOPY)) { + uarg = msg_zerocopy_alloc(sk, len); + if (!uarg) { + err = -ENOBUFS; + goto out_err; + } + } + while (sent < len) { size = len - sent; - /* Keep two messages in the pipe so it schedules better */ - size = min_t(int, size, (sk->sk_sndbuf >> 1) - 64); + if (!uarg) { + /* Keep two messages in the pipe so it schedules better */ + size = min_t(int, size, (sk->sk_sndbuf >> 1) - 64); + + /* allow fallback to order-0 allocations */ + size = min_t(int, size, SKB_MAX_HEAD(0) + UNIX_SKB_FRAGS_SZ); - /* allow fallback to order-0 allocations */ - size = min_t(int, size, SKB_MAX_HEAD(0) + UNIX_SKB_FRAGS_SZ); + data_len = max_t(int, 0, size - SKB_MAX_HEAD(0)); - data_len = max_t(int, 0, size - SKB_MAX_HEAD(0)); + data_len = min_t(size_t, size, PAGE_ALIGN(data_len)); - data_len = min_t(size_t, size, PAGE_ALIGN(data_len)); + skb = sock_alloc_send_pskb(sk, size - data_len, data_len, + msg->msg_flags & MSG_DONTWAIT, , + get_order(UNIX_SKB_FRAGS_SZ)); + } else { + size = min_t(int, size, sk->sk_sndbuf); + skb = sock_alloc_send_pskb(sk, 0, 0, + msg->msg_flags & MSG_DONTWAIT, , 0); + } - skb =
Re: [Devel] [PATCH vz7 v2] net: neigh: decrement the family specific qlen
I'd prefer to have two separate clean patches ported instead of porting a merge of one into another (without a change explaining comment). On 15/01/2024 22:56, Alexander Atanasov wrote: From: Thomas Zeitlhofer Commit 0ff4eb3d5ebb ("neighbour: make proxy_queue.qlen limit per-device") introduced the length counter qlen in struct neigh_parms. There are separate neigh_parms instances for IPv4/ARP and IPv6/ND, and while the family specific qlen is incremented in pneigh_enqueue(), the mentioned commit decrements always the IPv4/ARP specific qlen, regardless of the currently processed family, in pneigh_queue_purge() and neigh_proxy_process(). As a result, with IPv6/ND, the family specific qlen is only incremented (and never decremented) until it exceeds PROXY_QLEN, and then, according to the check in pneigh_enqueue(), neighbor solicitations are not answered anymore. As an example, this is noted when using the subnet-router anycast address to access a Linux router. After a certain amount of time (in the observed case, qlen exceeded PROXY_QLEN after two days), the Linux router stops answering neighbor solicitations for its subnet-router anycast address and effectively becomes unreachable. Another result with IPv6/ND is that the IPv4/ARP specific qlen is decremented more often than incremented. This leads to negative qlen values, as a signed integer has been used for the length counter qlen, and potentially to an integer overflow. Fix this by introducing the helper function neigh_parms_qlen_dec(), which decrements the family specific qlen. Thereby, make use of the existing helper function neigh_get_dev_parms_rcu(), whose definition therefore needs to be placed earlier in neighbour.c. Take the family member from struct neigh_table to determine the currently processed family and appropriately call neigh_parms_qlen_dec() from pneigh_queue_purge() and neigh_proxy_process(). Additionally, use an unsigned integer for the length counter qlen. Fixes: 0ff4eb3d5ebb ("neighbour: make proxy_queue.qlen limit per-device") Signed-off-by: Thomas Zeitlhofer Signed-off-by: David S. Miller (cherry picked from commit 8207f253a097fe15c93d85ac15ebb73c5e39e1e1) https://virtuozzo.atlassian.net/browse/PSBM-153018 https://bugs.openvz.org/browse/OVZ-7410 Signed-off-by: Alexander Atanasov --- include/net/neighbour.h | 2 +- net/core/neighbour.c| 58 + 2 files changed, 31 insertions(+), 29 deletions(-) diff --git a/include/net/neighbour.h b/include/net/neighbour.h index 2326044d5294..c63016cc4b27 100644 --- a/include/net/neighbour.h +++ b/include/net/neighbour.h @@ -79,7 +79,7 @@ struct neigh_parms { struct rcu_head rcu_head; int reachable_time; - int qlen; + u32 qlen; int data[NEIGH_VAR_DATA_MAX]; DECLARE_BITMAP(data_state, NEIGH_VAR_DATA_MAX); }; diff --git a/net/core/neighbour.c b/net/core/neighbour.c index d5b28c158144..fa9da27c0676 100644 --- a/net/core/neighbour.c +++ b/net/core/neighbour.c @@ -223,7 +223,31 @@ static int neigh_del_timer(struct neighbour *n) return 0; } -static void pneigh_queue_purge(struct sk_buff_head *list, struct net *net) +static struct neigh_parms *neigh_get_dev_parms_rcu(struct net_device *dev, + int family) +{ + switch (family) { + case AF_INET: + return __in_dev_arp_parms_get_rcu(dev); + case AF_INET6: + return __in6_dev_nd_parms_get_rcu(dev); + } + return NULL; +} + +static void neigh_parms_qlen_dec(struct net_device *dev, int family) +{ + struct neigh_parms *p; + + rcu_read_lock(); + p = neigh_get_dev_parms_rcu(dev, family); + if (p) + p->qlen--; + rcu_read_unlock(); +} + +static void pneigh_queue_purge(struct sk_buff_head *list, struct net *net, + int family) { struct sk_buff_head tmp; unsigned long flags; @@ -237,13 +261,7 @@ static void pneigh_queue_purge(struct sk_buff_head *list, struct net *net) struct net_device *dev = skb->dev; if (net == NULL || net_eq(dev_net(dev), net)) { - struct in_device *in_dev; - - rcu_read_lock(); - in_dev = __in_dev_get_rcu(dev); - if (in_dev) - in_dev->arp_parms->qlen--; - rcu_read_unlock(); + neigh_parms_qlen_dec(dev, family); __skb_unlink(skb, list); __skb_queue_tail(, skb); } @@ -321,7 +339,8 @@ int neigh_ifdown(struct neigh_table *tbl, struct net_device *dev) neigh_flush_dev(tbl, dev); pneigh_ifdown(tbl, dev); write_unlock_bh(>lock); - pneigh_queue_purge(>proxy_queue, dev_net(dev)); + pneigh_queue_purge(>proxy_queue, dev ? dev_net(dev) : NULL, +
Re: [Devel] [PATCH vz7] net: neigh: decrement the family specific qlen
Do we also need f8017317cb0b2 ("net, neigh: Fix null-ptr-deref in neigh_table_clear()") while on it? Looks good. Reviewed-by: Pavel Tikhomirov On 15/01/2024 16:42, Alexander Atanasov wrote: From: Thomas Zeitlhofer Commit 0ff4eb3d5ebb ("neighbour: make proxy_queue.qlen limit per-device") introduced the length counter qlen in struct neigh_parms. There are separate neigh_parms instances for IPv4/ARP and IPv6/ND, and while the family specific qlen is incremented in pneigh_enqueue(), the mentioned commit decrements always the IPv4/ARP specific qlen, regardless of the currently processed family, in pneigh_queue_purge() and neigh_proxy_process(). As a result, with IPv6/ND, the family specific qlen is only incremented (and never decremented) until it exceeds PROXY_QLEN, and then, according to the check in pneigh_enqueue(), neighbor solicitations are not answered anymore. As an example, this is noted when using the subnet-router anycast address to access a Linux router. After a certain amount of time (in the observed case, qlen exceeded PROXY_QLEN after two days), the Linux router stops answering neighbor solicitations for its subnet-router anycast address and effectively becomes unreachable. Another result with IPv6/ND is that the IPv4/ARP specific qlen is decremented more often than incremented. This leads to negative qlen values, as a signed integer has been used for the length counter qlen, and potentially to an integer overflow. Fix this by introducing the helper function neigh_parms_qlen_dec(), which decrements the family specific qlen. Thereby, make use of the existing helper function neigh_get_dev_parms_rcu(), whose definition therefore needs to be placed earlier in neighbour.c. Take the family member from struct neigh_table to determine the currently processed family and appropriately call neigh_parms_qlen_dec() from pneigh_queue_purge() and neigh_proxy_process(). Additionally, use an unsigned integer for the length counter qlen. Fixes: 0ff4eb3d5ebb ("neighbour: make proxy_queue.qlen limit per-device") Signed-off-by: Thomas Zeitlhofer Signed-off-by: David S. Miller (cherry picked from commit 8207f253a097fe15c93d85ac15ebb73c5e39e1e1) https://virtuozzo.atlassian.net/browse/PSBM-153018 https://bugs.openvz.org/browse/OVZ-7410 Signed-off-by: Alexander Atanasov --- include/net/neighbour.h | 2 +- net/core/neighbour.c| 57 + 2 files changed, 30 insertions(+), 29 deletions(-) diff --git a/include/net/neighbour.h b/include/net/neighbour.h index 2326044d5294..c63016cc4b27 100644 --- a/include/net/neighbour.h +++ b/include/net/neighbour.h @@ -79,7 +79,7 @@ struct neigh_parms { struct rcu_head rcu_head; int reachable_time; - int qlen; + u32 qlen; int data[NEIGH_VAR_DATA_MAX]; DECLARE_BITMAP(data_state, NEIGH_VAR_DATA_MAX); }; diff --git a/net/core/neighbour.c b/net/core/neighbour.c index d5b28c158144..caed48f39896 100644 --- a/net/core/neighbour.c +++ b/net/core/neighbour.c @@ -223,7 +223,31 @@ static int neigh_del_timer(struct neighbour *n) return 0; } -static void pneigh_queue_purge(struct sk_buff_head *list, struct net *net) +static struct neigh_parms *neigh_get_dev_parms_rcu(struct net_device *dev, + int family) +{ + switch (family) { + case AF_INET: + return __in_dev_arp_parms_get_rcu(dev); + case AF_INET6: + return __in6_dev_nd_parms_get_rcu(dev); + } + return NULL; +} + +static void neigh_parms_qlen_dec(struct net_device *dev, int family) +{ + struct neigh_parms *p; + + rcu_read_lock(); + p = neigh_get_dev_parms_rcu(dev, family); + if (p) + p->qlen--; + rcu_read_unlock(); +} + +static void pneigh_queue_purge(struct sk_buff_head *list, struct net *net, + int family) { struct sk_buff_head tmp; unsigned long flags; @@ -237,13 +261,7 @@ static void pneigh_queue_purge(struct sk_buff_head *list, struct net *net) struct net_device *dev = skb->dev; if (net == NULL || net_eq(dev_net(dev), net)) { - struct in_device *in_dev; - - rcu_read_lock(); - in_dev = __in_dev_get_rcu(dev); - if (in_dev) - in_dev->arp_parms->qlen--; - rcu_read_unlock(); + neigh_parms_qlen_dec(dev, family); __skb_unlink(skb, list); __skb_queue_tail(, skb); } @@ -321,7 +339,7 @@ int neigh_ifdown(struct neigh_table *tbl, struct net_device *dev) neigh_flush_dev(tbl, dev); pneigh_ifdown(tbl, dev); write_unlock_bh(>lock); - pneigh_queue_purge(>proxy_queue, dev_net(dev)); + pneigh_que
[Devel] [PATCH v3 VZ7] neighbour: purge nf_bridged skb from foreign device neigh
An skb can be added to a neigh->arp_queue while waiting for an arp reply. Where original skb's skb->dev can be different to neigh's neigh->dev. For instance in case of bridging dnated skb from one veth to another, the skb would be added to a neigh->arp_queue of the bridge. There is no explicit mechanism that prevents the original skb->dev link of such skb from being freed under us. For instance neigh_flush_dev does not cleanup skbs from different device's neigh queue. But that original link can be used and lead to crash on e.g. this stack: arp_process neigh_update skb = __skb_dequeue(>arp_queue) neigh_resolve_output(..., skb) ... br_nf_dev_xmit br_nf_pre_routing_finish_bridge_slow skb->dev = nf_bridge->physindev br_handle_frame_finish So let's improve neigh_flush_dev to also purge skbs when device equal to their skb->nf_bridge->physindev gets destroyed. https://virtuozzo.atlassian.net/browse/PSBM-151735 Signed-off-by: Pavel Tikhomirov --- v2: take neigh->lock for queue modification v3: put skb_peak under lock --- net/core/neighbour.c | 27 +++ 1 file changed, 27 insertions(+) diff --git a/net/core/neighbour.c b/net/core/neighbour.c index 7f197a63c780f..4f0ee6c04f1c2 100644 --- a/net/core/neighbour.c +++ b/net/core/neighbour.c @@ -43,6 +43,10 @@ #include #include +#include +#include +#include + #define DEBUG #define NEIGH_DEBUG 1 #define neigh_dbg(level, fmt, ...) \ @@ -257,6 +261,28 @@ static void pneigh_queue_purge(struct sk_buff_head *list, struct net *net) } } +static void neigh_purge_nf_bridge_dev(struct neighbour *neigh, struct net_device *dev) +{ + struct sk_buff_head *list = >arp_queue; + struct nf_bridge_info *nf_bridge; + struct sk_buff *skb, *next; + + write_lock(>lock); + skb = skb_peek(list); + while (skb) { + nf_bridge = nf_bridge_info_get(skb); + + next = skb_peek_next(skb, list); + if (nf_bridge && nf_bridge->physindev == dev) { + __skb_unlink(skb, list); + neigh->arp_queue_len_bytes -= skb->truesize; + kfree_skb(skb); + } + skb = next; + } + write_unlock(>lock); +} + static void neigh_flush_dev(struct neigh_table *tbl, struct net_device *dev) { int i; @@ -272,6 +298,7 @@ static void neigh_flush_dev(struct neigh_table *tbl, struct net_device *dev) while ((n = rcu_dereference_protected(*np, lockdep_is_held(>lock))) != NULL) { if (dev && n->dev != dev) { + neigh_purge_nf_bridge_dev(n, dev); np = >next; continue; } -- 2.43.0 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH v2 VZ7] neighbour: purge nf_bridged skb from foreign device neigh
An skb can be added to a neigh->arp_queue while waiting for an arp reply. Where original skb's skb->dev can be different to neigh's neigh->dev. For instance in case of bridging dnated skb from one veth to another, the skb would be added to a neigh->arp_queue of the bridge. There is no explicit mechanism that prevents the original skb->dev link of such skb from being freed under us. For instance neigh_flush_dev does not cleanup skbs from different device's neigh queue. But that original link can be used and lead to crash on e.g. this stack: arp_process neigh_update skb = __skb_dequeue(>arp_queue) neigh_resolve_output(..., skb) ... br_nf_dev_xmit br_nf_pre_routing_finish_bridge_slow skb->dev = nf_bridge->physindev br_handle_frame_finish So let's improve neigh_flush_dev to also purge skbs when device equal to their skb->nf_bridge->physindev gets destroyed. https://virtuozzo.atlassian.net/browse/PSBM-151735 Signed-off-by: Pavel Tikhomirov --- v2: take neigh->lock for queue modification --- net/core/neighbour.c | 26 ++ 1 file changed, 26 insertions(+) diff --git a/net/core/neighbour.c b/net/core/neighbour.c index 7f197a63c780f..2aa1c23a7f44d 100644 --- a/net/core/neighbour.c +++ b/net/core/neighbour.c @@ -43,6 +43,10 @@ #include #include +#include +#include +#include + #define DEBUG #define NEIGH_DEBUG 1 #define neigh_dbg(level, fmt, ...) \ @@ -257,6 +261,27 @@ static void pneigh_queue_purge(struct sk_buff_head *list, struct net *net) } } +static void neigh_purge_nf_bridge_dev(struct neighbour *neigh, struct net_device *dev) +{ + struct sk_buff_head *list = >arp_queue; + struct sk_buff *skb = skb_peek(list), *next; + struct nf_bridge_info *nf_bridge; + + write_lock(>lock); + while (skb) { + nf_bridge = nf_bridge_info_get(skb); + + next = skb_peek_next(skb, list); + if (nf_bridge && nf_bridge->physindev == dev) { + __skb_unlink(skb, list); + neigh->arp_queue_len_bytes -= skb->truesize; + kfree_skb(skb); + } + skb = next; + } + write_unlock(>lock); +} + static void neigh_flush_dev(struct neigh_table *tbl, struct net_device *dev) { int i; @@ -272,6 +297,7 @@ static void neigh_flush_dev(struct neigh_table *tbl, struct net_device *dev) while ((n = rcu_dereference_protected(*np, lockdep_is_held(>lock))) != NULL) { if (dev && n->dev != dev) { + neigh_purge_nf_bridge_dev(n, dev); np = >next; continue; } -- 2.43.0 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RH7] neighbour: purge nf_bridged skb from foreign device neigh
An skb can be added to a neigh->arp_queue while waiting for an arp reply. Where original skb's skb->dev can be different to neigh's neigh->dev. For instance in case of bridging dnated skb from one veth to another, the skb would be added to a neigh->arp_queue of the bridge. There is no explicit mechanism that prevents the original skb->dev link of such skb from being freed under us. For instance neigh_flush_dev does not cleanup skbs from different device's neigh queue. But that original link can be used and lead to crash on e.g. this stack: arp_process neigh_update skb = __skb_dequeue(>arp_queue) neigh_resolve_output(..., skb) ... br_nf_dev_xmit br_nf_pre_routing_finish_bridge_slow skb->dev = nf_bridge->physindev br_handle_frame_finish So let's improve neigh_flush_dev to also purge skbs when device equal to their skb->nf_bridge->physindev gets destroyed. https://virtuozzo.atlassian.net/browse/PSBM-151735 Signed-off-by: Pavel Tikhomirov --- net/core/neighbour.c | 24 1 file changed, 24 insertions(+) diff --git a/net/core/neighbour.c b/net/core/neighbour.c index 7f197a63c780f..e2e35d060ee0a 100644 --- a/net/core/neighbour.c +++ b/net/core/neighbour.c @@ -43,6 +43,10 @@ #include #include +#include +#include +#include + #define DEBUG #define NEIGH_DEBUG 1 #define neigh_dbg(level, fmt, ...) \ @@ -257,6 +261,25 @@ static void pneigh_queue_purge(struct sk_buff_head *list, struct net *net) } } +static void neigh_purge_nf_bridge_dev(struct neighbour *neigh, struct net_device *dev) +{ + struct sk_buff_head *list = >arp_queue; + struct sk_buff *skb = skb_peek(list), *next; + struct nf_bridge_info *nf_bridge; + + while (skb) { + nf_bridge = nf_bridge_info_get(skb); + + next = skb_peek_next(skb, list); + if (nf_bridge && nf_bridge->physindev == dev) { + __skb_unlink(skb, list); + neigh->arp_queue_len_bytes -= skb->truesize; + kfree_skb(skb); + } + skb = next; + } +} + static void neigh_flush_dev(struct neigh_table *tbl, struct net_device *dev) { int i; @@ -272,6 +295,7 @@ static void neigh_flush_dev(struct neigh_table *tbl, struct net_device *dev) while ((n = rcu_dereference_protected(*np, lockdep_is_held(>lock))) != NULL) { if (dev && n->dev != dev) { + neigh_purge_nf_bridge_dev(n, dev); np = >next; continue; } -- 2.43.0 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH] mm: Drop swap_cache_info reporting in vzstat
Looks good. Reviewed-by: Pavel Tikhomirov While on it, can we also remove swap_cache_info_struct in include/linux/vzstat.h as it looks unused? On 29/12/2023 01:15, Konstantin Khorenko wrote: Mainstream has dropped swap_cache_info statistics: 442701e7058b ("mm/swap: remove swap_cache_info statistics") So we are dropping reporting it via /proc/vz/stats interface. We could leave the format of /proc/vz/stats file the same (it is an interface after all, should be stable), but as in vz9 we'll have so many changes, vzstat utility is also should be rewritten, so it's a good time to drop the junk code rathen than preserve the interface file format. Of course we inc the format version to 2.7. This patch: * fully reverts 44a915de19aa ("mm: Export swap_cache_info struct and variable") * removes couple of hunks in 5952b9c2f339 ("vzstat: Add vzstat module and kstat interfaces") - related to swap_cache_info https://virtuozzo.atlassian.net/browse/PSBM-152466 Signed-off-by: Konstantin Khorenko Feature: ve: extra statistics (mm, latency) --- include/linux/swap.h | 9 - include/linux/vzstat.h | 7 --- kernel/ve/vzstat.c | 16 +--- mm/swap_state.c| 3 --- 4 files changed, 1 insertion(+), 34 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 9348574b5bb7..1af5b711bc00 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -715,14 +715,5 @@ static inline bool mem_cgroup_swap_full(struct page *page) } #endif -struct swap_cache_info { - unsigned long add_total; - unsigned long del_total; - unsigned long find_success; - unsigned long find_total; -}; - -extern struct swap_cache_info swap_cache_info; - #endif /* __KERNEL__*/ #endif /* _LINUX_SWAP_H */ diff --git a/include/linux/vzstat.h b/include/linux/vzstat.h index 505487117ca4..4a5738bd036f 100644 --- a/include/linux/vzstat.h +++ b/include/linux/vzstat.h @@ -14,13 +14,6 @@ #include #include -struct swap_cache_info_struct { - unsigned long add_total; - unsigned long del_total; - unsigned long find_success; - unsigned long find_total; -}; - struct kstat_zone_avg { unsigned long free_pages_avg[3], nr_active_avg[3], diff --git a/kernel/ve/vzstat.c b/kernel/ve/vzstat.c index 7ffd5ec7f2a3..73063a4189d7 100644 --- a/kernel/ve/vzstat.c +++ b/kernel/ve/vzstat.c @@ -27,7 +27,6 @@ #include #include #include -#include #include @@ -385,18 +384,6 @@ static void kernel_text_csum_seq_show(struct seq_file *m, void *v) seq_printf(m, "kernel_text_csum_broken: %d\n", 0); } -static void swap_cache_seq_show(struct seq_file *m, void *v) -{ - struct swap_cache_info *swpcache; - - swpcache = _cache_info; - seq_printf(m, "Swap cache: add %lu, del %lu, find %lu/%lu\n", - swpcache->add_total, - swpcache->del_total, - swpcache->find_success, - swpcache->find_total); -} - /* * Declare special structure to store summarized statistics. The 'struct zone' * is not used because of it's tremendous size. @@ -668,14 +655,13 @@ static int stats_seq_show(struct seq_file *m, void *v) { if (!v) return 0; - seq_puts(m, "Version: 2.6\n"); + seq_puts(m, "Version: 2.7\n"); cycles_per_jiffy_show(m, v); jiffies_per_second_show(m, v); seq_puts(m, "\nLoad info:\n"); task_counts_seq_show(m, v); seq_puts(m, "\nMemory info:\n"); kernel_text_csum_seq_show(m, v); - swap_cache_seq_show(m, v); mem_free_areas_show(m, v); mem_avg_show(m, v); mem_fails_show(m, v); diff --git a/mm/swap_state.c b/mm/swap_state.c index 034dd5d0c6d9..b16dc348bbd2 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -59,9 +59,6 @@ static bool enable_vma_readahead __read_mostly = true; #define GET_SWAP_RA_VAL(vma) \ (atomic_long_read(&(vma)->swap_readahead_info) ? : 4) -struct swap_cache_info swap_cache_info; -EXPORT_SYMBOL_GPL(swap_cache_info); - static atomic_t swapin_readahead_hits = ATOMIC_INIT(4); void show_swap_cache_info(void) -- Best regards, Tikhomirov Pavel Senior Software Developer, Virtuozzo. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RH7] br_netfilter: add debug to see what happens in br_nf_pre_routing_finish
We see a packet with skb->dev already freed, which should have passed this place in br_nf_pre_routing_finish as it has nf_bridge->neigh_header set to src mac which can be set only later in this stack in br_nf_pre_routing_finish_bridge. So I try to print all information I can imagine to help understanding this problem. We see this skb taken from neigh->arp_queue and then crash on it, and here: +-> br_nf_pre_routing +-> br_nf_pre_routing_finish +-> skb->dev = nf_bridge->physindev #1 +-> br_nf_pre_routing_finish_bridge #2 +-> nf_bridge->bridged_dnat = 1 #1 +-> neighbour->output #2 +-> neigh_resolve_output +-> neigh_event_send +-> __neigh_event_send +-> __skb_queue_tail(>arp_queue, skb) skb can acutually get to arp_queue. This is the second confirmation that something wrong likely happened on this stack. https://virtuozzo.atlassian.net/browse/PSBM-151735 Signed-off-by: Pavel Tikhomirov --- net/bridge/br_netfilter_hooks.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c index eb8ad039c343f..6258fe55a04eb 100644 --- a/net/bridge/br_netfilter_hooks.c +++ b/net/bridge/br_netfilter_hooks.c @@ -386,6 +386,17 @@ static int br_nf_pre_routing_finish(struct sock *sk, struct sk_buff *skb) } else { if (skb_dst(skb)->dev == dev) { bridged_dnat: + /* +* Debug for PSBM-151735, I'm really curious how we can get here. +*/ + WARN_ONCE(1, "br_nf_pre_routing_finish: saddr=%pI4 daddr=%pI4 nf_bridge_daddr=%pI4 " +"dev=%s skb->dev=%s nf_bridge->indev=%s " +"skb->dst->dev=%s " +"err=%d indev_forward=%d\n", + >saddr, >daddr, _bridge->ipv4_daddr, + dev->name, skb->dev->name, nf_bridge->physindev ? nf_bridge->physindev->name : "", + skb_dst(skb) && skb_dst(skb)->dev ? skb_dst(skb)->dev->name : "", + err, IN_DEV_FORWARD(__in_dev_get_rcu(dev))); skb->dev = nf_bridge->physindev; nf_bridge_update_protocol(skb); nf_bridge_push_encap_header(skb); -- 2.43.0 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH vz9] sched: Do not set LBF_NEED_BREAK flag if scanned all the tasks
On 16/12/2023 01:21, Konstantin Khorenko wrote: On 15.12.2023 12:19, Pavel Tikhomirov wrote: On 14/12/2023 06:01, Konstantin Khorenko wrote: After ms commit b0defa7ae03e ("sched/fair: Make sure to try to detach at least one movable task") detach_tasks() does not stop on the condition (env->loop > env->loop_max) in case no movable task found. Instead of that (if there are no movable tasks in the rq) exits always happen on the loop_break check - thus with LBF_NEED_BREAK flag set. It's not a problem for mainstream because load_balance() proceeds with balancing in case LBF_NEED_BREAK is set only in case (env.loop < busiest->nr_running), but in Virtuozzo kernel with CFS_CPULIMIT feature right before that we try to move a whole task group (object of the CFS_CPULIMIT feature) and resets env.loop to zero, so we get a livelock here. Resetting env.loop makes sense in case we have successfully moved some tasks (in the scope of the task group), but if we failed to move any task, no progress is expected during further balancing attempts. Ways to fix it:   1. In load_balance() restore old env.loop in case no tasks were moved  by move_task_groups()   2. Add a check in detach_tasks() to exit without LBF_NEED_BREAK flag in  case we have scanned all tasks and have not found movable tasks. Current patch implements the second way. Caused by ms commit: b0defa7ae03e ("sched/fair: Make sure to try to detach at least one movable task") Fixes ms commit: bca010328248 ("sched: Port CONFIG_CFS_CPULIMIT feature") this is not "ms" commit Reviewed-by: Pavel Tikhomirov Signed-off-by: Konstantin Khorenko Feature: sched: ability to limit number of CPUs available to a CT ---   kernel/sched/fair.c | 6 ++   1 file changed, 6 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9367d16a8d85..e068bb90f197 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8771,6 +8771,12 @@ static int detach_tasks(struct lb_env *env)   if (env->loop > env->loop_max &&   !(env->flags & LBF_ALL_PINNED))   break; +   /* + * Quit if we have scanned all tasks even in case we haven't + * found any movable task. + */ +   if (env->loop > env->src_rq->nr_running) +   break; The env->loop is clearly designed in load_balance function to be reused on next iteration of "goto more_balance". The fact that we clear it to zero in our code is really strange... Maybe we can have separate env.vzloop variable for our part of the code (or we can properly integrate with env.loop without zeroing it?)? Well, in fact load_balance() contains 2 other places where env.loop is reset to 0 as well and the execution is sent to "more_balance" after that. Agree. In general i understand the idea of resetting env.loop in our case as well: we have moved the task group (our feature) and after that we would like to balance the load again - so resetting loop counter looks logic. The other question is that we reset env.loop even in case we have not really moved any task group and thus restarting the balancing process is incorrect. i thought about checking the result of move_task_groups() and if it's zero, just restore the env.loop counter to the value before moving the task group attempt. May be it is worth to do, but i just decided to resurrect the previous behavior - thus do not set LBF_NEED_BREAK flag if we reach the end of the run queue. Do we also need same crazy LBF_ALL_PINNED change to move_task_groups's "if (env->loop > env->loop_max) break;" check? Similar to what b0defa7ae03ec ("sched/fair: Make sure to try to detach at least one movable task") does in detach_tasks to allow it to handle the case then lru tail is pinned? And again, i thought about it. move_task_groups() increases env.loop only in case it founds a task to move, so in case some tasks are pinned, env.loop is not increased and we will go further. Agree. So in move_task_groups() we exit either 1. we scanned all groups (and did not find enough tasks to move to fix the imbalance) 2. we scanned some groups and found/moved tasks > env.loop_max 3. we scanned some groups, found/moved some tasks and their number was enough to fix the imbalance Seems reasonable. Thanks for explanation!   /* take a breather every nr_migrate tasks */   if (env->loop > env->loop_break) { -- Best regards, Tikhomirov Pavel Senior Software Developer, Virtuozzo. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH vz9] sched: Do not set LBF_NEED_BREAK flag if scanned all the tasks
On 14/12/2023 06:01, Konstantin Khorenko wrote: After ms commit b0defa7ae03e ("sched/fair: Make sure to try to detach at least one movable task") detach_tasks() does not stop on the condition (env->loop > env->loop_max) in case no movable task found. Instead of that (if there are no movable tasks in the rq) exits always happen on the loop_break check - thus with LBF_NEED_BREAK flag set. It's not a problem for mainstream because load_balance() proceeds with balancing in case LBF_NEED_BREAK is set only in case (env.loop < busiest->nr_running), but in Virtuozzo kernel with CFS_CPULIMIT feature right before that we try to move a whole task group (object of the CFS_CPULIMIT feature) and resets env.loop to zero, so we get a livelock here. Resetting env.loop makes sense in case we have successfully moved some tasks (in the scope of the task group), but if we failed to move any task, no progress is expected during further balancing attempts. Ways to fix it: 1. In load_balance() restore old env.loop in case no tasks were moved by move_task_groups() 2. Add a check in detach_tasks() to exit without LBF_NEED_BREAK flag in case we have scanned all tasks and have not found movable tasks. Current patch implements the second way. Caused by ms commit: b0defa7ae03e ("sched/fair: Make sure to try to detach at least one movable task") Fixes ms commit: bca010328248 ("sched: Port CONFIG_CFS_CPULIMIT feature") this is not "ms" commit Signed-off-by: Konstantin Khorenko Feature: sched: ability to limit number of CPUs available to a CT --- kernel/sched/fair.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9367d16a8d85..e068bb90f197 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8771,6 +8771,12 @@ static int detach_tasks(struct lb_env *env) if (env->loop > env->loop_max && !(env->flags & LBF_ALL_PINNED)) break; + /* +* Quit if we have scanned all tasks even in case we haven't +* found any movable task. +*/ + if (env->loop > env->src_rq->nr_running) + break; The env->loop is clearly designed in load_balance function to be reused on next iteration of "goto more_balance". The fact that we clear it to zero in our code is really strange... Maybe we can have separate env.vzloop variable for our part of the code (or we can properly integrate with env.loop without zeroing it?)? Do we also need same crazy LBF_ALL_PINNED change to move_task_groups's "if (env->loop > env->loop_max) break;" check? Similar to what b0defa7ae03ec ("sched/fair: Make sure to try to detach at least one movable task") does in detach_tasks to allow it to handle the case then lru tail is pinned? /* take a breather every nr_migrate tasks */ if (env->loop > env->loop_break) { -- Best regards, Tikhomirov Pavel Senior Software Developer, Virtuozzo. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RH7] x86/kprobes: Fix kprobes instruction boudary check with CONFIG_RETHUNK
From: "Masami Hiramatsu (Google)" Since the CONFIG_RETHUNK and CONFIG_SLS will use INT3 for stopping speculative execution after RET instruction, kprobes always failes to check the probed instruction boundary by decoding the function body if the probed address is after such sequence. (Note that some conditional code blocks will be placed after function return, if compiler decides it is not on the hot path.) This is because kprobes expects kgdb puts the INT3 as a software breakpoint and it will replace the original instruction. But these INT3 are not such purpose, it doesn't need to recover the original instruction. To avoid this issue, kprobes checks whether the INT3 is owned by kgdb or not, and if so, stop decoding and make it fail. The other INT3 will come from CONFIG_RETHUNK/CONFIG_SLS and those can be treated as a one-byte instruction. Fixes: e463a09af2f0 ("x86: Add straight-line-speculation mitigation") Suggested-by: Peter Zijlstra Signed-off-by: Masami Hiramatsu (Google) Signed-off-by: Peter Zijlstra (Intel) Cc: sta...@vger.kernel.org Link: https://lore.kernel.org/r/167146051026.1374301.392728975473572291.stgit@devnote3 Changes: - s/INT3_INSN_OPCODE/BREAKPOINT_INSTRUCTION/ - include kgdb_has_hit_break implementation If function return (int3 instruction) got compiled in assembly code into the middle of the function, then we failed to insert kprobe at any point after that. To be able to insert kprobes everythere in the function code we need this patch. (cherry picked from commit 1993bf97992df2d560287f3c4120eda57426843d) Signed-off-by: Pavel Tikhomirov --- arch/x86/kernel/kprobes/core.c | 10 +++--- include/linux/kgdb.h | 1 + kernel/debug/debug_core.c | 12 3 files changed, 20 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c index 2be0dde130f14..9dc7f48e84ea9 100644 --- a/arch/x86/kernel/kprobes/core.c +++ b/arch/x86/kernel/kprobes/core.c @@ -48,6 +48,7 @@ #include #include #include +#include #include #include #include @@ -300,12 +301,15 @@ static int __kprobes can_probe(unsigned long paddr) kernel_insn_init(, (void *)__addr, MAX_INSN_SIZE); insn_get_length(); +#ifdef CONFIG_KGDB /* -* Another debugging subsystem might insert this breakpoint. -* In that case, we can't recover it. +* If there is a dynamically installed kgdb sw breakpoint, +* this function should not be probed. */ - if (insn.opcode.bytes[0] == BREAKPOINT_INSTRUCTION) + if (insn.opcode.bytes[0] == BREAKPOINT_INSTRUCTION && + kgdb_has_hit_break(addr)) return 0; +#endif addr += insn.length; } diff --git a/include/linux/kgdb.h b/include/linux/kgdb.h index 6b06d378f3dfe..47a8fd917b3b5 100644 --- a/include/linux/kgdb.h +++ b/include/linux/kgdb.h @@ -305,6 +305,7 @@ extern int kgdb_hex2mem(char *buf, char *mem, int count); extern int kgdb_isremovedbreak(unsigned long addr); extern void kgdb_schedule_breakpoint(void); +extern int kgdb_has_hit_break(unsigned long addr); extern int kgdb_handle_exception(int ex_vector, int signo, int err_code, diff --git a/kernel/debug/debug_core.c b/kernel/debug/debug_core.c index 8865caec45fbc..e5872e38fae51 100644 --- a/kernel/debug/debug_core.c +++ b/kernel/debug/debug_core.c @@ -356,6 +356,18 @@ int kgdb_isremovedbreak(unsigned long addr) return 0; } +int kgdb_has_hit_break(unsigned long addr) +{ + int i; + + for (i = 0; i < KGDB_MAX_BREAKPOINTS; i++) { + if (kgdb_break[i].state == BP_ACTIVE && + kgdb_break[i].bpt_addr == addr) + return 1; + } + return 0; +} + int dbg_remove_all_break(void) { int error; -- 2.43.0 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RH7] mm/memcontrol: prohibit writing to memory.numa_migrate from container
We might want to put containers on designated numa nodes for optimal perfomance, it will be all ruinied if container could force its memory pages to move to any node it wants. This memory.numa_migrate file was originaly made for vcmmd which works from ve0, so we should be fine with this additional restriction. Fixes: dfc0b63bfd50c ("mm: memcontrol: add memory.numa_migrate file") https://virtuozzo.atlassian.net/browse/PSBM-152372 Signed-off-by: Pavel Tikhomirov Feature: mm: interface to migrate memory between NUMA nodes upon userspace request --- mm/memcontrol.c | 12 ++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index a2fe48d93d02a..5d65b523a0ec8 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5812,8 +5812,8 @@ static int memcg_numa_migrate_pages(struct mem_cgroup *memcg, * * The call may be interrupted by a signal, in which case -EINTR is returned. */ -static int memcg_numa_migrate_write(struct cgroup *cont, - struct cftype *cft, const char *buf) +static int __memcg_numa_migrate_write(struct cgroup *cont, struct cftype *cft, + const char *buf) { struct mem_cgroup *memcg = mem_cgroup_from_cont(cont); NODEMASK_ALLOC(nodemask_t, target_nodes, GFP_KERNEL); @@ -5851,6 +5851,14 @@ static int memcg_numa_migrate_write(struct cgroup *cont, return ret; } +static int memcg_numa_migrate_write(struct cgroup *cont, struct cftype *cft, + const char *buf) + if (!ve_is_super(get_exec_env())) + return -EPERM; + + return __memcg_numa_migrate_write(cont, cft, buf); +} + #endif /* CONFIG_NUMA */ static inline void mem_cgroup_lru_names_not_uptodate(void) -- 2.43.0 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RH9] mm/memcontrol: prohibit writing to memory.numa_migrate from container
We might want to put containers on designated numa nodes for optimal perfomance, it will be all ruinied if container could force its memory pages to move to any node it wants. This memory.numa_migrate file was originaly made for vcmmd which works from ve0, so we should be fine with this additional restriction. Fixes: dfc0b63bfd50c ("mm: memcontrol: add memory.numa_migrate file") https://virtuozzo.atlassian.net/browse/PSBM-152372 Signed-off-by: Pavel Tikhomirov Feature: mm: interface to migrate memory between NUMA nodes upon userspace request --- mm/memcontrol.c | 13 +++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index fd706aa1bbbd1..b7c6ee09ab9fc 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4700,8 +4700,8 @@ static int memcg_numa_migrate_pages(struct mem_cgroup *memcg, * * The call may be interrupted by a signal, in which case -EINTR is returned. */ -static ssize_t memcg_numa_migrate_write(struct kernfs_open_file *of, char *buf, - size_t nbytes, loff_t off) +static ssize_t __memcg_numa_migrate_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) { struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); NODEMASK_ALLOC(nodemask_t, target_nodes, GFP_KERNEL); @@ -4739,6 +4739,15 @@ static ssize_t memcg_numa_migrate_write(struct kernfs_open_file *of, char *buf, return ret ?: nbytes; } +static ssize_t memcg_numa_migrate_write(struct kernfs_open_file *of, char *buf, + size_t nbytes, loff_t off) +{ + if (!ve_is_super(get_exec_env())) + return -EPERM; + + return __memcg_numa_migrate_write(of, buf, nbytes, off); +} + #endif /* CONFIG_NUMA */ static const unsigned int memcg1_stats[] = { -- 2.43.0 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RH7 1/3] netfilter: nf_tables: use call_rcu in netlink dumps
From: Florian Westphal We can make all dumps and lookups lockless. Dumps currently only hold the nfnl mutex on the dump request itself. Dumps can span multiple syscalls, dump continuation doesn't acquire the nfnl mutex anywhere, i.e. the dump callbacks in nf_tables already use rcu and never rely on nfnl mutex being held. So, just switch all dumpers to rcu. This requires taking a module reference before dropping the rcu lock so rmmod is blocked, we also need to hold module reference over the entire dump operation sequence. netlink already supports this via the .module member in the netlink_dump_control struct. For the non-dump case (i.e. lookup of a specific tables, chains, etc), we need to swtich to _rcu list iteration primitive and make sure we use GFP_ATOMIC. This patch also adds the new nft_netlink_dump_start_rcu() helper that takes care of the get_ref, drop-rcu-lock,start dump, get-rcu-lock,put-ref sequence. The helper will be reused for all dumps. Rationale in all dump requests is: - use the nft_netlink_dump_start_rcu helper added in first patch - use GFP_ATOMIC and rcu list iteration - switch to .call_rcu ... thus making all dumps in nf_tables not depend on the nfnl mutex anymore. In the nf_tables_getgen: This callback just fetches the current base sequence, there is no need to serialize this with nfnl nft mutex. Signed-off-by: Florian Westphal Signed-off-by: Pablo Neira Ayuso (cherry picked from commit d9adf22a2918832ac94335fddf1950b12543d0b5) Changes: - s/>nft.tables/>tables/ - drop most of "elements" related hunks - drop "obj" related hunks - drop "flowtable" related hunks - employ rcu in nf_tables_afinfo_lookup when needed https://virtuozzo.atlassian.net/browse/PSBM-150147 Signed-off-by: Pavel Tikhomirov --- net/netfilter/nf_tables_api.c | 135 ++ 1 file changed, 87 insertions(+), 48 deletions(-) diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c index 8beed7bd2970..2a3680da9cd2 100644 --- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -63,15 +63,17 @@ static struct nft_af_info *nft_afinfo_lookup(struct net *net, int family) { struct nft_af_info *afi; - list_for_each_entry(afi, >nft.af_info, list) { + /* called either with rcu_read_lock or nfnl_lock held */ + list_for_each_entry_rcu(afi, >nft.af_info, list) { if (afi->family == family) return afi; } + return NULL; } static struct nft_af_info * -nf_tables_afinfo_lookup(struct net *net, int family, bool autoload) +nf_tables_afinfo_lookup(struct net *net, int family, bool autoload, bool rcu) { struct nft_af_info *afi; @@ -80,9 +82,17 @@ nf_tables_afinfo_lookup(struct net *net, int family, bool autoload) return afi; #ifdef CONFIG_MODULES if (autoload) { - nfnl_unlock(NFNL_SUBSYS_NFTABLES); + if (rcu) { + rcu_read_unlock(); + } else { + nfnl_unlock(NFNL_SUBSYS_NFTABLES); + } request_module("nft-afinfo-%u", family); - nfnl_lock(NFNL_SUBSYS_NFTABLES); + if (rcu) { + rcu_read_lock(); + } else { + nfnl_lock(NFNL_SUBSYS_NFTABLES); + } afi = nft_afinfo_lookup(net, family); if (afi != NULL) return ERR_PTR(-EAGAIN); @@ -361,7 +371,7 @@ static struct nft_table *nft_table_lookup(const struct nft_af_info *afi, { struct nft_table *table; - list_for_each_entry(table, >tables, list) { + list_for_each_entry_rcu(table, >tables, list) { if (!nla_strcmp(nla, table->name) && nft_active_genmask(table, genmask)) return table; @@ -540,6 +550,24 @@ static int nf_tables_dump_tables(struct sk_buff *skb, return skb->len; } +static int nft_netlink_dump_start_rcu(struct sock *nlsk, struct sk_buff *skb, + const struct nlmsghdr *nlh, + struct netlink_dump_control *c) +{ + int err; + + if (!try_module_get(THIS_MODULE)) + return -EINVAL; + + rcu_read_unlock(); + err = netlink_dump_start(nlsk, skb, nlh, c); + rcu_read_lock(); + module_put(THIS_MODULE); + + return err; +} + +/* called with rcu_read_lock held */ static int nf_tables_gettable(struct sock *nlsk, struct sk_buff *skb, const struct nlmsghdr *nlh, const struct nlattr * const nla[]) @@ -556,11 +584,13 @@ static int nf_tables_gettable(struct sock *nlsk, struct sk_buff *skb, if (nlh->nlmsg_flags & NLM_F_DUMP) { struct netlink_dump_control c = {
[Devel] [PATCH RH7 2/3] netfilter: nf_tables: fix oops during rule dump
From: Florian Westphal We can oops in nf_tables_fill_rule_info(). Its not possible to fetch previous element in rcu-protected lists when deletions are not prevented somehow: list_del_rcu poisons the ->prev pointer value. Before rcu-conversion this was safe as dump operations did hold nfnetlink mutex. Pass previous rule as argument, obtained by keeping a pointer to the previous rule during traversal. Fixes: d9adf22a291883 ("netfilter: nf_tables: use call_rcu in netlink dumps") Signed-off-by: Florian Westphal Signed-off-by: Pablo Neira Ayuso (cherry picked from commit 2c82c7e724ff51cab78e1afd5c2aaa31994fe41e) Changes: - move hunks from __nf_tables_dump_rules to nf_tables_dump_rules https://virtuozzo.atlassian.net/browse/PSBM-150147 Signed-off-by: Pavel Tikhomirov --- net/netfilter/nf_tables_api.c | 19 ++- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c index 2a3680da9cd2..21b7b0f81f8f 100644 --- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -1890,13 +1890,13 @@ static int nf_tables_fill_rule_info(struct sk_buff *skb, struct net *net, u32 flags, int family, const struct nft_table *table, const struct nft_chain *chain, - const struct nft_rule *rule) + const struct nft_rule *rule, + const struct nft_rule *prule) { struct nlmsghdr *nlh; struct nfgenmsg *nfmsg; const struct nft_expr *expr, *next; struct nlattr *list; - const struct nft_rule *prule; int type = event | NFNL_SUBSYS_NFTABLES << 8; nlh = nlmsg_put(skb, portid, seq, type, sizeof(struct nfgenmsg), @@ -1917,8 +1917,7 @@ static int nf_tables_fill_rule_info(struct sk_buff *skb, struct net *net, NFTA_RULE_PAD)) goto nla_put_failure; - if ((event != NFT_MSG_DELRULE) && (rule->list.prev != >rules)) { - prule = list_entry(rule->list.prev, struct nft_rule, list); + if (event != NFT_MSG_DELRULE && prule) { if (nla_put_be64(skb, NFTA_RULE_POSITION, cpu_to_be64(prule->handle), NFTA_RULE_PAD)) @@ -1967,7 +1966,7 @@ static int nf_tables_rule_notify(const struct nft_ctx *ctx, err = nf_tables_fill_rule_info(skb, ctx->net, ctx->portid, ctx->seq, event, 0, ctx->afi->family, ctx->table, - ctx->chain, rule); + ctx->chain, rule, NULL); if (err < 0) { kfree_skb(skb); goto err; @@ -1996,7 +1995,7 @@ static int nf_tables_dump_rules(struct sk_buff *skb, const struct nft_af_info *afi; const struct nft_table *table; const struct nft_chain *chain; - const struct nft_rule *rule; + const struct nft_rule *rule, *prule = NULL; unsigned int idx = 0, s_idx = cb->args[0]; struct net *net = sock_net(skb->sk); int family = nfmsg->nfgen_family; @@ -2020,7 +2019,7 @@ static int nf_tables_dump_rules(struct sk_buff *skb, list_for_each_entry_rcu(rule, >rules, list) { if (!nft_is_active(net, rule)) - goto cont; + goto cont_skip; if (idx < s_idx) goto cont; if (idx > s_idx) @@ -2030,11 +2029,13 @@ static int nf_tables_dump_rules(struct sk_buff *skb, cb->nlh->nlmsg_seq, NFT_MSG_NEWRULE, NLM_F_MULTI | NLM_F_APPEND, - afi->family, table, chain, rule) < 0) + afi->family, table, chain, rule, prule) < 0) goto done; nl_dump_check_consistent(cb, nlmsg_hdr(skb)); cont: + prule = rule; +cont_skip: idx++; } } @@ -2134,7 +2135,7 @@ static int nf_tables_getrule(struct sock *nlsk, struct sk_buff *skb, err = nf_tables_fill_rule_inf
[Devel] [PATCH RH7 3/3] netfilter: nf_tables: use list_entry_rcu in nft_do_chain
We already use list_for_each_entry_continue_rcu two lines below, it is thus logical to also use list_entry_rcu there. https://virtuozzo.atlassian.net/browse/PSBM-150147 Signed-off-by: Pavel Tikhomirov --- net/netfilter/nf_tables_core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/netfilter/nf_tables_core.c b/net/netfilter/nf_tables_core.c index 81ccbca32fa8..2fc814445af1 100644 --- a/net/netfilter/nf_tables_core.c +++ b/net/netfilter/nf_tables_core.c @@ -141,7 +141,7 @@ nft_do_chain(struct nft_pktinfo *pkt, const struct nf_hook_ops *ops) nft_trace_init(, pkt, , basechain); do_chain: rulenum = 0; - rule = list_entry(>rules, struct nft_rule, list); + rule = list_entry_rcu(>rules, struct nft_rule, list); next_rule: regs.verdict.code = NFT_CONTINUE; list_for_each_entry_continue_rcu(rule, >rules, list) { -- 2.41.0 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RH7 0/3] netfilter: nf_tables: switch read path to rcu
We have a customer claiming that iptables-nft takes too long to list rules from container on big systems. So we remove global nfnl_lock from read code paths and replace it with rcu to improve perfomane for that case. https://virtuozzo.atlassian.net/browse/PSBM-150147 Signed-off-by: Pavel Tikhomirov Florian Westphal (2): netfilter: nf_tables: use call_rcu in netlink dumps netfilter: nf_tables: fix oops during rule dump Pavel Tikhomirov (1): netfilter: nf_tables: use list_entry_rcu in nft_do_chain net/netfilter/nf_tables_api.c | 154 + net/netfilter/nf_tables_core.c | 2 +- 2 files changed, 98 insertions(+), 58 deletions(-) -- 2.41.0 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RH9 v2 3/5] mm/memcontrol: fix mem_cgroup_enough_memory on cgroup-v2
In cgroup-v2 memsw is not used anymore, instead we have swap and memory separated from one another. So let's replace memsw to mem+swap. https://jira.vzint.dev/browse/PSBM-149975 Signed-off-by: Pavel Tikhomirov --- v2: should use mem+swap instead of just mem here --- mm/memcontrol.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 3a2099920756..ac327776144d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4283,7 +4283,12 @@ static int mem_cgroup_enough_memory(struct mem_cgroup *memcg, long pages) long free; /* unused memory */ - free = memcg->memsw.max - page_counter_read(>memsw); + if (memcg->css.cgroup->root == _dfl_root) { + free = memcg->memory.max - page_counter_read(>memory); + free += memcg->swap.max - page_counter_read(>swap); + } else { + free = memcg->memsw.max - page_counter_read(>memsw); + } /* reclaimable slabs */ free += memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B) >> PAGE_SHIFT; -- 2.41.0 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RH9 v2 2/5] mm/memcontrol: fix available memory calculation in mem_cgroup_enough_memory
The idea of mem_cgroup_enough_memory check is to return ENOMEM on allocations when container memory cgroup is close to OOM. In vz7 we have: free = memcg->memsw.limit - page_counter_read(>memsw); In vz9: free = memcg->memsw.max - page_counter_read(>memory); The error here is memsw -> memory change in rebase to vz8, after it we do not care about swap usage at all. This leads to allowing allocations even then all memory and swap are exhausted for container which is not what we want. https://jira.vzint.dev/browse/PSBM-149975 Fixes: 9417436f33d1 ("ve/mm: add heuristic check for memory overcommit") Signed-off-by: Pavel Tikhomirov --- mm/memcontrol.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 9287eed944d5..3a2099920756 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4283,7 +4283,7 @@ static int mem_cgroup_enough_memory(struct mem_cgroup *memcg, long pages) long free; /* unused memory */ - free = memcg->memsw.max - page_counter_read(>memory); + free = memcg->memsw.max - page_counter_read(>memsw); /* reclaimable slabs */ free += memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B) >> PAGE_SHIFT; -- 2.41.0 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RH9 v2 5/5] cgroup: show and allow mounting of cgroup-v2 in containers
The change in is_virtualized_cgroot makes cgroup-v2 "virtualized" only when it has enabled subsystems in it. And "virtualized" means that it will be: a) shown in different cgroup related proc files in container, b) allowed to be mounted in container and c) required to have separate root directory for container Note: we don't expect cgroup-v2 root to change enabled controllers, it either has no controllers or has at least one all the way. Note: cgroup-v2 has one hierarchy for all controllers, so skipping misc and debug for vzctl is not required anymore. Add FS_VIRTUALIZED to cgroup2 to allow mounting in container. https://jira.vzint.dev/browse/PSBM-149975 Signed-off-by: Pavel Tikhomirov --- kernel/cgroup/cgroup.c | 16 +++- 1 file changed, 7 insertions(+), 9 deletions(-) diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 945130057a30..d902e675598d 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -2060,18 +2060,16 @@ struct ve_struct *get_curr_ve(void) */ static inline bool is_virtualized_cgroot(struct cgroup_root *cgroot) { - /* Cgroup v2 */ - if (cgroot == _dfl_root) - return false; - + if (cgroot != _dfl_root) { #if IS_ENABLED(CONFIG_CGROUP_DEBUG) - if (cgroot->subsys_mask & (1 << debug_cgrp_id)) - return false; + if (cgroot->subsys_mask & (1 << debug_cgrp_id)) + return false; #endif #if IS_ENABLED(CONFIG_CGROUP_MISC) - if (cgroot->subsys_mask & (1 << misc_cgrp_id)) - return false; + if (cgroot->subsys_mask & (1 << misc_cgrp_id)) + return false; #endif + } if (cgroot->subsys_mask) return true; @@ -2628,7 +2626,7 @@ static struct file_system_type cgroup2_fs_type = { .init_fs_context= cgroup_init_fs_context, .parameters = cgroup2_fs_parameters, .kill_sb= cgroup_kill_sb, - .fs_flags = FS_USERNS_MOUNT, + .fs_flags = FS_USERNS_MOUNT | FS_VIRTUALIZED, }; #ifdef CONFIG_CPUSETS -- 2.41.0 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RH9 v2 1/5] cgroup: fix uninitiallized use of ctx->root
In vfs_get_tree we explicitly check that fc->root is NULL before calling ->get_tree helper. So when mounting cgroup2 filesystem in the begining of cgroup_get_tree the fc->root is uninitializled. We were lucky that ve_hide_cgroups never dereferenced it on this code path, as mounting cgroup2 from container was prohibited and from host ve_hide_cgroups does not dereference root. But if we will allow mounting cgroup2 filesystem in container, this use of ctx->root in cgroup_get_tree will leed to crash, let's fix it. https://jira.vzint.dev/browse/PSBM-149975 Fixes: e8e4834b833c ("ve/cgroup: hide non-virtualized cgroups in container") Signed-off-by: Pavel Tikhomirov --- kernel/cgroup/cgroup.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 3f685035076a..b0cf5cf66d20 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -2543,7 +2543,7 @@ static int cgroup_get_tree(struct fs_context *fc) struct cgroup_fs_context *ctx = cgroup_fc2context(fc); int ret; - if (ve_hide_cgroups(ctx->root)) + if (ve_hide_cgroups(_dfl_root)) return -EPERM; cgrp_dfl_visible = true; -- 2.41.0 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RH9 v2 4/5] cgroup: allow to write to cgroup.subtree_control in ct root cgroup
This way systemd in container on cgroup-v2 can enable the controllers it requires. https://jira.vzint.dev/browse/PSBM-149975 Signed-off-by: Pavel Tikhomirov --- kernel/cgroup/cgroup.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index b0cf5cf66d20..945130057a30 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -5482,7 +5482,7 @@ static struct cftype cgroup_base_files[] = { }, { .name = "cgroup.subtree_control", - .flags = CFTYPE_NS_DELEGATABLE, + .flags = CFTYPE_NS_DELEGATABLE | CFTYPE_VE_WRITABLE, .seq_show = cgroup_subtree_control_show, .write = cgroup_subtree_control_write, }, -- 2.41.0 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RH9 v2 0/5] enable cgroup-v2 in containers
To test it need to boot with systemd.unified_cgroup_hierarchy=1 option. https://jira.vzint.dev/browse/PSBM-149975 Signed-off-by: Pavel Tikhomirov Pavel Tikhomirov (5): cgroup: fix uninitiallized use of ctx->root mm/memcontrol: fix available memory calculation in mem_cgroup_enough_memory mm/memcontrol: fix mem_cgroup_enough_memory on cgroup-v2 cgroup: allow to write to cgroup.subtree_control in ct root cgroup cgroup: show and allow mounting of cgroup-v2 in containers kernel/cgroup/cgroup.c | 20 +--- mm/memcontrol.c| 7 ++- 2 files changed, 15 insertions(+), 12 deletions(-) -- 2.41.0 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH vz7] cgroup_freezer: fix memory allocation context when checking freeze timeouts
Thanks for fixing this! On 01.09.2023 20:44, Alexander Atanasov wrote: check_freezer_timeout(...) is called under spin_lock_irq(...) from update_if_frozen(...) the issue is that it does allocations under GFP_KERNEL context - fix that by using GFP_ATOMIC. Fixes: aab9c2fafb6b ("cgroup_freezer: print information about unfreezable process") https://jira.vzint.dev/browse/PSBM-150293 Reviewed-by: Pavel Tikhomirov Signed-off-by: Alexander Atanasov --- kernel/cgroup_freezer.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c index d4747ff98090..e5da9a63813d 100644 --- a/kernel/cgroup_freezer.c +++ b/kernel/cgroup_freezer.c @@ -269,7 +269,7 @@ static void check_freezer_timeout(struct cgroup *cgroup, if (!__ratelimit(_timeout_rs)) return; - freezer_cg_name = kmalloc(PATH_MAX, GFP_KERNEL); + freezer_cg_name = kmalloc(PATH_MAX, GFP_ATOMIC); if (!freezer_cg_name) return; @@ -283,7 +283,7 @@ static void check_freezer_timeout(struct cgroup *cgroup, freezer_cg_name, __freeze_timeout/HZ, tgid, task->comm); entries = kmalloc(MAX_STACK_TRACE_DEPTH * sizeof(*entries), - GFP_KERNEL); + GFP_ATOMIC); if (!entries) goto free_cg_name; -- Best regards, Tikhomirov Pavel Senior Software Developer, Virtuozzo. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH v2 rh7] vznetstat: fix statistics output when getting stats for same device in parallel
On 29.08.2023 18:01, Konstantin Khorenko wrote: Before this patch .get_stats() for venetdev was not defended from racing, Documentation/networking/netdevices.txt states ndo_get_stats() is to be syncronized by dev_base_lock rwlock, but we do READ stats, so can go in parallel. At the same time the algorithm of gathering per-cpu stats into a single value used per-device structure, so in case .get_stats() is called in parallel, stats could be corrupted. Example, 2 cpu node: cpu A cpu B .get_stats() memset(dev->...->stats, 0) stats.rx_bytes += cpuA.rx_bytes .get_stats() memset(dev->...->stats, 0) stats.rx_bytes += cpuB.rx_bytes return stats stats.rx_bytes += cpuA.rx_bytes stats.rx_bytes += cpuB.rx_bytes return stats => process running on cpu A will print rx_bytes only taken from cpu B (so lower than expected), and the process running on cpu B will print too high numbers because cpu B values will be added twice. Let's make venetdev provide ndo_get_stats64() callback which gets a storage as an argument, thus does not suffer from a race of filling a per-netdevice struct in parallel. https://jira.vzint.dev/browse/PSBM-150027 Reviewed-by: Pavel Tikhomirov Signed-off-by: Konstantin Khorenko --- v2: * get rid of extra function venet_stats_one() * get rid of extra struct venet_device_stats and filling it drivers/net/venetdev.c | 20 include/linux/venet.h | 1 - 2 files changed, 8 insertions(+), 13 deletions(-) diff --git a/drivers/net/venetdev.c b/drivers/net/venetdev.c index 3ca4f8a49a5c..9785b9ba7247 100644 --- a/drivers/net/venetdev.c +++ b/drivers/net/venetdev.c @@ -568,25 +568,21 @@ static int venet_xmit(struct sk_buff *skb, struct net_device *dev) return 0; } -static struct net_device_stats *get_stats(struct net_device *dev) +static void venet_get_stats64(struct net_device *dev, + struct rtnl_link_stats64 *total) { int i; - struct venet_stats *stats; - stats = (struct venet_stats *)dev->ml_priv; - memset(>stats, 0, sizeof(struct net_device_stats)); for_each_possible_cpu(i) { struct net_device_stats *dev_stats; dev_stats = venet_stats(dev, i); - stats->stats.rx_bytes += dev_stats->rx_bytes; - stats->stats.tx_bytes += dev_stats->tx_bytes; - stats->stats.rx_packets += dev_stats->rx_packets; - stats->stats.tx_packets += dev_stats->tx_packets; - stats->stats.tx_dropped += dev_stats->tx_dropped; + total->rx_bytes += dev_stats->rx_bytes; + total->tx_bytes += dev_stats->tx_bytes; + total->rx_packets += dev_stats->rx_packets; + total->tx_packets += dev_stats->tx_packets; + total->tx_dropped += dev_stats->tx_dropped; } - - return >stats; } /* Initialize the rest of the LOOPBACK device. */ @@ -712,7 +708,7 @@ static const struct ethtool_ops venet_ethtool_ops = { static const struct net_device_ops venet_netdev_ops = { .ndo_start_xmit = venet_xmit, - .ndo_get_stats = get_stats, + .ndo_get_stats64 = venet_get_stats64, .ndo_open = venet_open, .ndo_stop = venet_close, .ndo_init = venet_init_dev, diff --git a/include/linux/venet.h b/include/linux/venet.h index 51e0abeb03d7..f0a00832f6de 100644 --- a/include/linux/venet.h +++ b/include/linux/venet.h @@ -21,7 +21,6 @@ struct ve_struct; struct venet_stat; struct venet_stats { - struct net_device_stats stats; struct net_device_stats *real_stats; }; -- Best regards, Tikhomirov Pavel Senior Software Developer, Virtuozzo. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH rh7] vznetstat: fix statistics output when getting stats for same device in parallel
On 28.08.2023 23:54, Konstantin Khorenko wrote: Before this patch .get_stats() for venetdev was not defended from racing, Documentation/networking/netdevices.txt states ndo_get_stats() is to be syncronized by dev_base_lock rwlock, but we do READ stats, so can go in parallel. At the same time the algorithm of gathering per-cpu stats into a single value used per-device structure, so in case .get_stats() is called in parallel, stats could be corrupted. Example, 2 cpu node: cpu A cpu B .get_stats() memset(dev->...->stats, 0) stats.rx_bytes += cpuA.rx_bytes .get_stats() memset(dev->...->stats, 0) stats.rx_bytes += cpuB.rx_bytes return stats stats.rx_bytes += cpuA.rx_bytes stats.rx_bytes += cpuB.rx_bytes return stats => process running on cpu A will print rx_bytes only taken from cpu B (so lower than expected), and the process running on cpu B will print too high numbers because cpu B values will be added twice. Let's rework the code to sumup per-cpu values to a local structure (on-stack). Extra 5 unsigned long variables seem to be fine to be put on stack. https://jira.vzint.dev/browse/PSBM-150027 Signed-off-by: Konstantin Khorenko --- drivers/net/venetdev.c | 38 -- include/linux/venet.h | 10 +- 2 files changed, 33 insertions(+), 15 deletions(-) diff --git a/drivers/net/venetdev.c b/drivers/net/venetdev.c index 3ca4f8a49a5c..3ad67ee2f2b0 100644 --- a/drivers/net/venetdev.c +++ b/drivers/net/venetdev.c @@ -568,25 +568,35 @@ static int venet_xmit(struct sk_buff *skb, struct net_device *dev) return 0; } -static struct net_device_stats *get_stats(struct net_device *dev) +static void venet_stats_one(struct venet_device_stats *result, struct net_device *dev) { - int i; - struct venet_stats *stats; + int cpu; + + memset(result, 0, sizeof(struct venet_device_stats)); - stats = (struct venet_stats *)dev->ml_priv; - memset(>stats, 0, sizeof(struct net_device_stats)); - for_each_possible_cpu(i) { + for_each_possible_cpu(cpu) { struct net_device_stats *dev_stats; - dev_stats = venet_stats(dev, i); - stats->stats.rx_bytes += dev_stats->rx_bytes; - stats->stats.tx_bytes += dev_stats->tx_bytes; - stats->stats.rx_packets += dev_stats->rx_packets; - stats->stats.tx_packets += dev_stats->tx_packets; - stats->stats.tx_dropped += dev_stats->tx_dropped; + dev_stats = venet_stats(dev, cpu); + result->rx_bytes += dev_stats->rx_bytes; + result->tx_bytes += dev_stats->tx_bytes; + result->rx_packets += dev_stats->rx_packets; + result->tx_packets += dev_stats->tx_packets; + result->tx_dropped += dev_stats->tx_dropped; } +} - return >stats; +static void venet_get_stats64(struct net_device *dev, + struct rtnl_link_stats64 *total) +{ + struct venet_device_stats one; + + venet_stats_one(, dev); Having both venet_get_stats64 and venet_stats_one and adding struct venet_device_stats seems excess to me. Why can't we just fill rtnl_link_stats64 directly? + total->rx_bytes = one.rx_bytes; + total->tx_bytes = one.tx_bytes; + total->rx_packets = one.rx_packets; + total->tx_packets = one.tx_packets; + total->tx_dropped = one.tx_dropped; } /* Initialize the rest of the LOOPBACK device. */ @@ -712,7 +722,7 @@ static const struct ethtool_ops venet_ethtool_ops = { static const struct net_device_ops venet_netdev_ops = { .ndo_start_xmit = venet_xmit, - .ndo_get_stats = get_stats, + .ndo_get_stats64 = venet_get_stats64, .ndo_open = venet_open, .ndo_stop = venet_close, .ndo_init = venet_init_dev, diff --git a/include/linux/venet.h b/include/linux/venet.h index 51e0abeb03d7..b54b0342a38a 100644 --- a/include/linux/venet.h +++ b/include/linux/venet.h @@ -21,10 +21,18 @@ struct ve_struct; struct venet_stat; struct venet_stats { - struct net_device_stats stats; struct net_device_stats *real_stats; }; +struct venet_device_stats { + unsigned long rx_packets; + unsigned long tx_packets; + unsigned long rx_bytes; + unsigned long tx_bytes; + + unsigned long tx_dropped; +}; + struct ip_entry_struct { struct ve_addr_struct addr; -- Best regards, Tikhomirov Pavel Senior Software Developer, Virtuozzo. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RH9 4/4] cgroup: show and allow mounting of cgroup-v2 in containers
The change in is_virtualized_cgroot makes cgroup-v2 "virtualized" only when it has enabled subsystems in it. And "virtualized" means that it will be: a) shown in different cgroup related proc files in container, b) allowed to be mounted in container and c) required to have separate root directory for container Note: we don't expect cgroup-v2 root to change enabled controllers, it either has no controllers or has at least one all the way. Note: cgroup-v2 has one hierarchy for all controllers, so skipping misc and debug for vzctl is not required anymore. Add FS_VIRTUALIZED to cgroup2 to allow mounting in container. https://jira.vzint.dev/browse/PSBM-149975 Signed-off-by: Pavel Tikhomirov --- kernel/cgroup/cgroup.c | 16 +++- 1 file changed, 7 insertions(+), 9 deletions(-) diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 945130057a30..d902e675598d 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -2060,18 +2060,16 @@ struct ve_struct *get_curr_ve(void) */ static inline bool is_virtualized_cgroot(struct cgroup_root *cgroot) { - /* Cgroup v2 */ - if (cgroot == _dfl_root) - return false; - + if (cgroot != _dfl_root) { #if IS_ENABLED(CONFIG_CGROUP_DEBUG) - if (cgroot->subsys_mask & (1 << debug_cgrp_id)) - return false; + if (cgroot->subsys_mask & (1 << debug_cgrp_id)) + return false; #endif #if IS_ENABLED(CONFIG_CGROUP_MISC) - if (cgroot->subsys_mask & (1 << misc_cgrp_id)) - return false; + if (cgroot->subsys_mask & (1 << misc_cgrp_id)) + return false; #endif + } if (cgroot->subsys_mask) return true; @@ -2628,7 +2626,7 @@ static struct file_system_type cgroup2_fs_type = { .init_fs_context= cgroup_init_fs_context, .parameters = cgroup2_fs_parameters, .kill_sb= cgroup_kill_sb, - .fs_flags = FS_USERNS_MOUNT, + .fs_flags = FS_USERNS_MOUNT | FS_VIRTUALIZED, }; #ifdef CONFIG_CPUSETS -- 2.41.0 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RH9 3/4] cgroup: allow to write to cgroup.subtree_control in ct root cgroup
This way systemd in container on cgroup-v2 can enable the controllers it requires. https://jira.vzint.dev/browse/PSBM-149975 Signed-off-by: Pavel Tikhomirov --- kernel/cgroup/cgroup.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index b0cf5cf66d20..945130057a30 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -5482,7 +5482,7 @@ static struct cftype cgroup_base_files[] = { }, { .name = "cgroup.subtree_control", - .flags = CFTYPE_NS_DELEGATABLE, + .flags = CFTYPE_NS_DELEGATABLE | CFTYPE_VE_WRITABLE, .seq_show = cgroup_subtree_control_show, .write = cgroup_subtree_control_write, }, -- 2.41.0 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RH9 2/4] mm/memcontrol: fix mem_cgroup_enough_memory on cgroup-v2
In cgroup-v2 memsw is not used anymore, instead we have swap and memory separated from one another. So memory limit should be read from memory.max not from memsw.max. https://jira.vzint.dev/browse/PSBM-149975 Fixes: 9417436f33d1 ("ve/mm: add heuristic check for memory overcommit") Signed-off-by: Pavel Tikhomirov --- mm/memcontrol.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 9287eed944d5..cd11065cbe0d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4283,7 +4283,10 @@ static int mem_cgroup_enough_memory(struct mem_cgroup *memcg, long pages) long free; /* unused memory */ - free = memcg->memsw.max - page_counter_read(>memory); + if (memcg->css.cgroup->root == _dfl_root) + free = memcg->memory.max - page_counter_read(>memory); + else + free = memcg->memsw.max - page_counter_read(>memory); /* reclaimable slabs */ free += memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B) >> PAGE_SHIFT; -- 2.41.0 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RH9 1/4] cgroup: fix uninitiallized use of ctx->root
In vfs_get_tree we explicitly check that fc->root is NULL before calling ->get_tree helper. So when mounting cgroup2 filesystem in the begining of cgroup_get_tree the fc->root is uninitializled. We were lucky that ve_hide_cgroups never dereferenced it on this code path, as mounting cgroup2 from container was prohibited and from host ve_hide_cgroups does not dereference root. But if we will allow mounting cgroup2 filesystem in container, this use of ctx->root in cgroup_get_tree will leed to crash, let's fix it. https://jira.vzint.dev/browse/PSBM-149975 Fixes: e8e4834b833c ("ve/cgroup: hide non-virtualized cgroups in container") Signed-off-by: Pavel Tikhomirov --- kernel/cgroup/cgroup.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 3f685035076a..b0cf5cf66d20 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -2543,7 +2543,7 @@ static int cgroup_get_tree(struct fs_context *fc) struct cgroup_fs_context *ctx = cgroup_fc2context(fc); int ret; - if (ve_hide_cgroups(ctx->root)) + if (ve_hide_cgroups(_dfl_root)) return -EPERM; cgrp_dfl_visible = true; -- 2.41.0 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RH9 0/4] enable cgroup-v2 in containers
To test it need to boot with systemd.unified_cgroup_hierarchy=1 option. https://jira.vzint.dev/browse/PSBM-149975 Signed-off-by: Pavel Tikhomirov Pavel Tikhomirov (4): cgroup: fix uninitiallized use of ctx->root mm/memcontrol: fix mem_cgroup_enough_memory on cgroup-v2 cgroup: allow to write to cgroup.subtree_control in ct root cgroup cgroup: show and allow mounting of cgroup-v2 in containers kernel/cgroup/cgroup.c | 20 +--- mm/memcontrol.c| 5 - 2 files changed, 13 insertions(+), 12 deletions(-) -- 2.41.0 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH RH7 v2 0/2] fixes for memory cgroup id refcounting
On 11.07.2023 12:32, Pavel Tikhomirov wrote: Complete memcg id refcount design: - on cgroup online get css refcount and set memcg id refctount to 1 - on cgroup offline put memcg id refcount - on swapout: - if cgroup memcg id refcount is 0 (not online), use parent cgroup - if cgroup memcg id refcount is >0, get it (previously in rh7 we got css refcount instead) - on swapin put memcg id refcount (previously in rh7 we put css refcount instead) - on charge move from one to another cgroup put memcg id refctount on "from" cgroup and get it on "to" cgroup (previously in rh7 we put css refcount on "from" and get on "to") - on last memcg id refctount put also put css refcount Charge move is a bit tricky as we can move swap -> non-swap, thus we get "to" memcg id refcount in mem_cgroup_move_charge_pte_range, and put "from" memcg id refcount later in __mem_cgroup_clear_mc. The part about "swap -> non-swap" is wrong, rewrite of this paragraph: Charge move is a bit tricky: we get "to" memcg id refcount in mem_cgroup_move_charge_pte_range, and put "from" memcg id refcount later in __mem_cgroup_clear_mc. So, in the first patch of this series we remove leftover memcg id refcount put on memcg free path, and in the second patch we remove leftover "to" css refcount get in task move code for moved swap. https://jira.vzint.dev/browse/PSBM-148818 Signed-off-by: Pavel Tikhomirov Pavel Tikhomirov (2): mm/memcontrol: remove excess put of memory cgroup id refcount on css free mm/memcontrol: remove excess css_get on task migration with swap mm/memcontrol.c | 14 -- 1 file changed, 14 deletions(-) -- Best regards, Tikhomirov Pavel Senior Software Developer, Virtuozzo. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH RH7] mm/memcontrol: remove excess put of memory cgroup id refcount on css free
please ignore, see v2 On 11.07.2023 11:47, Pavel Tikhomirov wrote: The removed hunk comes from rhel partial porting of [1], there they added mem_cgroup_id_put on memcg free instead of offline. But they also didn't do refcounting in mem_cgroup_id_put, only removed id from idr. Now we do actual refcounting in mem_cgroup_id_put, and atomic_set(1) and css_get in mem_cgroup_css_online are paired with mem_cgroup_id_put in mem_cgroup_css_offline. The call in free path is excess and we should remove it. This is not a critical excess refcount put, as it happened on late cgroup freeing stage, which can only happen after this id refcount reaches zero, which leads to css_put, which leads to freeing. So nobody cares about id refcount at this point. https://jira.vzint.dev/browse/PSBM-148818 Fixes: 9dcef96ce3e2 ("ms/mm: memcontrol: fix cgroup creation failure after many small jobs") [1] Signed-off-by: Pavel Tikhomirov --- mm/memcontrol.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 3a5e89b913bd..c7c15c6dd6fb 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6818,8 +6818,6 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg) mem_cgroup_remove_from_trees(memcg); - mem_cgroup_id_put(memcg); - for_each_node(node) free_mem_cgroup_per_zone_info(memcg, node); -- Best regards, Tikhomirov Pavel Senior Software Developer, Virtuozzo. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RH7 v2 0/2] fixes for memory cgroup id refcounting
Complete memcg id refcount design: - on cgroup online get css refcount and set memcg id refctount to 1 - on cgroup offline put memcg id refcount - on swapout: - if cgroup memcg id refcount is 0 (not online), use parent cgroup - if cgroup memcg id refcount is >0, get it (previously in rh7 we got css refcount instead) - on swapin put memcg id refcount (previously in rh7 we put css refcount instead) - on charge move from one to another cgroup put memcg id refctount on "from" cgroup and get it on "to" cgroup (previously in rh7 we put css refcount on "from" and get on "to") - on last memcg id refctount put also put css refcount Charge move is a bit tricky as we can move swap -> non-swap, thus we get "to" memcg id refcount in mem_cgroup_move_charge_pte_range, and put "from" memcg id refcount later in __mem_cgroup_clear_mc. So, in the first patch of this series we remove leftover memcg id refcount put on memcg free path, and in the second patch we remove leftover "to" css refcount get in task move code for moved swap. https://jira.vzint.dev/browse/PSBM-148818 Signed-off-by: Pavel Tikhomirov Pavel Tikhomirov (2): mm/memcontrol: remove excess put of memory cgroup id refcount on css free mm/memcontrol: remove excess css_get on task migration with swap mm/memcontrol.c | 14 -- 1 file changed, 14 deletions(-) -- 2.41.0 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RH7 v2 1/2] mm/memcontrol: remove excess put of memory cgroup id refcount on css free
The removed hunk comes from rhel partial porting of [1], there they added mem_cgroup_id_put on memcg free instead of offline. But they also didn't do refcounting in mem_cgroup_id_put, only removed id from idr. Now we do actual refcounting in mem_cgroup_id_put, and atomic_set(1) and css_get in mem_cgroup_css_online are paired with mem_cgroup_id_put in mem_cgroup_css_offline. The call in free path is excess and we should remove it. This is not a critical excess refcount put, as it happened on late cgroup freeing stage, which can only happen after this id refcount reaches zero, which leads to css_put, which leads to freeing. So nobody cares about id refcount at this point. https://jira.vzint.dev/browse/PSBM-148818 Fixes: 9dcef96ce3e2 ("ms/mm: memcontrol: fix cgroup creation failure after many small jobs") [1] Signed-off-by: Pavel Tikhomirov --- mm/memcontrol.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 3a5e89b913bd..c7c15c6dd6fb 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6818,8 +6818,6 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg) mem_cgroup_remove_from_trees(memcg); - mem_cgroup_id_put(memcg); - for_each_node(node) free_mem_cgroup_per_zone_info(memcg, node); -- 2.41.0 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RH7 v2 2/2] mm/memcontrol: remove excess css_get on task migration with swap
Before [1] we took css refcount for each swap page, after that we take memory cgroup id refcount instead, that's why we don't need to get css refcount on destination cgroup when moving swap page to it. https://jira.vzint.dev/browse/PSBM-148818 Fixes: 9dcef96ce3e2 ("ms/mm: memcontrol: fix cgroup creation failure after many small jobs") [1] Fixes: 11870ae0f7b6 ("ms/mm: memcontrol: fix memcg id ref counter on swap charge move") Signed-off-by: Pavel Tikhomirov --- mm/memcontrol.c | 12 1 file changed, 12 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c7c15c6dd6fb..a2fe48d93d02 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4356,18 +4356,6 @@ static int mem_cgroup_move_swap_account(swp_entry_t entry, if (swap_cgroup_cmpxchg(entry, old_id, new_id) == old_id) { mem_cgroup_swap_statistics(from, false); mem_cgroup_swap_statistics(to, true); - /* -* This function is only called from task migration context now. -* It postpones page_counter and refcount handling till the end -* of task migration(mem_cgroup_clear_mc()) for performance -* improvement. But we cannot postpone css_get(to) because if -* the process that has been moved to @to does swap-in, the -* refcount of @to might be decreased to 0. -* -* We are in attach() phase, so the cgroup is guaranteed to be -* alive, so we can just call css_get(). -*/ - css_get(>css); return 0; } return -EINVAL; -- 2.41.0 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RH7] mm/memcontrol: remove excess put of memory cgroup id refcount on css free
The removed hunk comes from rhel partial porting of [1], there they added mem_cgroup_id_put on memcg free instead of offline. But they also didn't do refcounting in mem_cgroup_id_put, only removed id from idr. Now we do actual refcounting in mem_cgroup_id_put, and atomic_set(1) and css_get in mem_cgroup_css_online are paired with mem_cgroup_id_put in mem_cgroup_css_offline. The call in free path is excess and we should remove it. This is not a critical excess refcount put, as it happened on late cgroup freeing stage, which can only happen after this id refcount reaches zero, which leads to css_put, which leads to freeing. So nobody cares about id refcount at this point. https://jira.vzint.dev/browse/PSBM-148818 Fixes: 9dcef96ce3e2 ("ms/mm: memcontrol: fix cgroup creation failure after many small jobs") [1] Signed-off-by: Pavel Tikhomirov --- mm/memcontrol.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 3a5e89b913bd..c7c15c6dd6fb 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6818,8 +6818,6 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg) mem_cgroup_remove_from_trees(memcg); - mem_cgroup_id_put(memcg); - for_each_node(node) free_mem_cgroup_per_zone_info(memcg, node); -- 2.41.0 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RH7 v2] mm/memcontrol: do not put css when putting page to swap or moving swap page
When commit [1] was ported we should not have removed css_get in mem_cgroup_swapout, as when we put page from memory to swap we need to have refcount on css in case we will want to put it back. In vz7 we don't have commit [2] which in mainstream preceeds [1], that's why in mainstream in [1] they removed css_get as we don't need to take second reference for the same page. So fix for [1] is bringing back css_get. When commit [3] was ported we should have removed this css_get as now we take mem_cgroup_id_get instead. But we should not have added css_put there as it would be an excess put (remove get + put is equivalent of double put). In mainstream [3] is preceeded by [2], so they need to put css reference of page put to swap and replace it with id reference. So fix for [1] + [2] is removal of css_put. When commit [4] was ported in __mem_cgroup_clear_mc we should have removed css_put-s as now after [3] we don't have css reference for swapped pages, and there is nothing to put when moving to other cgroup. And now I understand why they do css_put(from)->css_put(to) replacement in original mainstream patch, it's because they also don't have refcount from page which is already in swap after [3], but obviousely when they charge pages to destination cgroup they have reference from them to css and as they are swap they need to put this reference. https://jira.vzint.dev/browse/PSBM-148702 Fixes: d0f735e5df66 ("ms/mm: memcontrol: uncharge pages on swapout") [1] e8ea14cc6ead ("mm: memcontrol: take a css reference for each charged page") [2] Fixes: 9dcef96ce3e2 ("ms/mm: memcontrol: fix cgroup creation failure after many small jobs") [3] Fixes: 11870ae0f7b6 ("ms/mm: memcontrol: fix memcg id ref counter on swap charge move") [4] Signed-off-by: Pavel Tikhomirov --- mm/memcontrol.c | 7 --- 1 file changed, 7 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b2911329852f..3a5e89b913bd 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -7413,7 +7413,6 @@ static void __mem_cgroup_clear_mc(void) { struct mem_cgroup *from = mc.from; struct mem_cgroup *to = mc.to; - int i; /* we must uncharge all the leftover precharges from mc.to */ if (mc.precharge) { @@ -7436,9 +7435,6 @@ static void __mem_cgroup_clear_mc(void) mem_cgroup_id_put_many(mc.from, mc.moved_swap); - for (i = 0; i < mc.moved_swap; i++) - css_put(>css); - if (!mem_cgroup_is_root(mc.to)) { /* * we charged both to->memory and to->memsw, so we @@ -7837,9 +7833,6 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry) mem_cgroup_charge_statistics(memcg, page, -1); memcg_check_events(memcg, page); - - if (!mem_cgroup_is_root(memcg)) - css_put(>css); } /** -- 2.41.0 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RH7] mm/memcontrol: do not put css when putting page to swap or moving swap page
When commit [1] was ported we should not have removed css_get in mem_cgroup_swapout, as when we put page from memory to swap we need to have refcount on css in case we will want to put it back. In vz7 we don't have commit [2] which in mainstream preceeds [1], that's why in mainstream in [1] they removed css_get as we don't need to take second reference for the same page. So fix for [1] is bringing back css_get. When commit [3] was ported we should have removed this css_get as now we take mem_cgroup_id_get instead. But we should not have added css_put there as it would be an excess put (remove get + put is equivalent of double put). In mainstream [3] is preceeded by [2], so they need to put css reference of page put to swap and replace it with id reference. So fix for [1] + [2] is removal of css_put. When commit [4] was ported in __mem_cgroup_clear_mc we should have removed css_put-s as now after [3] we don't have css reference for swapped pages, and there is nothing to put when moving to other cgroup. And now I understand why they do css_put(from)->css_put(to) replacement in original mainstream patch, it's because they also don't have refcount from page which is already in swap after [3], but obviousely when they charge pages to destination cgroup they have reference from them to css and as they are swap they need to put this reference. https://jira.vzint.dev/browse/PSBM-148702 Fixes: d0f735e5df66 ("ms/mm: memcontrol: uncharge pages on swapout") [1] e8ea14cc6ead ("mm: memcontrol: take a css reference for each charged page") [2] Fixes: 9dcef96ce3e2 ("ms/mm: memcontrol: fix cgroup creation failure after many small jobs") [3] Fixes: 11870ae0f7b6 ("ms/mm: memcontrol: fix memcg id ref counter on swap charge move") [4] Signed-off-by: Pavel Tikhomirov --- mm/memcontrol.c | 6 -- 1 file changed, 6 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b2911329852f..8f74de3ebace 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -7436,9 +7436,6 @@ static void __mem_cgroup_clear_mc(void) mem_cgroup_id_put_many(mc.from, mc.moved_swap); - for (i = 0; i < mc.moved_swap; i++) - css_put(>css); - if (!mem_cgroup_is_root(mc.to)) { /* * we charged both to->memory and to->memsw, so we @@ -7837,9 +7834,6 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry) mem_cgroup_charge_statistics(memcg, page, -1); memcg_check_events(memcg, page); - - if (!mem_cgroup_is_root(memcg)) - css_put(>css); } /** -- 2.41.0 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RH7 9/9] mm: memcontrol: avoid unused function warning
From: Arnd Bergmann A bugfix in v4.8-rc2 introduced a harmless warning when CONFIG_MEMCG_SWAP is disabled but CONFIG_MEMCG is enabled: mm/memcontrol.c:4085:27: error: 'mem_cgroup_id_get_online' defined but not used [-Werror=unused-function] static struct mem_cgroup *mem_cgroup_id_get_online(struct mem_cgroup *memcg) This moves the function inside of the #ifdef block that hides the calling function, to avoid the warning. Fixes: 1f47b61fb407 ("mm: memcontrol: fix swap counter leak on swapout from offline cgroup") Link: http://lkml.kernel.org/r/20160824113733.2776701-1-a...@arndb.de Signed-off-by: Arnd Bergmann Acked-by: Michal Hocko Acked-by: Vladimir Davydov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds https://jira.vzint.dev/browse/PSBM-147036 (cherry picked from commit 358c07fcc3b60ab08d77f1684de8bd81bcf49a1a) Signed-off-by: Pavel Tikhomirov --- mm/memcontrol.c | 36 ++-- 1 file changed, 18 insertions(+), 18 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 098c18f4981e..5ea9dfcae678 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6688,24 +6688,6 @@ static void mem_cgroup_id_get_many(struct mem_cgroup *memcg, unsigned int n) atomic_add(n, >id.ref); } -static struct mem_cgroup *mem_cgroup_id_get_online(struct mem_cgroup *memcg) -{ - while (!atomic_inc_not_zero(>id.ref)) { - /* -* The root cgroup cannot be destroyed, so it's refcount must -* always be >= 1. -*/ - if (WARN_ON_ONCE(memcg == root_mem_cgroup)) { - VM_BUG_ON(1); - break; - } - memcg = parent_mem_cgroup(memcg); - if (!memcg) - memcg = root_mem_cgroup; - } - return memcg; -} - static void mem_cgroup_id_put_many(struct mem_cgroup *memcg, unsigned int n) { VM_BUG_ON(atomic_read(>id.ref) < n); @@ -7785,6 +7767,24 @@ static void __init enable_swap_cgroup(void) #endif #ifdef CONFIG_MEMCG_SWAP +static struct mem_cgroup *mem_cgroup_id_get_online(struct mem_cgroup *memcg) +{ + while (!atomic_inc_not_zero(>id.ref)) { + /* +* The root cgroup cannot be destroyed, so it's refcount must +* always be >= 1. +*/ + if (WARN_ON_ONCE(memcg == root_mem_cgroup)) { + VM_BUG_ON(1); + break; + } + memcg = parent_mem_cgroup(memcg); + if (!memcg) + memcg = root_mem_cgroup; + } + return memcg; +} + /** * mem_cgroup_swapout - transfer a memsw charge to swap * @page: page whose memsw charge to transfer -- 2.40.1 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RH7 6/9] memcg: remove memcg_cgroup::id from IDR on mem_cgroup_css_alloc() failure
From: Kirill Tkhai In case of memcg_online_kmem() failure, memcg_cgroup::id remains hashed in mem_cgroup_idr even after memcg memory is freed. This leads to leak of ID in mem_cgroup_idr. This patch adds removal into mem_cgroup_css_alloc(), which fixes the problem. For better readability, it adds a generic helper which is used in mem_cgroup_alloc() and mem_cgroup_id_put_many() as well. Link: http://lkml.kernel.org/r/152354470916.22460.14397070748001974638.stgit@localhost.localdomain Fixes 73f576c04b94 ("mm: memcontrol: fix cgroup creation failure after many small jobs") Signed-off-by: Kirill Tkhai Acked-by: Johannes Weiner Acked-by: Vladimir Davydov Cc: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Changes: - skip hunk in mem_cgroup_css_alloc as there is no such error path yet, patch is not strictly required but better have it as a small cleanup https://jira.vzint.dev/browse/PSBM-147036 (cherry picked from commit 7e97de0b033bcac4fa9a35cef72e0c06e6a22c67) Signed-off-by: Pavel Tikhomirov --- mm/memcontrol.c | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 8c14585f5e86..af99281292e5 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6674,6 +6674,14 @@ unsigned short mem_cgroup_id(struct mem_cgroup *memcg) return memcg->id.id; } +static void mem_cgroup_id_remove(struct mem_cgroup *memcg) +{ + if (memcg->id.id > 0) { + idr_remove(_cgroup_idr, memcg->id.id); + memcg->id.id = 0; + } +} + static void mem_cgroup_id_get_many(struct mem_cgroup *memcg, unsigned int n) { atomic_add(n, >id.ref); @@ -6700,8 +6708,7 @@ static struct mem_cgroup *mem_cgroup_id_get_online(struct mem_cgroup *memcg) static void mem_cgroup_id_put_many(struct mem_cgroup *memcg, unsigned int n) { if (atomic_sub_and_test(n, >id.ref)) { - idr_remove(_cgroup_idr, memcg->id.id); - memcg->id.id = 0; + mem_cgroup_id_remove(memcg); /* Memcg ID pins CSS */ css_put(>css); @@ -6811,8 +6818,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void) for_each_node(node) free_mem_cgroup_per_zone_info(memcg, node); - if (memcg->id.id > 0) - idr_remove(_cgroup_idr, memcg->id.id); + mem_cgroup_id_remove(memcg); fail: kfree(memcg); return NULL; -- 2.40.1 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RH7 4/9] mm: memcontrol: fix memcg id ref counter on swap charge move
From: Vladimir Davydov Since commit 73f576c04b94 ("mm: memcontrol: fix cgroup creation failure after many small jobs") swap entries do not pin memcg->css.refcnt directly. Instead, they pin memcg->id.ref. So we should adjust the reference counters accordingly when moving swap charges between cgroups. Fixes: 73f576c04b941 ("mm: memcontrol: fix cgroup creation failure after many small jobs") Link: http://lkml.kernel.org/r/9ce297c64954a42dc90b543bc76106c4a94f07e8.1470219853.git.vdavy...@virtuozzo.com Signed-off-by: Vladimir Davydov Acked-by: Michal Hocko Acked-by: Johannes Weiner Cc: [3.19+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Changes: original patch also did: - css_put_many(>css, mc.moved_swap); + css_put_many(>css, mc.moved_swap); but this change seems strange and unrelated to me, let's skip it. https://jira.vzint.dev/browse/PSBM-147036 (cherry picked from commit 615d66c37c755c49ce022c9e5ac0875d27d2603d) Signed-off-by: Pavel Tikhomirov --- mm/memcontrol.c | 24 +++- 1 file changed, 19 insertions(+), 5 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index a9a1fb354d33..8b78c7c8d3e3 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6674,9 +6674,9 @@ unsigned short mem_cgroup_id(struct mem_cgroup *memcg) return memcg->id.id; } -static void mem_cgroup_id_get(struct mem_cgroup *memcg) +static void mem_cgroup_id_get_many(struct mem_cgroup *memcg, unsigned int n) { - atomic_inc(>id.ref); + atomic_add(n, >id.ref); } static struct mem_cgroup *mem_cgroup_id_get_online(struct mem_cgroup *memcg) @@ -6697,9 +6697,9 @@ static struct mem_cgroup *mem_cgroup_id_get_online(struct mem_cgroup *memcg) return memcg; } -static void mem_cgroup_id_put(struct mem_cgroup *memcg) +static void mem_cgroup_id_put_many(struct mem_cgroup *memcg, unsigned int n) { - if (atomic_dec_and_test(>id.ref)) { + if (atomic_sub_and_test(n, >id.ref)) { idr_remove(_cgroup_idr, memcg->id.id); memcg->id.id = 0; @@ -6708,6 +6708,16 @@ static void mem_cgroup_id_put(struct mem_cgroup *memcg) } } +static inline void mem_cgroup_id_get(struct mem_cgroup *memcg) +{ + mem_cgroup_id_get_many(memcg, 1); +} + +static inline void mem_cgroup_id_put(struct mem_cgroup *memcg) +{ + mem_cgroup_id_put_many(memcg, 1); +} + /** * mem_cgroup_from_id - look up a memcg from a memcg id * @id: the memcg id to look up @@ -7442,6 +7452,8 @@ static void __mem_cgroup_clear_mc(void) if (!mem_cgroup_is_root(mc.from)) page_counter_uncharge(>memsw, mc.moved_swap); + mem_cgroup_id_put_many(mc.from, mc.moved_swap); + for (i = 0; i < mc.moved_swap; i++) css_put(>css); @@ -7452,7 +7464,9 @@ static void __mem_cgroup_clear_mc(void) */ page_counter_uncharge(>memory, mc.moved_swap); } - /* we've already done css_get(mc.to) */ + + mem_cgroup_id_get_many(mc.to, mc.moved_swap); + mc.moved_swap = 0; } if (do_swap_account) { -- 2.40.1 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RH7 8/9] mm/memcontrol.c: fix a -Wunused-function warning
From: Qian Cai mem_cgroup_id_get() was introduced in commit 73f576c04b94 ("mm:memcontrol: fix cgroup creation failure after many small jobs"). Later, it no longer has any user since the commits, 1f47b61fb407 ("mm: memcontrol: fix swap counter leak on swapout from offline cgroup") 58fa2a5512d9 ("mm: memcontrol: add sanity checks for memcg->id.ref on get/put") so safe to remove it. Link: http://lkml.kernel.org/r/1568648453-5482-1-git-send-email-...@lca.pw Signed-off-by: Qian Cai Acked-by: Michal Hocko Cc: Johannes Weiner Cc: Vladimir Davydov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds https://jira.vzint.dev/browse/PSBM-147036 (cherry picked from commit 4d0e3230a56a75bf26d74b3f4b6cb05a0e9fae60) Signed-off-by: Pavel Tikhomirov --- mm/memcontrol.c | 5 - 1 file changed, 5 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 42ca14e637ef..098c18f4981e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6717,11 +6717,6 @@ static void mem_cgroup_id_put_many(struct mem_cgroup *memcg, unsigned int n) } } -static inline void mem_cgroup_id_get(struct mem_cgroup *memcg) -{ - mem_cgroup_id_get_many(memcg, 1); -} - static inline void mem_cgroup_id_put(struct mem_cgroup *memcg) { mem_cgroup_id_put_many(memcg, 1); -- 2.40.1 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RH7 2/9] mm: memcontrol: fix cgroup creation failure after many small jobs
From: Johannes Weiner The memory controller has quite a bit of state that usually outlives the cgroup and pins its CSS until said state disappears. At the same time it imposes a 16-bit limit on the CSS ID space to economically store IDs in the wild. Consequently, when we use cgroups to contain frequent but small and short-lived jobs that leave behind some page cache, we quickly run into the 64k limitations of outstanding CSSs. Creating a new cgroup fails with -ENOSPC while there are only a few, or even no user-visible cgroups in existence. Although pinning CSSs past cgroup removal is common, there are only two instances that actually need an ID after a cgroup is deleted: cache shadow entries and swapout records. Cache shadow entries reference the ID weakly and can deal with the CSS having disappeared when it's looked up later. They pose no hurdle. Swap-out records do need to pin the css to hierarchically attribute swapins after the cgroup has been deleted; though the only pages that remain swapped out after offlining are tmpfs/shmem pages. And those references are under the user's control, so they are manageable. This patch introduces a private 16-bit memcg ID and switches swap and cache shadow entries over to using that. This ID can then be recycled after offlining when the CSS remains pinned only by objects that don't specifically need it. This script demonstrates the problem by faulting one cache page in a new cgroup and deleting it again: set -e mkdir -p pages for x in `seq 128000`; do [ $((x % 1000)) -eq 0 ] && echo $x mkdir /cgroup/foo echo $$ >/cgroup/foo/cgroup.procs echo trex >pages/$x echo $$ >/cgroup/cgroup.procs rmdir /cgroup/foo done When run on an unpatched kernel, we eventually run out of possible IDs even though there are no visible cgroups: [root@ham ~]# ./cssidstress.sh [...] 65000 mkdir: cannot create directory '/cgroup/foo': No space left on device After this patch, the IDs get released upon cgroup destruction and the cache and css objects get released once memory reclaim kicks in. [han...@cmpxchg.org: init the IDR] Link: http://lkml.kernel.org/r/20160621154601.ga22...@cmpxchg.org Fixes: b2052564e66d ("mm: memcontrol: continue cache reclaim from offlined groups") Link: http://lkml.kernel.org/r/20160617162516.gd19...@cmpxchg.org Signed-off-by: Johannes Weiner Reported-by: John Garcia Reviewed-by: Vladimir Davydov Acked-by: Tejun Heo Cc: Nikolay Borisov Cc: [3.19+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Changes: - rhel almost ported it, so some things are already there - remove synchronize_rcu() from rhel version - skip mem_cgroup_try_charge_swap hunk - in original patch we didn't have if (!cont->parent) return 0; thing in mem_cgroup_css_online, we should handle memcg->id.id before this not to break refcounting on cgroups without parent (e.g. root_mem_cgroup) https://jira.vzint.dev/browse/PSBM-147036 (cherry picked from commit 73f576c04b9410ed19660f74f97521bee6e1c546) Signed-off-by: Pavel Tikhomirov --- mm/memcontrol.c | 62 + 1 file changed, 42 insertions(+), 20 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index a334e9a1a311..6356b6532163 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -176,6 +176,11 @@ enum mem_cgroup_events_target { #define SOFTLIMIT_EVENTS_TARGET 1024 #define NUMAINFO_EVENTS_TARGET 1024 +struct mem_cgroup_id { + int id; + atomic_t ref; +}; + static void mem_cgroup_id_put(struct mem_cgroup *memcg); struct mem_cgroup_stat_cpu { @@ -302,7 +307,7 @@ struct mem_cgroup { struct cgroup_subsys_state css; /* Private memcg ID. Used to ID objects that outlive the cgroup */ - unsigned short id; + struct mem_cgroup_id id; /* * the counter to account for memory usage @@ -6665,14 +6670,24 @@ unsigned short mem_cgroup_id(struct mem_cgroup *memcg) { if (mem_cgroup_disabled()) return 0; - return memcg->id; + + return memcg->id.id; +} + +static void mem_cgroup_id_get(struct mem_cgroup *memcg) +{ + atomic_inc(>id.ref); } static void mem_cgroup_id_put(struct mem_cgroup *memcg) { - idr_remove(_cgroup_idr, memcg->id); - memcg->id = 0; - synchronize_rcu(); + if (atomic_dec_and_test(>id.ref)) { + idr_remove(_cgroup_idr, memcg->id.id); + memcg->id.id = 0; + + /* Memcg ID pins CSS */ + css_put(>css); + } } /** @@ -6726,7 +6741,6 @@ static struct mem_cgroup *mem_cgroup_alloc(void) { struct mem_cgroup *memcg; size_t size; - int id; int i, ret; int node; @@ -6737,14 +6751,12 @@ static struct mem_cgroup *mem_cgroup_alloc(void) if (!memcg) return NULL; - id = idr_alloc(_cgroup_i
[Devel] [PATCH RH7 3/9] mm: memcontrol: fix swap counter leak on swapout from offline cgroup
From: Vladimir Davydov An offline memory cgroup might have anonymous memory or shmem left charged to it and no swap. Since only swap entries pin the id of an offline cgroup, such a cgroup will have no id and so an attempt to swapout its anon/shmem will not store memory cgroup info in the swap cgroup map. As a result, memcg->swap or memcg->memsw will never get uncharged from it and any of its ascendants. Fix this by always charging swapout to the first ancestor cgroup that hasn't released its id yet. [han...@cmpxchg.org: add comment to mem_cgroup_swapout] [vdavy...@virtuozzo.com: use WARN_ON_ONCE() in mem_cgroup_id_get_online()] Link: http://lkml.kernel.org/r/20160803123445.GJ13263@esperanza Fixes: 73f576c04b941 ("mm: memcontrol: fix cgroup creation failure after many small jobs") Link: http://lkml.kernel.org/r/5336daa5c9a32e776067773d9da655d2dc126491.1470219853.git.vdavy...@virtuozzo.com Signed-off-by: Vladimir Davydov Acked-by: Johannes Weiner Acked-by: Michal Hocko Cc: [3.19+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds https://jira.vzint.dev/browse/PSBM-147036 (cherry picked from commit 1f47b61fb4077936465dcde872a4e5cc4fe708da) Signed-off-by: Pavel Tikhomirov --- mm/memcontrol.c | 39 ++- 1 file changed, 34 insertions(+), 5 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 6356b6532163..a9a1fb354d33 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6679,6 +6679,24 @@ static void mem_cgroup_id_get(struct mem_cgroup *memcg) atomic_inc(>id.ref); } +static struct mem_cgroup *mem_cgroup_id_get_online(struct mem_cgroup *memcg) +{ + while (!atomic_inc_not_zero(>id.ref)) { + /* +* The root cgroup cannot be destroyed, so it's refcount must +* always be >= 1. +*/ + if (WARN_ON_ONCE(memcg == root_mem_cgroup)) { + VM_BUG_ON(1); + break; + } + memcg = parent_mem_cgroup(memcg); + if (!memcg) + memcg = root_mem_cgroup; + } + return memcg; +} + static void mem_cgroup_id_put(struct mem_cgroup *memcg) { if (atomic_dec_and_test(>id.ref)) { @@ -7759,7 +,7 @@ static void __init enable_swap_cgroup(void) */ void mem_cgroup_swapout(struct page *page, swp_entry_t entry) { - struct mem_cgroup *memcg; + struct mem_cgroup *memcg, *swap_memcg; struct page_cgroup *pc; unsigned short oldid; @@ -7778,17 +7796,28 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry) VM_BUG_ON_PAGE(!(pc->flags & PCG_MEMSW), page); memcg = pc->mem_cgroup; - mem_cgroup_id_get(memcg); - oldid = swap_cgroup_record(entry, mem_cgroup_id(memcg)); + /* +* In case the memcg owning these pages has been offlined and doesn't +* have an ID allocated to it anymore, charge the closest online +* ancestor for the swap instead and transfer the memory+swap charge. +*/ + swap_memcg = mem_cgroup_id_get_online(memcg); + oldid = swap_cgroup_record(entry, mem_cgroup_id(swap_memcg)); VM_BUG_ON_PAGE(oldid, page); - mem_cgroup_swap_statistics(memcg, true); - this_cpu_inc(memcg->stat->events[MEM_CGROUP_EVENTS_PSWPOUT]); + mem_cgroup_swap_statistics(swap_memcg, true); + this_cpu_inc(swap_memcg->stat->events[MEM_CGROUP_EVENTS_PSWPOUT]); pc->flags = 0; if (!mem_cgroup_is_root(memcg)) page_counter_uncharge(>memory, 1); + if (memcg != swap_memcg) { + if (!mem_cgroup_is_root(swap_memcg)) + page_counter_charge(_memcg->memsw, 1); + page_counter_uncharge(>memsw, 1); + } + /* XXX: caller holds IRQ-safe mapping->tree_lock */ VM_BUG_ON(!irqs_disabled()); -- 2.40.1 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RH7 7/9] mm: memcontrol: add sanity checks for memcg->id.ref on get/put
From: Vladimir Davydov Link: http://lkml.kernel.org/r/1c5ddb1c171dbdfc3262252769d6138a29b35b70.1470219853.git.vdavy...@virtuozzo.com Signed-off-by: Vladimir Davydov Acked-by: Johannes Weiner Acked-by: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds https://jira.vzint.dev/browse/PSBM-147036 (cherry picked from commit 58fa2a5512d9f224775fb01433f195e639953c5f) Signed-off-by: Pavel Tikhomirov --- mm/memcontrol.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index af99281292e5..42ca14e637ef 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6684,6 +6684,7 @@ static void mem_cgroup_id_remove(struct mem_cgroup *memcg) static void mem_cgroup_id_get_many(struct mem_cgroup *memcg, unsigned int n) { + VM_BUG_ON(atomic_read(>id.ref) <= 0); atomic_add(n, >id.ref); } @@ -6707,6 +6708,7 @@ static struct mem_cgroup *mem_cgroup_id_get_online(struct mem_cgroup *memcg) static void mem_cgroup_id_put_many(struct mem_cgroup *memcg, unsigned int n) { + VM_BUG_ON(atomic_read(>id.ref) < n); if (atomic_sub_and_test(n, >id.ref)) { mem_cgroup_id_remove(memcg); @@ -6965,7 +6967,7 @@ mem_cgroup_css_online(struct cgroup *cont) memcg = mem_cgroup_from_cont(cont); /* Online state pins memcg ID, memcg ID pins CSS */ - mem_cgroup_id_get(memcg); + atomic_set(>id.ref, 1); css_get(>css); if (!cont->parent) { -- 2.40.1 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RH7 5/9] mm/memcg: fix refcount error while moving and swapping
From: Hugh Dickins It was hard to keep a test running, moving tasks between memcgs with move_charge_at_immigrate, while swapping: mem_cgroup_id_get_many()'s refcount is discovered to be 0 (supposedly impossible), so it is then forced to REFCOUNT_SATURATED, and after thousands of warnings in quick succession, the test is at last put out of misery by being OOM killed. This is because of the way moved_swap accounting was saved up until the task move gets completed in __mem_cgroup_clear_mc(), deferred from when mem_cgroup_move_swap_account() actually exchanged old and new ids. Concurrent activity can free up swap quicker than the task is scanned, bringing id refcount down 0 (which should only be possible when offlining). Just skip that optimization: do that part of the accounting immediately. Fixes: 615d66c37c75 ("mm: memcontrol: fix memcg id ref counter on swap charge move") Signed-off-by: Hugh Dickins Signed-off-by: Andrew Morton Reviewed-by: Alex Shi Cc: Johannes Weiner Cc: Alex Shi Cc: Shakeel Butt Cc: Michal Hocko Cc: Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2007071431050.4726@eggly.anvils Signed-off-by: Linus Torvalds https://jira.vzint.dev/browse/PSBM-147036 (cherry picked from commit 8d22a9351035ef2ff12ef163a1091b8b8cf1e49c) Signed-off-by: Pavel Tikhomirov --- mm/memcontrol.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 8b78c7c8d3e3..8c14585f5e86 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -7465,7 +7465,6 @@ static void __mem_cgroup_clear_mc(void) page_counter_uncharge(>memory, mc.moved_swap); } - mem_cgroup_id_get_many(mc.to, mc.moved_swap); mc.moved_swap = 0; } @@ -7625,7 +7624,8 @@ put: /* get_mctgt_type() gets the page */ ent = target.ent; if (!mem_cgroup_move_swap_account(ent, mc.from, mc.to)) { mc.precharge--; - /* we fixup refcnts and charges later. */ + mem_cgroup_id_get_many(mc.to, 1); + /* we fixup other refcnts and charges later. */ mc.moved_swap++; } break; -- 2.40.1 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RH7 0/9] memcg: release id when offlinging cgroup
We see that container user can deplete memory cgroup ids on the system (64k) and prevent further memory cgroup creation. In crash collected by our customer in such a situation we see that mem_cgroup_idr is full of cgroups from one container with same exact path (cgroup of docker service), cgroups are not released because they have kmem charges, this kmem charge is for a tmpfs dentry allocated from this cgroup. (And on vz7 kernel it seems that such a dentry is only released after umounting tmpfs or removing the corresponding file from tmpfs.) So there is a valid way to hold kmem cgroup for a long time. Similar thing was mentioned in mainstream with page cache holding kmem cgroup for a long time. And they proposed a way to deal with it - just release cgroup id early so that one can allocate new cgroups immediately. Reproduce: https://git.vzint.dev/users/ptikhomirov/repos/helpers/browse/memcg-related/test-mycg-tmpfs.sh After this fix the number of memory cgroups in /proc/cgroups can now show > 64k as we allow to leave memory cgroups hanging while releasing their ids. Note: Maybe it's a bad idea to allow container to eat kernel memory with such a hanging cgroups, but yet I don't have better ideas. https://jira.vzint.dev/browse/PSBM-147473 https://jira.vzint.dev/browse/PSBM-147036 Signed-off-by: Pavel Tikhomirov Arnd Bergmann (1): mm: memcontrol: avoid unused function warning Hugh Dickins (1): mm/memcg: fix refcount error while moving and swapping Johannes Weiner (2): mm: memcontrol: uncharge pages on swapout mm: memcontrol: fix cgroup creation failure after many small jobs Kirill Tkhai (1): memcg: remove memcg_cgroup::id from IDR on mem_cgroup_css_alloc() failure Qian Cai (1): mm/memcontrol.c: fix a -Wunused-function warning Vladimir Davydov (3): mm: memcontrol: fix swap counter leak on swapout from offline cgroup mm: memcontrol: fix memcg id ref counter on swap charge move mm: memcontrol: add sanity checks for memcg->id.ref on get/put mm/memcontrol.c | 134 ++-- 1 file changed, 106 insertions(+), 28 deletions(-) -- 2.40.1 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RH7 1/9] mm: memcontrol: uncharge pages on swapout
From: Johannes Weiner This series gets rid of the remaining page_cgroup flags, thus cutting the memcg per-page overhead down to one pointer. This patch (of 4): mem_cgroup_swapout() is called with exclusive access to the page at the end of the page's lifetime. Instead of clearing the PCG_MEMSW flag and deferring the uncharge, just do it right away. This allows follow-up patches to simplify the uncharge code. Signed-off-by: Johannes Weiner Cc: Hugh Dickins Acked-by: Michal Hocko Acked-by: Vladimir Davydov Reviewed-by: KAMEZAWA Hiroyuki Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds https://jira.vzint.dev/browse/PSBM-147036 (cherry picked from commit 7bdd143c37e591c254d0991ac398a53f3f9ef1af) Signed-off-by: Pavel Tikhomirov --- mm/memcontrol.c | 20 +++- 1 file changed, 15 insertions(+), 5 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index f8a530e29b7a..a334e9a1a311 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -7741,6 +7741,7 @@ static void __init enable_swap_cgroup(void) */ void mem_cgroup_swapout(struct page *page, swp_entry_t entry) { + struct mem_cgroup *memcg; struct page_cgroup *pc; unsigned short oldid; @@ -7757,14 +7758,23 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry) return; VM_BUG_ON_PAGE(!(pc->flags & PCG_MEMSW), page); + memcg = pc->mem_cgroup; - oldid = swap_cgroup_record(entry, mem_cgroup_id(pc->mem_cgroup)); + oldid = swap_cgroup_record(entry, mem_cgroup_id(memcg)); VM_BUG_ON_PAGE(oldid, page); + mem_cgroup_swap_statistics(memcg, true); + this_cpu_inc(memcg->stat->events[MEM_CGROUP_EVENTS_PSWPOUT]); + + pc->flags = 0; - pc->flags &= ~PCG_MEMSW; - css_get(>mem_cgroup->css); - mem_cgroup_swap_statistics(pc->mem_cgroup, true); - this_cpu_inc(pc->mem_cgroup->stat->events[MEM_CGROUP_EVENTS_PSWPOUT]); + if (!mem_cgroup_is_root(memcg)) + page_counter_uncharge(>memory, 1); + + /* XXX: caller holds IRQ-safe mapping->tree_lock */ + VM_BUG_ON(!irqs_disabled()); + + mem_cgroup_charge_statistics(memcg, page, -1); + memcg_check_events(memcg, page); } /** -- 2.40.1 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH vz7 v3] ploop: properly restore old delta kobject on replace error
On 15.06.2023 04:23, Alexander Atanasov wrote: Current code removes the old_delta kobject before the new delta is completely ready to be used, in case there is an error while reading BAT, the "too short BAT" error we have seen, the new delta and its objects are destroyed properly but the original delta kobject is not restored, so pdelta sysfs dir stay empty and userspace tools get confused since they consult this files before performing operations on the ploop device. To fix this instead of deleting the delta's kobject early move it at the end and on error restore back original kobject. Extract code to rename kobject into a function and cleanup its users. Since kobject_add can fail, get an extra reference on error so later kobject_del/kobject_put would not destroy it unexpectedly. Object should be intact except that it would not be visible in sysfs. Fixes: 5f3ee110e6f4 (ploop: Repopulate holes_bitmap on changing delta, 2019-04-17) https://jira.vzint.dev/browse/PSBM-146797 Reviewed-by: Pavel Tikhomirov Signed-off-by: Alexander Atanasov --- drivers/block/ploop/dev.c | 54 --- 1 file changed, 28 insertions(+), 26 deletions(-) v1->v2: Addressing review comments. Removed the rename to minimize possibility of errors. added getting extra refs after failed add. v2->v3: dropped the idea of extra refs since they are not necessary. double kobject_put is not an issue since parent is cleared and kset objects are not used. diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c index 6eb22168b5fe..0d6272b39863 100644 --- a/drivers/block/ploop/dev.c +++ b/drivers/block/ploop/dev.c @@ -3594,27 +3594,24 @@ static int ploop_replace_delta(struct ploop_device * plo, unsigned long arg) if (err) goto out_destroy; - kobject_del(_delta->kobj); - - err = KOBJECT_ADD(>kobj, kobject_get(>kobj), - "%d", delta->level); - /* _put below is a pair for _get for OLD delta */ - kobject_put(>kobj); - - if (err < 0) { - kobject_put(>kobj); - goto out_close; - } - ploop_quiesce(plo); if (delta->ops->replace_delta) { err = delta->ops->replace_delta(delta); if (err) { ploop_relax(plo); - goto out_kobj_del; + goto out_close; } } + /* Remove old delta kobj to avoid name collision with the new one */ + kobject_del(_delta->kobj); + err = KOBJECT_ADD(>kobj, kobject_get(>kobj), + "%d", delta->level); + if (err < 0) { + /* _put for failed _ADD */ + kobject_put(>kobj); + goto out_kobj_restore; + } ploop_map_destroy(>map); list_replace_init(_delta->list, >list); ploop_delta_list_changed(plo); @@ -3632,11 +3629,14 @@ static int ploop_replace_delta(struct ploop_device * plo, unsigned long arg) old_delta->ops->stop(old_delta); old_delta->ops->destroy(old_delta); kobject_put(_delta->kobj); - return 0; - -out_kobj_del: - kobject_del(>kobj); kobject_put(>kobj); + return 0; +out_kobj_restore: + /* we haven't dropped our plo->kobj ref just add back */ + err = KOBJECT_ADD(_delta->kobj, >kobj, "%d", old_delta->level); + if (err < 0) + /* Nothing we can do unfortunately */ + PL_ERR(plo, "Failed to restore old delta kobject\n"); out_close: delta->ops->stop(delta); out_destroy: @@ -3965,6 +3965,16 @@ static void renumber_deltas(struct ploop_device * plo) } } +/* + * Have to implement own version of kobject_rename since it is GPL only symbol + */ +static int ploop_rename_delta(struct ploop_delta *delta, int level, char *pref) +{ + kobject_del(>kobj); + return KOBJECT_ADD(>kobj, >plo->kobj, + "%s%d", pref ? : "", level); +} + static void rename_deltas(struct ploop_device * plo, int level) { struct ploop_delta * delta; @@ -3974,15 +3984,7 @@ static void rename_deltas(struct ploop_device * plo, int level) if (delta->level < level) continue; -#if 0 - /* Oops, kobject_rename() is not exported! */ - sprintf(nname, "%d", delta->level); - err = kobject_rename(>kobj, nname); -#else - kobject_del(>kobj); - err = KOBJECT_ADD(>kobj, >kobj, - "%d", delta->level); -#endif + err = ploop_rename_delta(delta, delta->level, NULL); if (err) PL_WARN(plo, "rename_deltas:
Re: [Devel] [PATCH vz7 v2] ploop: properly restore old delta kobject on replace error
On 09.06.2023 00:19, Alexander Atanasov wrote: Current code removes the old_delta kobject before the new delta is ready to be used in case there is an error while reading BAT, the "too short BAT" error we have seen. The above sentence seems incomplete, as in an opposite case where there was no error, old_delta kobject would have been removed before new delta is ready anyway =) The new delta and its objects are destroyed properly but the original delta kobject is never added back, so pdelta sysfs dir stay empty and userspace tools get confused since they consult this files before performing operations on the ploop device. At later point when ploop device is destroyed a second kobject_del will be called and it can lead to a crash or corrupted memory. To fix this instead of deleting the delta early move it at the end, on error restore back original. Extract code to rename kobject into a function and cleanup its users. Since kobject_add can fail, get an extra reference on error so later kobject_del/kobject_put would not destroy it unexpectedly. Object should be intact except that it would not be visible in sysfs. Fixes: 5f3ee110e6f4 (ploop: Repopulate holes_bitmap on changing delta, 2019-04-17) https://jira.vzint.dev/browse/PSBM-146797 Signed-off-by: Alexander Atanasov --- drivers/block/ploop/dev.c | 63 +++ 1 file changed, 37 insertions(+), 26 deletions(-) v1->v2: Addressing review comments. Removed the rename to minimize possibility of errors. added getting extra refs after failed add. diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c index 6eb22168b5fe..2833865b087f 100644 --- a/drivers/block/ploop/dev.c +++ b/drivers/block/ploop/dev.c @@ -3557,6 +3557,16 @@ static int ploop_add_delta(struct ploop_device * plo, unsigned long arg) return err; } +/* + * Have to implement own version of kobject_rename since it is GPL only symbol + */ +static int ploop_rename_delta(struct ploop_delta *delta, int level, char *pref) +{ + kobject_del(>kobj); + return KOBJECT_ADD(>kobj, >plo->kobj, + "%s%d", pref ? : "", level); +} + It would probably be cleaner to put this function just before the function using it (rename_deltas). static int ploop_replace_delta(struct ploop_device * plo, unsigned long arg) { int err; @@ -3594,27 +3604,25 @@ static int ploop_replace_delta(struct ploop_device * plo, unsigned long arg) if (err) goto out_destroy; - kobject_del(_delta->kobj); - - err = KOBJECT_ADD(>kobj, kobject_get(>kobj), - "%d", delta->level); - /* _put below is a pair for _get for OLD delta */ - kobject_put(>kobj); - - if (err < 0) { - kobject_put(>kobj); - goto out_close; - } ^ double newline after code removal ploop_quiesce(plo); if (delta->ops->replace_delta) { err = delta->ops->replace_delta(delta); if (err) { ploop_relax(plo); - goto out_kobj_del; + goto out_close; } } + /* Remove old delta kobj to avoid name collision with the new one */ + kobject_del(_delta->kobj); + err = KOBJECT_ADD(>kobj, kobject_get(>kobj), + "%d", delta->level); + if (err < 0) { + /* _put for failed _ADD */ + kobject_put(>kobj); + goto out_kobj_restore; + } ploop_map_destroy(>map); list_replace_init(_delta->list, >list); ploop_delta_list_changed(plo); @@ -3632,11 +3640,19 @@ static int ploop_replace_delta(struct ploop_device * plo, unsigned long arg) old_delta->ops->stop(old_delta); old_delta->ops->destroy(old_delta); kobject_put(_delta->kobj); - return 0; - -out_kobj_del: - kobject_del(>kobj); kobject_put(>kobj); + return 0; +out_kobj_restore: + /* we haven't dropped our plo->kobj ref just add back */ + err = KOBJECT_ADD(_delta->kobj, >kobj, "%d", old_delta->level); + if (err < 0) { + PL_ERR(plo, "Failed to restore old delta kobject\n"); + /* +* Get an extra ref to parent object as kobject_add does, so it +* later kobject_del doesn't destory it +*/ + kobject_get(>kobj); Later kobject_del would not see parent object as first kobject_del did "kobj->parent = NULL;", so in this aspect you should not be getting reference here. note: Problem I was talking in my previous message was about kobj->kset put. + } out_close: delta->ops->stop(delta); out_destroy: @@ -3974,17 +3990,12 @@ static void rename_deltas(struct ploop_device * plo, int level) if (delta->level < level) continue; -#if 0 - /* Oops, kobject_rename() is not
Re: [Devel] [PATCH vz7 v2] ploop: increase logging on errors when opening new deltas
On 09.06.2023 00:16, Alexander Atanasov wrote: Ocassionally we got EBUSY but it is a bit over used, so it is not clear what it means. Add more logging to catch the source of the error. https://jira.vzint.dev/browse/PSBM-146836 Reviewed-by: Pavel Tikhomirov Signed-off-by: Alexander Atanasov --- drivers/block/ploop/dev.c | 9 - drivers/block/ploop/fmt_ploop1.c | 4 +++- drivers/block/ploop/io_kaio.c | 13 ++--- drivers/block/ploop/io_kaio_map.c | 7 +-- 4 files changed, 26 insertions(+), 7 deletions(-) v1->v2: addressing review comments - print the error value and fix coding style diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c index 2833865b087f..8e238eafd9f2 100644 --- a/drivers/block/ploop/dev.c +++ b/drivers/block/ploop/dev.c @@ -3582,8 +3582,11 @@ static int ploop_replace_delta(struct ploop_device * plo, unsigned long arg) sizeof(struct ploop_ctl_chunk))) return -EFAULT; - if (plo->maintenance_type != PLOOP_MNTN_OFF) + if (plo->maintenance_type != PLOOP_MNTN_OFF) { + if (printk_ratelimit()) + PL_WARN(plo, "Attempt to replace while in maintenance mode\n"); return -EBUSY; + } old_delta = find_delta(plo, ctl.pctl_level); if (old_delta == NULL) @@ -3596,6 +3599,10 @@ static int ploop_replace_delta(struct ploop_device * plo, unsigned long arg) if (IS_ERR(delta)) return PTR_ERR(delta); + WARN_ONCE(delta->ops != old_delta->ops, + "New delta uses different io %p vs %p\n", + delta->ops, old_delta->ops); + err = delta->ops->compose(delta, 1, ); if (err) goto out_destroy; diff --git a/drivers/block/ploop/fmt_ploop1.c b/drivers/block/ploop/fmt_ploop1.c index e59a9eb50ac2..a0369db35c83 100644 --- a/drivers/block/ploop/fmt_ploop1.c +++ b/drivers/block/ploop/fmt_ploop1.c @@ -314,8 +314,10 @@ ploop1_open(struct ploop_delta * delta) if (!(delta->flags & PLOOP_FMT_RDONLY)) { pvd_header_set_disk_in_use(vh); err = delta->io.ops->sync_write(>io, ph->dyn_page, 4096, 0, 0); - if (err) + if (err) { + PL_ERR(delta->plo, "failed to update in use err=%d\n", err); goto out_err; + } } delta->io.alloc_head = ph->alloc_head; diff --git a/drivers/block/ploop/io_kaio.c b/drivers/block/ploop/io_kaio.c index 35c0fad43baf..098f6c7b5f2d 100644 --- a/drivers/block/ploop/io_kaio.c +++ b/drivers/block/ploop/io_kaio.c @@ -966,17 +966,23 @@ static int kaio_open(struct ploop_io * io) io->files.bdev = io->files.inode->i_sb->s_bdev; err = io->ops->sync(io); - if (err) + if (err) { + PL_WARN(delta->plo, "open failed to sync err=%d\n", err); return err; + } mutex_lock(>files.inode->i_mutex); err = kaio_invalidate_cache(io); - if (!err) + if (err) + PL_WARN(delta->plo, "invaldiate_cache failed err=%d\n", err); + else err = ploop_kaio_open(file, delta->flags & PLOOP_FMT_RDONLY); mutex_unlock(>files.inode->i_mutex); - if (err) + if (err) { + PL_WARN(delta->plo, "open failed err=%d\n", err); return err; + } io->files.em_tree = _em_tree; @@ -988,6 +994,7 @@ static int kaio_open(struct ploop_io * io) err = PTR_ERR(io->fsync_thread); io->fsync_thread = NULL; ploop_kaio_close(io->files.mapping, 0); + PL_WARN(delta->plo, "fsync thread start failed=%d\n", err); return err; } diff --git a/drivers/block/ploop/io_kaio_map.c b/drivers/block/ploop/io_kaio_map.c index d4ff39d95e74..59e35c562ef9 100644 --- a/drivers/block/ploop/io_kaio_map.c +++ b/drivers/block/ploop/io_kaio_map.c @@ -35,10 +35,13 @@ int ploop_kaio_open(struct file * file, int rdonly) else m->readers++; } else { - if (m->readers) + if (m->readers) { + pr_warn("File is already active:%d\n", + m->readers); err = -EBUSY; - else + } else { m->readers = -1; + } } goto kaio_open_done;
Re: [Devel] [PATCH vz7] ploop: properly restore old delta kobject on replace error
On 22.05.2023 23:19, Alexander Atanasov wrote: Current code removes the old_delta kobject before the new delta is ready to be used in case there is an error while reading BAT, the "too short BAT" error we have seen. The new delta and its objects are destroyed properly but the original delta kobject is never added back, so pdelta sysfs dir stay empty and userspace tools get confused since they consult this files before performing operations on the ploop device. At later point when ploop device is destroyed a second kobject_del will be called and it can lead to a crash or corrupted memory. To fix this instead of deleting the delta early rename it with prefix and move the deletion at the end, while on error rename back to original name. This way if something goes wrong it will be visible in sysfs as old_N, where N is the delta level. Extract code to rename kobject into a function and cleanup its users. Fixes: commit 5f3ee110e6f4 ("ploop: Repopulate holes_bitmap on changing delta") https://jira.vzint.dev/browse/PSBM-146797 Signed-off-by: Alexander Atanasov --- drivers/block/ploop/dev.c | 47 ++- 1 file changed, 26 insertions(+), 21 deletions(-) diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c index 6eb22168b5fe..0ed63f9da1cb 100644 --- a/drivers/block/ploop/dev.c +++ b/drivers/block/ploop/dev.c @@ -3557,6 +3557,16 @@ static int ploop_add_delta(struct ploop_device * plo, unsigned long arg) return err; } +/* + * Have to implement own version of kobject_rename since it is GPL only symbol + */ +static int ploop_rename_delta(struct ploop_delta *delta, int level, char *pref) +{ + kobject_del(>kobj); + return KOBJECT_ADD(>kobj, >plo->kobj, + "%s%d", pref ? : "", level); +} + static int ploop_replace_delta(struct ploop_device * plo, unsigned long arg) { int err; @@ -3594,15 +3604,9 @@ static int ploop_replace_delta(struct ploop_device * plo, unsigned long arg) if (err) goto out_destroy; - kobject_del(_delta->kobj); - - err = KOBJECT_ADD(>kobj, kobject_get(>kobj), - "%d", delta->level); - /* _put below is a pair for _get for OLD delta */ - kobject_put(>kobj); - + err = ploop_rename_delta(old_delta, old_delta->level, "old_"); if (err < 0) { - kobject_put(>kobj); + PL_WARN(plo, "Failed to rename old delta kobj\n"); goto out_close; I don't like this error path, if ploop_rename_delta failed it means that we did kobject_del(_delta->kobj); but didn't finish corresponding KOBJECT_ADD, and in out_close, we would do ploop_rename_delta again where we would do second kobject_del(_delta->jobj) and corresponding KOBJECT_ADD. This would do double kobject_del->kobj_kset_leave->kset_put, not sure if it would break something or not though. I mean, maybe it's better to preserve original out_close, and add a new label out_restore_rename. } @@ -3611,10 +3615,17 @@ static int ploop_replace_delta(struct ploop_device * plo, unsigned long arg) err = delta->ops->replace_delta(delta); if (err) { ploop_relax(plo); - goto out_kobj_del; + goto out_close; } } + err = KOBJECT_ADD(>kobj, kobject_get(>kobj), + "%d", delta->level); + if (err < 0) { + /* _put for failed _ADD */ + kobject_put(>kobj); + goto out_close; + } ploop_map_destroy(>map); list_replace_init(_delta->list, >list); ploop_delta_list_changed(plo); @@ -3631,13 +3642,15 @@ static int ploop_replace_delta(struct ploop_device * plo, unsigned long arg) old_delta->ops->stop(old_delta); old_delta->ops->destroy(old_delta); + kobject_del(_delta->kobj); kobject_put(_delta->kobj); + kobject_put(>kobj); return 0; -out_kobj_del: - kobject_del(>kobj); - kobject_put(>kobj); out_close: + err = ploop_rename_delta(old_delta, old_delta->level, NULL); + if (err < 0) + PL_ERR(plo, "Failed to restore old delta kobj name\n"); delta->ops->stop(delta); out_destroy: delta->ops->destroy(delta); @@ -3974,15 +3987,7 @@ static void rename_deltas(struct ploop_device * plo, int level) if (delta->level < level) continue; -#if 0 - /* Oops, kobject_rename() is not exported! */ - sprintf(nname, "%d", delta->level); - err = kobject_rename(>kobj, nname); -#else - kobject_del(>kobj); - err = KOBJECT_ADD(>kobj, >kobj, - "%d", delta->level); -#endif + err = ploop_rename_delta(delta, delta->level, NULL); if (err) PL_WARN(plo,
Re: [Devel] [PATCH vz7] ploop: increase logging on errors when opening new deltas
Looks good, see minor nits in inline comments. On 22.05.2023 23:39, Alexander Atanasov wrote: Ocassionally we got EBUSY but it is a bit over used, so it is not clear what it means. Add more logging to catch the source of the error. https://jira.vzint.dev/browse/PSBM-146836 Signed-off-by: Alexander Atanasov --- drivers/block/ploop/dev.c | 9 - drivers/block/ploop/fmt_ploop1.c | 4 +++- drivers/block/ploop/io_kaio.c | 13 ++--- drivers/block/ploop/io_kaio_map.c | 6 -- 4 files changed, 25 insertions(+), 7 deletions(-) diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c index 6eb22168b5fe..75e427927713 100644 --- a/drivers/block/ploop/dev.c +++ b/drivers/block/ploop/dev.c @@ -3572,8 +3572,11 @@ static int ploop_replace_delta(struct ploop_device * plo, unsigned long arg) sizeof(struct ploop_ctl_chunk))) return -EFAULT; - if (plo->maintenance_type != PLOOP_MNTN_OFF) + if (plo->maintenance_type != PLOOP_MNTN_OFF) { + if (printk_ratelimit()) + PL_WARN(plo, "Attempt to replace while in maintenance mode\n"); return -EBUSY; + } old_delta = find_delta(plo, ctl.pctl_level); if (old_delta == NULL) @@ -3586,6 +3589,10 @@ static int ploop_replace_delta(struct ploop_device * plo, unsigned long arg) if (IS_ERR(delta)) return PTR_ERR(delta); + WARN_ONCE(delta->ops != old_delta->ops, + "New delta uses different io %p vs %p\n", + delta->ops, old_delta->ops); + err = delta->ops->compose(delta, 1, ); if (err) goto out_destroy; diff --git a/drivers/block/ploop/fmt_ploop1.c b/drivers/block/ploop/fmt_ploop1.c index e59a9eb50ac2..a89804561e57 100644 --- a/drivers/block/ploop/fmt_ploop1.c +++ b/drivers/block/ploop/fmt_ploop1.c @@ -314,8 +314,10 @@ ploop1_open(struct ploop_delta * delta) if (!(delta->flags & PLOOP_FMT_RDONLY)) { pvd_header_set_disk_in_use(vh); err = delta->io.ops->sync_write(>io, ph->dyn_page, 4096, 0, 0); - if (err) + if (err) { + PL_ERR(delta->plo, "write failed updating in use\n"); Don't we need err printed here? (like in other places below where we have err printed) goto out_err; + } } delta->io.alloc_head = ph->alloc_head; diff --git a/drivers/block/ploop/io_kaio.c b/drivers/block/ploop/io_kaio.c index ab93d2c70bc5..b7258252deb2 100644 --- a/drivers/block/ploop/io_kaio.c +++ b/drivers/block/ploop/io_kaio.c @@ -997,17 +997,23 @@ static int kaio_open(struct ploop_io * io) io->files.bdev = io->files.inode->i_sb->s_bdev; err = io->ops->sync(io); - if (err) + if (err) { + PL_WARN(delta->plo, "open failed to sync err=%d\n", err); return err; + } mutex_lock(>files.inode->i_mutex); err = kaio_invalidate_cache(io); - if (!err) + if (err) + PL_WARN(delta->plo, "invaldiate_cache failed err=%d\n", err); + else err = ploop_kaio_open(file, delta->flags & PLOOP_FMT_RDONLY); mutex_unlock(>files.inode->i_mutex); - if (err) + if (err) { + PL_WARN(delta->plo, "open failed err=%d\n", err); return err; + } io->files.em_tree = _em_tree; @@ -1019,6 +1025,7 @@ static int kaio_open(struct ploop_io * io) err = PTR_ERR(io->fsync_thread); io->fsync_thread = NULL; ploop_kaio_close(io->files.mapping, 0); + PL_WARN(delta->plo, "fsync thread start failed=%d\n", err); return err; } diff --git a/drivers/block/ploop/io_kaio_map.c b/drivers/block/ploop/io_kaio_map.c index d4ff39d95e74..5a3d11e0f4a9 100644 --- a/drivers/block/ploop/io_kaio_map.c +++ b/drivers/block/ploop/io_kaio_map.c @@ -35,9 +35,11 @@ int ploop_kaio_open(struct file * file, int rdonly) else m->readers++; } else { - if (m->readers) + if (m->readers) { + pr_warn("File is already active:%d\n", + m->readers); err = -EBUSY; - else + } else m->readers = -1; I'd rather you follow the codding style here and add {} to both branches. https://www.kernel.org/doc/html/v4.10/process/coding-style.html#placing-braces-and-spaces } goto kaio_open_done; -- Best regards, Tikhomirov Pavel Senior Software Developer,
[Devel] [PATCH RH7] ve/cgroups: fix subgroups_limit error path handling
We do ida_simple_get on root->cgroup_ida, just before checking subgroups_limit, and in case subgroups_limit is reached we don't do corresponding ida_simple_remove to free id. Let's fix it by jumping to proper goto label err_free_id. This may or may not be related to [1], found while investigating it. https://jira.vzint.dev/browse/PSBM-147036 [1] Fixes: 92faf0fad3e3 ("ve/cgroups: Introduce subgroups_limit control") Signed-off-by: Pavel Tikhomirov --- kernel/cgroup.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/cgroup.c b/kernel/cgroup.c index f2952d7c18dc..3f8c49b9ebe0 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -4888,7 +4888,7 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry, if (ve_root && ve_root->subgroups_limit > 0 && subgroups_count(ve_root) >= ve_root->subgroups_limit) { err = -EACCES; - goto err_free_name; + goto err_free_id; } /* -- 2.39.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH RH9 0/4] net/openvswitch: pull ms API
Ported cleanly. Would be nice to have jira issue tracking this. Reviewed-by: Pavel Tikhomirov On 25.04.2023 18:11, Andrey Zhadchenko wrote: Revert our patch for openvswitch, apply the ones that got accepted into mainstream: https://lore.kernel.org/all/20220825020450.664147-1-andrey.zhadche...@virtuozzo.com/ Andrey Zhadchenko (4): Revert "net: openvswitch: add capability to specify ifindex of new links" openvswitch: fix memory leak at failed datapath creation openvswitch: allow specifying ifindex of new interfaces openvswitch: add OVS_DP_ATTR_PER_CPU_PIDS to get requests include/uapi/linux/openvswitch.h | 3 +++ net/openvswitch/datapath.c | 29 net/openvswitch/vport-internal_dev.c | 2 +- net/openvswitch/vport.h | 4 ++-- 4 files changed, 23 insertions(+), 15 deletions(-) -- Best regards, Tikhomirov Pavel Senior Software Developer, Virtuozzo. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH RH9 2/4] openvswitch: fix memory leak at failed datapath creation
On 25.04.2023 18:11, Andrey Zhadchenko wrote: ovs_dp_cmd_new()->ovs_dp_change()->ovs_dp_set_upcall_portids() allocates array via kmalloc. If for some reason new_vport() fails during ovs_dp_cmd_new() dp->upcall_portids must be freed. Add missing kfree. Kmemleak example: unreferenced object 0x88800c382500 (size 64): comm "dump_state", pid 323, jiffies 4294955418 (age 104.347s) hex dump (first 32 bytes): 5e c2 79 e4 1f 7a 38 c7 09 21 38 0c 80 88 ff ff ^.y..z8..!8. 03 00 00 00 0a 00 00 00 14 00 00 00 28 00 00 00 (... backtrace: [<71bebc9f>] ovs_dp_set_upcall_portids+0x38/0xa0 [<0187d8bd>] ovs_dp_change+0x63/0xe0 [<2397e446>] ovs_dp_cmd_new+0x1f0/0x380 [] genl_family_rcv_msg_doit+0xea/0x150 [<8f583bc4>] genl_rcv_msg+0xdc/0x1e0 [ ] netlink_rcv_skb+0x50/0x100 [<4959cece>] genl_rcv+0x24/0x40 [<4699ac7f>] netlink_unicast+0x23e/0x360 [ ] netlink_sendmsg+0x24e/0x4b0 [<6f4aa380>] sock_sendmsg+0x62/0x70 [ ] sys_sendmsg+0x230/0x270 [<12dacf7d>] ___sys_sendmsg+0x88/0xd0 [<11776020>] __sys_sendmsg+0x59/0xa0 [<2e8f2dc1>] do_syscall_64+0x3b/0x90 [<3243e7cb>] entry_SYSCALL_64_after_hwframe+0x63/0xcd Fixes: b83d23a2a38b ("openvswitch: Introduce per-cpu upcall dispatch") Acked-by: Aaron Conole Signed-off-by: Andrey Zhadchenko Link: https://lore.kernel.org/r/20220825020326.664073-1-andrey.zhadche...@virtuozzo.com Signed-off-by: Jakub Kicinski (cherry picked from ms commit a87406f4adee9c53b311d8a1ba2849c69e29a6d0) Do we have jira issue for it? Signed-off-by: Andrey Zhadchenko --- net/openvswitch/datapath.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c index 7e8a39a35627..6c9d153afbee 100644 --- a/net/openvswitch/datapath.c +++ b/net/openvswitch/datapath.c @@ -1802,7 +1802,7 @@ static int ovs_dp_cmd_new(struct sk_buff *skb, struct genl_info *info) ovs_dp_reset_user_features(skb, info); } - goto err_unlock_and_destroy_meters; + goto err_destroy_portids; } err = ovs_dp_cmd_fill_info(dp, reply, info->snd_portid, @@ -1817,6 +1817,8 @@ static int ovs_dp_cmd_new(struct sk_buff *skb, struct genl_info *info) ovs_notify(_datapath_genl_family, reply, info); return 0; +err_destroy_portids: + kfree(rcu_dereference_raw(dp->upcall_portids)); err_unlock_and_destroy_meters: ovs_unlock(); ovs_meters_exit(dp); -- Best regards, Tikhomirov Pavel Senior Software Developer, Virtuozzo. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH RH9 1/4] Revert "net: openvswitch: add capability to specify ifindex of new links"
Just a note: old criu will get EOPNOTSUPP on restore in create_one_vport on new kernel. Probably it's OK and we can afford it, as we only released beta version yet. And probably nobody will just update kernel and reboot without updating criu. On 25.04.2023 18:11, Andrey Zhadchenko wrote: This reverts commit 757ebade1eec8c6a3d1a150c8bd6f564c939c058. We should use the version upstream accepted https://jira.vzint.dev/browse/PSBM-105844 Signed-off-by: Andrey Zhadchenko --- net/openvswitch/datapath.c | 16 ++-- net/openvswitch/vport-internal_dev.c | 1 - net/openvswitch/vport.h | 2 -- 3 files changed, 2 insertions(+), 17 deletions(-) diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c index 8033c97a8d65..7e8a39a35627 100644 --- a/net/openvswitch/datapath.c +++ b/net/openvswitch/datapath.c @@ -1739,7 +1739,6 @@ static int ovs_dp_cmd_new(struct sk_buff *skb, struct genl_info *info) struct vport *vport; struct ovs_net *ovs_net; int err; - struct ovs_header *ovs_header = info->userhdr; err = -EINVAL; if (!a[OVS_DP_ATTR_NAME] || !a[OVS_DP_ATTR_UPCALL_PID]) @@ -1780,7 +1779,6 @@ static int ovs_dp_cmd_new(struct sk_buff *skb, struct genl_info *info) parms.dp = dp; parms.port_no = OVSP_LOCAL; parms.upcall_portids = a[OVS_DP_ATTR_UPCALL_PID]; - parms.desired_ifindex = ovs_header->dp_ifindex; /* So far only local changes have been made, now need the lock. */ ovs_lock(); @@ -2201,10 +2199,7 @@ static int ovs_vport_cmd_new(struct sk_buff *skb, struct genl_info *info) if (!a[OVS_VPORT_ATTR_NAME] || !a[OVS_VPORT_ATTR_TYPE] || !a[OVS_VPORT_ATTR_UPCALL_PID]) return -EINVAL; - - parms.type = nla_get_u32(a[OVS_VPORT_ATTR_TYPE]); - - if (a[OVS_VPORT_ATTR_IFINDEX] && parms.type != OVS_VPORT_TYPE_INTERNAL) + if (a[OVS_VPORT_ATTR_IFINDEX]) return -EOPNOTSUPP; port_no = a[OVS_VPORT_ATTR_PORT_NO] @@ -2241,19 +2236,12 @@ static int ovs_vport_cmd_new(struct sk_buff *skb, struct genl_info *info) } parms.name = nla_data(a[OVS_VPORT_ATTR_NAME]); + parms.type = nla_get_u32(a[OVS_VPORT_ATTR_TYPE]); parms.options = a[OVS_VPORT_ATTR_OPTIONS]; parms.dp = dp; parms.port_no = port_no; parms.upcall_portids = a[OVS_VPORT_ATTR_UPCALL_PID]; - if (parms.type == OVS_VPORT_TYPE_INTERNAL) { - if (a[OVS_VPORT_ATTR_IFINDEX]) - parms.desired_ifindex = - nla_get_u32(a[OVS_VPORT_ATTR_IFINDEX]); - else - parms.desired_ifindex = 0; - } - vport = new_vport(); err = PTR_ERR(vport); if (IS_ERR(vport)) { diff --git a/net/openvswitch/vport-internal_dev.c b/net/openvswitch/vport-internal_dev.c index 1c25158fbdf2..1e5468137c88 100644 --- a/net/openvswitch/vport-internal_dev.c +++ b/net/openvswitch/vport-internal_dev.c @@ -157,7 +157,6 @@ static struct vport *internal_dev_create(const struct vport_parms *parms) if (vport->port_no == OVSP_LOCAL) vport->dev->features |= NETIF_F_NETNS_LOCAL; - dev->ifindex = parms->desired_ifindex; rtnl_lock(); err = register_netdevice(vport->dev); if (err) diff --git a/net/openvswitch/vport.h b/net/openvswitch/vport.h index 24e1cba2f1ac..9de5030d9801 100644 --- a/net/openvswitch/vport.h +++ b/net/openvswitch/vport.h @@ -98,8 +98,6 @@ struct vport_parms { enum ovs_vport_type type; struct nlattr *options; - int desired_ifindex; - /* For ovs_vport_alloc(). */ struct datapath *dp; u16 port_no; -- Best regards, Tikhomirov Pavel Senior Software Developer, Virtuozzo. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RH7 00/14] port move_mount_set_group and mount_setattr
We need this as in Virtuozzo criu after rebase to mainstream criu in u20 we will switch to this new API for sharing group setting accross mounts. https://jira.sw.ru/browse/PSBM-144416 Signed-off-by: Pavel Tikhomirov Aleksa Sarai (1): lib: introduce copy_struct_from_user() helper Christian Brauner (6): namespace: take lock_mount_hash() directly when changing flags mount: make {lock,unlock}_mount_hash() static namespace: only take read lock in do_reconfigure_mnt() fs: split out functions to hold writers fs: add mount_setattr() fs: drop peer group ids under namespace lock David Howells (3): VFS: Differentiate mount flags (MS_*) from internal superblock flags vfs: Undo an overly zealous MS_RDONLY -> SB_RDONLY conversion vfs: Separate changing mount flags full remount Markus Trippelsdorf (1): VFS: Handle lazytime in do_mount() Mimi Zohar (1): vfs: fix mounting a filesystem with i_version Pavel Tikhomirov (2): mount: rename do_set_group to do_set_group_old move_mount: allow to add a mount into an existing group Documentation/filesystems/porting | 2 +- arch/x86/syscalls/syscall_32.tbl | 1 + arch/x86/syscalls/syscall_64.tbl | 1 + fs/mount.h| 10 - fs/namespace.c| 646 +- fs/super.c| 68 ++-- include/linux/fs.h| 45 ++- include/linux/mount.h | 2 +- include/linux/uaccess.h | 70 include/linux/vz_bitops.h | 11 + include/uapi/linux/fs.h | 3 +- init/do_mounts.c | 2 +- lib/strnlen_user.c| 8 +- lib/usercopy.c| 76 14 files changed, 780 insertions(+), 165 deletions(-) create mode 100644 include/linux/vz_bitops.h -- 2.39.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RH7 03/14] lib: introduce copy_struct_from_user() helper
From: Aleksa Sarai A common pattern for syscall extensions is increasing the size of a struct passed from userspace, such that the zero-value of the new fields result in the old kernel behaviour (allowing for a mix of userspace and kernel vintages to operate on one another in most cases). While this interface exists for communication in both directions, only one interface is straightforward to have reasonable semantics for (userspace passing a struct to the kernel). For kernel returns to userspace, what the correct semantics are (whether there should be an error if userspace is unaware of a new extension) is very syscall-dependent and thus probably cannot be unified between syscalls (a good example of this problem is [1]). Previously there was no common lib/ function that implemented the necessary extension-checking semantics (and different syscalls implemented them slightly differently or incompletely[2]). Future patches replace common uses of this pattern to make use of copy_struct_from_user(). Some in-kernel selftests that insure that the handling of alignment and various byte patterns are all handled identically to memchr_inv() usage. [1]: commit 1251201c0d34 ("sched/core: Fix uclamp ABI bug, clean up and robustify sched_read_attr() ABI logic and code") [2]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do similar checks to copy_struct_from_user() while rt_sigprocmask(2) always rejects differently-sized struct arguments. Suggested-by: Rasmus Villemoes Signed-off-by: Aleksa Sarai Reviewed-by: Kees Cook Reviewed-by: Christian Brauner Link: https://lore.kernel.org/r/20191001011055.19283-2-cyp...@cyphar.com Signed-off-by: Christian Brauner - Need this for mount_setxattr syscall. - Dropped lib/test_user_copy.c hunks when rebasing. - Move aligned_byte_mask to vz_bitops.h else something breaks on boot after moving it to bitops.h. - Add unsafe_get_user, user_access_begin and user_access_end helpers. https://jira.vzint.dev/browse/PSBM-144416 (cherry picked from commit f5a1a536fa14895ccff4e94e6a5af90901ce86aa) Signed-off-by: Pavel Tikhomirov --- include/linux/uaccess.h | 70 include/linux/vz_bitops.h | 11 ++ lib/strnlen_user.c| 8 + lib/usercopy.c| 76 +++ 4 files changed, 158 insertions(+), 7 deletions(-) create mode 100644 include/linux/vz_bitops.h diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h index b28cc68c1896..9f8da4aa31a1 100644 --- a/include/linux/uaccess.h +++ b/include/linux/uaccess.h @@ -112,6 +112,76 @@ static inline unsigned long __copy_from_user_nocache(void *to, ret;\ }) +extern __must_check int check_zeroed_user(const void __user *from, size_t size); + +/** + * copy_struct_from_user: copy a struct from userspace + * @dst: Destination address, in kernel space. This buffer must be @ksize + * bytes long. + * @ksize: Size of @dst struct. + * @src: Source address, in userspace. + * @usize: (Alleged) size of @src struct. + * + * Copies a struct from userspace to kernel space, in a way that guarantees + * backwards-compatibility for struct syscall arguments (as long as future + * struct extensions are made such that all new fields are *appended* to the + * old struct, and zeroed-out new fields have the same meaning as the old + * struct). + * + * @ksize is just sizeof(*dst), and @usize should've been passed by userspace. + * The recommended usage is something like the following: + * + * SYSCALL_DEFINE2(foobar, const struct foo __user *, uarg, size_t, usize) + * { + * int err; + * struct foo karg = {}; + * + * if (usize > PAGE_SIZE) + *return -E2BIG; + * if (usize < FOO_SIZE_VER0) + *return -EINVAL; + * + * err = copy_struct_from_user(, sizeof(karg), uarg, usize); + * if (err) + *return err; + * + * // ... + * } + * + * There are three cases to consider: + * * If @usize == @ksize, then it's copied verbatim. + * * If @usize < @ksize, then the userspace has passed an old struct to a + *newer kernel. The rest of the trailing bytes in @dst (@ksize - @usize) + *are to be zero-filled. + * * If @usize > @ksize, then the userspace has passed a new struct to an + *older kernel. The trailing bytes unknown to the kernel (@usize - @ksize) + *are checked to ensure they are zeroed, otherwise -E2BIG is returned. + * + * Returns (in all cases, some data may have been copied): + * * -E2BIG: (@usize > @ksize) and there are non-zero trailing bytes in @src. + * * -EFAULT: access to userspace failed. + */ +static __always_inline __must_check int +copy_struct_from_user(void *dst, size_t ksize, const void __user *src, + size_t usize) +{ + size_t size = min(ksize, usize); + size_t rest = max(ksize, usize) - size; + + /*
[Devel] [PATCH RH7 13/14] fs: add mount_setattr()
From: Christian Brauner This implements the missing mount_setattr() syscall. While the new mount api allows to change the properties of a superblock there is currently no way to change the properties of a mount or a mount tree using file descriptors which the new mount api is based on. In addition the old mount api has the restriction that mount options cannot be applied recursively. This hasn't changed since changing mount options on a per-mount basis was implemented in [1] and has been a frequent request not just for convenience but also for security reasons. The legacy mount syscall is unable to accommodate this behavior without introducing a whole new set of flags because MS_REC | MS_REMOUNT | MS_BIND | MS_RDONLY | MS_NOEXEC | [...] only apply the mount option to the topmost mount. Changing MS_REC to apply to the whole mount tree would mean introducing a significant uapi change and would likely cause significant regressions. The new mount_setattr() syscall allows to recursively clear and set mount options in one shot. Multiple calls to change mount options requesting the same changes are idempotent: int mount_setattr(int dfd, const char *path, unsigned flags, struct mount_attr *uattr, size_t usize); Flags to modify path resolution behavior are specified in the @flags argument. Currently, AT_EMPTY_PATH, AT_RECURSIVE, AT_SYMLINK_NOFOLLOW, and AT_NO_AUTOMOUNT are supported. If useful, additional lookup flags to restrict path resolution as introduced with openat2() might be supported in the future. The mount_setattr() syscall can be expected to grow over time and is designed with extensibility in mind. It follows the extensible syscall pattern we have used with other syscalls such as openat2(), clone3(), sched_{set,get}attr(), and others. The set of mount options is passed in the uapi struct mount_attr which currently has the following layout: struct mount_attr { __u64 attr_set; __u64 attr_clr; __u64 propagation; __u64 userns_fd; }; The @attr_set and @attr_clr members are used to clear and set mount options. This way a user can e.g. request that a set of flags is to be raised such as turning mounts readonly by raising MOUNT_ATTR_RDONLY in @attr_set while at the same time requesting that another set of flags is to be lowered such as removing noexec from a mount tree by specifying MOUNT_ATTR_NOEXEC in @attr_clr. Note, since the MOUNT_ATTR_ values are an enum starting from 0, not a bitmap, users wanting to transition to a different atime setting cannot simply specify the atime setting in @attr_set, but must also specify MOUNT_ATTR__ATIME in the @attr_clr field. So we ensure that MOUNT_ATTR__ATIME can't be partially set in @attr_clr and that @attr_set can't have any atime bits set if MOUNT_ATTR__ATIME isn't set in @attr_clr. The @propagation field lets callers specify the propagation type of a mount tree. Propagation is a single property that has four different settings and as such is not really a flag argument but an enum. Specifically, it would be unclear what setting and clearing propagation settings in combination would amount to. The legacy mount() syscall thus forbids the combination of multiple propagation settings too. The goal is to keep the semantics of mount propagation somewhat simple as they are overly complex as it is. The @userns_fd field lets user specify a user namespace whose idmapping becomes the idmapping of the mount. This is implemented and explained in detail in the next patch. [1]: commit 2e4b7fcd9260 ("[PATCH] r/o bind mounts: honor mount writer counts at remount") Link: https://lore.kernel.org/r/20210121131959.646623-35-christian.brau...@ubuntu.com Cc: David Howells Cc: Aleksa Sarai Cc: Al Viro Cc: linux-fsde...@vger.kernel.org Cc: linux-...@vger.kernel.org Reviewed-by: Christoph Hellwig Signed-off-by: Christian Brauner Changes: port syscall for x86 only, drop uapi hunks and ignore MNT_NOSYMFOLLOW as it is not yet supported (cherry picked from commit 2a1867219c7b27f928e2545782b86daaf9ad50bd) https://jira.sw.ru/browse/PSBM-144416 Signed-off-by: Pavel Tikhomirov --- arch/x86/syscalls/syscall_32.tbl | 1 + arch/x86/syscalls/syscall_64.tbl | 1 + fs/namespace.c | 312 +++ 3 files changed, 314 insertions(+) diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl index 978f07cb0ea1..7978137648ed 100644 --- a/arch/x86/syscalls/syscall_32.tbl +++ b/arch/x86/syscalls/syscall_32.tbl @@ -373,6 +373,7 @@ 428i386open_tree sys_open_tree 429i386move_mount sys_move_mount +442i386mount_setattr sys_mount_setattr 510i386getluid sys_getluid 511i386setluid sys_setluid diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl index 3c86abac9a50..7175e94309fd 100644 --- a/arch/x86/syscalls/syscall_64.tbl +++
[Devel] [PATCH RH7 05/14] vfs: fix mounting a filesystem with i_version
From: Mimi Zohar The mount i_version flag is not enabled in the new sb_flags. This patch adds the missing SB_I_VERSION flag. Fixes: e462ec5 "VFS: Differentiate mount flags (MS_*) from internal superblock flags" Signed-off-by: Mimi Zohar Signed-off-by: Al Viro (cherry picked from commit 917086ff231f614e6705927d8fe3eb6aa74b21bf) https://jira.sw.ru/browse/PSBM-144416 Signed-off-by: Pavel Tikhomirov --- fs/namespace.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/namespace.c b/fs/namespace.c index 1a62ab03692e..b60e2eab0d85 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3586,7 +3586,8 @@ long do_mount(const char *dev_name, const char __user *dir_name, SB_MANDLOCK | SB_DIRSYNC | SB_SILENT | - SB_POSIXACL); + SB_POSIXACL | + SB_I_VERSION); if (flags & MS_REMOUNT) retval = do_remount(, flags, sb_flags, mnt_flags, -- 2.39.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RH7 08/14] vfs: Separate changing mount flags full remount
From: David Howells Separate just the changing of mount flags (MS_REMOUNT|MS_BIND) from full remount because the mount data will get parsed with the new fs_context stuff prior to doing a remount - and this causes the syscall to fail under some circumstances. To quote Eric's explanation: [...] mount(..., MS_REMOUNT|MS_BIND, ...) now validates the mount options string, which breaks systemd unit files with ProtectControlGroups=yes (e.g. systemd-networkd.service) when systemd does the following to change a cgroup (v1) mount to read-only: mount(NULL, "/run/systemd/unit-root/sys/fs/cgroup/systemd", NULL, MS_RDONLY|MS_NOSUID|MS_NODEV|MS_NOEXEC|MS_REMOUNT|MS_BIND, NULL) ... when the kernel has CONFIG_CGROUPS=y but no cgroup subsystems enabled, since in that case the error "cgroup1: Need name or subsystem set" is hit when the mount options string is empty. Probably it doesn't make sense to validate the mount options string at all in the MS_REMOUNT|MS_BIND case, though maybe you had something else in mind. This is also worthwhile doing because we will need to add a mount_setattr() syscall to take over the remount-bind function. Reported-by: Eric Biggers Signed-off-by: David Howells Signed-off-by: Al Viro Reviewed-by: David Howells (cherry picked from commit 43f5e655eff7e124d4e484515689cba374ab698e) https://jira.sw.ru/browse/PSBM-144416 Signed-off-by: Pavel Tikhomirov --- fs/namespace.c| 146 ++ include/linux/mount.h | 2 +- 2 files changed, 93 insertions(+), 55 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 10e294cb8cda..1b7b71c46a22 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -285,13 +285,9 @@ static struct mount *alloc_vfsmnt(const char *name) * mnt_want/drop_write() will _keep_ the filesystem * r/w. */ -int __mnt_is_readonly(struct vfsmount *mnt) +bool __mnt_is_readonly(struct vfsmount *mnt) { - if (mnt->mnt_flags & MNT_READONLY) - return 1; - if (mnt->mnt_sb->s_flags & MS_RDONLY) - return 1; - return 0; + return (mnt->mnt_flags & MNT_READONLY) || (mnt->mnt_sb->s_flags & MS_RDONLY); } EXPORT_SYMBOL_GPL(__mnt_is_readonly); @@ -606,11 +602,12 @@ static int mnt_make_readonly(struct mount *mnt) return ret; } -static void __mnt_unmake_readonly(struct mount *mnt) +static int __mnt_unmake_readonly(struct mount *mnt) { lock_mount_hash(); mnt->mnt.mnt_flags &= ~MNT_READONLY; unlock_mount_hash(); + return 0; } int sb_prepare_remount_readonly(struct super_block *sb) @@ -2657,21 +2654,91 @@ SYSCALL_DEFINE3(open_tree, int, dfd, const char __user *, filename, unsigned, fl return fd; } -static int change_mount_flags(struct vfsmount *mnt, int ms_flags) +/* + * Don't allow locked mount flags to be cleared. + * + * No locks need to be held here while testing the various MNT_LOCK + * flags because those flags can never be cleared once they are set. + */ +static bool can_change_locked_flags(struct mount *mnt, unsigned int mnt_flags) +{ + unsigned int fl = mnt->mnt.mnt_flags; + + if ((fl & MNT_LOCK_READONLY) && + !(mnt_flags & MNT_READONLY)) + return false; + + if ((fl & MNT_LOCK_NODEV) && + !(mnt_flags & MNT_NODEV)) + return false; + + if ((fl & MNT_LOCK_NOSUID) && + !(mnt_flags & MNT_NOSUID)) + return false; + + if ((fl & MNT_LOCK_NOEXEC) && + !(mnt_flags & MNT_NOEXEC)) + return false; + + if ((fl & MNT_LOCK_ATIME) && + ((fl & MNT_ATIME_MASK) != (mnt_flags & MNT_ATIME_MASK))) + return false; + + return true; +} + +static int change_mount_ro_state(struct mount *mnt, unsigned int mnt_flags) { - int error = 0; - int readonly_request = 0; + bool readonly_request = (mnt_flags & MNT_READONLY); - if (ms_flags & MS_RDONLY) - readonly_request = 1; - if (readonly_request == __mnt_is_readonly(mnt)) + if (readonly_request == __mnt_is_readonly(>mnt)) return 0; if (readonly_request) - error = mnt_make_readonly(real_mount(mnt)); - else - __mnt_unmake_readonly(real_mount(mnt)); - return error; + return mnt_make_readonly(mnt); + + return __mnt_unmake_readonly(mnt); +} + +/* + * Update the user-settable attributes on a mount. The caller must hold + * sb->s_umount for writing. + */ +static void set_mount_attributes(struct mount *mnt, unsigned int mnt_flags) +{ + lock_mount_hash(); + mnt_flags |= mnt->mnt.mnt_flags & ~MNT_USER_SETTABLE_MASK; + mnt->mnt.mnt_flags = mnt_flags; + touch_mnt_namespace(mnt->mnt_ns);
[Devel] [PATCH RH7 04/14] VFS: Differentiate mount flags (MS_*) from internal superblock flags
From: David Howells Differentiate the MS_* flags passed to mount(2) from the internal flags set in the super_block's s_flags. s_flags are now called SB_*, with the names and the values for the moment mirroring the MS_* flags that they're equivalent to. In this patch, just the headers are altered and some kernel code where blind automated conversion isn't necessarily correct. Note that this shows up some interesting issues: (1) Some MS_* flags get translated to MNT_* flags (such as MS_NODEV -> MNT_NODEV) without passing this on to the filesystem, but some filesystems set such flags anyway. (2) The ->remount_fs() methods of some filesystems adjust the *flags argument by setting MS_* flags in it, such as MS_NOATIME - but these flags are then scrubbed by do_remount_sb() (only the occupants of MS_RMT_MASK are permitted: MS_RDONLY, MS_SYNCHRONOUS, MS_MANDLOCK, MS_I_VERSION and MS_LAZYTIME) I'm not sure what's the best way to solve all these cases. Suggested-by: Al Viro Signed-off-by: David Howells Change: Get rid of our cmd variable in do_mount (cherry picked from commit e462ec50cb5fad19f6003a3d8087f4a0945dd2b1) https://jira.sw.ru/browse/PSBM-144416 Signed-off-by: Pavel Tikhomirov --- Documentation/filesystems/porting | 2 +- fs/namespace.c| 71 --- fs/super.c| 68 ++--- include/linux/fs.h| 45 init/do_mounts.c | 2 +- 5 files changed, 108 insertions(+), 80 deletions(-) diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting index fbfc7338f2bd..680e3401f7d8 100644 --- a/Documentation/filesystems/porting +++ b/Documentation/filesystems/porting @@ -228,7 +228,7 @@ anything from oops to silent memory corruption. --- [mandatory] - FS_NOMOUNT is gone. If you use it - just set MS_NOUSER in flags + FS_NOMOUNT is gone. If you use it - just set SB_NOUSER in flags (see rootfs for one kind of solution and bdev/socket/pipe for another). --- diff --git a/fs/namespace.c b/fs/namespace.c index d10138869c91..1a62ab03692e 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -1065,7 +1065,7 @@ vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void if (!mnt) return ERR_PTR(-ENOMEM); - if (flags & MS_KERNMOUNT) + if (flags & SB_KERNMOUNT) mnt->mnt.mnt_flags = MNT_INTERNAL; root = mount_fs(type, flags, name, data); @@ -1121,7 +1121,7 @@ vfs_submount(const struct dentry *mountpoint, struct file_system_type *type, return ERR_PTR(-EPERM); #endif - return vfs_kern_mount(type, MS_SUBMOUNT, name, data); + return vfs_kern_mount(type, SB_SUBMOUNT, name, data); } EXPORT_SYMBOL_GPL(vfs_submount); @@ -1828,7 +1828,7 @@ static int do_umount(struct mount *mnt, int flags) return -EPERM; down_write(>s_umount); if (!(sb->s_flags & MS_RDONLY)) - retval = do_remount_sb(sb, MS_RDONLY, NULL, 0); + retval = do_remount_sb(sb, SB_RDONLY, NULL, 0); up_write(>s_umount); return retval; } @@ -2419,7 +2419,7 @@ static void unlock_mount(struct mountpoint *where) static int graft_tree(struct mount *mnt, struct mount *p, struct mountpoint *mp) { - if (mnt->mnt.mnt_sb->s_flags & MS_NOUSER) + if (mnt->mnt.mnt_sb->s_flags & SB_NOUSER) return -EINVAL; if (S_ISDIR(mp->m_dentry->d_inode->i_mode) != @@ -2433,9 +2433,9 @@ static int graft_tree(struct mount *mnt, struct mount *p, struct mountpoint *mp) * Sanity check the flags to change_mnt_propagation. */ -static int flags_to_propagation_type(int flags) +static int flags_to_propagation_type(int ms_flags) { - int type = flags & ~(MS_REC | MS_SILENT); + int type = ms_flags & ~(MS_REC | MS_SILENT); /* Fail if any non-propagation flags are set */ if (type & ~(MS_SHARED | MS_PRIVATE | MS_SLAVE | MS_UNBINDABLE)) @@ -2449,18 +2449,18 @@ static int flags_to_propagation_type(int flags) /* * recursively change the type of the mountpoint. */ -static int do_change_type(struct path *path, int flag) +static int do_change_type(struct path *path, int ms_flags) { struct mount *m; struct mount *mnt = real_mount(path->mnt); - int recurse = flag & MS_REC; + int recurse = ms_flags & MS_REC; int type; int err = 0; if (path->dentry != path->mnt->mnt_root) return -EINVAL; - type = flags_to_propagation_type(flag); + type = flags_to_propagation_type(ms_flags); if (!type) return -EINVAL; @@ -2867,8 +2867,8 @@ static int do_check_and_remount_sb(struct s
[Devel] [PATCH RH7 10/14] mount: make {lock, unlock}_mount_hash() static
From: Christian Brauner The lock_mount_hash() and unlock_mount_hash() helpers are never called outside a single file. Remove them from the header and make them static to reflect this fact. There's no need to have them callable from other places right now, as Christoph observed. Link: https://lore.kernel.org/r/20210121131959.646623-31-christian.brau...@ubuntu.com Cc: David Howells Cc: Al Viro Cc: linux-fsde...@vger.kernel.org Suggested-by: Christoph Hellwig Reviewed-by: Christoph Hellwig Signed-off-by: Christian Brauner (cherry picked from commit d033cb6784c4f3a19a593cfe11f850e476197388) https://jira.sw.ru/browse/PSBM-144416 Signed-off-by: Pavel Tikhomirov --- fs/mount.h | 10 -- fs/namespace.c | 10 ++ 2 files changed, 10 insertions(+), 10 deletions(-) diff --git a/fs/mount.h b/fs/mount.h index b07b3934b746..f968bb548685 100644 --- a/fs/mount.h +++ b/fs/mount.h @@ -132,16 +132,6 @@ static inline void get_mnt_ns(struct mnt_namespace *ns) extern seqlock_t mount_lock; -static inline void lock_mount_hash(void) -{ - write_seqlock(_lock); -} - -static inline void unlock_mount_hash(void) -{ - write_sequnlock(_lock); -} - struct proc_mounts { struct seq_file m; struct mnt_namespace *ns; diff --git a/fs/namespace.c b/fs/namespace.c index 52a879628aba..9e58a2ff16ea 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -84,6 +84,16 @@ EXPORT_SYMBOL_GPL(fs_kobj); */ __cacheline_aligned_in_smp DEFINE_SEQLOCK(mount_lock); +static inline void lock_mount_hash(void) +{ + write_seqlock(_lock); +} + +static inline void unlock_mount_hash(void) +{ + write_sequnlock(_lock); +} + static inline struct hlist_head *m_hash(struct vfsmount *mnt, struct dentry *dentry) { unsigned long tmp = ((unsigned long)mnt / L1_CACHE_BYTES); -- 2.39.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RH7 09/14] namespace: take lock_mount_hash() directly when changing flags
From: Christian Brauner Changing mount options always ends up taking lock_mount_hash() but when MNT_READONLY is requested and neither the mount nor the superblock are MNT_READONLY we end up taking the lock, dropping it, and retaking it to change the other mount attributes. Instead, let's acquire the lock once when changing the mount attributes. This simplifies the locking in these codepath, makes them easier to reason about and avoids having to reacquire the lock right after dropping it. Link: https://lore.kernel.org/r/20210121131959.646623-30-christian.brau...@ubuntu.com Cc: David Howells Cc: Al Viro Cc: linux-fsde...@vger.kernel.org Reviewed-by: Christoph Hellwig Signed-off-by: Christian Brauner (cherry picked from commit 68847c941700475575ced191108971d26e82ae29) https://jira.sw.ru/browse/PSBM-144416 Signed-off-by: Pavel Tikhomirov --- fs/namespace.c | 22 -- 1 file changed, 8 insertions(+), 14 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 1b7b71c46a22..52a879628aba 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -564,7 +564,6 @@ static int mnt_make_readonly(struct mount *mnt) { int ret = 0; - lock_mount_hash(); mnt->mnt.mnt_flags |= MNT_WRITE_HOLD; /* * After storing MNT_WRITE_HOLD, we'll read the counters. This store @@ -598,18 +597,9 @@ static int mnt_make_readonly(struct mount *mnt) */ smp_wmb(); mnt->mnt.mnt_flags &= ~MNT_WRITE_HOLD; - unlock_mount_hash(); return ret; } -static int __mnt_unmake_readonly(struct mount *mnt) -{ - lock_mount_hash(); - mnt->mnt.mnt_flags &= ~MNT_READONLY; - unlock_mount_hash(); - return 0; -} - int sb_prepare_remount_readonly(struct super_block *sb) { struct mount *mnt; @@ -2697,7 +2687,8 @@ static int change_mount_ro_state(struct mount *mnt, unsigned int mnt_flags) if (readonly_request) return mnt_make_readonly(mnt); - return __mnt_unmake_readonly(mnt); + mnt->mnt.mnt_flags &= ~MNT_READONLY; + return 0; } /* @@ -2706,11 +2697,9 @@ static int change_mount_ro_state(struct mount *mnt, unsigned int mnt_flags) */ static void set_mount_attributes(struct mount *mnt, unsigned int mnt_flags) { - lock_mount_hash(); mnt_flags |= mnt->mnt.mnt_flags & ~MNT_USER_SETTABLE_MASK; mnt->mnt.mnt_flags = mnt_flags; touch_mnt_namespace(mnt->mnt_ns); - unlock_mount_hash(); } /* @@ -2734,9 +2723,11 @@ static int do_reconfigure_mnt(struct path *path, unsigned int mnt_flags) return -EPERM; down_write(>s_umount); + lock_mount_hash(); ret = change_mount_ro_state(mnt, mnt_flags); if (ret == 0) set_mount_attributes(mnt, mnt_flags); + unlock_mount_hash(); up_write(>s_umount); return ret; } @@ -2958,8 +2949,11 @@ static int do_remount(struct path *path, int ms_flags, int sb_flags, err = -EPERM; if (ns_capable(sb->s_user_ns, CAP_SYS_ADMIN)) { err = do_check_and_remount_sb(sb, sb_flags, data); - if (!err) + if (!err) { + lock_mount_hash(); set_mount_attributes(mnt, mnt_flags); + unlock_mount_hash(); + } } up_write(>s_umount); return err; -- 2.39.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RH7 07/14] vfs: Undo an overly zealous MS_RDONLY -> SB_RDONLY conversion
From: David Howells In do_mount() when the MS_* flags are being converted to MNT_* flags, MS_RDONLY got accidentally convered to SB_RDONLY. Undo this change. Fixes: e462ec50cb5f ("VFS: Differentiate mount flags (MS_*) from internal superblock flags") Signed-off-by: David Howells Signed-off-by: Linus Torvalds (cherry picked from commit a9e5b73288cf1595ac2e05cf1acd1924ceea05fa) https://jira.sw.ru/browse/PSBM-144416 Signed-off-by: Pavel Tikhomirov --- fs/namespace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/namespace.c b/fs/namespace.c index 33ccc11cf327..10e294cb8cda 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3570,7 +3570,7 @@ long do_mount(const char *dev_name, const char __user *dir_name, mnt_flags |= MNT_NODIRATIME; if (flags & MS_STRICTATIME) mnt_flags &= ~(MNT_RELATIME | MNT_NOATIME); - if (flags & SB_RDONLY) + if (flags & MS_RDONLY) mnt_flags |= MNT_READONLY; /* The default atime for remount is preservation */ -- 2.39.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RH7 01/14] mount: rename do_set_group to do_set_group_old
We have a VZ-only feature to copy mount sharing between mounts via mount syscall, to be able to handle mount restore in CRIU u15-u19 efficiently. In mainstream there is now similar feature through move_mount syscall. To support both old criu and new criu, which uses ms API, at the same time let's fix name collision and leave both variants for now, several updates later we can drop old mount syscall based API. https://jira.sw.ru/browse/PSBM-144416 Signed-off-by: Pavel Tikhomirov --- fs/namespace.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index fcfe15ed28f2..94f1e308b354 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2939,7 +2939,7 @@ static inline int tree_contains_unbindable(struct mount *mnt) return 0; } -static int do_set_group(struct path *path, const char *sibling_name) +static int do_set_group_old(struct path *path, const char *sibling_name) { struct ve_struct *ve = get_exec_env(); struct mount *sibling, *mnt; @@ -3525,7 +3525,7 @@ long do_mount(const char *dev_name, const char __user *dir_name, else if (cmd & MS_MOVE) retval = do_move_mount_old(, dev_name); else if (cmd & MS_SET_GROUP) - retval = do_set_group(, dev_name); + retval = do_set_group_old(, dev_name); else retval = do_new_mount(, type_page, flags, mnt_flags, dev_name, data_page); -- 2.39.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RH7 12/14] fs: split out functions to hold writers
From: Christian Brauner When a mount is marked read-only we set MNT_WRITE_HOLD on it if there aren't currently any active writers. Split this logic out into simple helpers that we can use in follow-up patches. Link: https://lore.kernel.org/r/20210121131959.646623-33-christian.brau...@ubuntu.com Cc: David Howells Cc: Al Viro Cc: linux-fsde...@vger.kernel.org Suggested-by: Christoph Hellwig Reviewed-by: Christoph Hellwig Signed-off-by: Christian Brauner (cherry picked from commit fbdc2f6c40f6528fa0db79c73e844451234f3e26) https://jira.sw.ru/browse/PSBM-144416 Signed-off-by: Pavel Tikhomirov --- fs/namespace.c | 24 ++-- 1 file changed, 18 insertions(+), 6 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index c03e21575d08..a40a217f9871 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -570,10 +570,8 @@ void mnt_drop_write_file(struct file *file) } EXPORT_SYMBOL(mnt_drop_write_file); -static int mnt_make_readonly(struct mount *mnt) +static inline int mnt_hold_writers(struct mount *mnt) { - int ret = 0; - mnt->mnt.mnt_flags |= MNT_WRITE_HOLD; /* * After storing MNT_WRITE_HOLD, we'll read the counters. This store @@ -598,15 +596,29 @@ static int mnt_make_readonly(struct mount *mnt) * we're counting up here. */ if (mnt_get_writers(mnt) > 0) - ret = -EBUSY; - else - mnt->mnt.mnt_flags |= MNT_READONLY; + return -EBUSY; + + return 0; +} + +static inline void mnt_unhold_writers(struct mount *mnt) +{ /* * MNT_READONLY must become visible before ~MNT_WRITE_HOLD, so writers * that become unheld will see MNT_READONLY. */ smp_wmb(); mnt->mnt.mnt_flags &= ~MNT_WRITE_HOLD; +} + +static int mnt_make_readonly(struct mount *mnt) +{ + int ret; + + ret = mnt_hold_writers(mnt); + if (!ret) + mnt->mnt.mnt_flags |= MNT_READONLY; + mnt_unhold_writers(mnt); return ret; } -- 2.39.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel