On 2025-11-21 13:03, Andrew Morton wrote:
On Thu, 20 Nov 2025 16:03:53 -0500 Mathieu Desnoyers 
<[email protected]> wrote:
...
It aims at fixing the per-mm RSS tracking which has become too
inaccurate for OOM killer purposes on large many-core systems [1].

Presentation nit: the info at [1] is rather well hidden until one reads
the [2/2] changelog.  You might want to move that material into the
[0/N] - after all, it's the entire point of the patchset.

Will do.

The hierarchical per-CPU counters propagate a sum approximation through
a N-way tree. When reaching the batch size, the carry is propagated
through a binary tree which consists of logN(nr_cpu_ids) levels. The
batch size for each level is twice the batch size of the prior level.
...

Looks very neat.

I'm glad you like the approach :) When looking at this problem, a
hierarchical tree approach seemed like a good candidate to solve the
general problem.


Have you identified other parts of the kernel which could use this?

I have not surveyed other possible users yet. At first glance, users
of the linux/percpu_counter.h percpu_counter_read_positive() may
be good candidates for the hierarchical tree. Here is a list:

lib/flex_proportions.c
157:            num = percpu_counter_read_positive(&pl->events);
158:            den = percpu_counter_read_positive(&p->events);

fs/file_table.c
88:     return percpu_counter_read_positive(&nr_files);

fs/ext4/balloc.c
627:    free_clusters  = percpu_counter_read_positive(fcc);
628:    dirty_clusters = percpu_counter_read_positive(dcc);

fs/ext4/ialloc.c
448:    freei = percpu_counter_read_positive(&sbi->s_freeinodes_counter);
450:    freec = percpu_counter_read_positive(&sbi->s_freeclusters_counter);
453:    ndirs = percpu_counter_read_positive(&sbi->s_dirs_counter);

fs/ext4/extents_status.c
1682:   nr = percpu_counter_read_positive(&sbi->s_es_stats.es_stats_shk_cnt);
1694:   ret = percpu_counter_read_positive(&sbi->s_es_stats.es_stats_shk_cnt);
1699:   ret = percpu_counter_read_positive(&sbi->s_es_stats.es_stats_shk_cnt);

fs/ext4/inode.c
3091:           percpu_counter_read_positive(&sbi->s_freeclusters_counter);
3093:           percpu_counter_read_positive(&sbi->s_dirtyclusters_counter);

fs/btrfs/space-info.c
1028:   ordered = percpu_counter_read_positive(&fs_info->ordered_bytes) >> 1;
1029:   delalloc = percpu_counter_read_positive(&fs_info->delalloc_bytes);

fs/quota/dquot.c
808:    percpu_counter_read_positive(&dqstats.counter[DQST_FREE_DQUOTS]));

fs/ext2/balloc.c
1162:   free_blocks = percpu_counter_read_positive(&sbi->s_freeblocks_counter);

fs/ext2/ialloc.c
268:    freei = percpu_counter_read_positive(&sbi->s_freeinodes_counter);
270:    free_blocks = percpu_counter_read_positive(&sbi->s_freeblocks_counter);
272:    ndirs = percpu_counter_read_positive(&sbi->s_dirs_counter);

fs/jbd2/journal.c
1263:   count = percpu_counter_read_positive(&journal->j_checkpoint_jh_count);
1268:   count = percpu_counter_read_positive(&journal->j_checkpoint_jh_count);
1287:   count = percpu_counter_read_positive(&journal->j_checkpoint_jh_count);

fs/xfs/xfs_ioctl.c
1149:           .allocino = percpu_counter_read_positive(&mp->m_icount),
1150:           .freeino  = percpu_counter_read_positive(&mp->m_ifree),

fs/xfs/xfs_mount.h
736:    return percpu_counter_read_positive(&mp->m_free[ctr].count);

fs/xfs/libxfs/xfs_ialloc.c
732:        percpu_counter_read_positive(&args.mp->m_icount) + newlen >
1913:    * Read rough value of mp->m_icount by percpu_counter_read_positive,
1917:       percpu_counter_read_positive(&mp->m_icount) + igeo->ialloc_inos

mm/util.c
965:    if (percpu_counter_read_positive(&vm_committed_as) < allowed)

drivers/md/dm-crypt.c
1734:                   if 
(unlikely(percpu_counter_read_positive(&cc->n_allocated_pages) +
2750:    * Note, percpu_counter_read_positive() may over (and under) estimate
2754:   if (unlikely(percpu_counter_read_positive(&cc->n_allocated_pages) >= 
dm_crypt_pages_per_client) &&

io_uring/io_uring.c
2502:   return percpu_counter_read_positive(&tctx->inflight);

include/linux/backing-dev.h
71:     return percpu_counter_read_positive(&wb->stat[item]);

include/net/sock.h
1435:   return percpu_counter_read_positive(sk->sk_prot->sockets_allocated);

include/net/dst_ops.h
48:     return percpu_counter_read_positive(&dst->pcpuc_entries);

There are probably other use-cases lurking out there. Anything
that requires sorting resources taken by objects used from many
CPUs concurrently in order to make decisions are good candidates.

The fact that the max inaccuracy is bounded means that we can perform
approximate comparisons in a fast path (which is enough if the values
to compare greatly vary), and only have to do the more expensive
"precise" comparison when values are close to each other and we
actually care about their relative order.

I'm thinking memory usage tracking, runtime usage (scheduler ?), cgroups
ressource accounting.


  include/linux/percpu_counter_tree.h | 239 +++++++++++++++
  init/main.c                         |   2 +
  lib/Makefile                        |   1 +
  lib/percpu_counter_tree.c           | 443 ++++++++++++++++++++++++++++
  4 files changed, 685 insertions(+)
  create mode 100644 include/linux/percpu_counter_tree.h
  create mode 100644 lib/percpu_counter_tree.c

An in-kernel test suite would be great.  Like lib/*test*.c or
tools/testing/.

I'll keep a note to port the tests from my userspace librseq
percpu counters feature branch to the kernel. I did not do it
initially because I wanted to see if the overall approach was
deemed interesting for the kernel.

+struct percpu_counter_tree {
...
+};

I find that understanding the data structure leads to understanding the
code, so additional documentation for the various fields would be
helpful.
Will do.


+
+static inline
+int percpu_counter_tree_carry(int orig, int res, int inc, unsigned int 
bit_mask)
+{
...
+}
+
+static inline
+void percpu_counter_tree_add(struct percpu_counter_tree *counter, int inc)
+{
...
+}

(x86-64 asm below when expanding the inline functions into a real function)

00000000000006a0 <disasm_percpu_counter_tree_add>:
[...]
 6a4:   8b 4f 08                mov    0x8(%rdi),%ecx
 6a7:   48 8b 17                mov    (%rdi),%rdx
 6aa:   89 f0                   mov    %esi,%eax
 6ac:   65 0f c1 02             xadd   %eax,%gs:(%rdx)
 6b0:   41 89 c8                mov    %ecx,%r8d
 6b3:   8d 14 06                lea    (%rsi,%rax,1),%edx
 6b6:   41 f7 d8                neg    %r8d
 6b9:   85 f6                   test   %esi,%esi
 6bb:   78 18                   js     6d5 <disasm_percpu_counter_tree_add+0x35>
 6bd:   44 21 c6                and    %r8d,%esi
 6c0:   31 f0                   xor    %esi,%eax
 6c2:   31 d0                   xor    %edx,%eax
 6c4:   8d 14 31                lea    (%rcx,%rsi,1),%edx
 6c7:   85 c8                   test   %ecx,%eax
 6c9:   0f 45 f2                cmovne %edx,%esi
 6cc:   85 f6                   test   %esi,%esi
 6ce:   75 1d                   jne    6ed <disasm_percpu_counter_tree_add+0x4d>
 6d0:   e9 00 00 00 00          jmp    6d5 <disasm_percpu_counter_tree_add+0x35>
 6d5:   f7 de                   neg    %esi
 6d7:   44 21 c6                and    %r8d,%esi
 6da:   f7 de                   neg    %esi
 6dc:   31 f0                   xor    %esi,%eax
 6de:   31 d0                   xor    %edx,%eax
 6e0:   89 f2                   mov    %esi,%edx
 6e2:   29 ca                   sub    %ecx,%edx
 6e4:   85 c8                   test   %ecx,%eax
 6e6:   0f 45 f2                cmovne %edx,%esi
 6e9:   85 f6                   test   %esi,%esi
 6eb:   74 e3                   je     6d0 <disasm_percpu_counter_tree_add+0x30>
 6ed:   e9 ae fb ff ff          jmp    2a0 <percpu_counter_tree_add_slowpath>

+
+static inline
+int percpu_counter_tree_approximate_sum(struct percpu_counter_tree *counter)
+{
...
+}

0000000000000710 <disasm_percpu_counter_tree_approximate_sum>:
[...]
 714:   48 8b 47 10             mov    0x10(%rdi),%rax
 718:   8b 10                   mov    (%rax),%edx
 71a:   8b 47 18                mov    0x18(%rdi),%eax
 71d:   01 d0                   add    %edx,%eax
 71f:   e9 00 00 00 00          jmp    724 
<disasm_percpu_counter_tree_approximate_sum+0x14>


These are pretty large after all the nested inlining is expanded.  Are
you sure that inlining them is the correct call?

percpu_counter_tree_add: 30 insn, 79 bytes
percpu_counter_tree_approximate_sum: 5 insn, 16 bytes

So I agree that inlining "add" may not be the right call there due
to the relatively large footprint of bitwise operations used to pick
the carry, and the fact that it contains a xadd which is ~9 cycles
on recent CPUs, so adding the overhead of a function call should not
be so significant.

Inlining "approximate_sum" seems to be the right call, because the
function is really small (16 bytes), and it only contains loads and
a "add", which are faster than a function call.



+#else  /* !CONFIG_SMP */
+

...

+#include <linux/percpu_counter_tree.h>
+#include <linux/cpumask.h>
+#include <linux/percpu.h>
+#include <linux/atomic.h>
+#include <linux/errno.h>
+#include <linux/slab.h>
+#include <linux/math.h>
+
+#define MAX_NR_LEVELS 5
+
+struct counter_config {
+       unsigned int nr_items;
+       unsigned char nr_levels;
+       unsigned char n_arity_order[MAX_NR_LEVELS];
+};
+
+/*
+ * nr_items is the number of items in the tree for levels 1 to and
+ * including the final level (approximate sum). It excludes the level 0
+ * per-cpu counters.
+ */

That's referring to counter_config.nr_items?  Comment appears to be
misplaced.

Good point, I'll move it besides the "nr_items" field in counter_config.
I'll also document the other fields.

...
+       /* Batch size must be power of 2 */
+       if (!batch_size || (batch_size & (batch_size - 1)))
+               return -EINVAL;

It's a bug, yes?  Worth a WARN?

I'll add a WARN_ON().

...
It would be nice to kerneldocify the exported API.

OK


Some fairly detailed words explaining the pros and cons of precise vs
approximate would be helpful to people who are using this API.

Good point.

+int __init percpu_counter_tree_subsystem_init(void)

I'm not sure that the "subsystem_" adds any value.

The symbol "percpu_counter_tree_init()" is already taken
for initialization of individual counter trees, so I added
"subsystem" here do distinguish between the two symbols.
Looking at kernel bootup functions "subsystem_init" seems
to be used at least once:

1108:   acpi_subsystem_init();


+{
+
+       nr_cpus_order = get_count_order(nr_cpu_ids);

Stray newline.

OK

Thanks for the review!

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Reply via email to