On 2026-01-16 16:51, Michal Hocko wrote:
On Wed 14-01-26 14:19:38, Mathieu Desnoyers wrote:
On 2026-01-14 11:41, Michal Hocko wrote:

One thing you should probably mention here is the memory consumption of
the structure.
Good point.

The most important parts are the per-cpu counters and the tree items
which propagate the carry.

In the proposed implementation, the per-cpu counters are allocated
within per-cpu data structures, so they end up using:

   nr_possible_cpus * sizeof(unsigned long)

In addition, the tree items are appended at the end of the mm_struct.
The size of those items is defined by the per_nr_cpu_order_config
table "nr_items" field.

Each item is aligned on cacheline size (typically 64 bytes) to minimize
false sharing.

Here is the footprint for a few nr_cpus on a 64-bit arch:

nr_cpus     percpu counters (bytes)     nr_items       items size (bytes)     
total (bytes)
   2                 16                     1                 64                
    80
   4                 32                     3                192                
   224
   8                 64                     7                448                
   512
  64                 512                   21               1344                
  1856
128                1024                   21               1344                 
 2368
256                2048                   37               2368                 
 4416
512                4096                   73               4672                 
 8768

I assume this is nr_possible_cpus not NR_CPUS, right?

More precisely, this is nr_cpu_ids, at least for the nr_items.

percpu counters are effectively allocated for nr_possible_cpus, but we
need to allocate the internal items for nr_cpu_ids (based on the max
limits a cpumask would need). For the sake of keeping the table
easy to understand, I will use nr_cpu_ids for the first column.

I'll update the commit message.


There are of course various trade offs we can make here. We can:

* Increase the n-arity of the intermediate items to shrink the nr_items
   required for a given nr_cpus. This will increase contention of carry
   propagation across more cores.

* Remove cacheline alignment of intermediate tree items. This will
   shrink the memory needed for tree items, but will increase false
   sharing.

* Represent intermediate tree items on a byte rather than long.
   This further reduces the memory required for intermediate tree
   items, but further increases false sharing.

* Represent per-cpu counters on bytes rather than long. This makes
   the "sum" operation trickier, because it needs to iterate on the
   intermediate carry propagation nodes as well and synchronize with
   ongoing "tree add" operations. It further reduces memory use.

* Implement a custom strided allocator for intermediate items carry
   propagation bytes. This shares cachelines across different tree
   instances, keeping good locality. This ensures that all accesses
   from a given location in the machine topology touch the same
   cacheline for the various tree instances. This adds complexity,
   but provides compactness as well as minimal false-sharing.

Compared to this, the upstream percpu counters use a 32-bit integer per-cpu
(4 bytes), and accumulate within a 64-bit global value.

So yes, there is an extra memory footprint added by the current hpcc
implementation, but if it's an issue we have various options to consider
to reduce its footprint.

Is it OK if I add this discussion to the commit message, or should it
be also added into the high level design doc within
Documentation/core-api/percpu-counter-tree.rst ?

I would mention them in both changelog and the documentation.


OK, will do for v17.

Thanks,

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Reply via email to