On 2026-01-14 11:41, Michal Hocko wrote:
One thing you should probably mention here is the memory consumption of the structure.
Good point. The most important parts are the per-cpu counters and the tree items which propagate the carry. In the proposed implementation, the per-cpu counters are allocated within per-cpu data structures, so they end up using: nr_possible_cpus * sizeof(unsigned long) In addition, the tree items are appended at the end of the mm_struct. The size of those items is defined by the per_nr_cpu_order_config table "nr_items" field. Each item is aligned on cacheline size (typically 64 bytes) to minimize false sharing. Here is the footprint for a few nr_cpus on a 64-bit arch: nr_cpus percpu counters (bytes) nr_items items size (bytes) total (bytes) 2 16 1 64 80 4 32 3 192 224 8 64 7 448 512 64 512 21 1344 1856 128 1024 21 1344 2368 256 2048 37 2368 4416 512 4096 73 4672 8768 There are of course various trade offs we can make here. We can: * Increase the n-arity of the intermediate items to shrink the nr_items required for a given nr_cpus. This will increase contention of carry propagation across more cores. * Remove cacheline alignment of intermediate tree items. This will shrink the memory needed for tree items, but will increase false sharing. * Represent intermediate tree items on a byte rather than long. This further reduces the memory required for intermediate tree items, but further increases false sharing. * Represent per-cpu counters on bytes rather than long. This makes the "sum" operation trickier, because it needs to iterate on the intermediate carry propagation nodes as well and synchronize with ongoing "tree add" operations. It further reduces memory use. * Implement a custom strided allocator for intermediate items carry propagation bytes. This shares cachelines across different tree instances, keeping good locality. This ensures that all accesses from a given location in the machine topology touch the same cacheline for the various tree instances. This adds complexity, but provides compactness as well as minimal false-sharing. Compared to this, the upstream percpu counters use a 32-bit integer per-cpu (4 bytes), and accumulate within a 64-bit global value. So yes, there is an extra memory footprint added by the current hpcc implementation, but if it's an issue we have various options to consider to reduce its footprint. Is it OK if I add this discussion to the commit message, or should it be also added into the high level design doc within Documentation/core-api/percpu-counter-tree.rst ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
