On Thu, Mar 19, 2026 at 01:37:46PM -0400, Waiman Long wrote: > The vmstats flush threshold currently increases linearly with the > number of online CPUs. As the number of CPUs increases over time, it > will become increasingly difficult to meet the threshold and update the > vmstats data in a timely manner. These days, systems with hundreds of > CPUs or even thousands of them are becoming more common. > > For example, the test_memcg_sock test of test_memcontrol always fails > when running on an arm64 system with 128 CPUs. It is because the > threshold is now 64*128 = 8192. With 4k page size, it needs changes in > 32 MB of memory. It will be even worse with larger page size like 64k. > > To make the output of memory.stat more correct, it is better to > scale up the threshold logarithmically instead of linearly with the > number of CPUs. With the log2 scale, we can use the possibly larger > num_possible_cpus() instead of num_online_cpus() which may change at > run time. > > Although there is supposed to be a periodic and asynchronous flush of > vmstats every 2 seconds, the actual time lag between succesive runs > can actually vary quite a bit. In fact, I have seen time lags of up > to 10s of seconds in some cases. So we couldn't too rely on the hope > that there will be an asynchronous vmstats flush every 2 seconds. This > may be something we need to look into. > > Signed-off-by: Waiman Long <[email protected]> > --- > mm/memcontrol.c | 17 ++++++++++++----- > 1 file changed, 12 insertions(+), 5 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 772bac21d155..8d4ede72f05c 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -548,20 +548,20 @@ struct memcg_vmstats { > * rstat update tree grow unbounded. > * > * 2) Flush the stats synchronously on reader side only when there are more > than > - * (MEMCG_CHARGE_BATCH * nr_cpus) update events. Though this optimization > - * will let stats be out of sync by atmost (MEMCG_CHARGE_BATCH * nr_cpus) > but > - * only for 2 seconds due to (1). > + * (MEMCG_CHARGE_BATCH * (ilog2(nr_cpus) + 1)) update events. Though this > + * optimization will let stats be out of sync by up to that amount but > only > + * for 2 seconds due to (1). > */ > static void flush_memcg_stats_dwork(struct work_struct *w); > static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork); > static u64 flush_last_time; > +static int vmstats_flush_threshold __ro_after_init; > > #define FLUSH_TIME (2UL*HZ) > > static bool memcg_vmstats_needs_flush(struct memcg_vmstats *vmstats) > { > - return atomic_read(&vmstats->stats_updates) > > - MEMCG_CHARGE_BATCH * num_online_cpus(); > + return atomic_read(&vmstats->stats_updates) > vmstats_flush_threshold; > } > > static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val, > @@ -5191,6 +5191,13 @@ int __init mem_cgroup_init(void) > > memcg_pn_cachep = KMEM_CACHE(mem_cgroup_per_node, > SLAB_PANIC | SLAB_HWCACHE_ALIGN); > + /* > + * Logarithmically scale up vmstats flush threshold with the number > + * of CPUs. > + * N.B. ilog2(1) = 0. > + */ > + vmstats_flush_threshold = MEMCG_CHARGE_BATCH * > + (ilog2(num_possible_cpus()) + 1);
Changing the threashold from linearly to logarithmically looks smarter, but my concern is that, on large systems (hundreds/thousands of CPUs), the threshold drops dramatically. For example, 1024 CPUs it goes from 65536 (256MB) to only 704 (2.7MB), that's almost 100x. Could this potentially raise a performance issue as frequently read 'memory.stat' on a heavily loaded system? Maybe go with MEMCG_CHARGE_BATCH * int_sqrt(num_possible_cpus()), which sits between linear and log2? -- Regards, Li Wang

