On Wed, Mar 25, 2026 at 9:47 AM Waiman Long <[email protected]> wrote:
>
> On 3/23/26 8:15 PM, Yosry Ahmed wrote:
> > On Mon, Mar 23, 2026 at 5:46 AM Li Wang <[email protected]> wrote:
> >> On Fri, Mar 20, 2026 at 04:42:35PM -0400, Waiman Long wrote:
> >>> The vmstats flush threshold currently increases linearly with the
> >>> number of online CPUs. As the number of CPUs increases over time, it
> >>> will become increasingly difficult to meet the threshold and update the
> >>> vmstats data in a timely manner. These days, systems with hundreds of
> >>> CPUs or even thousands of them are becoming more common.
> >>>
> >>> For example, the test_memcg_sock test of test_memcontrol always fails
> >>> when running on an arm64 system with 128 CPUs. It is because the
> >>> threshold is now 64*128 = 8192. With 4k page size, it needs changes in
> >>> 32 MB of memory. It will be even worse with larger page size like 64k.
> >>>
> >>> To make the output of memory.stat more correct, it is better to scale
> >>> up the threshold slower than linearly with the number of CPUs. The
> >>> int_sqrt() function is a good compromise as suggested by Li Wang [1].
> >>> An extra 2 is added to make sure that we will double the threshold for
> >>> a 2-core system. The increase will be slower after that.
> >>>
> >>> With the int_sqrt() scale, we can use the possibly larger
> >>> num_possible_cpus() instead of num_online_cpus() which may change at
> >>> run time.
> >>>
> >>> Although there is supposed to be a periodic and asynchronous flush of
> >>> vmstats every 2 seconds, the actual time lag between succesive runs
> >>> can actually vary quite a bit. In fact, I have seen time lags of up
> >>> to 10s of seconds in some cases. So we couldn't too rely on the hope
> >>> that there will be an asynchronous vmstats flush every 2 seconds. This
> >>> may be something we need to look into.
> >>>
> >>> [1] https://lore.kernel.org/lkml/[email protected]/
> >>>
> >>> Suggested-by: Li Wang <[email protected]>
> >>> Signed-off-by: Waiman Long <[email protected]>
> > What's the motivation for this fix? Is it purely to make tests more
> > reliable on systems with larger page sizes?
> >
> > We need some performance tests to make sure we're not flushing too
> > eagerly with the sqrt scale imo. We need to make sure that when we
> > have a lot of cgroups and a lot of flushers we don't end up performing
> > worse.
>
> I will include some performance data in the next version. Do you have
> any suggestion of which readily available tests that I can use for this
> performance testing purpose.

I am not sure what readily available tests can stress this. In the
past, I wrote a synthetic workload that spawns a lot of readers in
memory.stat in userspace as well as reclaimers to trigger flushing
from both the kernel and userspace, with a large number of cgroups. I
don't have that lying around unfortunately.

Reply via email to