On 08/05/2026 20.07, Dmitry Ilvokhin wrote:
On Fri, May 08, 2026 at 07:40:51PM +0200, Vlastimil Babka (SUSE) wrote:
On 5/8/26 7:38 PM, Vlastimil Babka (SUSE) wrote:
On 5/8/26 7:29 PM, Andrew Morton wrote:
e .configOn Fri, 8 May 2026 18:22:06 +0200 [email protected] wrote:
Add tracepoints to the page allocator fast paths that acquire
zone->lock, allowing diagnosis of lock contention in production.
Thanks, I'm surprised we haven't done this yet.
There was a recent attempt [1]. Not being a generic solution wasn't welcome.
[1] https://lore.kernel.org/all/[email protected]/
And this is the generic solution I think?
https://lore.kernel.org/all/[email protected]/
Thanks for cc'ing me, Vlastimil.
Yes, this is an attempt at a generic solution for tracing contended
locks, including spinlocks, so it should also cover the use case
proposed in this patchset.
I'm aware of the generic solution and often use `perf lock contention`.
And the tool libbpf-tools/klockstat. My experience is unfortunately that
enabling these tracepoint is prohibitive expensive on production server,
and production suffers when I run these tools.
I'm very happy to see a patchset adding a contended case. But I worry
that tracing all contented locks in the system is also too much to have
enabled continuously for production.
This patch is carefully constructed to minimize overhead, such that I
can enable this continuously on production to catch issues. If I
identify issue I will use the generic tracpoints for further debugging.
In fact, zone->lock contention was one of the primary motivations for
this work.
In the generic solution I'm loosing the "zone" and pages "count". I
need this information to get the answers I'm looking for. Specifically
I'm looking at reducing CONFIG_PCP_BATCH_SCALE_MAX, but I want to this
to be a data-driven decision (my first principle is: if you cannot
measure it you cannot improve it).
I'm likely going to apply this patch to our production system, such that
I can get my data-driven decision. I need to deploy it widely enough to
get enough server experiencing direct-reclaim. I'll report back if
people are interested in these learning?
--Jesper