On Tue, Jul 16, 2024 at 1:42 AM Dhananjay Ugwekar
<[email protected]> wrote:
>
> Hello Ian,
>
> On 7/15/2024 8:52 PM, Ian Rogers wrote:
> > On Mon, Jul 15, 2024 at 2:36 AM Dhananjay Ugwekar
> > <[email protected]> wrote:
> >>
> >> Hello Ian,
> >>
> >> On 7/12/2024 3:53 AM, Ian Rogers wrote:
> >>> On Thu, Jul 11, 2024 at 3:25 AM Dhananjay Ugwekar
> >>> <[email protected]> wrote:
> >>>>
> >>>> Currently the energy-cores event in the power PMU aggregates energy
> >>>> consumption data at a package level. On the other hand the core energy
> >>>> RAPL counter in AMD CPUs has a core scope (which means the energy
> >>>> consumption is recorded separately for each core). Earlier efforts to add
> >>>> the core event in the power PMU had failed [1], due to the difference in
> >>>> the scope of these two events. Hence, there is a need for a new core 
> >>>> scope
> >>>> PMU.
> >>>>
> >>>> This patchset adds a new "power_per_core" PMU alongside the existing
> >>>> "power" PMU, which will be responsible for collecting the new
> >>>> "energy-per-core" event.
> >>>
> >>> Sorry for being naive, is the only reason for adding the new PMU for
> >>> the sake of the cpumask? Perhaps we can add per event cpumasks like
> >>> say `/sys/devices/power/events/energy-per-core.cpumask` which contains
> >>> the CPUs of the different cores in this case. There's supporting
> >>> hotplugging CPUs as an issue. Adding the tool support for this
> >>> wouldn't be hard and it may be less messy (although old perf tools on
> >>> new kernels wouldn't know about these files).
> >>
> >> I went over the two approaches and below are my thoughts,
> >>
> >> New PMU approach:
> >> Pros
> >> * It will work with older perf tools, hence these patches can be 
> >> backported to an older kernel and the new per-core event will work there 
> >> as well.
> >> Cons
> >> * More code changes in rapl.c
> >>
> >> Event specific cpumask approach:
> >> Pros
> >> * It might be easier to add diff scope events within the same PMU in 
> >> future(although currently I'm not able to find such a usecase, apart from 
> >> the RAPL pkg and core energy counters)
> >> Cons
> >> * Both new kernel and perf tool will be required to use the new per-core 
> >> event.
> >>
> >> I feel that while the event-specific cpumask is a viable alternative to 
> >> the new PMU addition approach, I dont see any clear pros to select that 
> >> over the current approach. Please let me know if you have any design 
> >> related concerns to the addition of new PMU or your concern is mostly 
> >> about the amount of code changes in this approach.
> >
> > Thanks Dhananjay, and thanks for taking the time for an objective
> > discussion on the mailing list. I'm very supportive of seeing the work
> > you are enabling land.
> >
> > My concern comes from the tool side. If every PMU starts to have
> > variants for the sake of the cpumask what does this mean for
> > aggregation in the perf tool? There is another issue around event
> > grouping, you can't group events across PMUs, but my feeling is that
> > event grouping needs to be rethought. By default the power_per_core
> > events are going to be aggregated together by the perf tool, which
> > then loses their per_core-ness.
>
> Yea right, maybe we need to fix this behavior.
>
> >
> > I was trying to think of the same problem but in other PMUs. One
> > thought I had was the difference between hyperthread and core events.
> > At least on Intel, some events can only count for the whole core not
> > per hyperthread. The events don't have a cpu_per_core PMU, they just
> > use the regular cpu one, and so the cpumask is set to all online
> > hyperthreads. When a per-core event is programmed it will get
> > programmed on every hyperthread and so counted twice for the core.
> > This at the least wastes a counter, but it probably also yields twice
> > the expected count as every event is counted twice then aggregated. So
> > this is just wrong and the user is expected to be smart and fix it
> > (checking the x86 events there is a convention to use a ".ALL" or
> > ".ANY" suffix for core wide events iirc). If we had a cpumask for
> > these events then we could avoid the double setting, free up a counter
> > and avoid double counting. Were we to fix things the way it is done in
> > this patch series we'd add another PMU.
>
> Yes, this seems like a valid usecase for event-specific cpumasks.
>
> >
> > My feeling is that in the longer term a per event cpumask looks
> > cleaner. I think either way you need a new kernel for the new RAPL
> > events. The problem with an old perf tool and a new kernel, this
> > doesn't normally happen with distributions as they match the perf tool
> > to the kernel version needlessly losing features and fixes along the
> > way. If the new PMU is going to get backported through fixes.. then we
> > can do similar for reading the per event cpumask. I'd be tempted not
> > to do this and focus on the next LTS kernel, getting the kernel and
> > tool fixes in as necessary.
>
> Makes sense, even though this approach will require more effort but it seems
> to be worthwhile as it would help things down the line (make it easier to have
> heterogenous-scope events within a PMU). I'll need to go through the perf tool
> to see how we can design this. I'll get back with an RFC series probably once
> I have an initial design in mind.

Hi Dhananjay,

I can have a go at the perf tool side of this - I probably know the
way around the code best. Basically we need to do something similar to
how other "<event>.<setting>" values are parsed:
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/pmu.c?h=perf-tools-next#n335
The cpumask handling in the perf tool is a little weird as there is a
summary value in the evlist. Anyway, I can do that if you want to spin
the RAPL/power PMU side of things.

Thanks,
Ian

> Thanks,
> Dhananjay
>
> >
> > Thanks,
> > Ian
> >
> >
> >> Thanks,
> >> Dhananjay
> >>
> >>>
> >>> Thanks,
> >>> Ian
> >>>
> >>>
> >>>> Tested the package level and core level PMU counters with workloads
> >>>> pinned to different CPUs.
> >>>>
> >>>> Results with workload pinned to CPU 1 in Core 1 on an AMD Zen4 Genoa
> >>>> machine:
> >>>>
> >>>> $ perf stat -a --per-core -e power_per_core/energy-per-core/ -- sleep 1
> >>>>
> >>>>  Performance counter stats for 'system wide':
> >>>>
> >>>> S0-D0-C0         1          0.02 Joules power_per_core/energy-per-core/
> >>>> S0-D0-C1         1          5.72 Joules power_per_core/energy-per-core/
> >>>> S0-D0-C2         1          0.02 Joules power_per_core/energy-per-core/
> >>>> S0-D0-C3         1          0.02 Joules power_per_core/energy-per-core/
> >>>> S0-D0-C4         1          0.02 Joules power_per_core/energy-per-core/
> >>>> S0-D0-C5         1          0.02 Joules power_per_core/energy-per-core/
> >>>> S0-D0-C6         1          0.02 Joules power_per_core/energy-per-core/
> >>>> S0-D0-C7         1          0.02 Joules power_per_core/energy-per-core/
> >>>> S0-D0-C8         1          0.02 Joules power_per_core/energy-per-core/
> >>>> S0-D0-C9         1          0.02 Joules power_per_core/energy-per-core/
> >>>> S0-D0-C10        1          0.02 Joules power_per_core/energy-per-core/
> >>>>
> >>>> [1]: 
> >>>> https://lore.kernel.org/lkml/[email protected]/
> >>>>
> >>>> This patchset applies cleanly on top of v6.10-rc7 as well as latest
> >>>> tip/master.
> >>>>
> >>>> v4 changes:
> >>>> * Add patch 11 which removes the unused function cpu_to_rapl_pmu()
> >>>> * Add Rui's rb tag for patch 1
> >>>> * Invert the pmu scope check logic in patch 2 (Peter)
> >>>> * Add comments explaining the scope check in patch 2 (Peter)
> >>>> * Use cpumask_var_t instead of cpumask_t in patch 5 (Peter)
> >>>> * Move renaming code to patch 8 (Rui)
> >>>> * Reorder the cleanup order of per-core and per-pkg PMU in patch 10 (Rui)
> >>>> * Add rapl_core_hw_unit variable to store the per-core PMU unit in patch
> >>>>   10 (Rui)
> >>>>
> >>>> PS: Scope check logic is still kept the same (i.e., all Intel systems 
> >>>> being
> >>>> considered as die scope), Rui will be modifying it to limit the die-scope
> >>>> only to Cascadelake-AP in a future patch on top of this patchset.
> >>>>
> >>>> v3 changes:
> >>>> * Patch 1 added to introduce the logical_core_id which is unique across
> >>>>   the system (Prateek)
> >>>> * Use the unique topology_logical_core_id() instead of
> >>>>   topology_core_id() (which is only unique within a package on tested
> >>>>   AMD and Intel systems) in Patch 10
> >>>>
> >>>> v2 changes:
> >>>> * Patches 6,7,8 added to split some changes out of the last patch
> >>>> * Use container_of to get the rapl_pmus from event variable (Rui)
> >>>> * Set PERF_EV_CAP_READ_ACTIVE_PKG flag only for pkg scope PMU (Rui)
> >>>> * Use event id 0x1 for energy-per-core event (Rui)
> >>>> * Use PERF_RAPL_PER_CORE bit instead of adding a new flag to check for
> >>>>   per-core counter hw support (Rui)
> >>>>
> >>>> Dhananjay Ugwekar (10):
> >>>>   perf/x86/rapl: Fix the energy-pkg event for AMD CPUs
> >>>>   perf/x86/rapl: Rename rapl_pmu variables
> >>>>   perf/x86/rapl: Make rapl_model struct global
> >>>>   perf/x86/rapl: Move cpumask variable to rapl_pmus struct
> >>>>   perf/x86/rapl: Add wrapper for online/offline functions
> >>>>   perf/x86/rapl: Add an argument to the cleanup and init functions
> >>>>   perf/x86/rapl: Modify the generic variable names to *_pkg*
> >>>>   perf/x86/rapl: Remove the global variable rapl_msrs
> >>>>   perf/x86/rapl: Add per-core energy counter support for AMD CPUs
> >>>>   perf/x86/rapl: Remove the unused function cpu_to_rapl_pmu
> >>>>
> >>>> K Prateek Nayak (1):
> >>>>   x86/topology: Introduce topology_logical_core_id()
> >>>>
> >>>>  Documentation/arch/x86/topology.rst   |   4 +
> >>>>  arch/x86/events/rapl.c                | 454 ++++++++++++++++++--------
> >>>>  arch/x86/include/asm/processor.h      |   1 +
> >>>>  arch/x86/include/asm/topology.h       |   1 +
> >>>>  arch/x86/kernel/cpu/debugfs.c         |   1 +
> >>>>  arch/x86/kernel/cpu/topology_common.c |   1 +
> >>>>  6 files changed, 328 insertions(+), 134 deletions(-)
> >>>>
> >>>> --
> >>>> 2.34.1
> >>>>

Reply via email to