On Tue, Jul 16, 2024 at 1:42 AM Dhananjay Ugwekar <[email protected]> wrote: > > Hello Ian, > > On 7/15/2024 8:52 PM, Ian Rogers wrote: > > On Mon, Jul 15, 2024 at 2:36 AM Dhananjay Ugwekar > > <[email protected]> wrote: > >> > >> Hello Ian, > >> > >> On 7/12/2024 3:53 AM, Ian Rogers wrote: > >>> On Thu, Jul 11, 2024 at 3:25 AM Dhananjay Ugwekar > >>> <[email protected]> wrote: > >>>> > >>>> Currently the energy-cores event in the power PMU aggregates energy > >>>> consumption data at a package level. On the other hand the core energy > >>>> RAPL counter in AMD CPUs has a core scope (which means the energy > >>>> consumption is recorded separately for each core). Earlier efforts to add > >>>> the core event in the power PMU had failed [1], due to the difference in > >>>> the scope of these two events. Hence, there is a need for a new core > >>>> scope > >>>> PMU. > >>>> > >>>> This patchset adds a new "power_per_core" PMU alongside the existing > >>>> "power" PMU, which will be responsible for collecting the new > >>>> "energy-per-core" event. > >>> > >>> Sorry for being naive, is the only reason for adding the new PMU for > >>> the sake of the cpumask? Perhaps we can add per event cpumasks like > >>> say `/sys/devices/power/events/energy-per-core.cpumask` which contains > >>> the CPUs of the different cores in this case. There's supporting > >>> hotplugging CPUs as an issue. Adding the tool support for this > >>> wouldn't be hard and it may be less messy (although old perf tools on > >>> new kernels wouldn't know about these files). > >> > >> I went over the two approaches and below are my thoughts, > >> > >> New PMU approach: > >> Pros > >> * It will work with older perf tools, hence these patches can be > >> backported to an older kernel and the new per-core event will work there > >> as well. > >> Cons > >> * More code changes in rapl.c > >> > >> Event specific cpumask approach: > >> Pros > >> * It might be easier to add diff scope events within the same PMU in > >> future(although currently I'm not able to find such a usecase, apart from > >> the RAPL pkg and core energy counters) > >> Cons > >> * Both new kernel and perf tool will be required to use the new per-core > >> event. > >> > >> I feel that while the event-specific cpumask is a viable alternative to > >> the new PMU addition approach, I dont see any clear pros to select that > >> over the current approach. Please let me know if you have any design > >> related concerns to the addition of new PMU or your concern is mostly > >> about the amount of code changes in this approach. > > > > Thanks Dhananjay, and thanks for taking the time for an objective > > discussion on the mailing list. I'm very supportive of seeing the work > > you are enabling land. > > > > My concern comes from the tool side. If every PMU starts to have > > variants for the sake of the cpumask what does this mean for > > aggregation in the perf tool? There is another issue around event > > grouping, you can't group events across PMUs, but my feeling is that > > event grouping needs to be rethought. By default the power_per_core > > events are going to be aggregated together by the perf tool, which > > then loses their per_core-ness. > > Yea right, maybe we need to fix this behavior. > > > > > I was trying to think of the same problem but in other PMUs. One > > thought I had was the difference between hyperthread and core events. > > At least on Intel, some events can only count for the whole core not > > per hyperthread. The events don't have a cpu_per_core PMU, they just > > use the regular cpu one, and so the cpumask is set to all online > > hyperthreads. When a per-core event is programmed it will get > > programmed on every hyperthread and so counted twice for the core. > > This at the least wastes a counter, but it probably also yields twice > > the expected count as every event is counted twice then aggregated. So > > this is just wrong and the user is expected to be smart and fix it > > (checking the x86 events there is a convention to use a ".ALL" or > > ".ANY" suffix for core wide events iirc). If we had a cpumask for > > these events then we could avoid the double setting, free up a counter > > and avoid double counting. Were we to fix things the way it is done in > > this patch series we'd add another PMU. > > Yes, this seems like a valid usecase for event-specific cpumasks. > > > > > My feeling is that in the longer term a per event cpumask looks > > cleaner. I think either way you need a new kernel for the new RAPL > > events. The problem with an old perf tool and a new kernel, this > > doesn't normally happen with distributions as they match the perf tool > > to the kernel version needlessly losing features and fixes along the > > way. If the new PMU is going to get backported through fixes.. then we > > can do similar for reading the per event cpumask. I'd be tempted not > > to do this and focus on the next LTS kernel, getting the kernel and > > tool fixes in as necessary. > > Makes sense, even though this approach will require more effort but it seems > to be worthwhile as it would help things down the line (make it easier to have > heterogenous-scope events within a PMU). I'll need to go through the perf tool > to see how we can design this. I'll get back with an RFC series probably once > I have an initial design in mind.
Hi Dhananjay, I can have a go at the perf tool side of this - I probably know the way around the code best. Basically we need to do something similar to how other "<event>.<setting>" values are parsed: https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/pmu.c?h=perf-tools-next#n335 The cpumask handling in the perf tool is a little weird as there is a summary value in the evlist. Anyway, I can do that if you want to spin the RAPL/power PMU side of things. Thanks, Ian > Thanks, > Dhananjay > > > > > Thanks, > > Ian > > > > > >> Thanks, > >> Dhananjay > >> > >>> > >>> Thanks, > >>> Ian > >>> > >>> > >>>> Tested the package level and core level PMU counters with workloads > >>>> pinned to different CPUs. > >>>> > >>>> Results with workload pinned to CPU 1 in Core 1 on an AMD Zen4 Genoa > >>>> machine: > >>>> > >>>> $ perf stat -a --per-core -e power_per_core/energy-per-core/ -- sleep 1 > >>>> > >>>> Performance counter stats for 'system wide': > >>>> > >>>> S0-D0-C0 1 0.02 Joules power_per_core/energy-per-core/ > >>>> S0-D0-C1 1 5.72 Joules power_per_core/energy-per-core/ > >>>> S0-D0-C2 1 0.02 Joules power_per_core/energy-per-core/ > >>>> S0-D0-C3 1 0.02 Joules power_per_core/energy-per-core/ > >>>> S0-D0-C4 1 0.02 Joules power_per_core/energy-per-core/ > >>>> S0-D0-C5 1 0.02 Joules power_per_core/energy-per-core/ > >>>> S0-D0-C6 1 0.02 Joules power_per_core/energy-per-core/ > >>>> S0-D0-C7 1 0.02 Joules power_per_core/energy-per-core/ > >>>> S0-D0-C8 1 0.02 Joules power_per_core/energy-per-core/ > >>>> S0-D0-C9 1 0.02 Joules power_per_core/energy-per-core/ > >>>> S0-D0-C10 1 0.02 Joules power_per_core/energy-per-core/ > >>>> > >>>> [1]: > >>>> https://lore.kernel.org/lkml/[email protected]/ > >>>> > >>>> This patchset applies cleanly on top of v6.10-rc7 as well as latest > >>>> tip/master. > >>>> > >>>> v4 changes: > >>>> * Add patch 11 which removes the unused function cpu_to_rapl_pmu() > >>>> * Add Rui's rb tag for patch 1 > >>>> * Invert the pmu scope check logic in patch 2 (Peter) > >>>> * Add comments explaining the scope check in patch 2 (Peter) > >>>> * Use cpumask_var_t instead of cpumask_t in patch 5 (Peter) > >>>> * Move renaming code to patch 8 (Rui) > >>>> * Reorder the cleanup order of per-core and per-pkg PMU in patch 10 (Rui) > >>>> * Add rapl_core_hw_unit variable to store the per-core PMU unit in patch > >>>> 10 (Rui) > >>>> > >>>> PS: Scope check logic is still kept the same (i.e., all Intel systems > >>>> being > >>>> considered as die scope), Rui will be modifying it to limit the die-scope > >>>> only to Cascadelake-AP in a future patch on top of this patchset. > >>>> > >>>> v3 changes: > >>>> * Patch 1 added to introduce the logical_core_id which is unique across > >>>> the system (Prateek) > >>>> * Use the unique topology_logical_core_id() instead of > >>>> topology_core_id() (which is only unique within a package on tested > >>>> AMD and Intel systems) in Patch 10 > >>>> > >>>> v2 changes: > >>>> * Patches 6,7,8 added to split some changes out of the last patch > >>>> * Use container_of to get the rapl_pmus from event variable (Rui) > >>>> * Set PERF_EV_CAP_READ_ACTIVE_PKG flag only for pkg scope PMU (Rui) > >>>> * Use event id 0x1 for energy-per-core event (Rui) > >>>> * Use PERF_RAPL_PER_CORE bit instead of adding a new flag to check for > >>>> per-core counter hw support (Rui) > >>>> > >>>> Dhananjay Ugwekar (10): > >>>> perf/x86/rapl: Fix the energy-pkg event for AMD CPUs > >>>> perf/x86/rapl: Rename rapl_pmu variables > >>>> perf/x86/rapl: Make rapl_model struct global > >>>> perf/x86/rapl: Move cpumask variable to rapl_pmus struct > >>>> perf/x86/rapl: Add wrapper for online/offline functions > >>>> perf/x86/rapl: Add an argument to the cleanup and init functions > >>>> perf/x86/rapl: Modify the generic variable names to *_pkg* > >>>> perf/x86/rapl: Remove the global variable rapl_msrs > >>>> perf/x86/rapl: Add per-core energy counter support for AMD CPUs > >>>> perf/x86/rapl: Remove the unused function cpu_to_rapl_pmu > >>>> > >>>> K Prateek Nayak (1): > >>>> x86/topology: Introduce topology_logical_core_id() > >>>> > >>>> Documentation/arch/x86/topology.rst | 4 + > >>>> arch/x86/events/rapl.c | 454 ++++++++++++++++++-------- > >>>> arch/x86/include/asm/processor.h | 1 + > >>>> arch/x86/include/asm/topology.h | 1 + > >>>> arch/x86/kernel/cpu/debugfs.c | 1 + > >>>> arch/x86/kernel/cpu/topology_common.c | 1 + > >>>> 6 files changed, 328 insertions(+), 134 deletions(-) > >>>> > >>>> -- > >>>> 2.34.1 > >>>>
