On 7/17/2024 4:17 AM, Ian Rogers wrote: > On Tue, Jul 16, 2024 at 1:42 AM Dhananjay Ugwekar > <[email protected]> wrote: >> >> Hello Ian, >> >> On 7/15/2024 8:52 PM, Ian Rogers wrote: >>> On Mon, Jul 15, 2024 at 2:36 AM Dhananjay Ugwekar >>> <[email protected]> wrote: >>>> >>>> Hello Ian, >>>> >>>> On 7/12/2024 3:53 AM, Ian Rogers wrote: >>>>> On Thu, Jul 11, 2024 at 3:25 AM Dhananjay Ugwekar >>>>> <[email protected]> wrote: >>>>>> >>>>>> Currently the energy-cores event in the power PMU aggregates energy >>>>>> consumption data at a package level. On the other hand the core energy >>>>>> RAPL counter in AMD CPUs has a core scope (which means the energy >>>>>> consumption is recorded separately for each core). Earlier efforts to add >>>>>> the core event in the power PMU had failed [1], due to the difference in >>>>>> the scope of these two events. Hence, there is a need for a new core >>>>>> scope >>>>>> PMU. >>>>>> >>>>>> This patchset adds a new "power_per_core" PMU alongside the existing >>>>>> "power" PMU, which will be responsible for collecting the new >>>>>> "energy-per-core" event. >>>>> >>>>> Sorry for being naive, is the only reason for adding the new PMU for >>>>> the sake of the cpumask? Perhaps we can add per event cpumasks like >>>>> say `/sys/devices/power/events/energy-per-core.cpumask` which contains >>>>> the CPUs of the different cores in this case. There's supporting >>>>> hotplugging CPUs as an issue. Adding the tool support for this >>>>> wouldn't be hard and it may be less messy (although old perf tools on >>>>> new kernels wouldn't know about these files). >>>> >>>> I went over the two approaches and below are my thoughts, >>>> >>>> New PMU approach: >>>> Pros >>>> * It will work with older perf tools, hence these patches can be >>>> backported to an older kernel and the new per-core event will work there >>>> as well. >>>> Cons >>>> * More code changes in rapl.c >>>> >>>> Event specific cpumask approach: >>>> Pros >>>> * It might be easier to add diff scope events within the same PMU in >>>> future(although currently I'm not able to find such a usecase, apart from >>>> the RAPL pkg and core energy counters) >>>> Cons >>>> * Both new kernel and perf tool will be required to use the new per-core >>>> event. >>>> >>>> I feel that while the event-specific cpumask is a viable alternative to >>>> the new PMU addition approach, I dont see any clear pros to select that >>>> over the current approach. Please let me know if you have any design >>>> related concerns to the addition of new PMU or your concern is mostly >>>> about the amount of code changes in this approach. >>> >>> Thanks Dhananjay, and thanks for taking the time for an objective >>> discussion on the mailing list. I'm very supportive of seeing the work >>> you are enabling land. >>> >>> My concern comes from the tool side. If every PMU starts to have >>> variants for the sake of the cpumask what does this mean for >>> aggregation in the perf tool? There is another issue around event >>> grouping, you can't group events across PMUs, but my feeling is that >>> event grouping needs to be rethought. By default the power_per_core >>> events are going to be aggregated together by the perf tool, which >>> then loses their per_core-ness. >> >> Yea right, maybe we need to fix this behavior. >> >>> >>> I was trying to think of the same problem but in other PMUs. One >>> thought I had was the difference between hyperthread and core events. >>> At least on Intel, some events can only count for the whole core not >>> per hyperthread. The events don't have a cpu_per_core PMU, they just >>> use the regular cpu one, and so the cpumask is set to all online >>> hyperthreads. When a per-core event is programmed it will get >>> programmed on every hyperthread and so counted twice for the core. >>> This at the least wastes a counter, but it probably also yields twice >>> the expected count as every event is counted twice then aggregated. So >>> this is just wrong and the user is expected to be smart and fix it >>> (checking the x86 events there is a convention to use a ".ALL" or >>> ".ANY" suffix for core wide events iirc). If we had a cpumask for >>> these events then we could avoid the double setting, free up a counter >>> and avoid double counting. Were we to fix things the way it is done in >>> this patch series we'd add another PMU. >> >> Yes, this seems like a valid usecase for event-specific cpumasks. >> >>> >>> My feeling is that in the longer term a per event cpumask looks >>> cleaner. I think either way you need a new kernel for the new RAPL >>> events. The problem with an old perf tool and a new kernel, this >>> doesn't normally happen with distributions as they match the perf tool >>> to the kernel version needlessly losing features and fixes along the >>> way. If the new PMU is going to get backported through fixes.. then we >>> can do similar for reading the per event cpumask. I'd be tempted not >>> to do this and focus on the next LTS kernel, getting the kernel and >>> tool fixes in as necessary. >> >> Makes sense, even though this approach will require more effort but it seems >> to be worthwhile as it would help things down the line (make it easier to >> have >> heterogenous-scope events within a PMU). I'll need to go through the perf >> tool >> to see how we can design this. I'll get back with an RFC series probably once >> I have an initial design in mind. > > Hi Dhananjay, > > I can have a go at the perf tool side of this - I probably know the > way around the code best. Basically we need to do something similar to > how other "<event>.<setting>" values are parsed: > https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/pmu.c?h=perf-tools-next#n335 Yea, right.
> The cpumask handling in the perf tool is a little weird as there is a > summary value in the evlist. Anyway, I can do that if you want to spin > the RAPL/power PMU side of things. Sounds great!, I'll be happy to refactor the RAPL code to use the event.cpumask feature to add the per-core energy counter. Also, please let me know if you need any help from me on the perf tool side as well. Regards, Dhananjay > > Thanks, > Ian > >> Thanks, >> Dhananjay >> >>> >>> Thanks, >>> Ian >>> >>> >>>> Thanks, >>>> Dhananjay >>>> >>>>> >>>>> Thanks, >>>>> Ian >>>>> >>>>> >>>>>> Tested the package level and core level PMU counters with workloads >>>>>> pinned to different CPUs. >>>>>> >>>>>> Results with workload pinned to CPU 1 in Core 1 on an AMD Zen4 Genoa >>>>>> machine: >>>>>> >>>>>> $ perf stat -a --per-core -e power_per_core/energy-per-core/ -- sleep 1 >>>>>> >>>>>> Performance counter stats for 'system wide': >>>>>> >>>>>> S0-D0-C0 1 0.02 Joules power_per_core/energy-per-core/ >>>>>> S0-D0-C1 1 5.72 Joules power_per_core/energy-per-core/ >>>>>> S0-D0-C2 1 0.02 Joules power_per_core/energy-per-core/ >>>>>> S0-D0-C3 1 0.02 Joules power_per_core/energy-per-core/ >>>>>> S0-D0-C4 1 0.02 Joules power_per_core/energy-per-core/ >>>>>> S0-D0-C5 1 0.02 Joules power_per_core/energy-per-core/ >>>>>> S0-D0-C6 1 0.02 Joules power_per_core/energy-per-core/ >>>>>> S0-D0-C7 1 0.02 Joules power_per_core/energy-per-core/ >>>>>> S0-D0-C8 1 0.02 Joules power_per_core/energy-per-core/ >>>>>> S0-D0-C9 1 0.02 Joules power_per_core/energy-per-core/ >>>>>> S0-D0-C10 1 0.02 Joules power_per_core/energy-per-core/ >>>>>> >>>>>> [1]: >>>>>> https://lore.kernel.org/lkml/[email protected]/ >>>>>> >>>>>> This patchset applies cleanly on top of v6.10-rc7 as well as latest >>>>>> tip/master. >>>>>> >>>>>> v4 changes: >>>>>> * Add patch 11 which removes the unused function cpu_to_rapl_pmu() >>>>>> * Add Rui's rb tag for patch 1 >>>>>> * Invert the pmu scope check logic in patch 2 (Peter) >>>>>> * Add comments explaining the scope check in patch 2 (Peter) >>>>>> * Use cpumask_var_t instead of cpumask_t in patch 5 (Peter) >>>>>> * Move renaming code to patch 8 (Rui) >>>>>> * Reorder the cleanup order of per-core and per-pkg PMU in patch 10 (Rui) >>>>>> * Add rapl_core_hw_unit variable to store the per-core PMU unit in patch >>>>>> 10 (Rui) >>>>>> >>>>>> PS: Scope check logic is still kept the same (i.e., all Intel systems >>>>>> being >>>>>> considered as die scope), Rui will be modifying it to limit the die-scope >>>>>> only to Cascadelake-AP in a future patch on top of this patchset. >>>>>> >>>>>> v3 changes: >>>>>> * Patch 1 added to introduce the logical_core_id which is unique across >>>>>> the system (Prateek) >>>>>> * Use the unique topology_logical_core_id() instead of >>>>>> topology_core_id() (which is only unique within a package on tested >>>>>> AMD and Intel systems) in Patch 10 >>>>>> >>>>>> v2 changes: >>>>>> * Patches 6,7,8 added to split some changes out of the last patch >>>>>> * Use container_of to get the rapl_pmus from event variable (Rui) >>>>>> * Set PERF_EV_CAP_READ_ACTIVE_PKG flag only for pkg scope PMU (Rui) >>>>>> * Use event id 0x1 for energy-per-core event (Rui) >>>>>> * Use PERF_RAPL_PER_CORE bit instead of adding a new flag to check for >>>>>> per-core counter hw support (Rui) >>>>>> >>>>>> Dhananjay Ugwekar (10): >>>>>> perf/x86/rapl: Fix the energy-pkg event for AMD CPUs >>>>>> perf/x86/rapl: Rename rapl_pmu variables >>>>>> perf/x86/rapl: Make rapl_model struct global >>>>>> perf/x86/rapl: Move cpumask variable to rapl_pmus struct >>>>>> perf/x86/rapl: Add wrapper for online/offline functions >>>>>> perf/x86/rapl: Add an argument to the cleanup and init functions >>>>>> perf/x86/rapl: Modify the generic variable names to *_pkg* >>>>>> perf/x86/rapl: Remove the global variable rapl_msrs >>>>>> perf/x86/rapl: Add per-core energy counter support for AMD CPUs >>>>>> perf/x86/rapl: Remove the unused function cpu_to_rapl_pmu >>>>>> >>>>>> K Prateek Nayak (1): >>>>>> x86/topology: Introduce topology_logical_core_id() >>>>>> >>>>>> Documentation/arch/x86/topology.rst | 4 + >>>>>> arch/x86/events/rapl.c | 454 ++++++++++++++++++-------- >>>>>> arch/x86/include/asm/processor.h | 1 + >>>>>> arch/x86/include/asm/topology.h | 1 + >>>>>> arch/x86/kernel/cpu/debugfs.c | 1 + >>>>>> arch/x86/kernel/cpu/topology_common.c | 1 + >>>>>> 6 files changed, 328 insertions(+), 134 deletions(-) >>>>>> >>>>>> -- >>>>>> 2.34.1 >>>>>>
