On 7/17/2024 4:17 AM, Ian Rogers wrote:
> On Tue, Jul 16, 2024 at 1:42 AM Dhananjay Ugwekar
> <[email protected]> wrote:
>>
>> Hello Ian,
>>
>> On 7/15/2024 8:52 PM, Ian Rogers wrote:
>>> On Mon, Jul 15, 2024 at 2:36 AM Dhananjay Ugwekar
>>> <[email protected]> wrote:
>>>>
>>>> Hello Ian,
>>>>
>>>> On 7/12/2024 3:53 AM, Ian Rogers wrote:
>>>>> On Thu, Jul 11, 2024 at 3:25 AM Dhananjay Ugwekar
>>>>> <[email protected]> wrote:
>>>>>>
>>>>>> Currently the energy-cores event in the power PMU aggregates energy
>>>>>> consumption data at a package level. On the other hand the core energy
>>>>>> RAPL counter in AMD CPUs has a core scope (which means the energy
>>>>>> consumption is recorded separately for each core). Earlier efforts to add
>>>>>> the core event in the power PMU had failed [1], due to the difference in
>>>>>> the scope of these two events. Hence, there is a need for a new core 
>>>>>> scope
>>>>>> PMU.
>>>>>>
>>>>>> This patchset adds a new "power_per_core" PMU alongside the existing
>>>>>> "power" PMU, which will be responsible for collecting the new
>>>>>> "energy-per-core" event.
>>>>>
>>>>> Sorry for being naive, is the only reason for adding the new PMU for
>>>>> the sake of the cpumask? Perhaps we can add per event cpumasks like
>>>>> say `/sys/devices/power/events/energy-per-core.cpumask` which contains
>>>>> the CPUs of the different cores in this case. There's supporting
>>>>> hotplugging CPUs as an issue. Adding the tool support for this
>>>>> wouldn't be hard and it may be less messy (although old perf tools on
>>>>> new kernels wouldn't know about these files).
>>>>
>>>> I went over the two approaches and below are my thoughts,
>>>>
>>>> New PMU approach:
>>>> Pros
>>>> * It will work with older perf tools, hence these patches can be 
>>>> backported to an older kernel and the new per-core event will work there 
>>>> as well.
>>>> Cons
>>>> * More code changes in rapl.c
>>>>
>>>> Event specific cpumask approach:
>>>> Pros
>>>> * It might be easier to add diff scope events within the same PMU in 
>>>> future(although currently I'm not able to find such a usecase, apart from 
>>>> the RAPL pkg and core energy counters)
>>>> Cons
>>>> * Both new kernel and perf tool will be required to use the new per-core 
>>>> event.
>>>>
>>>> I feel that while the event-specific cpumask is a viable alternative to 
>>>> the new PMU addition approach, I dont see any clear pros to select that 
>>>> over the current approach. Please let me know if you have any design 
>>>> related concerns to the addition of new PMU or your concern is mostly 
>>>> about the amount of code changes in this approach.
>>>
>>> Thanks Dhananjay, and thanks for taking the time for an objective
>>> discussion on the mailing list. I'm very supportive of seeing the work
>>> you are enabling land.
>>>
>>> My concern comes from the tool side. If every PMU starts to have
>>> variants for the sake of the cpumask what does this mean for
>>> aggregation in the perf tool? There is another issue around event
>>> grouping, you can't group events across PMUs, but my feeling is that
>>> event grouping needs to be rethought. By default the power_per_core
>>> events are going to be aggregated together by the perf tool, which
>>> then loses their per_core-ness.
>>
>> Yea right, maybe we need to fix this behavior.
>>
>>>
>>> I was trying to think of the same problem but in other PMUs. One
>>> thought I had was the difference between hyperthread and core events.
>>> At least on Intel, some events can only count for the whole core not
>>> per hyperthread. The events don't have a cpu_per_core PMU, they just
>>> use the regular cpu one, and so the cpumask is set to all online
>>> hyperthreads. When a per-core event is programmed it will get
>>> programmed on every hyperthread and so counted twice for the core.
>>> This at the least wastes a counter, but it probably also yields twice
>>> the expected count as every event is counted twice then aggregated. So
>>> this is just wrong and the user is expected to be smart and fix it
>>> (checking the x86 events there is a convention to use a ".ALL" or
>>> ".ANY" suffix for core wide events iirc). If we had a cpumask for
>>> these events then we could avoid the double setting, free up a counter
>>> and avoid double counting. Were we to fix things the way it is done in
>>> this patch series we'd add another PMU.
>>
>> Yes, this seems like a valid usecase for event-specific cpumasks.
>>
>>>
>>> My feeling is that in the longer term a per event cpumask looks
>>> cleaner. I think either way you need a new kernel for the new RAPL
>>> events. The problem with an old perf tool and a new kernel, this
>>> doesn't normally happen with distributions as they match the perf tool
>>> to the kernel version needlessly losing features and fixes along the
>>> way. If the new PMU is going to get backported through fixes.. then we
>>> can do similar for reading the per event cpumask. I'd be tempted not
>>> to do this and focus on the next LTS kernel, getting the kernel and
>>> tool fixes in as necessary.
>>
>> Makes sense, even though this approach will require more effort but it seems
>> to be worthwhile as it would help things down the line (make it easier to 
>> have
>> heterogenous-scope events within a PMU). I'll need to go through the perf 
>> tool
>> to see how we can design this. I'll get back with an RFC series probably once
>> I have an initial design in mind.
> 
> Hi Dhananjay,
> 
> I can have a go at the perf tool side of this - I probably know the
> way around the code best. Basically we need to do something similar to
> how other "<event>.<setting>" values are parsed:
> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/pmu.c?h=perf-tools-next#n335
Yea, right.

> The cpumask handling in the perf tool is a little weird as there is a
> summary value in the evlist. Anyway, I can do that if you want to spin
> the RAPL/power PMU side of things.

Sounds great!, I'll be happy to refactor the RAPL code to use the event.cpumask 
feature to add the per-core energy counter. Also, please let me know if you 
need 
any help from me on the perf tool side as well.

Regards,
Dhananjay

> 
> Thanks,
> Ian
> 
>> Thanks,
>> Dhananjay
>>
>>>
>>> Thanks,
>>> Ian
>>>
>>>
>>>> Thanks,
>>>> Dhananjay
>>>>
>>>>>
>>>>> Thanks,
>>>>> Ian
>>>>>
>>>>>
>>>>>> Tested the package level and core level PMU counters with workloads
>>>>>> pinned to different CPUs.
>>>>>>
>>>>>> Results with workload pinned to CPU 1 in Core 1 on an AMD Zen4 Genoa
>>>>>> machine:
>>>>>>
>>>>>> $ perf stat -a --per-core -e power_per_core/energy-per-core/ -- sleep 1
>>>>>>
>>>>>>  Performance counter stats for 'system wide':
>>>>>>
>>>>>> S0-D0-C0         1          0.02 Joules power_per_core/energy-per-core/
>>>>>> S0-D0-C1         1          5.72 Joules power_per_core/energy-per-core/
>>>>>> S0-D0-C2         1          0.02 Joules power_per_core/energy-per-core/
>>>>>> S0-D0-C3         1          0.02 Joules power_per_core/energy-per-core/
>>>>>> S0-D0-C4         1          0.02 Joules power_per_core/energy-per-core/
>>>>>> S0-D0-C5         1          0.02 Joules power_per_core/energy-per-core/
>>>>>> S0-D0-C6         1          0.02 Joules power_per_core/energy-per-core/
>>>>>> S0-D0-C7         1          0.02 Joules power_per_core/energy-per-core/
>>>>>> S0-D0-C8         1          0.02 Joules power_per_core/energy-per-core/
>>>>>> S0-D0-C9         1          0.02 Joules power_per_core/energy-per-core/
>>>>>> S0-D0-C10        1          0.02 Joules power_per_core/energy-per-core/
>>>>>>
>>>>>> [1]: 
>>>>>> https://lore.kernel.org/lkml/[email protected]/
>>>>>>
>>>>>> This patchset applies cleanly on top of v6.10-rc7 as well as latest
>>>>>> tip/master.
>>>>>>
>>>>>> v4 changes:
>>>>>> * Add patch 11 which removes the unused function cpu_to_rapl_pmu()
>>>>>> * Add Rui's rb tag for patch 1
>>>>>> * Invert the pmu scope check logic in patch 2 (Peter)
>>>>>> * Add comments explaining the scope check in patch 2 (Peter)
>>>>>> * Use cpumask_var_t instead of cpumask_t in patch 5 (Peter)
>>>>>> * Move renaming code to patch 8 (Rui)
>>>>>> * Reorder the cleanup order of per-core and per-pkg PMU in patch 10 (Rui)
>>>>>> * Add rapl_core_hw_unit variable to store the per-core PMU unit in patch
>>>>>>   10 (Rui)
>>>>>>
>>>>>> PS: Scope check logic is still kept the same (i.e., all Intel systems 
>>>>>> being
>>>>>> considered as die scope), Rui will be modifying it to limit the die-scope
>>>>>> only to Cascadelake-AP in a future patch on top of this patchset.
>>>>>>
>>>>>> v3 changes:
>>>>>> * Patch 1 added to introduce the logical_core_id which is unique across
>>>>>>   the system (Prateek)
>>>>>> * Use the unique topology_logical_core_id() instead of
>>>>>>   topology_core_id() (which is only unique within a package on tested
>>>>>>   AMD and Intel systems) in Patch 10
>>>>>>
>>>>>> v2 changes:
>>>>>> * Patches 6,7,8 added to split some changes out of the last patch
>>>>>> * Use container_of to get the rapl_pmus from event variable (Rui)
>>>>>> * Set PERF_EV_CAP_READ_ACTIVE_PKG flag only for pkg scope PMU (Rui)
>>>>>> * Use event id 0x1 for energy-per-core event (Rui)
>>>>>> * Use PERF_RAPL_PER_CORE bit instead of adding a new flag to check for
>>>>>>   per-core counter hw support (Rui)
>>>>>>
>>>>>> Dhananjay Ugwekar (10):
>>>>>>   perf/x86/rapl: Fix the energy-pkg event for AMD CPUs
>>>>>>   perf/x86/rapl: Rename rapl_pmu variables
>>>>>>   perf/x86/rapl: Make rapl_model struct global
>>>>>>   perf/x86/rapl: Move cpumask variable to rapl_pmus struct
>>>>>>   perf/x86/rapl: Add wrapper for online/offline functions
>>>>>>   perf/x86/rapl: Add an argument to the cleanup and init functions
>>>>>>   perf/x86/rapl: Modify the generic variable names to *_pkg*
>>>>>>   perf/x86/rapl: Remove the global variable rapl_msrs
>>>>>>   perf/x86/rapl: Add per-core energy counter support for AMD CPUs
>>>>>>   perf/x86/rapl: Remove the unused function cpu_to_rapl_pmu
>>>>>>
>>>>>> K Prateek Nayak (1):
>>>>>>   x86/topology: Introduce topology_logical_core_id()
>>>>>>
>>>>>>  Documentation/arch/x86/topology.rst   |   4 +
>>>>>>  arch/x86/events/rapl.c                | 454 ++++++++++++++++++--------
>>>>>>  arch/x86/include/asm/processor.h      |   1 +
>>>>>>  arch/x86/include/asm/topology.h       |   1 +
>>>>>>  arch/x86/kernel/cpu/debugfs.c         |   1 +
>>>>>>  arch/x86/kernel/cpu/topology_common.c |   1 +
>>>>>>  6 files changed, 328 insertions(+), 134 deletions(-)
>>>>>>
>>>>>> --
>>>>>> 2.34.1
>>>>>>

Reply via email to