On Wed, Aug 27, 2025 at 1:18 AM Mark Rutland <mark.rutl...@arm.com> wrote: > > On Tue, Aug 26, 2025 at 11:48:48AM -0700, Ian Rogers wrote: > > On Tue, Aug 26, 2025 at 8:32 AM Robin Murphy <robin.mur...@arm.com> wrote: > > > > > > On 2025-08-26 2:03 pm, Peter Zijlstra wrote: > > > > On Wed, Aug 13, 2025 at 06:01:04PM +0100, Robin Murphy wrote: > > > >> It may have been different long ago, but today it seems wrong for these > > > >> drivers to skip counting disabled sibling events in group validation, > > > >> given that perf_event_enable() could make them schedulable again, and > > > >> thus increase the effective size of the group later. Conversely, if a > > > >> sibling event is truly dead then it stands to reason that the whole > > > >> group is dead, so it's not worth going to any special effort to try to > > > >> squeeze in a new event that's never going to run anyway. Thus, we can > > > >> simply remove all these checks. > > > > > > > > So currently you can do sort of a manual event rotation inside an > > > > over-sized group and have it work. > > > > > > > > I'm not sure if anybody actually does this, but its possible. > > > > > > > > Eg. on a PMU that supports only 4 counters, create a group of 5 and > > > > periodically cycle which of the 5 events is off. > > > > I'm not sure this is true, I thought this would fail in the > > perf_event_open when adding the 5th event and there being insufficient > > counters for the group. > > We're talking specifically about cases where the logic in a pmu's > pmu::event_init() callback doesn't count events in specific states, and > hence the 5th even doesn't get rejected when it is initialised. > > For example, in arch/x86/events/core.c, validate_group() uses > collect_events(), which has: > > for_each_sibling_event(event, leader) { > if (!is_x86_event(event) || event->state <= > PERF_EVENT_STATE_OFF) > continue; > > if (collect_event(cpuc, event, max_count, n)) > return -EINVAL; > > n++; > } > > ... and so where an event's state is <= PERF_EVENT_STATE_OFF at init > time, that event is not counted to see if it fits into HW counters.
Hmm.. Thinking out loud. So it looked like perf with weak groups could be broken then: ``` $ sudo perf stat -vv -e '{instructions,cycles}:W' true ... perf_event_attr: type 0 (PERF_TYPE_HARDWARE) size 136 config 0x400000001 (cpu_core/PERF_COUNT_HW_INSTRUCTIONS/) sample_type IDENTIFIER read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING|ID|GROUP disabled 1 inherit 1 enable_on_exec 1 ------------------------------------------------------------ sys_perf_event_open: pid 3337764 cpu -1 group_fd -1 flags 0x8 = 5 ------------------------------------------------------------ perf_event_attr: type 0 (PERF_TYPE_HARDWARE) size 136 config 0x400000000 (cpu_core/PERF_COUNT_HW_CPU_CYCLES/) sample_type IDENTIFIER read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING|ID|GROUP inherit 1 ------------------------------------------------------------ sys_perf_event_open: pid 3337764 cpu -1 group_fd 5 flags 0x8 = 7 ... ``` Note, the group leader (instructions) is disabled because of: https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/stat.c?h=perf-tools-next#n761 ``` /* * Disabling all counters initially, they will be enabled * either manually by us or by kernel via enable_on_exec * set later. */ if (evsel__is_group_leader(evsel)) { attr->disabled = 1; ``` but the checking of being disabled (PERF_EVENT_STATE_OFF) is only done on siblings in the code you show above. So yes, you can disable the group events to allow the perf_event_open to succeed but not on the leader which is always checked (no PERF_EVENT_STATE_OFF check): https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/arch/x86/events/core.c?h=perf-tools-next#n1204 ``` if (is_x86_event(leader)) { if (collect_event(cpuc, leader, max_count, n)) return -EINVAL; ``` Thanks, Ian