Excerpts from Athira Rajeev's message of April 9, 2021 10:53 pm: > > >> On 09-Apr-2021, at 6:38 AM, Nicholas Piggin <npig...@gmail.com> wrote: >> > Hi Nick, > > Thanks for checking the patch and sharing review comments. > >> I was going to nitpick "overflown" here as something birds do, but some >> sources says overflown is okay for past tense. >> >> You could use "overflowed" for that, but I understand the issue with the >> word: you are talking about counters that are currently in an "overflow" >> state, but the overflow occurred in the past and is not still happening >> so you "overflowing" doesn't exactly fit either. >> >> overflown kind of works for some reason you can kind of use it for >> present tense! > > Ok sure, Yes counter is currently in an “overflow” state. > >> >> Excerpts from Athira Rajeev's message of April 7, 2021 12:47 am: >>> Running perf fuzzer showed below in dmesg logs: >>> "Can't find PMC that caused IRQ" >>> >>> This means a PMU exception happened, but none of the PMC's (Performance >>> Monitor Counter) were found to be overflown. There are some corner cases >>> that clears the PMCs after PMI gets masked. In such cases, the perf >>> interrupt handler will not find the active PMC values that had caused >>> the overflow and thus leads to this message while replaying. >>> >>> Case 1: PMU Interrupt happens during replay of other interrupts and >>> counter values gets cleared by PMU callbacks before replay: >>> >>> During replay of interrupts like timer, __do_irq and doorbell exception, we >>> conditionally enable interrupts via may_hard_irq_enable(). This could >>> potentially create a window to generate a PMI. Since irq soft mask is set >>> to ALL_DISABLED, the PMI will get masked here. >> >> I wonder if may_hard_irq_enable shouldn't enable if PMI is soft >> disabled. And also maybe replay should not set ALL_DISABLED if >> there are no PMI interrupts pending. >> >> Still, I think those are a bit more tricky and might take a while >> to get right or just not be worth while, so I think your patch is >> fine. > > Ok Nick. >> >>> We could get IPIs run before >>> perf interrupt is replayed and the PMU events could deleted or stopped. >>> This will change the PMU SPR values and resets the counters. Snippet of >>> ftrace log showing PMU callbacks invoked in "__do_irq": >>> >>> <idle>-0 [051] dns. 132025441306354: __do_irq <-call_do_irq >>> <idle>-0 [051] dns. 132025441306430: irq_enter <-__do_irq >>> <idle>-0 [051] dns. 132025441306503: irq_enter_rcu <-__do_irq >>> <idle>-0 [051] dnH. 132025441306599: xive_get_irq <-__do_irq >>> <<>> >>> <idle>-0 [051] dnH. 132025441307770: >>> generic_smp_call_function_single_interrupt <-smp_ipi_demux_relaxed >>> <idle>-0 [051] dnH. 132025441307839: flush_smp_call_function_queue >>> <-smp_ipi_demux_relaxed >>> <idle>-0 [051] dnH. 132025441308057: _raw_spin_lock <-event_function >>> <idle>-0 [051] dnH. 132025441308206: power_pmu_disable <-perf_pmu_disable >>> <idle>-0 [051] dnH. 132025441308337: power_pmu_del <-event_sched_out >>> <idle>-0 [051] dnH. 132025441308407: power_pmu_read <-power_pmu_del >>> <idle>-0 [051] dnH. 132025441308477: read_pmc <-power_pmu_read >>> <idle>-0 [051] dnH. 132025441308590: isa207_disable_pmc <-power_pmu_del >>> <idle>-0 [051] dnH. 132025441308663: write_pmc <-power_pmu_del >>> <idle>-0 [051] dnH. 132025441308787: power_pmu_event_idx >>> <-perf_event_update_userpage >>> <idle>-0 [051] dnH. 132025441308859: rcu_read_unlock_strict >>> <-perf_event_update_userpage >>> <idle>-0 [051] dnH. 132025441308975: power_pmu_enable <-perf_pmu_enable >>> <<>> >>> <idle>-0 [051] dnH. 132025441311108: irq_exit <-__do_irq >>> <idle>-0 [051] dns. 132025441311319: performance_monitor_exception >>> <-replay_soft_interrupts >>> >>> Case 2: PMI's masked during local_* operations, example local_add. >>> If the local_add operation happens within a local_irq_save, replay of >>> PMI will be during local_irq_restore. Similar to case 1, this could >>> also create a window before replay where PMU events gets deleted or >>> stopped. >> >> Here as well perhaps PMIs should be replayed if they are unmasked >> even if other interrupts are still masked. Again that might be more >> complexity than it's worth. > Ok.. > >> >>> >>> Patch adds a fix to update the PMU callback functions (del,stop,enable) to >>> check for pending perf interrupt. If there is an overflown PMC and pending >>> perf interrupt indicated in Paca, clear the PMI bit in paca to drop that >>> sample. In case of power_pmu_del, also clear the MMCR0 PMAO bit which >>> otherwise could lead to spurious interrupts in some corner cases. Example, >>> a timer after power_pmu_del which will re-enable interrupts since PMI is >>> cleared and triggers a PMI again since PMAO bit is still set. >>> >>> We can't just replay PMI any time. Hence this approach is preferred rather >>> than replaying PMI before resetting overflown PMC. Patch also documents >>> core-book3s on a race condition which can trigger these PMC messages during >>> idle path in PowerNV. >>> >>> Fixes: f442d004806e ("powerpc/64s: Add support to mask perf interrupts and >>> replay them") >>> Reported-by: Nageswara R Sastry <nasas...@in.ibm.com> >>> Suggested-by: Nicholas Piggin <npig...@gmail.com> >>> Suggested-by: Madhavan Srinivasan <ma...@linux.ibm.com> >>> Signed-off-by: Athira Rajeev <atraj...@linux.vnet.ibm.com> >>> --- >>> arch/powerpc/include/asm/pmc.h | 11 +++++++++ >>> arch/powerpc/perf/core-book3s.c | 55 >>> +++++++++++++++++++++++++++++++++++++++++ >>> 2 files changed, 66 insertions(+) >>> >>> diff --git a/arch/powerpc/include/asm/pmc.h b/arch/powerpc/include/asm/pmc.h >>> index c6bbe9778d3c..97b4bd8de25b 100644 >>> --- a/arch/powerpc/include/asm/pmc.h >>> +++ b/arch/powerpc/include/asm/pmc.h >>> @@ -34,11 +34,22 @@ static inline void ppc_set_pmu_inuse(int inuse) >>> #endif >>> } >>> >>> +static inline int clear_paca_irq_pmi(void) >>> +{ >>> + if (get_paca()->irq_happened & PACA_IRQ_PMI) { >>> + WARN_ON_ONCE(mfmsr() & MSR_EE); >>> + get_paca()->irq_happened &= ~PACA_IRQ_PMI; >>> + return 1; >>> + } >>> + return 0; >>> +} >> >> Could you put this in arch/powerpc/include/asm/hw_irq.h and >> rather than paca_irq, call it irq_pending perhaps >> >> clear_pmi_irq_pending() >> >> get_clear_pmi_irq_pending() if you're also testing it. > > Sure, I will use “get_clear_pmi_irq_pending()” and try with moving this to > arch/powerpc/include/asm/hw_irq.h > >> >> Could you add a little comment about the corner cases above it too? >> The root cause seem to be interrupt replay while a masked PMI is >> pending can result in other interrupts arriving which clear the PMU >> overflow so the pending PMI must be cleared. > > Ok, I will add comment and fix this in next version. > >> >>> + >>> extern void power4_enable_pmcs(void); >>> >>> #else /* CONFIG_PPC64 */ >>> >>> static inline void ppc_set_pmu_inuse(int inuse) { } >>> +static inline int clear_paca_irq_pmi(void) { return 0; } >>> >>> #endif >>> >>> diff --git a/arch/powerpc/perf/core-book3s.c >>> b/arch/powerpc/perf/core-book3s.c >>> index 766f064f00fb..18ca3c90f866 100644 >>> --- a/arch/powerpc/perf/core-book3s.c >>> +++ b/arch/powerpc/perf/core-book3s.c >>> @@ -847,6 +847,20 @@ static void write_pmc(int idx, unsigned long val) >>> } >>> } >>> >>> +static int pmc_overflown(int idx) >>> +{ >>> + unsigned long val[8]; >>> + int i; >>> + >>> + for (i = 0; i < ppmu->n_counter; i++) >>> + val[i] = read_pmc(i + 1); >>> + >>> + if ((int)val[idx-1] < 0) >>> + return 1; >>> + >>> + return 0; >>> +} >>> + >>> /* Called from sysrq_handle_showregs() */ >>> void perf_event_print_debug(void) >>> { >>> @@ -1438,6 +1452,15 @@ static void power_pmu_enable(struct pmu *pmu) >>> event = cpuhw->event[i]; >>> if (event->hw.idx && event->hw.idx != hwc_index[i] + 1) { >>> power_pmu_read(event); >>> + /* >>> + * if the PMC corresponding to event->hw.idx is >>> + * overflown, check if there is any pending perf >>> + * interrupt set in paca. If so, disable the interrupt >>> + * by clearing the paca bit for PMI since we are going >>> + * to reset the PMC. >>> + */ >>> + if (pmc_overflown(event->hw.idx)) >>> + clear_paca_irq_pmi(); >> >> If the pmc is not overflown, could there still be a PMI pending? > > I didn’t hit that scenario where PMI is pending without an overflown PMC. > Also I believe if such a case happens, we will need an investigation there. > It could be a different case to be handled.
Okay, so a PMI will not occur without an overflown PMC, and the overflown PMC will only be cleared in places where you also clear a possible pending PMI? > > I actually considered below two points for adding this PMC check instead of > just clearing the PMI. > > 1. Make sure we are not masking any bug here by just clearing PACA_IRQ_PMI. > Ideally if PMI is set in irq_happened, it means there was a counter overflow. > 2. If there is more than one PMU event, say two events. Make sure we are > clearing PMI only for the > event whose counter is overflown. Those are good points. Would you consider also adding a warning for the case of no PMCs overflown but PMI is pending? That way you might have more information about such a problem if it ever happens. We try to add a good deal of warnings around the soft-mask code because it's very tricky to change without causing more bugs, so even for future changes to the code this would probably be useful. >>> @@ -1636,6 +1664,22 @@ static void power_pmu_del(struct perf_event *event, >>> int ef_flags) >>> --cpuhw->n_events; >>> ppmu->disable_pmc(event->hw.idx - 1, &cpuhw->mmcr); >>> if (event->hw.idx) { >>> + /* >>> + * if the PMC corresponding to event->hw.idx is >>> + * overflown, check if there is any pending perf >>> + * interrupt set in paca. If so, disable the >>> interrupt >>> + * and clear the MMCR0 PMAO bit since we are >>> going >>> + * to reset the PMC and delete the event. >>> + */ >>> + if (pmc_overflown(event->hw.idx)) { >>> + if (clear_paca_irq_pmi()) { >>> + val_mmcr0 = mfspr(SPRN_MMCR0); >>> + val_mmcr0 &= ~MMCR0_PMAO; >>> + write_mmcr0(cpuhw, val_mmcr0); >>> + mb(); >>> + isync(); >> >> I don't know the perf subsystem, but just out of curiosity why does >> MMCR0 need to be cleared only in this case? > > I got a corner case in power_pmu_del, with only clearing PACA_IRQ_PMI and > without resetting MMCR0 PMAO bit. > Here is the flow: > > 1. We clear the PMI bit Paca, but MMCR0 has the PMAO bit still set. PMAO bit > indicates a PMI has occurred. > 2. A timer interrupt is replayed after power_pmu_del which does a > “may_hard_irq_enable”. > This will re-enable interrupts and triggers a PMI again since PMAO bit is > still set. > > So clear PMAO bit to avoid such spurious interrupts. > Ftrace logs showing the same with some debug trace_printks : > > <idle>-0 [134] d.h. 327287888478: power_pmu_del > <-event_sched_out.isra.126 > <<>> Here we cleared the PMI > <idle>-0 [134] d.h. 327287889272: write_pmc <-power_pmu_del > <idle>-0 [134] d.h. 327287889346: rcu_read_unlock_strict > <-perf_event_update_userpage > <idle>-0 [134] d.h. 327287889711: power_pmu_del: In power_pmu_del > MMCR0 is 82004090, local_paca->irq_happened is 9 > <idle>-0 [134] d.h. 327287889811: power_pmu_enable <-perf_pmu_enable > <idle>-0 [134] d.h. 327287889982: irq_exit <-doorbell_exception > <idle>-0 [134] d... 327287890053: idle_cpu <-irq_exit > <idle>-0 [134] d... 327287890158: tick_nohz_irq_exit <-irq_exit > <idle>-0 [134] d... 327287890219: ktime_get <-tick_nohz_irq_exit > <idle>-0 [134] d... 327287890328: replay_soft_interrupts > <-interrupt_exit_kernel_prepare > <idle>-0 [134] d... 327287890399: irq_enter <-timer_interrupt > <<>> > <idle>-0 [134] d.h. 327287891163: timer_interrupt: Before > may_hard_irq_enable MMCR0 is 82004090, local_paca->irq_happened is 1 > <<>> > <idle>-0 [134] d.h. 327287894310: timer_interrupt: After > may_hard_irq_enable MMCR0 is 82004090, local_paca->irq_happened is 21 > > In case of other callbacks like pmu enable, we are programming MMCR0. But in > case of event getting deleted, there is no > way we clear PMAO unless an event gets scheduled again in that cpu. Hence > added this check only in pmu_del callback. > > >> What if we disabled MSR[EE] >> right before a perf interrupt came in, so we don't get a pending PMI >> but the condition is still close to the same. > > Nick, I didn’t get this question exactly. Can you please help explain a bit ? > From my understanding, consider that we disabled MSR[EE] before perf > interrupt came in. > So once the interrupts are re-enabled: > > 1. If soft mask is set to IRQS_DISABLED, perf interrupt will be triggered as > NMI. > 2. In case of ALL_DISABLED, it will be masked for replay. If PMU callbacks > are invoked before replay, > our present patch will take care of clearing PMI in corner cases. Well I'm wondering about the same PMAO bug. Above you said: 1. We clear the PMI bit Paca, but MMCR0 has the PMAO bit still set. PMAO bit indicates a PMI has occurred. 2. A timer interrupt is replayed after power_pmu_del which does a “may_hard_irq_enable”. This will re-enable interrupts and triggers a PMI again since PMAO bit is still set. So in this situation, what if we had disabled interrupts and that had caused MSR[EE] to be cleared (let's say due to a PCI interrupt arriving), and then a PMC overflows and causes PMAO to be set. Then you run this code: + /* + * if the PMC corresponding to event->hw.idx is + * overflown, check if there is any pending perf + * interrupt set in paca. If so, disable the interrupt + * and clear the MMCR0 PMAO bit since we are going + * to reset the PMC and delete the event. + */ + if (pmc_overflown(event->hw.idx)) { + if (clear_paca_irq_pmi()) { + val_mmcr0 = mfspr(SPRN_MMCR0); + val_mmcr0 &= ~MMCR0_PMAO; + write_mmcr0(cpuhw, val_mmcr0); + mb(); + isync(); And this does not clear PMAO because we had no pending PMI, but we still have the pending PMAO exception. The only difference was that MSR[EE] happened to be disabled when the PMC overflowed so no pending PMI was recorded, but otherwise everything is the same so I wonder why it's not subject to the same problem? Thanks, Nick