On Mon, Apr 08, 2019 at 11:50:31AM +0200, Peter Zijlstra wrote: > On Mon, Apr 08, 2019 at 10:22:29AM +0200, Peter Zijlstra wrote: > > On Mon, Apr 08, 2019 at 09:12:28AM +0200, Thomas-Mich Richter wrote: > > > > very good news, your fix ran over the weekend without any hit!!! > > > > > > Thanks very much for your help. Do you submit this patch to the kernel > > > mailing list? > > > > Most excellent, let me go write a Changelog. > > Hi Thomas, find below. > > Sadly, while writing the Changelog I ended up with a 'completely' > differet patch again, could I bother you to test this one too? > > --- > Subject: perf: Fix perf_event_disable_inatomic() > From: Peter Zijlstra <pet...@infradead.org> > Date: Thu, 4 Apr 2019 15:03:00 +0200 > > Thomas-Mich Richter reported he triggered a WARN from event_function_local() > on his s390. The problem boils down to: > > CPU-A CPU-B > > perf_event_overflow() > perf_event_disable_inatomic() > @pending_disable = 1 > irq_work_queue(); > > sched-out > event_sched_out() > @pending_disable = 0 > > sched-in > perf_event_overflow() > perf_event_disable_inatomic() > @pending_disable = 1; > irq_work_queue(); // FAILS > > irq_work_run() > perf_pending_event() > if (@pending_disable) > perf_event_disable_local(); // WHOOPS > > The problem exists in generic, but s390 is particularly sensitive > because it doesn't implement arch_irq_work_raise(), nor does it call > irq_work_run() from it's PMU interrupt handler (nor would that be > sufficient in this case, because s390 also generates > perf_event_overflow() from pmu::stop). Add to that the fact that s390 > is a virtual architecture and (virtual) CPU-A can stall long enough > for the above race to happen, even if it would self-IPI. > > Adding an irq_work_syn() to event_sched_in() would work for all hardare > PMUs that properly use irq_work_run() but fails for software PMUs.
Typo: s/syn/sync/ > > Instead encode the CPU number in @pending_disable, such that we can > tell which CPU requested the disable. This then allows us to detect > the above scenario and even redirect the IPI to make up for the failed > queue. > > Cc: Heiko Carstens <heiko.carst...@de.ibm.com> > Cc: Kees Cook <keesc...@chromium.org> > Cc: Hendrik Brueckner <brueck...@linux.ibm.com> > Cc: a...@redhat.com > Cc: Martin Schwidefsky <schwidef...@de.ibm.com> > Cc: Mark Rutland <mark.rutl...@arm.com> > Cc: Jiri Olsa <jo...@redhat.com> > Reported-by: Thomas-Mich Richter <tmri...@linux.ibm.com> > Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org> I can't think of a nicer way of handling this, so FWIW: Acked-by: Mark Rutland <mark.rutl...@arm.com> Mark. > --- > kernel/events/core.c | 52 > ++++++++++++++++++++++++++++++++++++++++++--------- > 1 file changed, 43 insertions(+), 9 deletions(-) > > --- a/kernel/events/core.c > +++ b/kernel/events/core.c > @@ -2009,8 +2009,8 @@ event_sched_out(struct perf_event *event > event->pmu->del(event, 0); > event->oncpu = -1; > > - if (event->pending_disable) { > - event->pending_disable = 0; > + if (READ_ONCE(event->pending_disable) >= 0) { > + WRITE_ONCE(event->pending_disable, -1); > state = PERF_EVENT_STATE_OFF; > } > perf_event_set_state(event, state); > @@ -2198,7 +2198,8 @@ EXPORT_SYMBOL_GPL(perf_event_disable); > > void perf_event_disable_inatomic(struct perf_event *event) > { > - event->pending_disable = 1; > + WRITE_ONCE(event->pending_disable, smp_processor_id()); > + /* can fail, see perf_pending_event_disable() */ > irq_work_queue(&event->pending); > } > > @@ -5810,10 +5811,45 @@ void perf_event_wakeup(struct perf_event > } > } > > +static void perf_pending_event_disable(struct perf_event *event) > +{ > + int cpu = READ_ONCE(event->pending_disable); > + > + if (cpu < 0) > + return; > + > + if (cpu == smp_processor_id()) { > + WRITE_ONCE(event->pending_disable, -1); > + perf_event_disable_local(event); > + return; > + } > + > + /* > + * CPU-A CPU-B > + * > + * perf_event_disable_inatomic() > + * @pending_disable = CPU-A; > + * irq_work_queue(); > + * > + * sched-out > + * @pending_disable = -1; > + * > + * sched-in > + * perf_event_disable_inatomic() > + * @pending_disable = CPU-B; > + * irq_work_queue(); // FAILS > + * > + * irq_work_run() > + * perf_pending_event() > + * > + * But the event runs on CPU-B and wants disabling there. > + */ > + irq_work_queue_on(&event->pending, cpu); > +} > + > static void perf_pending_event(struct irq_work *entry) > { > - struct perf_event *event = container_of(entry, > - struct perf_event, pending); > + struct perf_event *event = container_of(entry, struct perf_event, > pending); > int rctx; > > rctx = perf_swevent_get_recursion_context(); > @@ -5822,10 +5858,7 @@ static void perf_pending_event(struct ir > * and we won't recurse 'further'. > */ > > - if (event->pending_disable) { > - event->pending_disable = 0; > - perf_event_disable_local(event); > - } > + perf_pending_event_disable(event); > > if (event->pending_wakeup) { > event->pending_wakeup = 0; > @@ -10236,6 +10269,7 @@ perf_event_alloc(struct perf_event_attr > > > init_waitqueue_head(&event->waitq); > + event->pending_disable = -1; > init_irq_work(&event->pending, perf_pending_event); > > mutex_init(&event->mmap_mutex);