On Wed 2015-06-10 16:31:55, Petr Mladek wrote:
> On Wed 2015-06-10 14:55:09, Peter Zijlstra wrote:
> > Hi,
> > 
> > I just stumbled upon:
> > 
> >     a9edc8809328 ("x86/nmi: Perform a safe NMI stack trace on all CPUs")
> > 
> > Which is not dissimilar to what I've proposed in the past. Except its
> > squirreled away in some far and dark corner of one arch.
> > 
> > Lets fix that.
> 
> It is very interesting approach. But I see two possible races, see below.
> 
> > diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
> > index dfd59d6bc6f0..a477e0766d2e 100644
> > --- a/include/linux/hardirq.h
> > +++ b/include/linux/hardirq.h
> > @@ -67,10 +67,12 @@ extern void irq_exit(void);
> >             preempt_count_add(NMI_OFFSET + HARDIRQ_OFFSET); \
> >             rcu_nmi_enter();                                \
> >             trace_hardirq_enter();                          \
> > +           printk_nmi_enter();                             \
> >     } while (0)
> >  
> >  #define nmi_exit()                                         \
> >     do {                                                    \
> > +           printk_nmi_exit();                              \
> >             trace_hardirq_exit();                           \
> >             rcu_nmi_exit();                                 \
> >             BUG_ON(!in_nmi());                              \
> > diff --git a/include/linux/percpu.h b/include/linux/percpu.h
> > index caebf2a758dc..04c68b9f56f8 100644
> > --- a/include/linux/percpu.h
> > +++ b/include/linux/percpu.h
> > @@ -135,7 +135,4 @@ extern phys_addr_t per_cpu_ptr_to_phys(void *addr);
> >     (typeof(type) __percpu *)__alloc_percpu(sizeof(type),           \
> >                                             __alignof__(type))
> >  
> > -/* To avoid include hell, as printk can not declare this, we declare it 
> > here */
> > -DECLARE_PER_CPU(printk_func_t, printk_func);
> > -
> >  #endif /* __LINUX_PERCPU_H */
> > diff --git a/init/main.c b/init/main.c
> > index 2115055faeac..4a28accaaa98 100644
> > --- a/init/main.c
> > +++ b/init/main.c
> > @@ -586,6 +586,7 @@ asmlinkage __visible void __init start_kernel(void)
> >     timekeeping_init();
> >     time_init();
> >     sched_clock_postinit();
> > +   printk_init();
> >     perf_event_init();
> >     profile_init();
> >     call_function_init();
> > diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> > index c099b082cd02..0149c41be95b 100644
> > --- a/kernel/printk/printk.c
> > +++ b/kernel/printk/printk.c
> > @@ -1821,13 +1821,125 @@ int vprintk_default(const char *fmt, va_list args)
> >  }
> >  EXPORT_SYMBOL_GPL(vprintk_default);
> >  
> > +#ifdef CONFIG_PRINTK_NMI
> > +
> > +typedef int(*printk_func_t)(const char *fmt, va_list args);
> >  /*
> >   * This allows printk to be diverted to another function per cpu.
> >   * This is useful for calling printk functions from within NMI
> >   * without worrying about race conditions that can lock up the
> >   * box.
> >   */
> > -DEFINE_PER_CPU(printk_func_t, printk_func) = vprintk_default;
> > +static DEFINE_PER_CPU(printk_func_t, printk_func) = vprintk_default;
> > +
> > +#include <linux/seq_buf.h>
> > +
> > +struct nmi_seq_buf {
> > +   struct seq_buf          seq;
> > +   struct irq_work         work;
> > +   unsigned char           buffer[PAGE_SIZE -
> > +                                   sizeof(struct seq_buf) -
> > +                                   sizeof(struct irq_work)];
> > +};
> > +
> > +/* Safe printing in NMI context */
> > +static DEFINE_PER_CPU(struct nmi_seq_buf, nmi_print_seq);
> > +
> > +static void print_seq_line(struct nmi_seq_buf *s, int start, int end)
> > +{
> > +   const char *buf = s->buffer + start;
> > +
> > +   printk("%.*s", (end - start) + 1, buf);
> > +}
> > +
> > +static void __printk_nmi_flush(struct irq_work *work)
> > +{
> > +   struct nmi_seq_buf *s = container_of(work, struct nmi_seq_buf, work);
> > +   int len, last_i = 0, i = 0;
> > +
> > +again:
> > +   len = seq_buf_used(&s->seq);
> > +   if (!len)
> > +           return;
> > +
> > +   /* Print line by line. */
> > +   for (; i < len; i++) {
> > +           if (s->buffer[i] == '\n') {
> > +                   print_seq_line(s, last_i, i);
> > +                   last_i = i + 1;
> > +           }
> > +   }
> > +   /* Check if there was a partial line. */
> > +   if (last_i < len) {
> > +           print_seq_line(s, last_i, len - 1);
> > +           pr_cont("\n");
> > +   }
> > +
> > +   /*
> > +    * Another NMI could have come in while we were printing
> > +    * check if nothing has been added to the buffer.
> > +    */
> > +   if (cmpxchg_local(&s->seq.len, len, 0) != len)
> > +           goto again;
> 
> If another NMI comes at this point, it will start filling the buffer
> from the beginning. If it is fast enough, it might override the text
> that we print above.
> 
> Note that it is easy to be faster because the normal printk() is more
> complex and might wait for console.

Also note that show_regs() calls many separate printk()s, the irqwork
is scheduled by the first one => it is quite likely that some
backtrace will get messed.


Anothrer problem is that __printk_nmi_flush() is per-CPU => more
instances might be running in parallel. I haven't tested this but
I quess that it will mix backtraces from different CPUs in
the main ring buffer.

Note that the backtraces used to be serialized via
static arch_spinlock_t lock = __ARCH_SPIN_LOCK_UNLOCKED
in the past. See the commit
a9edc8809328 ("x86/nmi: Perform a safe NMI stack trace on all CPUs")


A solution would be to block __printk_nmi_flush() until
printk_nmi_exit() is called via some bit and cmpxchg.

Also we could add one more bit and vprintk_nmi() could drop messages
until the buffer is printed to avoid any mess.

Huh, it is getting more and more complicated.


Best Regards,
Petr

> > +}
> > +
> > +void printk_init(void)
> > +{
> > +   int cpu;
> > +
> > +   for_each_possible_cpu(cpu) {
> > +           struct nmi_seq_buf *s = &per_cpu(nmi_print_seq, cpu);
> > +
> > +           init_irq_work(&s->work, __printk_nmi_flush);
> > +           seq_buf_init(&s->seq, s->buffer, sizeof(s->buffer));
> > +   }
> > +}
> > +
> > +/*
> > + * It is not safe to call printk() directly from NMI handlers.
> > + * It may be fine if the NMI detected a lock up and we have no choice
> > + * but to do so, but doing a NMI on all other CPUs to get a back trace
> > + * can be done with a sysrq-l. We don't want that to lock up, which
> > + * can happen if the NMI interrupts a printk in progress.
> > + *
> > + * Instead, we redirect the vprintk() to this nmi_vprintk() that writes
> > + * the content into a per cpu seq_buf buffer. Then when the NMIs are
> > + * all done, we can safely dump the contents of the seq_buf to a printk()
> > + * from a non NMI context.
> > + */
> > +static int vprintk_nmi(const char *fmt, va_list args)
> > +{
> > +   struct nmi_seq_buf *s = this_cpu_ptr(&nmi_print_seq);
> > +   unsigned int len = seq_buf_used(&s->seq);
> > +
> > +   irq_work_queue(&s->work);
> > +   seq_buf_vprintf(&s->seq, fmt, args);
> 
> This is more critical. seq_buf_vprintf() has the following code:
> 
>       len = vsnprintf(s->buffer + s->len, s->size - s->len, fmt, args);
> 
> The race might look like:
> 
> CPU0                                  CPU1
> 
>                                       __printk_nmi_flush()
>                                       ...
> 
> seq_buf_vprintf()
> 
>    s->buffer + s->len
> 
>                                       cmpxchg_local(&s->seq.len,
>                                       len, 0)
> 
>    s->size - s->len
> 
> => that seq_buf_vprint() will start writing to the end of the seq_buf
>   but it will think that the entire writer is available and it might
>   cause buffer overflow
> 
> 
> Well, it is rather theoretical. s->len will be most likely read only
> once (compiler optimization). Anyway, we must make make sure that
> all the used seq_buf operations are save against such races.
> It is not obvious and might be prone to regressions.
> 
> To be honest, I am not familiar with cmpxchg_local(). But I think that
> it can't do much difference. The value has to be synced to the other
> CPUs one day.
> 
> 
> Note that Steven's solution is safe because the prints are serialized via
> @backtrace_flag in arch_trigger_all_cpu_backtrace().
> 
> 
> Best Regards,
> Petr
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to