Re: [PATCH trivial-v2] RCU: Call touch_nmi_watchdog() while printing stall warnings

Paul E. McKenney Tue, 09 Jan 2018 12:06:12 -0800

On Tue, Jan 09, 2018 at 10:52:22AM -0800, Tejun Heo wrote:
> >From c1d1ae75ba4bde7c833560248b3f561085fa850e Mon Sep 17 00:00:00 2001
> From: Tejun Heo <[email protected]>
> Date: Tue, 9 Jan 2018 10:38:17 -0800
> Subject: [PATCH] RCU: Call touch_nmi_watchdog() while printing stall warnings
> 
> When RCU stall warning triggers, it can print out a lot of messages
> while holding spinlocks.  If the console device is slow (e.g. an
> actual or IPMI serial console), it may end up triggering NMI hard
> lockup watchdog like the following.


Queued for testing and review, thank you!

                                                        Thanx, Paul

> *** CPU printking while holding RCU spinlock
> 
>   PID: 4149739  TASK: ffff881a46baa880  CPU: 13  COMMAND: "CPUThreadPool8"
>    #0 [ffff881fff945e48] crash_nmi_callback at ffffffff8103f7d0
>    #1 [ffff881fff945e58] nmi_handle at ffffffff81020653
>    #2 [ffff881fff945eb0] default_do_nmi at ffffffff81020c36
>    #3 [ffff881fff945ed0] do_nmi at ffffffff81020d32
>    #4 [ffff881fff945ef0] end_repeat_nmi at ffffffff81956a7e
>       [exception RIP: io_serial_in+21]
>       RIP: ffffffff81630e55  RSP: ffff881fff943b88  RFLAGS: 00000002
>       RAX: 000000000000ca00  RBX: ffffffff8230e188  RCX: 0000000000000000
>       RDX: 00000000000002fd  RSI: 0000000000000005  RDI: ffffffff8230e188
>       RBP: ffff881fff943bb0   R8: 0000000000000000   R9: ffffffff820cb3c4
>       R10: 0000000000000019  R11: 0000000000002000  R12: 00000000000026e1
>       R13: 0000000000000020  R14: ffffffff820cd398  R15: 0000000000000035
>       ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
>   --- <NMI exception stack> ---
>    #5 [ffff881fff943b88] io_serial_in at ffffffff81630e55
>    #6 [ffff881fff943b90] wait_for_xmitr at ffffffff8163175c
>    #7 [ffff881fff943bb8] serial8250_console_putchar at ffffffff816317dc
>    #8 [ffff881fff943bd8] uart_console_write at ffffffff8162ac00
>    #9 [ffff881fff943c08] serial8250_console_write at ffffffff81634691
>   #10 [ffff881fff943c80] univ8250_console_write at ffffffff8162f7c2
>   #11 [ffff881fff943c90] console_unlock at ffffffff810dfc55
>   #12 [ffff881fff943cf0] vprintk_emit at ffffffff810dffb5
>   #13 [ffff881fff943d50] vprintk_default at ffffffff810e01bf
>   #14 [ffff881fff943d60] vprintk_func at ffffffff810e1127
>   #15 [ffff881fff943d70] printk at ffffffff8119a8a4
>   #16 [ffff881fff943dd0] print_cpu_stall_info at ffffffff810eb78c
>   #17 [ffff881fff943e88] rcu_check_callbacks at ffffffff810ef133
>   #18 [ffff881fff943ee8] update_process_times at ffffffff810f3497
>   #19 [ffff881fff943f10] tick_sched_timer at ffffffff81103037
>   #20 [ffff881fff943f38] __hrtimer_run_queues at ffffffff810f3f38
>   #21 [ffff881fff943f88] hrtimer_interrupt at ffffffff810f442b
> 
> *** CPU triggering the hardlockup watchdog
> 
>   PID: 4149709  TASK: ffff88010f88c380  CPU: 26  COMMAND: "CPUThreadPool35"
>    #0 [ffff883fff1059d0] machine_kexec at ffffffff8104a874
>    #1 [ffff883fff105a30] __crash_kexec at ffffffff811116cc
>    #2 [ffff883fff105af0] __crash_kexec at ffffffff81111795
>    #3 [ffff883fff105b08] panic at ffffffff8119a6ae
>    #4 [ffff883fff105b98] watchdog_overflow_callback at ffffffff81135dbd
>    #5 [ffff883fff105bb0] __perf_event_overflow at ffffffff81186866
>    #6 [ffff883fff105be8] perf_event_overflow at ffffffff81192bc4
>    #7 [ffff883fff105bf8] intel_pmu_handle_irq at ffffffff8100b265
>    #8 [ffff883fff105df8] perf_event_nmi_handler at ffffffff8100489f
>    #9 [ffff883fff105e58] nmi_handle at ffffffff81020653
>   #10 [ffff883fff105eb0] default_do_nmi at ffffffff81020b94
>   #11 [ffff883fff105ed0] do_nmi at ffffffff81020d32
>   #12 [ffff883fff105ef0] end_repeat_nmi at ffffffff81956a7e
>       [exception RIP: queued_spin_lock_slowpath+248]
>       RIP: ffffffff810da958  RSP: ffff883fff103e68  RFLAGS: 00000046
>       RAX: 0000000000000000  RBX: 0000000000000046  RCX: 00000000006d0000
>       RDX: ffff883fff49a950  RSI: 0000000000d10101  RDI: ffffffff81e54300
>       RBP: ffff883fff103e80   R8: ffff883fff11a950   R9: 0000000000000000
>       R10: 000000000e5873ba  R11: 000000000000010f  R12: ffffffff81e54300
>       R13: 0000000000000000  R14: ffff88010f88c380  R15: ffffffff81e54300
>       ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>   --- <NMI exception stack> ---
>   #13 [ffff883fff103e68] queued_spin_lock_slowpath at ffffffff810da958
>   #14 [ffff883fff103e70] _raw_spin_lock_irqsave at ffffffff8195550b
>   #15 [ffff883fff103e88] rcu_check_callbacks at ffffffff810eed18
>   #16 [ffff883fff103ee8] update_process_times at ffffffff810f3497
>   #17 [ffff883fff103f10] tick_sched_timer at ffffffff81103037
>   #18 [ffff883fff103f38] __hrtimer_run_queues at ffffffff810f3f38
>   #19 [ffff883fff103f88] hrtimer_interrupt at ffffffff810f442b
>   --- <IRQ stack> ---
> 
> Avoid spuriously triggering NMI hardlockup watchdog by touching it
> from the print functions.  show_state_filter() shares the same problem
> and solution.
> 
> v2: Relocate the comment to where it belongs.
> 
> Signed-off-by: Tejun Heo <[email protected]>
> ---
>  kernel/rcu/tree_plugin.h | 14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index db85ca3..bf05bf9 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -561,8 +561,14 @@ static void rcu_print_detail_task_stall_rnp(struct 
> rcu_node *rnp)
>       }
>       t = list_entry(rnp->gp_tasks->prev,
>                      struct task_struct, rcu_node_entry);
> -     list_for_each_entry_continue(t, &rnp->blkd_tasks, rcu_node_entry)
> +     list_for_each_entry_continue(t, &rnp->blkd_tasks, rcu_node_entry) {
> +             /*
> +              * We could be printing a lot while holding a spinlock.
> +              * Avoid triggering hard lockup.
> +              */
> +             touch_nmi_watchdog();
>               sched_show_task(t);
> +     }
>       raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
>  }
> 
> @@ -1678,6 +1684,12 @@ static void print_cpu_stall_info(struct rcu_state 
> *rsp, int cpu)
>       char *ticks_title;
>       unsigned long ticks_value;
> 
> +     /*
> +      * We could be printing a lot while holding a spinlock.  Avoid
> +      * triggering hard lockup.
> +      */
> +     touch_nmi_watchdog();
> +
>       if (rsp->gpnum == rdp->gpnum) {
>               ticks_title = "ticks this GP";
>               ticks_value = rdp->ticks_this_gp;
> -- 
> 2.9.5
>

Re: [PATCH trivial-v2] RCU: Call touch_nmi_watchdog() while printing stall warnings

Reply via email to