Gautham R Shenoy's on April 4, 2019 9:19 pm: > Hello Nicholas, > > On Tue, Apr 2, 2019 at 4:57 PM Nicholas Piggin <npig...@gmail.com> wrote: >> >> Using a jiffies timer creates a dependency on the tick_do_timer_cpu >> incrementing jiffies. If that CPU has locked up and jiffies is not >> incrementing, the watchdog heartbeat timer for all CPUs stops and >> creates false positives and confusing warnings on local CPUs, and >> also causes the SMP detector to stop, so the root cause is never >> detected. >> >> Fix this by using hrtimer based timers for the watchdog heartbeat, >> like the generic kernel hardlockup detector. >> >> Reported-by: Ravikumar Bangoria <ravi.bango...@in.ibm.com> >> Signed-off-by: Nicholas Piggin <npig...@gmail.com> > > [..snip..] > >> @@ -325,19 +325,21 @@ EXPORT_SYMBOL(arch_touch_nmi_watchdog); >> >> static void start_watchdog_timer_on(unsigned int cpu) >> { >> - struct timer_list *t = per_cpu_ptr(&wd_timer, cpu); >> + struct hrtimer *hrtimer = this_cpu_ptr(&wd_hrtimer); > > This function can be called during the initialization via > > watchdog_nmi_start --> > for_each_online_cpu(cpu) > start_wd_on_cpu(cpu) --> > start_watchdog_timer_on(cpu) > > Thus, it is not guarateed that we are always calling > start_watchdog_timer_on() from the CPU where > we want to start the watchdog timer. > > Thus, should we be calling this function from start_wd_on_cpu() via an > smp_call_function_single() ?
Good catch, yes I think we need that change (like kernel/watchdog.c). I'll resend. Thanks, Nick