Excerpts from Laurent Dufour's message of October 27, 2021 2:27 am:
> The wd_smp_cpus_pending CPU mask should be accessed under the protection of
> the __wd_smp_lock.
> 
> This prevents false alarm to be raised when the system is under an heavy
> stress. This has been seen while doing LPM on large system with a big
> workload.
> 
> Signed-off-by: Laurent Dufour <lduf...@linux.ibm.com>
> ---
>  arch/powerpc/kernel/watchdog.c | 14 ++++++++------
>  1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/watchdog.c b/arch/powerpc/kernel/watchdog.c
> index bc7411327066..8d7a1a86187e 100644
> --- a/arch/powerpc/kernel/watchdog.c
> +++ b/arch/powerpc/kernel/watchdog.c
> @@ -203,12 +203,13 @@ static void watchdog_smp_panic(int cpu, u64 tb)
>  
>  static void wd_smp_clear_cpu_pending(int cpu, u64 tb)
>  {
> +     unsigned long flags;
> +
> +     wd_smp_lock(&flags);
>       if (!cpumask_test_cpu(cpu, &wd_smp_cpus_pending)) {
>               if (unlikely(cpumask_test_cpu(cpu, &wd_smp_cpus_stuck))) {
>                       struct pt_regs *regs = get_irq_regs();
> -                     unsigned long flags;
>  
> -                     wd_smp_lock(&flags);
>                       cpumask_clear_cpu(cpu, &wd_smp_cpus_stuck);
>                       wd_smp_unlock(&flags);
>  
> @@ -219,22 +220,23 @@ static void wd_smp_clear_cpu_pending(int cpu, u64 tb)
>                               show_regs(regs);
>                       else
>                               dump_stack();
> +                     return;
>               }
> +
> +             wd_smp_unlock(&flags);
>               return;
>       }
> +
>       cpumask_clear_cpu(cpu, &wd_smp_cpus_pending);
>       if (cpumask_empty(&wd_smp_cpus_pending)) {
> -             unsigned long flags;
> -
> -             wd_smp_lock(&flags);
>               if (cpumask_empty(&wd_smp_cpus_pending)) {
>                       wd_smp_last_reset_tb = tb;
>                       cpumask_andnot(&wd_smp_cpus_pending,
>                                       &wd_cpus_enabled,
>                                       &wd_smp_cpus_stuck);
>               }
> -             wd_smp_unlock(&flags);
>       }
> +     wd_smp_unlock(&flags);

Hmm. I wanted to avoid the lock here because all CPUs will do it on 
every heartbeat timer.

Although maybe when you look at the cacheline transfers required it 
doesn't matter much (and the timer is only once every few seconds).

I guess it's always better to aovid lock free craziness unless it's
required... so where is the race coming from? I guess 2 CPUs can
clear wd_smp_cpus_pending but not see one another's update so they
both miss cpumask_empty == true! Good catch.

We shouldn't strictly need the wd_smp_lock for the first test, but
that should be an uncommon case, so okay.

Can we clear wd_smp_cpus_pending with a non-atomic operation now?

Thanks,
Nick

Reply via email to