On Wed, Mar 01 2023 at 15:47, Ricardo Neri wrote:
> + * An HPET channel is reserved for the detector. The channel issues an NMI to
> + * one of the CPUs in @watchdog_allowed_mask. This CPU monitors itself for
> + * hardlockups and sends an NMI IPI to the rest of the CPUs in the system.
> + *
> + * The detector uses IPI shorthands. Thus, all CPUs in the system get the NMI
> + * (offline CPUs also get the NMI but they "ignore" it). A cpumask is used to
> + * specify whether a CPU must check for hardlockups.
> + *
> + * The NMI also disturbs isolated CPUs. The detector fails to initialize if
> + * tick_nohz_full is enabled.

This can be completely avoided. See below.

> +/**
> + * hardlockup_detector_nmi_handler() - NMI Interrupt handler
> + * @type:    Type of NMI handler; not used.
> + * @regs:    Register values as seen when the NMI was asserted
> + *
> + * Check if our HPET channel caused the NMI. If yes, inspect for lockups by
> + * issuing an IPI to the rest of the CPUs. Also, kick the timer if it is
> + * non-periodic.
> + *
> + * Returns:
> + * NMI_DONE if the HPET timer did not cause the interrupt. NMI_HANDLED
> + * otherwise.
> + */
> +static int hardlockup_detector_nmi_handler(unsigned int type,
> +                                        struct pt_regs *regs)
> +{
> +     struct hpet_hld_data *hdata = hld_data;
> +     int cpu;
> +
> +     /*
> +      * The CPU handling the HPET NMI will land here and trigger the
> +      * inspection of hardlockups in the rest of the monitored
> +      * CPUs.
> +      */
> +     if (is_hpet_hld_interrupt(hdata)) {
> +             /*
> +              * Kick the timer first. If the HPET channel is periodic, it
> +              * helps to reduce the delta between the expected TSC value and
> +              * its actual value the next time the HPET channel fires.
> +              */
> +             kick_timer(hdata, !(hdata->has_periodic));
> +
> +             if (cpumask_weight(hld_data->monitored_cpumask) > 1) {
> +                     /*
> +                      * Since we cannot know the source of an NMI, the best
> +                      * we can do is to use a flag to indicate to all online
> +                      * CPUs that they will get an NMI and that the source of
> +                      * that NMI is the hardlockup detector. Offline CPUs
> +                      * also receive the NMI but they ignore it.
> +                      */
> +                     cpumask_copy(hld_data->inspect_cpumask,
> +                                  cpu_online_mask);
> +
> +                     /* If we are here, IPI shorthands are enabled. */
> +                     apic->send_IPI_allbutself(NMI_VECTOR);

This should be done differently so that the isolation stuff is
respected.

The HPET NMI is directed to one of the CPUs in the 'to be monitored'
mask. That CPU is known when this is [re-]configured. So this can be
changed to:

        seq = atomic_read(&hld.nmi_seq);
        if (seq == __this_cpu_read(hpet_wd_seq))
                return;

        cpu = raw_smp_processor_id();
        if (cpu == hld.nmi_target) {
                if (!hpet_hld_interrupt(...))
                        return;
        }
                
        next_cpu = cpumask_next_wrap(cpu, monitored_mask);
        if (next_cpu < nr_cpu_ids) {
                if (cpu == hld.nmi_target)
                      atomic_inc(&hld.nmi_seq);
                apic->send_IPI(next_cpu, NMI_VECTOR);
        }
        __this_cpu_write(hpet_wd_seq, seq);

        watchdog_hardlockup_check();

which has the following downside:

      It requires x2APIC because xAPIC has this two stage ICR dance,
      which is not NMI safe and only to be used if there is no other
      choice, i.e. backtrace trigger from NMI context.

but the following upsides:

    1) The fast path quick check is a trivial memory read/cmp

       It is always dropping through for the NMI target CPUn as that has
       to check for the timing window. 

       For the other CPUs it only becomes true when the NMI target CPU
       incremented the sequence counter and sent a NMI to the next one.

    2) Propagating the NMI through the monitored mask avoids the whole
       limitation vs. the shorthand setup, which means the watchdog can
       be enabled early on.

       This also allows to use it in isolation scenarios.

Restricting this to x2APIC is in my opinion reasonable as the people who
crave for the lost PMC are surely not those who use 8 CPU machines from
two decades ago.

Thanks,

        tglx

Reply via email to