Re: [PATCH v6 15/29] x86/hpet: Add helper function hpet_set_comparator_periodic()

2022-05-17 Thread Ricardo Neri
On Sat, May 14, 2022 at 10:17:38AM +0200, Thomas Gleixner wrote:
> On Fri, May 13 2022 at 14:19, Ricardo Neri wrote:
> > On Fri, May 06, 2022 at 11:41:13PM +0200, Thomas Gleixner wrote:
> >> The argument about not bloating the code
> >> with an "obvious???" function which is quite small is slightly beyond my
> >> comprehension level.
> >
> > That obvious function would look like this:
> >
> > void hpet_set_comparator_one_shot(int channel, u32 delta)
> > {
> > u32 count;
> >
> > count = hpet_readl(HPET_COUNTER);
> > count += delta;
> > hpet_writel(count, HPET_Tn_CMP(channel));
> > }
> 
> This function only works reliably when the delta is large. See
> hpet_clkevt_set_next_event().

That is a good point. One more reason to not have a
hpet_set_comparator_one_shot(), IMO.

Thanks and BR,
Ricardo
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v6 28/29] x86/tsc: Restart NMI watchdog after refining tsc_khz

2022-05-17 Thread Ricardo Neri
On Tue, May 10, 2022 at 01:44:05PM +0200, Thomas Gleixner wrote:
> On Tue, May 10 2022 at 21:16, Nicholas Piggin wrote:
> > Excerpts from Ricardo Neri's message of May 6, 2022 10:00 am:
> >> +  /*
> >> +   * If in use, the HPET hardlockup detector relies on tsc_khz.
> >> +   * Reconfigure it to make use of the refined tsc_khz.
> >> +   */
> >> +  lockup_detector_reconfigure();
> >
> > I don't know if the API is conceptually good.
> >
> > You change something that the lockup detector is currently using, 
> > *while* the detector is running asynchronously, and then reconfigure
> > it. What happens in the window? If this code is only used for small
> > adjustments maybe it does not really matter but in principle it's
> > a bad API to export.
> >
> > lockup_detector_reconfigure as an internal API is okay because it
> > reconfigures things while the watchdog is stopped [actually that
> > looks untrue for soft dog which uses watchdog_thresh in
> > is_softlockup(), but that should be fixed].
> >
> > You're the arch so you're allowed to stop the watchdog and configure
> > it, e.g., hardlockup_detector_perf_stop() is called in arch/.
> >
> > So you want to disable HPET watchdog if it was enabled, then update
> > wherever you're using tsc_khz, then re-enable.
> 
> The real question is whether making this refined tsc_khz value
> immediately effective matters at all. IMO, it does not because up to
> that point the watchdog was happily using the coarse calibrated value
> and the whole use TSC to assess whether the HPET fired mechanism is just
> a guestimate anyway. So what's the point of trying to guess 'more
> correct'.

In some of my test systems I observed that, the TSC value does not fall
within the expected error window the first time the HPET channel expires.

I inferred that the error computed using the coarser tsc_khz was wrong.
Recalculating the error window with refined tsc_khz would correct it.

However, restarting the timer has the side-effect of kicking the timer and,
therefore pushing the first HPET NMI further in the future.

Perhaps kicking HPET channel, not recomputing the error window, corrected
(masked?) the problem.

I will investigate further and rework or drop this patch as needed.

Thanks and BR,
Ricardo
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v6 28/29] x86/tsc: Restart NMI watchdog after refining tsc_khz

2022-05-17 Thread Ricardo Neri
On Tue, May 10, 2022 at 09:16:21PM +1000, Nicholas Piggin wrote:
> Excerpts from Ricardo Neri's message of May 6, 2022 10:00 am:
> > The HPET hardlockup detector relies on tsc_khz to estimate the value of
> > that the TSC will have when its HPET channel fires. A refined tsc_khz
> > helps to estimate better the expected TSC value.
> > 
> > Using the early value of tsc_khz may lead to a large error in the expected
> > TSC value. Restarting the NMI watchdog detector has the effect of kicking
> > its HPET channel and make use of the refined tsc_khz.
> > 
> > When the HPET hardlockup is not in use, restarting the NMI watchdog is
> > a noop.
> > 
> > Cc: Andi Kleen 
> > Cc: Stephane Eranian 
> > Cc: "Ravi V. Shankar" 
> > Cc: iommu@lists.linux-foundation.org
> > Cc: linuxppc-...@lists.ozlabs.org
> > Cc: x...@kernel.org
> > Signed-off-by: Ricardo Neri 
> > ---
> > Changes since v5:
> >  * Introduced this patch
> > 
> > Changes since v4
> >  * N/A
> > 
> > Changes since v3
> >  * N/A
> > 
> > Changes since v2:
> >  * N/A
> > 
> > Changes since v1:
> >  * N/A
> > ---
> >  arch/x86/kernel/tsc.c | 6 ++
> >  1 file changed, 6 insertions(+)
> > 
> > diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> > index cafacb2e58cc..cc1843044d88 100644
> > --- a/arch/x86/kernel/tsc.c
> > +++ b/arch/x86/kernel/tsc.c
> > @@ -1386,6 +1386,12 @@ static void tsc_refine_calibration_work(struct 
> > work_struct *work)
> > /* Inform the TSC deadline clockevent devices about the recalibration */
> > lapic_update_tsc_freq();
> >  
> > +   /*
> > +* If in use, the HPET hardlockup detector relies on tsc_khz.
> > +* Reconfigure it to make use of the refined tsc_khz.
> > +*/
> > +   lockup_detector_reconfigure();
> 
> I don't know if the API is conceptually good.
> 
> You change something that the lockup detector is currently using, 
> *while* the detector is running asynchronously, and then reconfigure
> it. 

Yes, this is what happens.

> What happens in the window? If this code is only used for small
> adjustments maybe it does not really matter

Please see my comment

> but in principle it's a bad API to export.
> 
> lockup_detector_reconfigure as an internal API is okay because it
> reconfigures things while the watchdog is stopped

I see.

> [actually that  looks untrue for soft dog which uses watchdog_thresh in
> is_softlockup(), but that should be fixed].

Perhaps there should be a watchdog_thresh_user. When the user updates it,
the detector is stopped, watchdog_thresh is updated, and then the detector
is resumed.

> 
> You're the arch so you're allowed to stop the watchdog and configure
> it, e.g., hardlockup_detector_perf_stop() is called in arch/.

I had it like this but it did not look right to me. You are right, however,
I can stop/restart the watchdog when needed.

Thanks and BR,
Ricardo
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v6 21/29] x86/nmi: Add an NMI_WATCHDOG NMI handler category

2022-05-17 Thread Ricardo Neri
On Mon, May 09, 2022 at 03:59:40PM +0200, Thomas Gleixner wrote:
> On Thu, May 05 2022 at 17:00, Ricardo Neri wrote:
> > Add a NMI_WATCHDOG as a new category of NMI handler. This new category
> > is to be used with the HPET-based hardlockup detector. This detector
> > does not have a direct way of checking if the HPET timer is the source of
> > the NMI. Instead, it indirectly estimates it using the time-stamp counter.
> >
> > Therefore, we may have false-positives in case another NMI occurs within
> > the estimated time window. For this reason, we want the handler of the
> > detector to be called after all the NMI_LOCAL handlers. A simple way
> > of achieving this with a new NMI handler category.
> >
> > @@ -379,6 +385,10 @@ static noinstr void default_do_nmi(struct pt_regs 
> > *regs)
> > }
> > raw_spin_unlock(&nmi_reason_lock);
> >  
> > +   handled = nmi_handle(NMI_WATCHDOG, regs);
> > +   if (handled == NMI_HANDLED)
> > +   goto out;
> > +
> 
> How is this supposed to work reliably?
> 
> If perf is active and the HPET NMI and the perf NMI come in around the
> same time, then nmi_handle(LOCAL) can swallow the NMI and the watchdog
> won't be checked. Because MSI is strictly edge and the message is only
> sent once, this can result in a stale watchdog, no?

This is true. Instead, at the end of each NMI I should _also_ check if the TSC
is within the expected value of the HPET NMI watchdog. In this way, unrelated
NMIs (e.g., perf NMI) are handled and we don't miss the NMI from the HPET
channel.

Thanks and BR,
Ricardo
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v6 29/29] x86/tsc: Switch to perf-based hardlockup detector if TSC become unstable

2022-05-16 Thread Ricardo Neri
On Tue, May 10, 2022 at 10:14:00PM +1000, Nicholas Piggin wrote:
> Excerpts from Ricardo Neri's message of May 6, 2022 10:00 am:
> > The HPET-based hardlockup detector relies on the TSC to determine if an
> > observed NMI interrupt was originated by HPET timer. Hence, this detector
> > can no longer be used with an unstable TSC.
> > 
> > In such case, permanently stop the HPET-based hardlockup detector and
> > start the perf-based detector.
> > 
> > Cc: Andi Kleen 
> > Cc: Stephane Eranian 
> > Cc: "Ravi V. Shankar" 
> > Cc: iommu@lists.linux-foundation.org
> > Cc: linuxppc-...@lists.ozlabs.org
> > Cc: x...@kernel.org
> > Suggested-by: Thomas Gleixner 
> > Reviewed-by: Tony Luck 
> > Signed-off-by: Ricardo Neri 
> > ---
> > Changes since v5:
> >  * Relocated the delcaration of hardlockup_detector_switch_to_perf() to
> >x86/nmi.h It does not depend on HPET.
> >  * Removed function stub. The shim hardlockup detector is always for x86.
> > 
> > Changes since v4:
> >  * Added a stub version of hardlockup_detector_switch_to_perf() for
> >!CONFIG_HPET_TIMER. (lkp)
> >  * Reconfigure the whole lockup detector instead of unconditionally
> >starting the perf-based hardlockup detector.
> > 
> > Changes since v3:
> >  * None
> > 
> > Changes since v2:
> >  * Introduced this patch.
> > 
> > Changes since v1:
> >  * N/A
> > ---
> >  arch/x86/include/asm/nmi.h | 6 ++
> >  arch/x86/kernel/tsc.c  | 2 ++
> >  arch/x86/kernel/watchdog_hld.c | 6 ++
> >  3 files changed, 14 insertions(+)
> > 
> > diff --git a/arch/x86/include/asm/nmi.h b/arch/x86/include/asm/nmi.h
> > index 4a0d5b562c91..47752ff67d8b 100644
> > --- a/arch/x86/include/asm/nmi.h
> > +++ b/arch/x86/include/asm/nmi.h
> > @@ -63,4 +63,10 @@ void stop_nmi(void);
> >  void restart_nmi(void);
> >  void local_touch_nmi(void);
> >  
> > +#ifdef CONFIG_X86_HARDLOCKUP_DETECTOR
> > +void hardlockup_detector_switch_to_perf(void);
> > +#else
> > +static inline void hardlockup_detector_switch_to_perf(void) { }
> > +#endif
> > +
> >  #endif /* _ASM_X86_NMI_H */
> > diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> > index cc1843044d88..74772ffc79d1 100644
> > --- a/arch/x86/kernel/tsc.c
> > +++ b/arch/x86/kernel/tsc.c
> > @@ -1176,6 +1176,8 @@ void mark_tsc_unstable(char *reason)
> >  
> > clocksource_mark_unstable(&clocksource_tsc_early);
> > clocksource_mark_unstable(&clocksource_tsc);
> > +
> > +   hardlockup_detector_switch_to_perf();
> >  }
> >  
> >  EXPORT_SYMBOL_GPL(mark_tsc_unstable);
> > diff --git a/arch/x86/kernel/watchdog_hld.c b/arch/x86/kernel/watchdog_hld.c
> > index ef11f0af4ef5..7940977c6312 100644
> > --- a/arch/x86/kernel/watchdog_hld.c
> > +++ b/arch/x86/kernel/watchdog_hld.c
> > @@ -83,3 +83,9 @@ void watchdog_nmi_start(void)
> > if (detector_type == X86_HARDLOCKUP_DETECTOR_HPET)
> > hardlockup_detector_hpet_start();
> >  }
> > +
> > +void hardlockup_detector_switch_to_perf(void)
> > +{
> > +   detector_type = X86_HARDLOCKUP_DETECTOR_PERF;
> 
> Another possible problem along the same lines here,
> isn't your watchdog still running at this point? And
> it uses detector_type in the switch.
> 
> > +   lockup_detector_reconfigure();
> 
> Actually the detector_type switch is used in some
> functions called by lockup_detector_reconfigure()
> e.g., watchdog_nmi_stop, so this seems buggy even
> without concurrent watchdog.

Yes, this true. I missed this race.

> 
> Is this switching a good idea in general? The admin
> has asked for non-standard option because they want
> more PMU counterss available and now it eats a
> counter potentially causing a problem rather than
> detecting one.

Agreed. A very valid point.
> 
> I would rather just disable with a warning if it were
> up to me. If you *really* wanted to be fancy then
> allow admin to re-enable via proc maybe.

I think that in either case, /proc/sys/kernel/nmi_watchdog
need to be updated to reflect that the NMI watchdog has
been disabled. That would require to expose other interfaces
of the watchdog.

Thanks and BR,
Ricardo
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v6 05/29] x86/apic/vector: Do not allocate vectors for NMIs

2022-05-13 Thread Ricardo Neri
On Fri, May 13, 2022 at 10:50:09PM +0200, Thomas Gleixner wrote:
> On Fri, May 13 2022 at 11:03, Ricardo Neri wrote:
> > On Fri, May 06, 2022 at 11:12:20PM +0200, Thomas Gleixner wrote:
> >> Why would a NMI ever end up in this code? There is no vector management
> >> required and this find cpu exercise is pointless.
> >
> > But even if the NMI has a fixed vector, it is still necessary to determine
> > which CPU will get the NMI. It is still necessary to determine what to
> > write in the Destination ID field of the MSI message.
> >
> > irq_matrix_find_best_cpu() would find the CPU with the lowest number of
> > managed vectors so that the NMI is directed to that CPU. 
> 
> What's the point to send it to the CPU with the lowest number of
> interrupts. It's not that this NMI happens every 50 microseconds.
> We pick one online CPU and are done.

Indeed, that is sensible.

> 
> > In today's code, an NMI would end up here because we rely on the existing
> > interrupt management infrastructure... Unless, the check is done the entry
> > points as you propose.
> 
> Correct. We don't want to call into functions which are not designed for
> NMIs.

Agreed.

>  
> >> > +
> >> > +if (apicd->hw_irq_cfg.delivery_mode == APIC_DELIVERY_MODE_NMI) {
> >> > +cpu = irq_matrix_find_best_cpu_managed(vector_matrix, 
> >> > dest);
> >> > +apicd->cpu = cpu;
> >> > +vector = 0;
> >> > +goto no_vector;
> >> > +}
> >> 
> >> This code can never be reached for a NMI delivery. If so, then it's a
> >> bug.
> >> 
> >> This all is special purpose for that particular HPET NMI watchdog use
> >> case and we are not exposing this to anything else at all.
> >> 
> >> So why are you sprinkling this NMI nonsense all over the place? Just
> >> because? There are well defined entry points to all of this where this
> >> can be fenced off.
> >
> > I put the NMI checks in these points because assign_vector_locked() and
> > assign_managed_vector() are reached through multiple paths and these are
> > the two places where the allocation of the vector is requested and the
> > destination CPU is determined.
> >
> > I do observe this code being reached for an NMI, but that is because this
> > code still does not know about NMIs... Unless the checks for NMI are put
> > in the entry points as you pointed out.
> >
> > The intent was to refactor the code in a generic manner and not to focus
> > only in the NMI watchdog. That would have looked hacky IMO.
> 
> We don't want to have more of this really. Supporting NMIs on x86 in a
> broader way is simply not reasonable because there is only one NMI
> vector and we have no sensible way to get to the cause of the NMI
> without a massive overhead.
> 
> Even if we get multiple NMI vectors some shiny day, this will be
> fundamentally different than regular interrupts and certainly not
> exposed broadly. There will be 99.99% fixed vectors for simplicity sake.

Understood.

> 
> >> +  if (info->flags & X86_IRQ_ALLOC_AS_NMI) {
> >> +  /*
> >> +   * NMIs have a fixed vector and need their own
> >> +   * interrupt chip so nothing can end up in the
> >> +   * regular local APIC management code except the
> >> +   * MSI message composing callback.
> >> +   */
> >> +  irqd->chip = &lapic_nmi_controller;
> >> +  /*
> >> +   * Don't allow affinity setting attempts for NMIs.
> >> +   * This cannot work with the regular affinity
> >> +   * mechanisms and for the intended HPET NMI
> >> +   * watchdog use case it's not required.
> >
> > But we do need the ability to set affinity, right? As stated above, we need
> > to know what Destination ID to write in the MSI message or in the interrupt
> > remapping table entry.
> >
> > It cannot be any CPU because only one specific CPU is supposed to handle the
> > NMI from the HPET channel.
> >
> > We cannot hard-code a CPU for that because it may go offline (and ignore 
> > NMIs)
> > or not be part of the monitored CPUs.
> >
> > Also, if lapic_nmi_controller.irq_set_affinity() is NULL, then irq_chips
> > INTEL-IR, AMD-IR, those using msi_domain_set_affinity() need to c

Re: [PATCH v6 24/29] watchdog/hardlockup: Use parse_option_str() to handle "nmi_watchdog"

2022-05-13 Thread Ricardo Neri
On Tue, May 10, 2022 at 08:46:41PM +1000, Nicholas Piggin wrote:
> Excerpts from Ricardo Neri's message of May 6, 2022 10:00 am:
> > Prepare hardlockup_panic_setup() to handle a comma-separated list of
> > options. Thus, it can continue parsing its own command-line options while
> > ignoring parameters that are relevant only to specific implementations of
> > the hardlockup detector. Such implementations may use an early_param to
> > parse their own options.
> 
> It can't really handle comma separated list though, until the next
> patch. nmi_watchdog=panic,0 does not make sense, so you lost error
> handling of that.

Yes that is true. All possible combinations need to be checked.

> 
> And is it kosher to double handle options like this? I'm sure it
> happens but it's ugly.
> 
> Would you consider just add a new option for x86 and avoid changing
> this? Less code and patches.

Sure, I can not modify this code and add a x86-specific command-line
option.

Thanks and BR,
Ricardo
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v6 20/29] init/main: Delay initialization of the lockup detector after smp_init()

2022-05-13 Thread Ricardo Neri
On Tue, May 10, 2022 at 08:38:22PM +1000, Nicholas Piggin wrote:
> Excerpts from Ricardo Neri's message of May 6, 2022 9:59 am:
> > Certain implementations of the hardlockup detector require support for
> > Inter-Processor Interrupt shorthands. On x86, support for these can only
> > be determined after all the possible CPUs have booted once (in
> > smp_init()). Other architectures may not need such check.
> > 
> > lockup_detector_init() only performs the initializations of data
> > structures of the lockup detector. Hence, there are no dependencies on
> > smp_init().
> 

Thank you for your feedback Nicholas!

> I think this is the only real thing which affects other watchdog types?

Also patches 18 and 19 that decouple the NMI watchdog functionality from
perf.

> 
> Not sure if it's a big problem, the secondary CPUs coming up won't
> have their watchdog active until quite late, and the primary could
> implement its own timeout in __cpu_up for secondary coming up, and
> IPI it to get traces if necessary which is probably more robust.

Indeed that could work. Another alternative I have been pondering is to boot
the system with the perf-based NMI watchdog enabled. Once all CPUs are up
and running, switch to the HPET-based NMI watchdog and free the PMU counters.

> 
> Acked-by: Nicholas Piggin 

Thank you!

BR,
Ricardo
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v6 22/29] x86/watchdog/hardlockup: Add an HPET-based hardlockup detector

2022-05-13 Thread Ricardo Neri
On Mon, May 09, 2022 at 04:03:39PM +0200, Thomas Gleixner wrote:
> On Thu, May 05 2022 at 17:00, Ricardo Neri wrote:
> > +   if (is_hpet_hld_interrupt(hdata)) {
> > +   /*
> > +* Kick the timer first. If the HPET channel is periodic, it
> > +* helps to reduce the delta between the expected TSC value and
> > +* its actual value the next time the HPET channel fires.
> > +*/
> > +   kick_timer(hdata, !(hdata->has_periodic));
> > +
> > +   if (cpumask_weight(hld_data->monitored_cpumask) > 1) {
> > +   /*
> > +* Since we cannot know the source of an NMI, the best
> > +* we can do is to use a flag to indicate to all online
> > +* CPUs that they will get an NMI and that the source of
> > +* that NMI is the hardlockup detector. Offline CPUs
> > +* also receive the NMI but they ignore it.
> > +*
> > +* Even though we are in NMI context, we have concluded
> > +* that the NMI came from the HPET channel assigned to
> > +* the detector, an event that is infrequent and only
> > +* occurs in the handling CPU. There should not be races
> > +* with other NMIs.
> > +*/
> > +   cpumask_copy(hld_data->inspect_cpumask,
> > +cpu_online_mask);
> > +
> > +   /* If we are here, IPI shorthands are enabled. */
> > +   apic->send_IPI_allbutself(NMI_VECTOR);
> 
> So if the monitored cpumask is a subset of online CPUs, which is the
> case when isolation features are enabled, then you still send NMIs to
> those isolated CPUs. I'm sure the isolation folks will be enthused.

Yes, I acknowledged this limitation in the cover letter. I should also update
Documentation/admin-guide/lockup-watchdogs.rst.

This patchset proposes the HPET NMI watchdog as an opt-in feature.

Perhaps the limitation might be mitigated by adding a check for non-housekeeping
and non-monitored CPUs in exc_nmi(). However, that will not eliminate the
problem of isolated CPUs also getting the NMI.

Thanks and BR,
Ricardo
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v6 15/29] x86/hpet: Add helper function hpet_set_comparator_periodic()

2022-05-13 Thread Ricardo Neri
On Fri, May 06, 2022 at 11:51:52PM +0200, Thomas Gleixner wrote:
> On Fri, May 06 2022 at 23:41, Thomas Gleixner wrote:
> > On Thu, May 05 2022 at 16:59, Ricardo Neri wrote:
> >> Programming an HPET channel as periodic requires setting the
> >> HPET_TN_SETVAL bit in the channel configuration. Plus, the comparator
> >> register must be written twice (once for the comparator value and once for
> >> the periodic value). Since this programming might be needed in several
> >> places (e.g., the HPET clocksource and the HPET-based hardlockup detector),
> >> add a helper function for this purpose.
> >>
> >> A helper function hpet_set_comparator_oneshot() could also be implemented.
> >> However, such function would only program the comparator register and the
> >> function would be quite small. Hence, it is better to not bloat the code
> >> with such an obvious function.
> >
> > This word salad above does not provide a single reason why the periodic
> > programming function is required and better suited for the NMI watchdog
> > case and then goes on and blurbs about why a function which is not
> > required is not implemented. The argument about not bloating the code
> > with an "obvious???" function which is quite small is slightly beyond my
> > comprehension level.
> 
> What's even more uncomprehensible is that the patch which actually sets
> up that NMI watchdog cruft has:
> 
> > +   if (hc->boot_cfg & HPET_TN_PERIODIC_CAP)
> > +   hld_data->has_periodic = true;
> 
> So how the heck does that work with a HPET which does not support
> periodic mode?

If the HPET channel does not support periodic mode (as indicated by the flag
above), it will read the HPET counter into a local variable, increment that
local variable, and write comparator of the HPET channel.

If the HPET channel does support periodic mode, it will not kick it again.
It will only kick a periodic HPET channel if needed (e.g., if the NMI watchdog
was idle of watchdog_thresh changed its value).

> 
> That watchdog muck will still happily invoke that set periodic function
> in the hope that it works by chance?

It will not. It will check hld_data->has_periodic and act accordingly.

FWIW, I have tested this NMI watchdog with periodic and non-periodic HPET
channels.

Thanks and BR,
Ricardo
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v6 15/29] x86/hpet: Add helper function hpet_set_comparator_periodic()

2022-05-13 Thread Ricardo Neri
On Fri, May 06, 2022 at 11:41:13PM +0200, Thomas Gleixner wrote:
> On Thu, May 05 2022 at 16:59, Ricardo Neri wrote:
> > Programming an HPET channel as periodic requires setting the
> > HPET_TN_SETVAL bit in the channel configuration. Plus, the comparator
> > register must be written twice (once for the comparator value and once for
> > the periodic value). Since this programming might be needed in several
> > places (e.g., the HPET clocksource and the HPET-based hardlockup detector),
> > add a helper function for this purpose.
> >
> > A helper function hpet_set_comparator_oneshot() could also be implemented.
> > However, such function would only program the comparator register and the
> > function would be quite small. Hence, it is better to not bloat the code
> > with such an obvious function.
> 
> This word salad above does not provide a single reason why the periodic
> programming function is required and better suited for the NMI watchdog
> case

The goal of hpet_set_comparator_periodic() is to avoid code duplication. The
functions hpet_clkevt_set_state_periodic() and kick_timer() in the HPET NMI
watchdog need to program a periodic HPET channel when supported.


> and then goes on and blurbs about why a function which is not
> required is not implemented.

I can remove this.

> The argument about not bloating the code
> with an "obvious???" function which is quite small is slightly beyond my
> comprehension level.

That obvious function would look like this:

void hpet_set_comparator_one_shot(int channel, u32 delta)
{
u32 count;

count = hpet_readl(HPET_COUNTER);
count += delta;
hpet_writel(count, HPET_Tn_CMP(channel));
}

It involves one register read, one addition and one register write. IMO, this
code is sufficiently simple and small to allow duplication.

Programming a periodic HPET channel is not as straightforward, IMO. It involves
handling two different values (period and comparator) written in a specific
sequence, one configuration bit, and one delay. It also involves three register
writes and one register read.

Thanks and BR,
Ricardo
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v6 13/29] iommu/amd: Compose MSI messages for NMI irqs in non-IR format

2022-05-13 Thread Ricardo Neri
On Fri, May 06, 2022 at 11:31:56PM +0200, Thomas Gleixner wrote:
> On Thu, May 05 2022 at 16:59, Ricardo Neri wrote:
> > +*
> > +* Also, NMIs do not have an associated vector. No need for cleanup.
> 
> They have a vector and what the heck is this cleanup comment for here?
> There is nothing cleanup alike anywhere near.
> 
> Adding confusing comments is worse than adding no comments at all.

I will remove the comment regarding cleanup. I will clarify that NMI has a
fixed vector.

Thanks and BR,
Ricardo
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v6 12/29] iommu/amd: Enable NMIPass when allocating an NMI irq

2022-05-13 Thread Ricardo Neri
On Fri, May 06, 2022 at 11:26:22PM +0200, Thomas Gleixner wrote:
> On Thu, May 05 2022 at 16:59, Ricardo Neri wrote:
> >  
> > +   if (info->flags & X86_IRQ_ALLOC_AS_NMI) {
> > +   /* Only one IRQ per NMI */
> > +   if (nr_irqs != 1)
> > +   return -EINVAL;
> 
> See previous reply.

I remove this check.

Thanks and BR,
Ricardo
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v6 10/29] iommu/vt-d: Implement minor tweaks for NMI irqs

2022-05-13 Thread Ricardo Neri
On Fri, May 06, 2022 at 11:23:23PM +0200, Thomas Gleixner wrote:
> On Thu, May 05 2022 at 16:59, Ricardo Neri wrote:
> > The Intel IOMMU interrupt remapping driver already programs correctly the
> > delivery mode of individual irqs as per their irq_data. Improve handling
> > of NMIs. Allow only one irq per NMI. Also, it is not necessary to cleanup
> > irq vectors after updating affinity.
> 
> Structuring a changelog in paragraphs might make it readable. New lines
> exist for a reason.

Sure, I can structure this in paragraphps.
> 
> > NMIs do not have associated vectors.
> 
> Again. NMI has an vector associated, but it is not subject to dynamic
> vector management.

Indeed, it is clear to me now.

> 
> > diff --git a/drivers/iommu/intel/irq_remapping.c 
> > b/drivers/iommu/intel/irq_remapping.c
> > index fb2d71bea98d..791a9331e257 100644
> > --- a/drivers/iommu/intel/irq_remapping.c
> > +++ b/drivers/iommu/intel/irq_remapping.c
> > @@ -1198,8 +1198,12 @@ intel_ir_set_affinity(struct irq_data *data, const 
> > struct cpumask *mask,
> >  * After this point, all the interrupts will start arriving
> >  * at the new destination. So, time to cleanup the previous
> >  * vector allocation.
> > +*
> > +* Do it only for non-NMI irqs. NMIs don't have associated
> > +* vectors.
> 
> See above.

Sure.

> 
> >  */
> > -   send_cleanup_vector(cfg);
> > +   if (cfg->delivery_mode != APIC_DELIVERY_MODE_NMI)
> > +   send_cleanup_vector(cfg);
> 
> So this needs to be replicated for all invocations of
> send_cleanup_vector(), right? Why can't you put it into that function?

Certainly, it can be done inside the function.

>   
> > return IRQ_SET_MASK_OK_DONE;
> >  }
> > @@ -1352,6 +1356,9 @@ static int intel_irq_remapping_alloc(struct 
> > irq_domain *domain,
> > if (info->type == X86_IRQ_ALLOC_TYPE_PCI_MSI)
> > info->flags &= ~X86_IRQ_ALLOC_CONTIGUOUS_VECTORS;
> >  
> > +   if ((info->flags & X86_IRQ_ALLOC_AS_NMI) && nr_irqs != 1)
> > +   return -EINVAL;
> 
> This cannot be reached when the vector allocation domain already
> rejected it, but copy & pasta is wonderful and increases the line count.

Yes, this is not needed.

Thanks and BR,
Ricardo
> 
> Thanks,
> 
> tglx
> 
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v6 05/29] x86/apic/vector: Do not allocate vectors for NMIs

2022-05-13 Thread Ricardo Neri
On Fri, May 06, 2022 at 11:12:20PM +0200, Thomas Gleixner wrote:
> On Thu, May 05 2022 at 16:59, Ricardo Neri wrote:
> > Vectors are meaningless when allocating IRQs with NMI as the delivery
> > mode.
> 
> Vectors are not meaningless. NMI has a fixed vector.
> 
> The point is that for a fixed vector there is no vector management
> required.
> 
> Can you spot the difference?

Yes, I see your point now. Thank you for the explanation.

> 
> > In such case, skip the reservation of IRQ vectors. Do it in the lowest-
> > level functions where the actual IRQ reservation takes place.
> >
> > Since NMIs target specific CPUs, keep the functionality to find the best
> > CPU.
> 
> Again. What for?
>   
> > +   if (apicd->hw_irq_cfg.delivery_mode == APIC_DELIVERY_MODE_NMI) {
> > +   cpu = irq_matrix_find_best_cpu(vector_matrix, dest);
> > +   apicd->cpu = cpu;
> > +   vector = 0;
> > +   goto no_vector;
> > +   }
> 
> Why would a NMI ever end up in this code? There is no vector management
> required and this find cpu exercise is pointless.

But even if the NMI has a fixed vector, it is still necessary to determine
which CPU will get the NMI. It is still necessary to determine what to
write in the Destination ID field of the MSI message.

irq_matrix_find_best_cpu() would find the CPU with the lowest number of
managed vectors so that the NMI is directed to that CPU. 

In today's code, an NMI would end up here because we rely on the existing
interrupt management infrastructure... Unless, the check is done the entry
points as you propose.

> 
> > vector = irq_matrix_alloc(vector_matrix, dest, resvd, &cpu);
> > trace_vector_alloc(irqd->irq, vector, resvd, vector);
> > if (vector < 0)
> > return vector;
> > apic_update_vector(irqd, vector, cpu);
> > +
> > +no_vector:
> > apic_update_irq_cfg(irqd, vector, cpu);
> >  
> > return 0;
> > @@ -321,12 +330,22 @@ assign_managed_vector(struct irq_data *irqd, const 
> > struct cpumask *dest)
> > /* set_affinity might call here for nothing */
> > if (apicd->vector && cpumask_test_cpu(apicd->cpu, vector_searchmask))
> > return 0;
> > +
> > +   if (apicd->hw_irq_cfg.delivery_mode == APIC_DELIVERY_MODE_NMI) {
> > +   cpu = irq_matrix_find_best_cpu_managed(vector_matrix, dest);
> > +   apicd->cpu = cpu;
> > +   vector = 0;
> > +   goto no_vector;
> > +   }
> 
> This code can never be reached for a NMI delivery. If so, then it's a
> bug.
> 
> This all is special purpose for that particular HPET NMI watchdog use
> case and we are not exposing this to anything else at all.
> 
> So why are you sprinkling this NMI nonsense all over the place? Just
> because? There are well defined entry points to all of this where this
> can be fenced off.

I put the NMI checks in these points because assign_vector_locked() and
assign_managed_vector() are reached through multiple paths and these are
the two places where the allocation of the vector is requested and the
destination CPU is determined.

I do observe this code being reached for an NMI, but that is because this
code still does not know about NMIs... Unless the checks for NMI are put
in the entry points as you pointed out.

The intent was to refactor the code in a generic manner and not to focus
only in the NMI watchdog. That would have looked hacky IMO.

> 
> If at some day the hardware people get their act together and provide a
> proper vector space for NMIs then we have to revisit that, but then
> there will be a separate NMI vector management too.
> 
> What you want is the below which also covers the next patch. Please keep
> an eye on the comments I added/modified.

Thank you for the code and the clarifying comments.
> 
> Thanks,
> 
> tglx
> ---
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -42,6 +42,7 @@ EXPORT_SYMBOL_GPL(x86_vector_domain);
>  static DEFINE_RAW_SPINLOCK(vector_lock);
>  static cpumask_var_t vector_searchmask;
>  static struct irq_chip lapic_controller;
> +static struct irq_chip lapic_nmi_controller;
>  static struct irq_matrix *vector_matrix;
>  #ifdef CONFIG_SMP
>  static DEFINE_PER_CPU(struct hlist_head, cleanup_list);
> @@ -451,6 +452,10 @@ static int x86_vector_activate(struct ir
>   trace_vector_activate(irqd->irq, apicd->is_managed,
> apicd->can_reserve, reserve);
>  
> + /* NMI has a fixed vector. No vector management required */
> + if (apicd->hw_irq_cfg.delivery_mode == APIC_DELIVERY_MODE

Re: [PATCH v6 03/29] x86/apic/msi: Set the delivery mode individually for each IRQ

2022-05-11 Thread Ricardo Neri
On Fri, May 06, 2022 at 10:05:34PM +0200, Thomas Gleixner wrote:
> On Thu, May 05 2022 at 16:59, Ricardo Neri wrote:
> > There are no restrictions in hardware to set  MSI messages with its
> > own delivery mode.
> 
> "messages with its own" ? Plural/singular confusion.

Yes, this is not correct. It should have read "messages with their own..."

> 
> > Use the mode specified in the provided IRQ hardware
> > configuration data. Since most of the IRQs are configured to use the
> > delivery mode of the APIC driver in use (set in all of them to
> > APIC_DELIVERY_MODE_FIXED), the only functional changes are where
> > IRQs are configured to use a specific delivery mode.
> 
> This does not parse. There are no functional changes due to this patch
> and there is no point talking about functional changes in subsequent
> patches here.

I will remove this.

> 
> > Changing the utility function __irq_msi_compose_msg() takes care of
> > implementing the change in the in the local APIC, PCI-MSI, and DMAR-MSI
> 
> in the in the

Sorry! This is not correct.

> 
> > irq_chips.
> >
> > The IO-APIC irq_chip configures the entries in the interrupt redirection
> > table using the delivery mode specified in the corresponding MSI message.
> > Since the MSI message is composed by a higher irq_chip in the hierarchy,
> > it does not need to be updated.
> 
> The point is that updating __irq_msi_compose_msg() covers _all_ MSI
> consumers including IO-APIC.
> 
> I had to read that changelog 3 times to make sense of it. Something like
> this perhaps:
> 
>   "x86/apic/msi: Use the delivery mode from irq_cfg for message composition
> 
>irq_cfg provides a delivery mode for each interrupt. Use it instead
>of the hardcoded APIC_DELIVERY_MODE_FIXED. This allows to compose
>messages for NMI delivery mode which is required to implement a HPET
>based NMI watchdog.
> 
>No functional change as the default delivery mode is set to
>APIC_DELIVERY_MODE_FIXED."

Thank you for your help on the changelog! I will take your suggestion.

BR,
Ricardo
> 
> Thanks,
> 
> tglx
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v6 02/29] x86/apic: Add irq_cfg::delivery_mode

2022-05-11 Thread Ricardo Neri
On Fri, May 06, 2022 at 09:53:54PM +0200, Thomas Gleixner wrote:
> On Thu, May 05 2022 at 16:59, Ricardo Neri wrote:
> > Currently, the delivery mode of all interrupts is set to the mode of the
> > APIC driver in use. There are no restrictions in hardware to configure the
> > delivery mode of each interrupt individually. Also, certain IRQs need
> > to be
> 
> s/IRQ/interrupt/ Changelogs can do without acronyms.

Sure. I will sanitize all the changelogs to remove acronyms.

> 
> > configured with a specific delivery mode (e.g., NMI).
> >
> > Add a new member, delivery_mode, to struct irq_cfg. Subsequent changesets
> > will update every irq_domain to set the delivery mode of each IRQ to that
> > specified in its irq_cfg data.
> >
> > To keep the current behavior, when allocating an IRQ in the root
> > domain
> 
> The root domain does not allocate an interrupt. The root domain
> allocates a vector for an interrupt. There is a very clear and technical
> destinction. Can you please be more careful about the wording?

I will review the wording in the changelogs.

> 
> > --- a/arch/x86/kernel/apic/vector.c
> > +++ b/arch/x86/kernel/apic/vector.c
> > @@ -567,6 +567,7 @@ static int x86_vector_alloc_irqs(struct irq_domain 
> > *domain, unsigned int virq,
> > irqd->chip_data = apicd;
> > irqd->hwirq = virq + i;
> > irqd_set_single_target(irqd);
> > +
> 
> Stray newline.

Sorry! I will remove it.
> 
> > /*
> >  * Prevent that any of these interrupts is invoked in
> >  * non interrupt context via e.g. generic_handle_irq()
> > @@ -577,6 +578,14 @@ static int x86_vector_alloc_irqs(struct irq_domain 
> > *domain, unsigned int virq,
> > /* Don't invoke affinity setter on deactivated interrupts */
> > irqd_set_affinity_on_activate(irqd);
> >  
> > +   /*
> > +* Initialize the delivery mode of this irq to match the
> 
> s/irq/interrupt/

I will make this change.

Thanks and BR,
Ricardo

> 
> > +* default delivery mode of the APIC. Children irq domains
> > +* may take the delivery mode from the individual irq
> > +* configuration rather than from the APIC driver.
> > +*/
> > +   apicd->hw_irq_cfg.delivery_mode = apic->delivery_mode;
> > +
> > /*
> >  * Legacy vectors are already assigned when the IOAPIC
> >  * takes them over. They stay on the same vector. This is
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v6 01/29] irq/matrix: Expose functions to allocate the best CPU for new vectors

2022-05-11 Thread Ricardo Neri
On Fri, May 06, 2022 at 09:48:28PM +0200, Thomas Gleixner wrote:
> Ricardo,

Thank you very much for your feedback Thomas! I am sorry for my late reply, I
had been out of office.

> 
> On Thu, May 05 2022 at 16:59, Ricardo Neri wrote:
> > Certain types of interrupts, such as NMI, do not have an associated vector.
> > They, however, target specific CPUs. Thus, when assigning the destination
> > CPU, it is beneficial to select the one with the lowest number of
> > vectors.
> 
> Why is that beneficial especially in the context of a NMI watchdog which
> then broadcasts the NMI to all other CPUs?

My intent was not the NMI watchdog specifically but potential use cases that do
not involve NMI broadcasts. If the NMI targets a single CPU, it is best to
select the CPU with the lowest vector allocation count.

> 
> That's wishful thinking perhaps, but I don't see any benefit at all.
> 
> > Prepend the functions matrix_find_best_cpu_managed() and
> > matrix_find_best_cpu_managed()
> 
> The same function prepended twice becomes two functions :)
> 

Sorry, I missed this.

> > with the irq_ prefix and expose them for
> > IRQ controllers to use when allocating and activating vector-less IRQs.
> 
> There is no such thing like a vectorless IRQ. NMIs have a vector. Can we
> please describe facts and not pulled out of thin air concepts which do
> not exist?

Thank you for the clarification. I see your point. I wrote this patch because
maskable interrupts and NMIs have different entry points. As you state,
however, the also have a vector.

I can drop this patch.

BR,
Ricardo

> 
> Thanks,
> 
> tglx
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 29/29] x86/tsc: Switch to perf-based hardlockup detector if TSC become unstable

2022-05-05 Thread Ricardo Neri
The HPET-based hardlockup detector relies on the TSC to determine if an
observed NMI interrupt was originated by HPET timer. Hence, this detector
can no longer be used with an unstable TSC.

In such case, permanently stop the HPET-based hardlockup detector and
start the perf-based detector.

Cc: Andi Kleen 
Cc: Stephane Eranian 
Cc: "Ravi V. Shankar" 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Suggested-by: Thomas Gleixner 
Reviewed-by: Tony Luck 
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * Relocated the delcaration of hardlockup_detector_switch_to_perf() to
   x86/nmi.h It does not depend on HPET.
 * Removed function stub. The shim hardlockup detector is always for x86.

Changes since v4:
 * Added a stub version of hardlockup_detector_switch_to_perf() for
   !CONFIG_HPET_TIMER. (lkp)
 * Reconfigure the whole lockup detector instead of unconditionally
   starting the perf-based hardlockup detector.

Changes since v3:
 * None

Changes since v2:
 * Introduced this patch.

Changes since v1:
 * N/A
---
 arch/x86/include/asm/nmi.h | 6 ++
 arch/x86/kernel/tsc.c  | 2 ++
 arch/x86/kernel/watchdog_hld.c | 6 ++
 3 files changed, 14 insertions(+)

diff --git a/arch/x86/include/asm/nmi.h b/arch/x86/include/asm/nmi.h
index 4a0d5b562c91..47752ff67d8b 100644
--- a/arch/x86/include/asm/nmi.h
+++ b/arch/x86/include/asm/nmi.h
@@ -63,4 +63,10 @@ void stop_nmi(void);
 void restart_nmi(void);
 void local_touch_nmi(void);
 
+#ifdef CONFIG_X86_HARDLOCKUP_DETECTOR
+void hardlockup_detector_switch_to_perf(void);
+#else
+static inline void hardlockup_detector_switch_to_perf(void) { }
+#endif
+
 #endif /* _ASM_X86_NMI_H */
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index cc1843044d88..74772ffc79d1 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1176,6 +1176,8 @@ void mark_tsc_unstable(char *reason)
 
clocksource_mark_unstable(&clocksource_tsc_early);
clocksource_mark_unstable(&clocksource_tsc);
+
+   hardlockup_detector_switch_to_perf();
 }
 
 EXPORT_SYMBOL_GPL(mark_tsc_unstable);
diff --git a/arch/x86/kernel/watchdog_hld.c b/arch/x86/kernel/watchdog_hld.c
index ef11f0af4ef5..7940977c6312 100644
--- a/arch/x86/kernel/watchdog_hld.c
+++ b/arch/x86/kernel/watchdog_hld.c
@@ -83,3 +83,9 @@ void watchdog_nmi_start(void)
if (detector_type == X86_HARDLOCKUP_DETECTOR_HPET)
hardlockup_detector_hpet_start();
 }
+
+void hardlockup_detector_switch_to_perf(void)
+{
+   detector_type = X86_HARDLOCKUP_DETECTOR_PERF;
+   lockup_detector_reconfigure();
+}
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 26/29] x86/watchdog: Add a shim hardlockup detector

2022-05-05 Thread Ricardo Neri
The generic hardlockup detector is based on perf. It also provides a set
of weak functions that CPU architectures can override. Add a shim
hardlockup detector for x86 that overrides such functions and can
select between perf and HPET implementations of the detector.

For clarity, add the intermediate Kconfig symbol X86_HARDLOCKUP_DETECTOR
that is selected whenever the core of the hardlockup detector is
selected.

Cc: Andi Kleen 
Cc: Stephane Eranian 
Cc: "Ravi V. Shankar" 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Suggested-by: Nicholas Piggin 
Reviewed-by: Tony Luck 
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * Added watchdog_nmi_start() to be used when tsc_khz is recalibrated.
 * Always build the x86-specific hardlockup detector shim; not only
   when the HPET-based detector is selected.
 * Corrected a typo in comment in watchdog_nmi_probe() (Ani)
 * Removed useless local ret variable in watchdog_nmi_enable(). (Ani)

Changes since v4:
 * Use a switch to enable and disable the various available detectors.
   (Andi)

Changes since v3:
 * Fixed style in multi-line comment. (Randy Dunlap)

Changes since v2:
 * Pass cpu number as argument to hardlockup_detector_[enable|disable].
   (Thomas Gleixner)

Changes since v1:
 * Introduced this patch: Added an x86-specific shim hardlockup
   detector. (Nicholas Piggin)
---
 arch/x86/Kconfig.debug |  3 ++
 arch/x86/kernel/Makefile   |  2 +
 arch/x86/kernel/watchdog_hld.c | 85 ++
 3 files changed, 90 insertions(+)
 create mode 100644 arch/x86/kernel/watchdog_hld.c

diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index bc34239589db..599001157847 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -6,6 +6,9 @@ config TRACE_IRQFLAGS_NMI_SUPPORT
 config EARLY_PRINTK_USB
bool
 
+config X86_HARDLOCKUP_DETECTOR
+   def_bool y if HARDLOCKUP_DETECTOR_CORE
+
 config X86_VERBOSE_BOOTUP
bool "Enable verbose x86 bootup info messages"
default y
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index c700b00a2d86..af3d54e4c836 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -114,6 +114,8 @@ obj-$(CONFIG_KGDB)  += kgdb.o
 obj-$(CONFIG_VM86) += vm86_32.o
 obj-$(CONFIG_EARLY_PRINTK) += early_printk.o
 
+obj-$(CONFIG_X86_HARDLOCKUP_DETECTOR) += watchdog_hld.o
+
 obj-$(CONFIG_HPET_TIMER)   += hpet.o
 obj-$(CONFIG_X86_HARDLOCKUP_DETECTOR_HPET) += watchdog_hld_hpet.o
 
diff --git a/arch/x86/kernel/watchdog_hld.c b/arch/x86/kernel/watchdog_hld.c
new file mode 100644
index ..ef11f0af4ef5
--- /dev/null
+++ b/arch/x86/kernel/watchdog_hld.c
@@ -0,0 +1,85 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * A shim hardlockup detector. It overrides the weak stubs of the generic
+ * implementation to select between the perf- or the hpet-based implementation.
+ *
+ * Copyright (C) Intel Corporation 2022
+ */
+
+#include 
+#include 
+
+enum x86_hardlockup_detector {
+   X86_HARDLOCKUP_DETECTOR_PERF,
+   X86_HARDLOCKUP_DETECTOR_HPET,
+};
+
+static enum __read_mostly x86_hardlockup_detector detector_type;
+
+int watchdog_nmi_enable(unsigned int cpu)
+{
+   switch (detector_type) {
+   case X86_HARDLOCKUP_DETECTOR_PERF:
+   hardlockup_detector_perf_enable();
+   break;
+   case X86_HARDLOCKUP_DETECTOR_HPET:
+   hardlockup_detector_hpet_enable(cpu);
+   break;
+   default:
+   return -ENODEV;
+   }
+
+   return 0;
+}
+
+void watchdog_nmi_disable(unsigned int cpu)
+{
+   switch (detector_type) {
+   case X86_HARDLOCKUP_DETECTOR_PERF:
+   hardlockup_detector_perf_disable();
+   break;
+   case X86_HARDLOCKUP_DETECTOR_HPET:
+   hardlockup_detector_hpet_disable(cpu);
+   break;
+   }
+}
+
+int __init watchdog_nmi_probe(void)
+{
+   int ret;
+
+   /*
+* Try first with the HPET hardlockup detector. It will only
+* succeed if selected at build time and requested in the
+* nmi_watchdog command-line parameter. This ensures that the
+* perf-based detector is used by default, if selected at
+* build time.
+*/
+   ret = hardlockup_detector_hpet_init();
+   if (!ret) {
+   detector_type = X86_HARDLOCKUP_DETECTOR_HPET;
+   return ret;
+   }
+
+   ret = hardlockup_detector_perf_init();
+   if (!ret) {
+   detector_type = X86_HARDLOCKUP_DETECTOR_PERF;
+   return ret;
+   }
+
+   return 0;
+}
+
+void watchdog_nmi_stop(void)
+{
+   /* Only the HPET lockup detector defines a stop function. */
+   if (detector_type == X86_HARDLOCKUP_DETECTOR_HPET)
+   hardlockup_detector_hpet_stop();
+}
+
+void watchdog_nmi_start(void)
+{
+   /* Only the HPET lockup detector defi

[PATCH v6 23/29] x86/watchdog/hardlockup/hpet: Determine if HPET timer caused NMI

2022-05-05 Thread Ricardo Neri
It is not possible to determine the source of a non-maskable interrupt
(NMI) in x86. When dealing with an HPET channel, the only direct method to
determine whether it caused an NMI would be to read the Interrupt Status
register.

However, reading HPET registers is slow and, therefore, not to be done
while in NMI context. Furthermore, status is not available if the HPET
channel is programmed to deliver an MSI interrupt.

An indirect manner to infer if an incoming NMI was caused by the HPET
channel of the detector is to use the time-stamp counter (TSC). Compute the
value that the TSC is expected to have at the next interrupt of the HPET
channel and compare it with the value it has when the interrupt does
happen. If the actual value falls within a small error window, assume
that the HPET channel of the detector is the source of the NMI.

Let tsc_delta be the difference between the value the TSC has now and the
value it will have when the next HPET channel interrupt happens. Define the
error window as a percentage of tsc_delta.

Below is a table that characterizes the error in the error in the expected
TSC value when the HPET channel fires on a variety of systems. It presents
the error as a percentage of tsc_delta and in microseconds.

The table summarizes the error of 4096 interrupts of the HPET channel
collected after the system has been up for 5 minutes as well as since boot.

The maximum observed error on any system is 0.045%. When the error since
boot is considered, the maximum observed error is 0.198%.

To find the most common error value, the collected data is grouped into
buckets of 0.01 percentage points of the error and 10ns, respectively.
The most common error on any system is of 0.01317%

Allow a maximum error that is twice as big the maximum error observed in
these experiments: 0.4%

watchdog_thresh 1s10s 60s
Error wrt
expected
TSC value %us %us%us

AMD EPYC 7742 64-Core Processor
Abs max
since boot 0.04517   451.740.00171   171.04   0.00034   201.89
Abs max0.04517   451.740.00171   171.04   0.00034   201.89
Mode   0.2 0.180.2 2.07  -0.3   -19.20

Intel(R) Xeon(R) CPU E7-8890 - INTEL_FAM6_HASWELL_X
abs max
since boot 0.0081181.150.00462   462.40   0.0001481.65
Abs max0.0081181.150.0008484.31   0.0001481.65
Mode  -0.00422   -42.16   -0.00043   -42.50  -0.7   -40.40

Intel(R) Xeon(R) Platinum 8170M - INTEL_FAM6_SKYLAKE_X
Abs max
since boot 0.10530  1053.040.01324  1324.27   0.00407  2443.25
Abs max0.01166   116.590.00114   114.11   0.00024   143.47
Mode  -0.01023  -102.32   -0.00103  -102.44  -0.00022  -132.38

Intel(R) Xeon(R) CPU E5-2699A v4 - INTEL_FAM6_BROADSWELL_X
Abs max
since boot 0.0001099.340.0009998.83   0.0001697.50
Abs max0.0001099.340.0009998.83   0.0001697.50
Mode  -0.7   -74.29   -0.00074   -73.99  -0.00012   -73.12

Intel(R) Xeon(R) Gold 5318H - INTEL_FAM6_COOPERLAKE_X
Abs max
since boot 0.11262  1126.170.01109  1109.17   0.00409  2455.73
Abs max0.01073   107.310.00109   109.02   0.00019   115.34
Mode  -0.00953   -95.26   -0.00094   -93.63  -0.00015   -90.42

Intel(R) Xeon(R) Platinum 8360Y - INTEL_FAM6_ICELAKE_X
Abs max
since boot 0.19853  1985.300.00784   783.53  -0.00017  -104.77
Abs max0.01550   155.020.00158   157.56   0.00020   117.74
Mode  -0.01317  -131.65   -0.00136  -136.42  -0.00018  -105.06

Cc: Andi Kleen 
Cc: Stephane Eranian 
Cc: "Ravi V. Shankar" 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Suggested-by: Andi Kleen 
Reviewed-by: Tony Luck 
Signed-off-by: Ricardo Neri 
---
NOTE: The error characterization data is repetead here from the cover
letter.
---
Changes since v5:
 * Reworked is_hpet_hld_interrupt() to reduce indentation.
 * Use time_in_range64() to compare the actual TSC value vs the expected
   value. This makes it more readable. (Tony)
 * Reduced the error window of the expected TSC value at the time of the
   HPET channel expiration.
 * Described better the heuristics used to determine if the HPET channel
   caused the NMI. (Tony)
 * Added a table to characterize the error in the expected TSC value when
   the HPET channel fires.
 * Removed references to groups of monitored CPUs. Instead, use tsc_khz
   directly.

Changes since v4:
 * Compute the TSC expected value at the next HPET interrupt based on the
   number of monitored packages and not the number of monitored CPUs.

Changes since v3:
 * None

Changes since v2:
 * Reworked condition to check if the expected TSC value is within the
   error margin to avoid an unnecessary conditional. (Peter Zijlstra)
 * Removed TSC error margin from struct hld_data; use a global variable
   instead. (Peter Zijlstra)

Changes since v1:
 * Introduced this patch.
---
 arch/x86/include/asm/hpet.h |  3 ++

[PATCH v6 27/29] watchdog: Expose lockup_detector_reconfigure()

2022-05-05 Thread Ricardo Neri
When there are multiple implementations of the NMI watchdog, there may be
situations in which switching from one to another is needed. If the time-
stamp counter becomes unstable, the HPET-based NMI watchdog can no longer
be used. Similarly, the HPET-based NMI watchdog relies on tsc_khz and
needs to be informed when it is refined.

Reloading the NMI watchdog or switching to another hardlockup detector can
be done cleanly by updating the arch-specific stub and then reconfiguring
the whole lockup detector.

Expose lockup_detector_reconfigure() to achieve this goal.

Cc: Andi Kleen 
Cc: Nicholas Piggin 
Cc: Stephane Eranian 
Cc: "Ravi V. Shankar" 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Reviewed-by: Tony Luck 
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * None

Changes since v4:
 * Switching to the perf-based lockup detector under the hood is hacky.
   Instead, reconfigure the whole lockup detector.

Changes since v3:
 * None

Changes since v2:
 * Introduced this patch.

Changes since v1:
 * N/A
---
 include/linux/nmi.h | 2 ++
 kernel/watchdog.c   | 4 ++--
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index cf12380e51b3..73827a477288 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -16,6 +16,7 @@ void lockup_detector_init(void);
 void lockup_detector_soft_poweroff(void);
 void lockup_detector_cleanup(void);
 bool is_hardlockup(void);
+void lockup_detector_reconfigure(void);
 
 extern int watchdog_user_enabled;
 extern int nmi_watchdog_user_enabled;
@@ -37,6 +38,7 @@ extern int sysctl_hardlockup_all_cpu_backtrace;
 static inline void lockup_detector_init(void) { }
 static inline void lockup_detector_soft_poweroff(void) { }
 static inline void lockup_detector_cleanup(void) { }
+static inline void lockup_detector_reconfigure(void) { }
 #endif /* !CONFIG_LOCKUP_DETECTOR */
 
 #ifdef CONFIG_SOFTLOCKUP_DETECTOR
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 6443841a755f..e5b67544f8c8 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -537,7 +537,7 @@ int lockup_detector_offline_cpu(unsigned int cpu)
return 0;
 }
 
-static void lockup_detector_reconfigure(void)
+void lockup_detector_reconfigure(void)
 {
cpus_read_lock();
watchdog_nmi_stop();
@@ -579,7 +579,7 @@ static __init void lockup_detector_setup(void)
 }
 
 #else /* CONFIG_SOFTLOCKUP_DETECTOR */
-static void lockup_detector_reconfigure(void)
+void lockup_detector_reconfigure(void)
 {
cpus_read_lock();
watchdog_nmi_stop();
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 28/29] x86/tsc: Restart NMI watchdog after refining tsc_khz

2022-05-05 Thread Ricardo Neri
The HPET hardlockup detector relies on tsc_khz to estimate the value of
that the TSC will have when its HPET channel fires. A refined tsc_khz
helps to estimate better the expected TSC value.

Using the early value of tsc_khz may lead to a large error in the expected
TSC value. Restarting the NMI watchdog detector has the effect of kicking
its HPET channel and make use of the refined tsc_khz.

When the HPET hardlockup is not in use, restarting the NMI watchdog is
a noop.

Cc: Andi Kleen 
Cc: Stephane Eranian 
Cc: "Ravi V. Shankar" 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * Introduced this patch

Changes since v4
 * N/A

Changes since v3
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 arch/x86/kernel/tsc.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index cafacb2e58cc..cc1843044d88 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1386,6 +1386,12 @@ static void tsc_refine_calibration_work(struct 
work_struct *work)
/* Inform the TSC deadline clockevent devices about the recalibration */
lapic_update_tsc_freq();
 
+   /*
+* If in use, the HPET hardlockup detector relies on tsc_khz.
+* Reconfigure it to make use of the refined tsc_khz.
+*/
+   lockup_detector_reconfigure();
+
/* Update the sched_clock() rate to match the clocksource one */
for_each_possible_cpu(cpu)
set_cyc2ns_scale(tsc_khz, cpu, tsc_stop);
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 25/29] watchdog/hardlockup/hpet: Only enable the HPET watchdog via a boot parameter

2022-05-05 Thread Ricardo Neri
Keep the HPET-based hardlockup detector disabled unless explicitly enabled
via a command-line argument. If such parameter is not given, the
initialization of the HPET-based hardlockup detector fails and the NMI
watchdog will fall back to use the perf-based implementation.

Implement the command-line parsing using an early_param, as
__setup("nmi_watchdog=") only parses generic options.

Cc: Andi Kleen 
Cc: Stephane Eranian 
Cc: "Ravi V. Shankar" 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Reviewed-by: Tony Luck 
Signed-off-by: Ricardo Neri 
--
Changes since v5:
 * None

Changes since v4:
 * None

Changes since v3:
 * None

Changes since v2:
 * Do not imply that using nmi_watchdog=hpet means the detector is
   enabled. Instead, print a warning in such case.

Changes since v1:
 * Added documentation to the function handing the nmi_watchdog
   kernel command-line argument.
---
 .../admin-guide/kernel-parameters.txt |  8 ++-
 arch/x86/kernel/watchdog_hld_hpet.c   | 22 +++
 2 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 269be339d738..89eae950fdb8 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3370,7 +3370,7 @@
Format: [state][,regs][,debounce][,die]
 
nmi_watchdog=   [KNL,BUGS=X86] Debugging features for SMP kernels
-   Format: [panic,][nopanic,][num]
+   Format: [panic,][nopanic,][num,][hpet]
Valid num: 0 or 1
0 - turn hardlockup detector in nmi_watchdog off
1 - turn hardlockup detector in nmi_watchdog on
@@ -3381,6 +3381,12 @@
please see 'nowatchdog'.
This is useful when you use a panic=... timeout and
need the box quickly up again.
+   When hpet is specified, the NMI watchdog will be driven
+   by an HPET timer, if available in the system. Otherwise,
+   it falls back to the default implementation (perf or
+   architecture-specific). Specifying hpet has no effect
+   if the NMI watchdog is not enabled (either at build time
+   or via the command line).
 
These settings can be accessed at runtime via
the nmi_watchdog and hardlockup_panic sysctls.
diff --git a/arch/x86/kernel/watchdog_hld_hpet.c 
b/arch/x86/kernel/watchdog_hld_hpet.c
index 3effdbf29095..4413d5fb94f4 100644
--- a/arch/x86/kernel/watchdog_hld_hpet.c
+++ b/arch/x86/kernel/watchdog_hld_hpet.c
@@ -379,6 +379,28 @@ void hardlockup_detector_hpet_start(void)
enable_timer(hld_data);
 }
 
+/**
+ * hardlockup_detector_hpet_setup() - Parse command-line parameters
+ * @str:   A string containing the kernel command line
+ *
+ * Parse the nmi_watchdog parameter from the kernel command line. If
+ * selected by the user, use this implementation to detect hardlockups.
+ */
+static int __init hardlockup_detector_hpet_setup(char *str)
+{
+   if (!str)
+   return -EINVAL;
+
+   if (parse_option_str(str, "hpet"))
+   hardlockup_use_hpet = true;
+
+   if (!nmi_watchdog_user_enabled && hardlockup_use_hpet)
+   pr_err("Selecting HPET NMI watchdog has no effect with NMI 
watchdog disabled\n");
+
+   return 0;
+}
+early_param("nmi_watchdog", hardlockup_detector_hpet_setup);
+
 /**
  * hardlockup_detector_hpet_init() - Initialize the hardlockup detector
  *
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 22/29] x86/watchdog/hardlockup: Add an HPET-based hardlockup detector

2022-05-05 Thread Ricardo Neri
Implement a hardlockup detector that uses an HPET channel as the source
of the non-maskable interrupt. Implement the basic functionality to
start, stop, and configure the timer.

Designate as the handling CPU one of the CPUs that the detector monitors.
Use it to service the NMI from the HPET channel. When servicing the HPET
NMI, issue an inter-processor interrupt to the rest of the monitored CPUs.
Only enable the detector if IPI shorthands are enabled in the system.

During operation, the HPET registers are only accessed to kick the timer.
This operation can be avoided if a periodic HPET channel is added to the
detector.

To configure the HPET channel interrupt, the detector relies on the
interrupt subsystem to configure the deliver mode as NMI (as requested
in hpet_hld_get_timer()) throughout the IRQ hierarchy. This covers
systems with and without interrupt remapping enabled.

The detector is not functional at this stage. A subsequent changeset will
invoke the interfaces implemented in this changeset go start, stop, and
reconfigure the detector. Another subsequent changeset implements logic
to determine if the HPET timer caused the NMI. For now, implement a
stub function.

Cc: Andi Kleen 
Cc: Stephane Eranian 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Reviewed-by: Tony Luck 
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * Squashed a previously separate patch to support interrupt remapping into
   this patch. There is no need to handle interrupt remapping separately.
   All the necessary plumbing is done in the interrupt subsytem. Now it
   uses request_irq().
 * Use IPI shorthands to send an NMI to the CPUs being monitored. (Thomas)
 * Added extra check to only use the HPET hardlockup detector if the IPI
   shorthands are enabled. (Thomas)
 * Relocated flushing of outstanding interrupts from enable_timer() to
   disable_timer(). On some systems, making any change in the
   configuration of the HPET channel causes it to issue an interrupt.
 * Added a new cpumask to function as a per-cpu test bit to determine if
   a CPU should check for hardlockups.
 * Dropped pointless X86_64 || X86_32 check in Kconfig. (Tony)
 * Dropped pointless dependency on CONFIG_HPET.
 * Added dependency on CONFIG_GENERIC_MSI_IRQ, needed to build the [|IR]-
   HPET-MSI irq_chip.
 * Added hardlockup_detector_hpet_start() to be used when tsc_khz is
   recalibrated.
 * Reworked the periodic setting the HPET channel. Rather than changing it
   every time the channel is disabled or enabled, do it only once. While
   at here, wrap the code in an initial setup function.
 * Implemented hardlockup_detector_hpet_start() to be called when tsc_khz
   is refined.
 * Enhanced inline comments for clarity.
 * Added missing #include files.
 * Relocated function declarations to not depend on CONFIG_HPET_TIMER.

Changes since v4:
 * Dropped hpet_hld_data.enabled_cpus and instead use cpumask_weight().
 * Renamed hpet_hld_data.cpu_monitored_mask to
   hld_data_data.cpu_monitored_mask and converted it to cpumask_var_t.
 * Flushed out any outstanding interrupt before enabling the HPET channel.
 * Removed unnecessary MSI_DATA_LEVEL_ASSERT from the MSI message.
 * Added comments in hardlockup_detector_nmi_handler() to explain how
   CPUs are targeted for an IPI.
 * Updated code to only issue an IPI when needed (i.e., there are monitored
   CPUs to be inspected via an IPI).
 * Reworked hardlockup_detector_hpet_init() for readability.
 * Now reserve the cpumasks in the hardlockup detector code and not in the
   generic HPET code.
 * Handled the case of watchdog_thresh = 0 when disabling the detector.
 * Made this detector available to i386.
 * Reworked logic to kick the timer to remove a local variable. (Andi)
 * Added a comment on what type of timer channel will be assigned to the
   detector. (Andi)
 * Reworded prompt comment in Kconfig. (Andi)
 * Removed unneeded switch to level interrupt mode when disabling the
   timer. (Andi)
 * Disabled the HPET timer to avoid a race between an incoming interrupt
   and an update of the MSI destination ID. (Ashok)
 * Corrected a typo in an inline comment. (Tony)
 * Made the HPET hardlockup detector depend on HARDLOCKUP_DETECTOR instead
   of selecting it.

Changes since v3:
 * Fixed typo in Kconfig.debug. (Randy Dunlap)
 * Added missing slab.h to include the definition of kfree to fix a build
   break.

Changes since v2:
 * Removed use of struct cpumask in favor of a variable length array in
   conjunction with kzalloc. (Peter Zijlstra)
 * Removed redundant documentation of functions. (Thomas Gleixner)
 * Added CPU as argument hardlockup_detector_hpet_enable()/disable().
   (Thomas Gleixner).

Changes since v1:
 * Do not target CPUs in a round-robin manner. Instead, the HPET timer
   always targets the same CPU; other CPUs are monitored via an
   interprocessor interrupt.
 * Dropped support for IO APIC interrupts and instead use only MSI
   interrupts.
 * Removed

[PATCH v6 24/29] watchdog/hardlockup: Use parse_option_str() to handle "nmi_watchdog"

2022-05-05 Thread Ricardo Neri
Prepare hardlockup_panic_setup() to handle a comma-separated list of
options. Thus, it can continue parsing its own command-line options while
ignoring parameters that are relevant only to specific implementations of
the hardlockup detector. Such implementations may use an early_param to
parse their own options.

Cc: Andi Kleen 
Cc: Nicholas Piggin 
Cc: Stephane Eranian 
Cc: "Ravi V. Shankar" 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Reviewed-by: Tony Luck 
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * Corrected typo in commit message. (Tony)

Changes since v4:
 * None

Changes since v3:
 * None

Changes since v2:
 * Introduced this patch.

Changes since v1:
 * None
---
 kernel/watchdog.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 9166220457bc..6443841a755f 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -73,13 +73,13 @@ void __init hardlockup_detector_disable(void)
 
 static int __init hardlockup_panic_setup(char *str)
 {
-   if (!strncmp(str, "panic", 5))
+   if (parse_option_str(str, "panic"))
hardlockup_panic = 1;
-   else if (!strncmp(str, "nopanic", 7))
+   else if (parse_option_str(str, "nopanic"))
hardlockup_panic = 0;
-   else if (!strncmp(str, "0", 1))
+   else if (parse_option_str(str, "0"))
nmi_watchdog_user_enabled = 0;
-   else if (!strncmp(str, "1", 1))
+   else if (parse_option_str(str, "1"))
nmi_watchdog_user_enabled = 1;
return 1;
 }
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 20/29] init/main: Delay initialization of the lockup detector after smp_init()

2022-05-05 Thread Ricardo Neri
Certain implementations of the hardlockup detector require support for
Inter-Processor Interrupt shorthands. On x86, support for these can only
be determined after all the possible CPUs have booted once (in
smp_init()). Other architectures may not need such check.

lockup_detector_init() only performs the initializations of data
structures of the lockup detector. Hence, there are no dependencies on
smp_init().

Cc: Andi Kleen 
Cc: Nicholas Piggin 
Cc: Andrew Morton 
Cc: Stephane Eranian 
Cc: "Ravi V. Shankar" 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Reviewed-by: Tony Luck 
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * Introduced this patch

Changes since v4:
 * N/A

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 init/main.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/init/main.c b/init/main.c
index 98182c3c2c4b..62c52c9e4c2b 100644
--- a/init/main.c
+++ b/init/main.c
@@ -1600,9 +1600,11 @@ static noinline void __init kernel_init_freeable(void)
 
rcu_init_tasks_generic();
do_pre_smp_initcalls();
-   lockup_detector_init();
 
smp_init();
+
+   lockup_detector_init();
+
sched_init_smp();
 
padata_init();
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 21/29] x86/nmi: Add an NMI_WATCHDOG NMI handler category

2022-05-05 Thread Ricardo Neri
Add a NMI_WATCHDOG as a new category of NMI handler. This new category
is to be used with the HPET-based hardlockup detector. This detector
does not have a direct way of checking if the HPET timer is the source of
the NMI. Instead, it indirectly estimates it using the time-stamp counter.

Therefore, we may have false-positives in case another NMI occurs within
the estimated time window. For this reason, we want the handler of the
detector to be called after all the NMI_LOCAL handlers. A simple way
of achieving this with a new NMI handler category.

Cc: Andi Kleen 
Cc: Andrew Morton 
Cc: "Ravi V. Shankar" 
Cc: Stephane Eranian 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Reviewed-by: Tony Luck 
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * Updated to call instrumentation_end() as per f051f6979550 ("x86/nmi:
   Protect NMI entry against instrumentation")

Changes since v4:
 * None

Changes since v3:
 * None

Changes since v2:
 * Introduced this patch.

Changes since v1:
 * N/A
---
 arch/x86/include/asm/nmi.h |  1 +
 arch/x86/kernel/nmi.c  | 10 ++
 2 files changed, 11 insertions(+)

diff --git a/arch/x86/include/asm/nmi.h b/arch/x86/include/asm/nmi.h
index 1cb9c17a4cb4..4a0d5b562c91 100644
--- a/arch/x86/include/asm/nmi.h
+++ b/arch/x86/include/asm/nmi.h
@@ -28,6 +28,7 @@ enum {
NMI_UNKNOWN,
NMI_SERR,
NMI_IO_CHECK,
+   NMI_WATCHDOG,
NMI_MAX
 };
 
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index e73f7df362f5..fde387e0812a 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -61,6 +61,10 @@ static struct nmi_desc nmi_desc[NMI_MAX] =
.lock = __RAW_SPIN_LOCK_UNLOCKED(&nmi_desc[3].lock),
.head = LIST_HEAD_INIT(nmi_desc[3].head),
},
+   {
+   .lock = __RAW_SPIN_LOCK_UNLOCKED(&nmi_desc[4].lock),
+   .head = LIST_HEAD_INIT(nmi_desc[4].head),
+   },
 
 };
 
@@ -168,6 +172,8 @@ int __register_nmi_handler(unsigned int type, struct 
nmiaction *action)
 */
WARN_ON_ONCE(type == NMI_SERR && !list_empty(&desc->head));
WARN_ON_ONCE(type == NMI_IO_CHECK && !list_empty(&desc->head));
+   WARN_ON_ONCE(type == NMI_WATCHDOG && !list_empty(&desc->head));
+
 
/*
 * some handlers need to be executed first otherwise a fake
@@ -379,6 +385,10 @@ static noinstr void default_do_nmi(struct pt_regs *regs)
}
raw_spin_unlock(&nmi_reason_lock);
 
+   handled = nmi_handle(NMI_WATCHDOG, regs);
+   if (handled == NMI_HANDLED)
+   goto out;
+
/*
 * Only one NMI can be latched at a time.  To handle
 * this we may process multiple nmi handlers at once to
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 18/29] watchdog/hardlockup: Define a generic function to detect hardlockups

2022-05-05 Thread Ricardo Neri
The procedure to detect hardlockups is independent of the underlying
mechanism that generates the non-maskable interrupt used to drive the
detector. Thus, it can be put in a separate, generic function. In this
manner, it can be invoked by various implementations of the NMI watchdog.

For this purpose, move the bulk of watchdog_overflow_callback() to the
new function inspect_for_hardlockups(). This function can then be called
from the applicable NMI handlers. No functional changes.

Cc: Andi Kleen 
Cc: Nicholas Piggin 
Cc: Andrew Morton 
Cc: Stephane Eranian 
Cc: "Ravi V. Shankar" 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Reviewed-by: Tony Luck 
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * None

Changes since v4:
 * None

Changes since v3:
 * None

Changes since v2:
 * None

Changes since v1:
 * None
---
 include/linux/nmi.h   |  1 +
 kernel/watchdog_hld.c | 18 +++---
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index 750c7f395ca9..1b68f48ad440 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -207,6 +207,7 @@ int proc_nmi_watchdog(struct ctl_table *, int , void *, 
size_t *, loff_t *);
 int proc_soft_watchdog(struct ctl_table *, int , void *, size_t *, loff_t *);
 int proc_watchdog_thresh(struct ctl_table *, int , void *, size_t *, loff_t *);
 int proc_watchdog_cpumask(struct ctl_table *, int, void *, size_t *, loff_t *);
+void inspect_for_hardlockups(struct pt_regs *regs);
 
 #ifdef CONFIG_HAVE_ACPI_APEI_NMI
 #include 
diff --git a/kernel/watchdog_hld.c b/kernel/watchdog_hld.c
index 247bf0b1582c..b352e507b17f 100644
--- a/kernel/watchdog_hld.c
+++ b/kernel/watchdog_hld.c
@@ -106,14 +106,8 @@ static struct perf_event_attr wd_hw_attr = {
.disabled   = 1,
 };
 
-/* Callback function for perf event subsystem */
-static void watchdog_overflow_callback(struct perf_event *event,
-  struct perf_sample_data *data,
-  struct pt_regs *regs)
+void inspect_for_hardlockups(struct pt_regs *regs)
 {
-   /* Ensure the watchdog never gets throttled */
-   event->hw.interrupts = 0;
-
if (__this_cpu_read(watchdog_nmi_touch) == true) {
__this_cpu_write(watchdog_nmi_touch, false);
return;
@@ -163,6 +157,16 @@ static void watchdog_overflow_callback(struct perf_event 
*event,
return;
 }
 
+/* Callback function for perf event subsystem */
+static void watchdog_overflow_callback(struct perf_event *event,
+  struct perf_sample_data *data,
+  struct pt_regs *regs)
+{
+   /* Ensure the watchdog never gets throttled */
+   event->hw.interrupts = 0;
+   inspect_for_hardlockups(regs);
+}
+
 static int hardlockup_detector_event_create(void)
 {
unsigned int cpu = smp_processor_id();
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 19/29] watchdog/hardlockup: Decouple the hardlockup detector from perf

2022-05-05 Thread Ricardo Neri
The current default implementation of the hardlockup detector assumes that
it is implemented using perf events. However, the hardlockup detector can
be driven by other sources of non-maskable interrupts (e.g., a properly
configured timer).

Group and wrap in #ifdef CONFIG_HARDLOCKUP_DETECTOR_PERF all the code
specific to perf: create and manage perf events, stop and start the perf-
based detector.

The generic portion of the detector (monitor the timers' thresholds, check
timestamps and detect hardlockups as well as the implementation of
arch_touch_nmi_watchdog()) is now selected with the new intermediate config
symbol CONFIG_HARDLOCKUP_DETECTOR_CORE.

The perf-based implementation of the detector selects the new intermediate
symbol. Other implementations should do the same.

Cc: Andi Kleen 
Cc: Nicholas Piggin 
Cc: Andrew Morton 
Cc: Stephane Eranian 
Cc: "Ravi V. Shankar" 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Reviewed-by: Tony Luck 
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * None

Changes since v4:
 * None

Changes since v3:
 * Squashed into this patch a previous patch to make
   arch_touch_nmi_watchdog() part of the core detector code.

Changes since v2:
 * Undid split of the generic hardlockup detector into a separate file.
   (Thomas Gleixner)
 * Added a new intermediate symbol CONFIG_HARDLOCKUP_DETECTOR_CORE to
   select generic parts of the detector (Paul E. McKenney,
   Thomas Gleixner).

Changes since v1:
 * Make the generic detector code with CONFIG_HARDLOCKUP_DETECTOR.
---
 include/linux/nmi.h   |  5 -
 kernel/Makefile   |  2 +-
 kernel/watchdog_hld.c | 32 
 lib/Kconfig.debug |  4 
 4 files changed, 29 insertions(+), 14 deletions(-)

diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index 1b68f48ad440..cf12380e51b3 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -94,8 +94,11 @@ static inline void hardlockup_detector_disable(void) {}
 # define NMI_WATCHDOG_SYSCTL_PERM  0444
 #endif
 
-#if defined(CONFIG_HARDLOCKUP_DETECTOR_PERF)
+#if defined(CONFIG_HARDLOCKUP_DETECTOR_CORE)
 extern void arch_touch_nmi_watchdog(void);
+#endif
+
+#if defined(CONFIG_HARDLOCKUP_DETECTOR_PERF)
 extern void hardlockup_detector_perf_stop(void);
 extern void hardlockup_detector_perf_restart(void);
 extern void hardlockup_detector_perf_disable(void);
diff --git a/kernel/Makefile b/kernel/Makefile
index 847a82bfe0e3..27e75b735ef7 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -95,7 +95,7 @@ obj-$(CONFIG_FAIL_FUNCTION) += fail_function.o
 obj-$(CONFIG_KGDB) += debug/
 obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
 obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
-obj-$(CONFIG_HARDLOCKUP_DETECTOR_PERF) += watchdog_hld.o
+obj-$(CONFIG_HARDLOCKUP_DETECTOR_CORE) += watchdog_hld.o
 obj-$(CONFIG_SECCOMP) += seccomp.o
 obj-$(CONFIG_RELAY) += relay.o
 obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
diff --git a/kernel/watchdog_hld.c b/kernel/watchdog_hld.c
index b352e507b17f..bb6435978c46 100644
--- a/kernel/watchdog_hld.c
+++ b/kernel/watchdog_hld.c
@@ -22,12 +22,8 @@
 
 static DEFINE_PER_CPU(bool, hard_watchdog_warn);
 static DEFINE_PER_CPU(bool, watchdog_nmi_touch);
-static DEFINE_PER_CPU(struct perf_event *, watchdog_ev);
-static DEFINE_PER_CPU(struct perf_event *, dead_event);
-static struct cpumask dead_events_mask;
 
 static unsigned long hardlockup_allcpu_dumped;
-static atomic_t watchdog_cpus = ATOMIC_INIT(0);
 
 notrace void arch_touch_nmi_watchdog(void)
 {
@@ -98,14 +94,6 @@ static inline bool watchdog_check_timestamp(void)
 }
 #endif
 
-static struct perf_event_attr wd_hw_attr = {
-   .type   = PERF_TYPE_HARDWARE,
-   .config = PERF_COUNT_HW_CPU_CYCLES,
-   .size   = sizeof(struct perf_event_attr),
-   .pinned = 1,
-   .disabled   = 1,
-};
-
 void inspect_for_hardlockups(struct pt_regs *regs)
 {
if (__this_cpu_read(watchdog_nmi_touch) == true) {
@@ -157,6 +145,24 @@ void inspect_for_hardlockups(struct pt_regs *regs)
return;
 }
 
+#ifdef CONFIG_HARDLOCKUP_DETECTOR_PERF
+#undef pr_fmt
+#define pr_fmt(fmt) "NMI perf watchdog: " fmt
+
+static DEFINE_PER_CPU(struct perf_event *, watchdog_ev);
+static DEFINE_PER_CPU(struct perf_event *, dead_event);
+static struct cpumask dead_events_mask;
+
+static atomic_t watchdog_cpus = ATOMIC_INIT(0);
+
+static struct perf_event_attr wd_hw_attr = {
+   .type   = PERF_TYPE_HARDWARE,
+   .config = PERF_COUNT_HW_CPU_CYCLES,
+   .size   = sizeof(struct perf_event_attr),
+   .pinned = 1,
+   .disabled   = 1,
+};
+
 /* Callback function for perf event subsystem */
 static void watchdog_overflow_callback(struct perf_event *event,
   struct perf_sample_data *data,
@@ -298,3 +304,5 @@ int __init hardlockup_detector_perf_init(void)
}
   

[PATCH v6 17/29] x86/hpet: Reserve an HPET channel for the hardlockup detector

2022-05-05 Thread Ricardo Neri
The HPET hardlockup detector needs a dedicated HPET channel. Hence, create
a new HPET_MODE_NMI_WATCHDOG mode category to indicate that it cannot be
used for other purposes. Using MSI interrupts greatly simplifies the
implementation of the detector. Specifically, it helps to avoid the
complexities of routing the interrupt via the IO-APIC (e.g., potential
race conditions that arise from re-programming the IO-APIC while also
servicing an NMI). Therefore, only reserve the timer if it supports Front
Side Bus interrupt delivery.

HPET channels are reserved at various stages. First, from
x86_late_time_init(), hpet_time_init() checks if the HPET timer supports
Legacy Replacement Routing. If this is the case, channels 0 and 1 are
reserved as HPET_MODE_LEGACY.

At a later stage, from lockup_detector_init(), reserve the HPET channel
for the hardlockup detector. Then, the HPET clocksource reserves the
channels it needs and then the remaining channels are given to the HPET
char driver via hpet_alloc().

Hence, the channel assigned to the HPET hardlockup detector depends on
whether the first two channels are reserved for legacy mode.

Lastly, only reserve the channel for the hardlockup detector if enabled
in the kernel command line.

Cc: Andi Kleen 
Cc: Stephane Eranian 
Cc: "Ravi V. Shankar" 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Reviewed-by: Tony Luck 
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * Added a check for the allowed maximum frequency of the HPET.
 * Added hpet_hld_free_timer() to properly free the reserved HPET channel
   if the initialization is not completed.
 * Call hpet_assign_irq() with as_nmi = true.
 * Relocated declarations of functions and data structures of the detector
   to not depend on CONFIG_HPET_TIMER.

Changes since v4:
 * Reworked timer reservation to use Thomas' rework on HPET channel
   management.
 * Removed hard-coded channel number for the hardlockup detector.
 * Provided more details on the sequence of HPET channel reservations.
   (Thomas Gleixner)
 * Only reserve a channel for the hardlockup detector if enabled via
   kernel command line. The function reserving the channel is called from
   hardlockup detector. (Thomas Gleixner)
 * Shorten the name of hpet_hardlockup_detector_get_timer() to
   hpet_hld_get_timer(). (Andi)
 * Simplify error handling when a channel is not found. (Tony)

Changes since v3:
 * None

Changes since v2:
 * None

Changes since v1:
 * None
---
 arch/x86/include/asm/hpet.h |  22 
 arch/x86/kernel/hpet.c  | 105 
 2 files changed, 127 insertions(+)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index 486e001413c7..5762bd0169a1 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -103,4 +103,26 @@ static inline int is_hpet_enabled(void) { return 0; }
 #define default_setup_hpet_msi NULL
 
 #endif
+
+#ifdef CONFIG_X86_HARDLOCKUP_DETECTOR_HPET
+/**
+ * struct hpet_hld_data - Data needed to operate the detector
+ * @has_periodic:  The HPET channel supports periodic mode
+ * @channel:   HPET channel assigned to the detector
+ * @channe_priv:   Private data of the assigned channel
+ * @ticks_per_second:  Frequency of the HPET timer
+ * @irq:   IRQ number assigned to the HPET channel
+ */
+struct hpet_hld_data {
+   boolhas_periodic;
+   u32 channel;
+   struct hpet_channel *channel_priv;
+   u64 ticks_per_second;
+   int irq;
+};
+
+extern struct hpet_hld_data *hpet_hld_get_timer(void);
+extern void hpet_hld_free_timer(struct hpet_hld_data *hdata);
+#endif /* CONFIG_X86_HARDLOCKUP_DETECTOR_HPET */
+
 #endif /* _ASM_X86_HPET_H */
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index 02d25e00e93f..ee9275c013f5 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -20,6 +20,7 @@ enum hpet_mode {
HPET_MODE_LEGACY,
HPET_MODE_CLOCKEVT,
HPET_MODE_DEVICE,
+   HPET_MODE_NMI_WATCHDOG,
 };
 
 struct hpet_channel {
@@ -216,6 +217,7 @@ static void __init hpet_reserve_platform_timers(void)
break;
case HPET_MODE_CLOCKEVT:
case HPET_MODE_LEGACY:
+   case HPET_MODE_NMI_WATCHDOG:
hpet_reserve_timer(&hd, hc->num);
break;
}
@@ -1496,3 +1498,106 @@ irqreturn_t hpet_rtc_interrupt(int irq, void *dev_id)
 }
 EXPORT_SYMBOL_GPL(hpet_rtc_interrupt);
 #endif
+
+#ifdef CONFIG_X86_HARDLOCKUP_DETECTOR_HPET
+
+/*
+ * We program the timer in 32-bit mode to reduce the number of register
+ * accesses. The maximum value of watch_thresh is 60 seconds. The HPET counter
+ * should not wrap around more frequently than that. Thus, the frequency of the
+ * HPET timer must be l

[PATCH v6 08/29] iommu/vt-d: Rework prepare_irte() to support per-IRQ delivery mode

2022-05-05 Thread Ricardo Neri
struct irq_cfg::delivery_mode specifies the delivery mode of each IRQ
separately. Configuring the delivery mode of an IRTE would require adding
a third argument to prepare_irte(). Instead, simply take a pointer to the
irq_cfg for which an IRTE is being configured. This change does not cause
functional changes.

Cc: Andi Kleen 
Cc: David Woodhouse 
Cc: "Ravi V. Shankar" 
Cc: Lu Baolu 
Cc: Stephane Eranian 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Reviewed-by: Ashok Raj 
Reviewed-by: Tony Luck 
Reviewed-by: Lu Baolu 
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * Only change the signature of prepare_irte(). A separate patch changes
   the setting of the delivery_mode.

Changes since v4:
 * None

Changes since v3:
 * None

Changes since v2:
 * None

Changes since v1:
 * Introduced this patch.
---
 drivers/iommu/intel/irq_remapping.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/intel/irq_remapping.c 
b/drivers/iommu/intel/irq_remapping.c
index d2764a71f91a..66d37186ec28 100644
--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -,7 +,7 @@ void intel_irq_remap_add_device(struct 
dmar_pci_notify_info *info)
dev_set_msi_domain(&info->dev->dev, map_dev_to_ir(info->dev));
 }
 
-static void prepare_irte(struct irte *irte, int vector, unsigned int dest)
+static void prepare_irte(struct irte *irte, struct irq_cfg *irq_cfg)
 {
memset(irte, 0, sizeof(*irte));
 
@@ -1126,8 +1126,8 @@ static void prepare_irte(struct irte *irte, int vector, 
unsigned int dest)
*/
irte->trigger_mode = 0;
irte->dlvry_mode = apic->delivery_mode;
-   irte->vector = vector;
-   irte->dest_id = IRTE_DEST(dest);
+   irte->vector = irq_cfg->vector;
+   irte->dest_id = IRTE_DEST(irq_cfg->dest_apicid);
 
/*
 * When using the destination mode of physical APICID, only the
@@ -1278,8 +1278,7 @@ static void intel_irq_remapping_prepare_irte(struct 
intel_ir_data *data,
 {
struct irte *irte = &data->irte_entry;
 
-   prepare_irte(irte, irq_cfg->vector, irq_cfg->dest_apicid);
-
+   prepare_irte(irte, irq_cfg);
switch (info->type) {
case X86_IRQ_ALLOC_TYPE_IOAPIC:
/* Set source-id of interrupt request */
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 01/29] irq/matrix: Expose functions to allocate the best CPU for new vectors

2022-05-05 Thread Ricardo Neri
Certain types of interrupts, such as NMI, do not have an associated vector.
They, however, target specific CPUs. Thus, when assigning the destination
CPU, it is beneficial to select the one with the lowest number of vectors.
Prepend the functions matrix_find_best_cpu_managed() and
matrix_find_best_cpu_managed() with the irq_ prefix and expose them for
IRQ controllers to use when allocating and activating vector-less IRQs.

Cc: Andi Kleen 
Cc: "Ravi V. Shankar" 
Cc: Stephane Eranian 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Reviewed-by: Tony Luck 
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * Introduced this patch.

Changes since v4:
 * N/A

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 include/linux/irq.h |  4 
 kernel/irq/matrix.c | 32 +++-
 2 files changed, 27 insertions(+), 9 deletions(-)

diff --git a/include/linux/irq.h b/include/linux/irq.h
index f92788ccdba2..9e674e73d295 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -1223,6 +1223,10 @@ struct irq_matrix *irq_alloc_matrix(unsigned int 
matrix_bits,
 void irq_matrix_online(struct irq_matrix *m);
 void irq_matrix_offline(struct irq_matrix *m);
 void irq_matrix_assign_system(struct irq_matrix *m, unsigned int bit, bool 
replace);
+unsigned int irq_matrix_find_best_cpu(struct irq_matrix *m,
+ const struct cpumask *msk);
+unsigned int irq_matrix_find_best_cpu_managed(struct irq_matrix *m,
+ const struct cpumask *msk);
 int irq_matrix_reserve_managed(struct irq_matrix *m, const struct cpumask 
*msk);
 void irq_matrix_remove_managed(struct irq_matrix *m, const struct cpumask 
*msk);
 int irq_matrix_alloc_managed(struct irq_matrix *m, const struct cpumask *msk,
diff --git a/kernel/irq/matrix.c b/kernel/irq/matrix.c
index 1698e77645ac..810479f608f4 100644
--- a/kernel/irq/matrix.c
+++ b/kernel/irq/matrix.c
@@ -125,9 +125,16 @@ static unsigned int matrix_alloc_area(struct irq_matrix 
*m, struct cpumap *cm,
return area;
 }
 
-/* Find the best CPU which has the lowest vector allocation count */
-static unsigned int matrix_find_best_cpu(struct irq_matrix *m,
-   const struct cpumask *msk)
+/**
+ * irq_matrix_find_best_cpu() - Find the best CPU for an IRQ
+ * @m: Matrix pointer
+ * @msk:   On which CPUs the search will be performed
+ *
+ * Find the best CPU which has the lowest vector allocation count
+ * Returns: The best CPU to use
+ */
+unsigned int irq_matrix_find_best_cpu(struct irq_matrix *m,
+ const struct cpumask *msk)
 {
unsigned int cpu, best_cpu, maxavl = 0;
struct cpumap *cm;
@@ -146,9 +153,16 @@ static unsigned int matrix_find_best_cpu(struct irq_matrix 
*m,
return best_cpu;
 }
 
-/* Find the best CPU which has the lowest number of managed IRQs allocated */
-static unsigned int matrix_find_best_cpu_managed(struct irq_matrix *m,
-   const struct cpumask *msk)
+/**
+ * irq_matrix_find_best_cpu_managed() - Find the best CPU for a managed IRQ
+ * @m: Matrix pointer
+ * @msk:   On which CPUs the search will be performed
+ *
+ * Find the best CPU which has the lowest number of managed IRQs allocated
+ * Returns: The best CPU to use
+ */
+unsigned int irq_matrix_find_best_cpu_managed(struct irq_matrix *m,
+ const struct cpumask *msk)
 {
unsigned int cpu, best_cpu, allocated = UINT_MAX;
struct cpumap *cm;
@@ -292,7 +306,7 @@ int irq_matrix_alloc_managed(struct irq_matrix *m, const 
struct cpumask *msk,
if (cpumask_empty(msk))
return -EINVAL;
 
-   cpu = matrix_find_best_cpu_managed(m, msk);
+   cpu = irq_matrix_find_best_cpu_managed(m, msk);
if (cpu == UINT_MAX)
return -ENOSPC;
 
@@ -381,13 +395,13 @@ int irq_matrix_alloc(struct irq_matrix *m, const struct 
cpumask *msk,
struct cpumap *cm;
 
/*
-* Not required in theory, but matrix_find_best_cpu() uses
+* Not required in theory, but irq_matrix_find_best_cpu() uses
 * for_each_cpu() which ignores the cpumask on UP .
 */
if (cpumask_empty(msk))
return -EINVAL;
 
-   cpu = matrix_find_best_cpu(m, msk);
+   cpu = irq_matrix_find_best_cpu(m, msk);
if (cpu == UINT_MAX)
return -ENOSPC;
 
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 16/29] x86/hpet: Prepare IRQ assignments to use the X86_ALLOC_AS_NMI flag

2022-05-05 Thread Ricardo Neri
The flag X86_ALLOC_AS_NMI indicates that the IRQs to be allocated in an
IRQ domain need to be configured as NMIs.  Add an as_nmi argument to
hpet_assign_irq(). Even though the HPET clock events do not need NMI
IRQs, the HPET hardlockup detector does. A subsequent changeset will
implement the reservation of a channel for it.

Cc: Andi Kleen 
Cc: "Ravi V. Shankar" 
Cc: Stephane Eranian 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Suggested-by: Thomas Gleixner 
Reviewed-by: Tony Luck 
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * Introduced this patch.

Changes since v4:
 * N/A

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 arch/x86/kernel/hpet.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index 2c6713b40921..02d25e00e93f 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -618,7 +618,7 @@ static inline int hpet_dev_id(struct irq_domain *domain)
 }
 
 static int hpet_assign_irq(struct irq_domain *domain, struct hpet_channel *hc,
-  int dev_num)
+  int dev_num, bool as_nmi)
 {
struct irq_alloc_info info;
 
@@ -627,6 +627,8 @@ static int hpet_assign_irq(struct irq_domain *domain, 
struct hpet_channel *hc,
info.data = hc;
info.devid = hpet_dev_id(domain);
info.hwirq = dev_num;
+   if (as_nmi)
+   info.flags |= X86_IRQ_ALLOC_AS_NMI;
 
return irq_domain_alloc_irqs(domain, 1, NUMA_NO_NODE, &info);
 }
@@ -755,7 +757,7 @@ static void __init hpet_select_clockevents(void)
 
sprintf(hc->name, "hpet%d", i);
 
-   irq = hpet_assign_irq(hpet_domain, hc, hc->num);
+   irq = hpet_assign_irq(hpet_domain, hc, hc->num, false);
if (irq <= 0)
continue;
 
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 15/29] x86/hpet: Add helper function hpet_set_comparator_periodic()

2022-05-05 Thread Ricardo Neri
Programming an HPET channel as periodic requires setting the
HPET_TN_SETVAL bit in the channel configuration. Plus, the comparator
register must be written twice (once for the comparator value and once for
the periodic value). Since this programming might be needed in several
places (e.g., the HPET clocksource and the HPET-based hardlockup detector),
add a helper function for this purpose.

A helper function hpet_set_comparator_oneshot() could also be implemented.
However, such function would only program the comparator register and the
function would be quite small. Hence, it is better to not bloat the code
with such an obvious function.

Cc: Andi Kleen 
Cc: Tony Luck 
Cc: Stephane Eranian 
Cc: "Ravi V. Shankar" 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Originally-by: Suravee Suthikulpanit 
Reviewed-by: Tony Luck 
Signed-off-by: Ricardo Neri 
---
When programming the HPET channel in periodic mode, a udelay(1) between
the two successive writes to HPET_Tn_CMP was introduced in commit
e9e2cdb41241 ("[PATCH] clockevents: i386 drivers"). The commit message
does not give any reason for such delay. The hardware specification does
not seem to require it. The refactoring in this patch simply carries such
delay.
---
Changes since v5:
 * None

Changes since v4:
 * Implement function only for periodic mode. This removed extra logic to
   to use a non-zero period value as a proxy for periodic mode
   programming. (Thomas)
 * Added a comment on the history of the udelay() when programming the
   channel in periodic mode. (Ashok)

Changes since v3:
 * Added back a missing hpet_writel() for time configuration.

Changes since v2:
 *  Introduced this patch.

Changes since v1:
 * N/A
---
 arch/x86/include/asm/hpet.h |  2 ++
 arch/x86/kernel/hpet.c  | 49 -
 2 files changed, 39 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index be9848f0883f..486e001413c7 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -74,6 +74,8 @@ extern void hpet_disable(void);
 extern unsigned int hpet_readl(unsigned int a);
 extern void hpet_writel(unsigned int d, unsigned int a);
 extern void force_hpet_resume(void);
+extern void hpet_set_comparator_periodic(int channel, unsigned int cmp,
+unsigned int period);
 
 #ifdef CONFIG_HPET_EMULATE_RTC
 
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index 47678e7927ff..2c6713b40921 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -294,6 +294,39 @@ static void hpet_enable_legacy_int(void)
hpet_legacy_int_enabled = true;
 }
 
+/**
+ * hpet_set_comparator_periodic() - Helper function to set periodic channel
+ * @channel:   The HPET channel
+ * @cmp:   The value to be written to the comparator/accumulator
+ * @period:Number of ticks per period
+ *
+ * Helper function for updating comparator, accumulator and period values.
+ *
+ * In periodic mode, HPET needs HPET_TN_SETVAL to be set before writing
+ * to the Tn_CMP to update the accumulator. Then, HPET needs a second
+ * write (with HPET_TN_SETVAL cleared) to Tn_CMP to set the period.
+ * The HPET_TN_SETVAL bit is automatically cleared after the first write.
+ *
+ * This function takes a 1 microsecond delay. However, this function is 
supposed
+ * to be called only once (or when reprogramming the timer) as it deals with a
+ * periodic timer channel.
+ *
+ * See the following documents:
+ *   - Intel IA-PC HPET (High Precision Event Timers) Specification
+ *   - AMD-8111 HyperTransport I/O Hub Data Sheet, Publication # 24674
+ */
+void hpet_set_comparator_periodic(int channel, unsigned int cmp, unsigned int 
period)
+{
+   unsigned int v = hpet_readl(HPET_Tn_CFG(channel));
+
+   hpet_writel(v | HPET_TN_SETVAL, HPET_Tn_CFG(channel));
+
+   hpet_writel(cmp, HPET_Tn_CMP(channel));
+
+   udelay(1);
+   hpet_writel(period, HPET_Tn_CMP(channel));
+}
+
 static int hpet_clkevt_set_state_periodic(struct clock_event_device *evt)
 {
unsigned int channel = clockevent_to_channel(evt)->num;
@@ -306,19 +339,11 @@ static int hpet_clkevt_set_state_periodic(struct 
clock_event_device *evt)
now = hpet_readl(HPET_COUNTER);
cmp = now + (unsigned int)delta;
cfg = hpet_readl(HPET_Tn_CFG(channel));
-   cfg |= HPET_TN_ENABLE | HPET_TN_PERIODIC | HPET_TN_SETVAL |
-  HPET_TN_32BIT;
+   cfg |= HPET_TN_ENABLE | HPET_TN_PERIODIC | HPET_TN_32BIT;
hpet_writel(cfg, HPET_Tn_CFG(channel));
-   hpet_writel(cmp, HPET_Tn_CMP(channel));
-   udelay(1);
-   /*
-* HPET on AMD 81xx needs a second write (with HPET_TN_SETVAL
-* cleared) to T0_CMP to set the period. The HPET_TN_SETVAL
-* bit is automatically cleared after the first write.
-* (See AMD-8111 HyperTransport I/O Hub Data Sheet,
-  

[PATCH v6 03/29] x86/apic/msi: Set the delivery mode individually for each IRQ

2022-05-05 Thread Ricardo Neri
There are no restrictions in hardware to set  MSI messages with its
own delivery mode. Use the mode specified in the provided IRQ hardware
configuration data. Since most of the IRQs are configured to use the
delivery mode of the APIC driver in use (set in all of them to
APIC_DELIVERY_MODE_FIXED), the only functional changes are where
IRQs are configured to use a specific delivery mode.

Changing the utility function __irq_msi_compose_msg() takes care of
implementing the change in the in the local APIC, PCI-MSI, and DMAR-MSI
irq_chips.

The IO-APIC irq_chip configures the entries in the interrupt redirection
table using the delivery mode specified in the corresponding MSI message.
Since the MSI message is composed by a higher irq_chip in the hierarchy,
it does not need to be updated.

Cc: Andi Kleen 
Cc: "Ravi V. Shankar" 
Cc: Stephane Eranian 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Reviewed-by: Tony Luck 
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * Introduced this patch

Changes since v4:
 * N/A

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 arch/x86/kernel/apic/apic.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 189d3a5e471a..d1e12da1e9af 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -2528,7 +2528,7 @@ void __irq_msi_compose_msg(struct irq_cfg *cfg, struct 
msi_msg *msg,
msg->arch_addr_lo.dest_mode_logical = apic->dest_mode_logical;
msg->arch_addr_lo.destid_0_7 = cfg->dest_apicid & 0xFF;
 
-   msg->arch_data.delivery_mode = APIC_DELIVERY_MODE_FIXED;
+   msg->arch_data.delivery_mode = cfg->delivery_mode;
msg->arch_data.vector = cfg->vector;
 
msg->address_hi = X86_MSI_BASE_ADDRESS_HIGH;
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 11/29] iommu/amd: Expose [set|get]_dev_entry_bit()

2022-05-05 Thread Ricardo Neri
These functions are used to check and set specific bits in a Device Table
Entry. For instance, they can be used to modify the setting of the NMIPass
field.

Currently, these functions are used only for ACPI-specified devices.
However, an interrupt is to be allocated with NMI as delivery mode, the
Device Table Entry needs modified accordingly in irq_remapping_alloc().

As a first step expose these two functions. No functional changes.

Cc: Andi Kleen 
Cc: "Ravi V. Shankar" 
Cc: Joerg Roedel 
Cc: Suravee Suthikulpanit 
Cc: Stephane Eranian 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * Introduced this patch

Changes since v4:
 * N/A

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 drivers/iommu/amd/amd_iommu.h | 3 +++
 drivers/iommu/amd/init.c  | 4 ++--
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/amd/amd_iommu.h b/drivers/iommu/amd/amd_iommu.h
index 1ab31074f5b3..9f3d1564c84e 100644
--- a/drivers/iommu/amd/amd_iommu.h
+++ b/drivers/iommu/amd/amd_iommu.h
@@ -128,4 +128,7 @@ static inline void amd_iommu_apply_ivrs_quirks(void) { }
 
 extern void amd_iommu_domain_set_pgtable(struct protection_domain *domain,
 u64 *root, int mode);
+
+extern void set_dev_entry_bit(u16 devid, u8 bit);
+extern int get_dev_entry_bit(u16 devid, u8 bit);
 #endif
diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
index b4a798c7b347..823e76b284f1 100644
--- a/drivers/iommu/amd/init.c
+++ b/drivers/iommu/amd/init.c
@@ -914,7 +914,7 @@ static void iommu_enable_gt(struct amd_iommu *iommu)
 }
 
 /* sets a specific bit in the device table entry. */
-static void set_dev_entry_bit(u16 devid, u8 bit)
+void set_dev_entry_bit(u16 devid, u8 bit)
 {
int i = (bit >> 6) & 0x03;
int _bit = bit & 0x3f;
@@ -922,7 +922,7 @@ static void set_dev_entry_bit(u16 devid, u8 bit)
amd_iommu_dev_table[devid].data[i] |= (1UL << _bit);
 }
 
-static int get_dev_entry_bit(u16 devid, u8 bit)
+int get_dev_entry_bit(u16 devid, u8 bit)
 {
int i = (bit >> 6) & 0x03;
int _bit = bit & 0x3f;
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 14/29] x86/hpet: Expose hpet_writel() in header

2022-05-05 Thread Ricardo Neri
In order to allow hpet_writel() to be used by other components (e.g.,
the HPET-based hardlockup detector), expose it in the HPET header file.

Cc: Andi Kleen 
Cc: Stephane Eranian 
Cc: "Ravi V. Shankar" 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Reviewed-by: Tony Luck 
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * None

Changes since v4:
 * Dropped exposing hpet_readq() as it is not needed.

Changes since v3:
 * None

Changes since v2:
 * None

Changes since v1:
 * None
---
 arch/x86/include/asm/hpet.h | 1 +
 arch/x86/kernel/hpet.c  | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index ab9f3dd87c80..be9848f0883f 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -72,6 +72,7 @@ extern int is_hpet_enabled(void);
 extern int hpet_enable(void);
 extern void hpet_disable(void);
 extern unsigned int hpet_readl(unsigned int a);
+extern void hpet_writel(unsigned int d, unsigned int a);
 extern void force_hpet_resume(void);
 
 #ifdef CONFIG_HPET_EMULATE_RTC
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index 71f336425e58..47678e7927ff 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -79,7 +79,7 @@ inline unsigned int hpet_readl(unsigned int a)
return readl(hpet_virt_address + a);
 }
 
-static inline void hpet_writel(unsigned int d, unsigned int a)
+inline void hpet_writel(unsigned int d, unsigned int a)
 {
writel(d, hpet_virt_address + a);
 }
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 13/29] iommu/amd: Compose MSI messages for NMI irqs in non-IR format

2022-05-05 Thread Ricardo Neri
If NMIPass is enabled in a device's DTE, the IOMMU lets NMI interrupt
messages pass through unmapped. Therefore, the contents of the MSI
message, not an IRTE, determine how and where the NMI is delivered.

Since the IOMMU driver owns the MSI message of the NMI irq, compose
it using the non-interrupt-remapping format. Also, let descendant
irqchips write the MSI as appropriate for the device.

Cc: Andi Kleen 
Cc: "Ravi V. Shankar" 
Cc: Joerg Roedel 
Cc: Suravee Suthikulpanit 
Cc: Stephane Eranian 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * Introduced this patch

Changes since v4:
 * N/A

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 drivers/iommu/amd/iommu.c | 23 ++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 4d7421b6858d..6e07949b3e2a 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3111,7 +3111,16 @@ static void irq_remapping_prepare_irte(struct 
amd_ir_data *data,
case X86_IRQ_ALLOC_TYPE_HPET:
case X86_IRQ_ALLOC_TYPE_PCI_MSI:
case X86_IRQ_ALLOC_TYPE_PCI_MSIX:
-   fill_msi_msg(&data->msi_entry, irte_info->index);
+   if (irq_cfg->delivery_mode == APIC_DELIVERY_MODE_NMI)
+   /*
+* The IOMMU lets NMIs pass through unmapped. Thus, the
+* MSI message, not the IRTE, determines the irq
+* configuration. Since we own the MSI message,
+* compose it. Descendant irqchips will write it.
+*/
+   __irq_msi_compose_msg(irq_cfg, &data->msi_entry, true);
+   else
+   fill_msi_msg(&data->msi_entry, irte_info->index);
break;
 
default:
@@ -3509,6 +3518,18 @@ static int amd_ir_set_affinity(struct irq_data *data,
 */
send_cleanup_vector(cfg);
 
+   /*
+* When the delivery mode of an irq is NMI, the IOMMU lets the NMI
+* interrupt messages pass through unmapped. Hence, changes in the
+* destination are to be reflected in the NMI message itself, not the
+* IRTE. Thus, descendant irqchips must set the affinity and compose
+* write the MSI message.
+*
+* Also, NMIs do not have an associated vector. No need for cleanup.
+*/
+   if (cfg->delivery_mode == APIC_DELIVERY_MODE_NMI)
+   return IRQ_SET_MASK_OK;
+
return IRQ_SET_MASK_OK_DONE;
 }
 
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 12/29] iommu/amd: Enable NMIPass when allocating an NMI irq

2022-05-05 Thread Ricardo Neri
As per the AMD I/O Virtualization Technology (IOMMU) Specification, the
AMD IOMMU only remaps fixed and arbitrated MSIs. NMIs are controlled
by the NMIPass bit of a Device Table Entry. When set, the IOMMU passes
through NMI interrupt messages unmapped. Otherwise, they are aborted.

Furthermore, Section 2.2.5 Table 19 states that the IOMMU will also
abort NMIs when the destination mode is logical.

Update the NMIPass setting of a device's DTE when an NMI irq is being
allocated. Only do so when the destination mode of the APIC is not
logical.

Cc: Andi Kleen 
Cc: "Ravi V. Shankar" 
Cc: Joerg Roedel 
Cc: Suravee Suthikulpanit 
Cc: Stephane Eranian 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * Introduced this patch

Changes since v4:
 * N/A

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 drivers/iommu/amd/iommu.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index a1ada7bff44e..4d7421b6858d 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3156,6 +3156,15 @@ static int irq_remapping_alloc(struct irq_domain 
*domain, unsigned int virq,
info->type != X86_IRQ_ALLOC_TYPE_PCI_MSIX)
return -EINVAL;
 
+   if (info->flags & X86_IRQ_ALLOC_AS_NMI) {
+   /* Only one IRQ per NMI */
+   if (nr_irqs != 1)
+   return -EINVAL;
+
+   /* NMIs are aborted when the destination mode is logical. */
+   if (apic->dest_mode_logical)
+   return -EPERM;
+   }
/*
 * With IRQ remapping enabled, don't need contiguous CPU vectors
 * to support multiple MSI interrupts.
@@ -3208,6 +3217,15 @@ static int irq_remapping_alloc(struct irq_domain 
*domain, unsigned int virq,
goto out_free_parent;
}
 
+   if (info->flags & X86_IRQ_ALLOC_AS_NMI) {
+   struct amd_iommu *iommu = amd_iommu_rlookup_table[devid];
+
+   if (!get_dev_entry_bit(devid, DEV_ENTRY_NMI_PASS)) {
+   set_dev_entry_bit(devid, DEV_ENTRY_NMI_PASS);
+   iommu_flush_dte(iommu, devid);
+   }
+   }
+
for (i = 0; i < nr_irqs; i++) {
irq_data = irq_domain_get_irq_data(domain, virq + i);
cfg = irq_data ? irqd_cfg(irq_data) : NULL;
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 00/29] x86: Implement an HPET-based hardlockup detector

2022-05-05 Thread Ricardo Neri
dling CPU).
 * Reworked hardlockup_detector_hpet_init() for readability.
 * Now reserve the cpumasks in the hardlockup detector code and not in the
   generic HPET code.
 * Handle the case of watchdog_thresh = 0 when disabling the detector.

Change since v3:
 * Fixed yet another bug in periodic programming of the HPET timer that
   prevented the system from booting.
 * Fixed computation of HPET frequency to use hpet_readl() only.
 * Added a missing #include in the watchdog_hld_hpet.c
 * Fixed various typos and grammar errors (Randy Dunlap)

Changes since v2:
 * Added functionality to switch to the perf-based hardlockup
   detector if the TSC becomes unstable (Thomas Gleixner).
 * Brought back the round-robin mechanism proposed in v1 (this time not
   using the interrupt subsystem). This also requires computing
   expiration times as in v1 (Andi Kleen, Stephane Eranian).
 * Fixed a bug in which using a periodic timer was not working(thanks
   to Suravee Suthikulpanit!).
 * In this version, I incorporate support for interrupt remapping in the
   last 4 patches so that they can be reviewed separately if needed.
 * Removed redundant documentation of functions (Thomas Gleixner).
 * Added a new category of NMI handler, NMI_WATCHDOG, which executes after
   NMI_LOCAL handlers (Andi Kleen).
 * Updated handling of "nmi_watchdog" to support comma-separated
   arguments.
 * Undid split of the generic hardlockup detector into a separate file
   (Thomas Gleixner).
 * Added a new intermediate symbol CONFIG_HARDLOCKUP_DETECTOR_CORE to
   select generic parts of the detector (Paul E. McKenney,
   Thomas Gleixner).
 * Removed use of struct cpumask in favor of a variable length array in
   conjunction with kzalloc (Peter Zijlstra).
 * Added CPU as argument hardlockup_detector_hpet_enable()/disable()
   (Thomas Gleixner).
 * Remove unnecessary export of function declarations, flags, and bit
   fields (Thomas Gleixner).
 * Removed  unnecessary check for FSB support when reserving timer for the
   detector (Thomas Gleixner).
 * Separated TSC code from HPET code in kick_timer() (Thomas Gleixner).
 * Reworked condition to check if the expected TSC value is within the
   error margin to avoid conditional (Peter Zijlstra).
 * Removed TSC error margin from struct hld_data; use global variable
   instead (Peter Zijlstra).
 * Removed previously introduced watchdog_get_allowed_cpumask*() and
   reworked hardlockup_detector_hpet_enable()/disable() to not need
   access to watchdog_allowed_mask (Thomas Gleixner).

Changes since v1:
 * Removed reads to HPET registers at every NMI. Instead use the time-stamp
   counter to infer the interrupt source (Thomas Gleixner, Andi Kleen).
 * Do not target CPUs in a round-robin manner. Instead, the HPET timer
   always targets the same CPU; other CPUs are monitored via an
   interprocessor interrupt.
 * Removed use of generic irq code to set interrupt affinity and NMI
   delivery. Instead, configure the interrupt directly in HPET registers
   (Thomas Gleixner).
 * Removed the proposed ops structure for NMI watchdogs. Instead, split
   the existing implementation into a generic library and perf-specific
   infrastructure (Thomas Gleixner, Nicholas Piggin).
 * Added an x86-specific shim hardlockup detector that selects between
   HPET and perf infrastructures as needed (Nicholas Piggin).
 * Removed locks taken in NMI and !NMI context. This was wrong and is no
   longer needed (Thomas Gleixner).
 * Fixed unconditional return NMI_HANDLED when the HPET timer is programmed
   for FSB/MSI delivery (Peter Zijlstra).

[1]. 
https://lore.kernel.org/lkml/1528851463-21140-1-git-send-email-ricardo.neri-calde...@linux.intel.com/
[2]. 
https://lore.kernel.org/lkml/1551283518-18922-1-git-send-email-ricardo.neri-calde...@linux.intel.com/
[3]. 
https://lore.kernel.org/lkml/1557842534-4266-1-git-send-email-ricardo.neri-calde...@linux.intel.com/
[4]. 
https://lore.kernel.org/lkml/1558660583-28561-1-git-send-email-ricardo.neri-calde...@linux.intel.com/
[5]. 
https://lore.kernel.org/lkml/20210504190526.22347-1-ricardo.neri-calde...@linux.intel.com/T/
[6]. 
https://lore.kernel.org/linux-iommu/87lf8uhzk9@nanos.tec.linutronix.de/T/
[7]. 
https://lore.kernel.org/lkml/20200117091341.gx2...@hirez.programming.kicks-ass.net/
[8]. 
https://lore.kernel.org/lkml/1582581564-184429-1-git-send-email-kan.li...@linux.intel.com/
[9]. 
https://lore.kernel.org/lkml/20210504190526.22347-1-ricardo.neri-calde...@linux.intel.com/T/#me7de1b4ff4a91166c1610a2883b1f77ffe8b6ddf
[10]. 
https://lore.kernel.org/all/tip-dea978632e8400b84888bad20df0cd91c18f0...@git.kernel.org/t/
[11]. 
https://lore.kernel.org/linux-iommu/87lf8uhzk9@nanos.tec.linutronix.de/T/#mde9be6aca9119602e90e9293df9995aa056dafce

[12]. https://lore.kernel.org/r/20201024213535.443185-6-dw...@infradead.org
https://lore.kernel.org/all/tip-dea978632e8400b84888bad20df0cd91c18f0...@git.kernel.org/t/
[13]. https://lore.kernel.org/lkml/20190623132340.463097...@linutronix.de/

Ricardo Neri (29):

[PATCH v6 07/29] iommu/vt-d: Clear the redirection hint when the destination mode is physical

2022-05-05 Thread Ricardo Neri
When the destination mode of an interrupt is physical APICID, the interrupt
is delivered only to the single CPU of which the physical APICID is
specified in the destination ID field. Therefore, the redirection hint is
meaningless.

Furthermore, on certain processors, the IOMMU does not deliver the
interrupt when the delivery mode is NMI, the redirection hint is set, and
the destination mode is physical. Clearing the redirection hint ensures
that the NMI is delivered.

Cc: Andi Kleen 
Cc: David Woodhouse 
Cc: "Ravi V. Shankar" 
Cc: Lu Baolu 
Cc: Stephane Eranian 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Suggested-by: Ashok Raj 
Reviewed-by: Lu Baolu 
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * Introduced this patch.

Changes since v4:
 * N/A

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 drivers/iommu/intel/irq_remapping.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel/irq_remapping.c 
b/drivers/iommu/intel/irq_remapping.c
index a67319597884..d2764a71f91a 100644
--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -1128,7 +1128,17 @@ static void prepare_irte(struct irte *irte, int vector, 
unsigned int dest)
irte->dlvry_mode = apic->delivery_mode;
irte->vector = vector;
irte->dest_id = IRTE_DEST(dest);
-   irte->redir_hint = 1;
+
+   /*
+* When using the destination mode of physical APICID, only the
+* processor specified in @dest receives the interrupt. Thus, the
+* redirection hint is meaningless.
+*
+* Furthermore, on some processors, NMIs with physical delivery mode
+* and the redirection hint set are delivered as regular interrupts
+* or not delivered at all.
+*/
+   irte->redir_hint = apic->dest_mode_logical;
 }
 
 struct irq_remap_ops intel_irq_remap_ops = {
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 10/29] iommu/vt-d: Implement minor tweaks for NMI irqs

2022-05-05 Thread Ricardo Neri
The Intel IOMMU interrupt remapping driver already programs correctly the
delivery mode of individual irqs as per their irq_data. Improve handling
of NMIs. Allow only one irq per NMI. Also, it is not necessary to cleanup
irq vectors after updating affinity. NMIs do not have associated vectors.

Cc: Andi Kleen 
Cc: David Woodhouse 
Cc: "Ravi V. Shankar" 
Cc: Lu Baolu 
Cc: Stephane Eranian 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Reviewed-by: Lu Baolu 
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * Introduced this patch.

Changes since v4:
 * N/A

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 drivers/iommu/intel/irq_remapping.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel/irq_remapping.c 
b/drivers/iommu/intel/irq_remapping.c
index fb2d71bea98d..791a9331e257 100644
--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -1198,8 +1198,12 @@ intel_ir_set_affinity(struct irq_data *data, const 
struct cpumask *mask,
 * After this point, all the interrupts will start arriving
 * at the new destination. So, time to cleanup the previous
 * vector allocation.
+*
+* Do it only for non-NMI irqs. NMIs don't have associated
+* vectors.
 */
-   send_cleanup_vector(cfg);
+   if (cfg->delivery_mode != APIC_DELIVERY_MODE_NMI)
+   send_cleanup_vector(cfg);
 
return IRQ_SET_MASK_OK_DONE;
 }
@@ -1352,6 +1356,9 @@ static int intel_irq_remapping_alloc(struct irq_domain 
*domain,
if (info->type == X86_IRQ_ALLOC_TYPE_PCI_MSI)
info->flags &= ~X86_IRQ_ALLOC_CONTIGUOUS_VECTORS;
 
+   if ((info->flags & X86_IRQ_ALLOC_AS_NMI) && nr_irqs != 1)
+   return -EINVAL;
+
ret = irq_domain_alloc_irqs_parent(domain, virq, nr_irqs, arg);
if (ret < 0)
return ret;
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 02/29] x86/apic: Add irq_cfg::delivery_mode

2022-05-05 Thread Ricardo Neri
Currently, the delivery mode of all interrupts is set to the mode of the
APIC driver in use. There are no restrictions in hardware to configure the
delivery mode of each interrupt individually. Also, certain IRQs need to be
configured with a specific delivery mode (e.g., NMI).

Add a new member, delivery_mode, to struct irq_cfg. Subsequent changesets
will update every irq_domain to set the delivery mode of each IRQ to that
specified in its irq_cfg data.

To keep the current behavior, when allocating an IRQ in the root domain
(i.e., the x86_vector_domain), set the delivery mode of the IRQ as that of
the APIC driver.

Cc: Andi Kleen 
Cc: "Ravi V. Shankar" 
Cc: Stephane Eranian 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Reviewed-by: Ashok Raj 
Reviewed-by: Tony Luck 
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * Updated indentation of the existing members of struct irq_cfg.
 * Reworded the commit message.

Changes since v4:
 * Rebased to use new enumeration apic_delivery_modes.

Changes since v3:
 * None

Changes since v2:
 * Reduced scope to only add the interrupt delivery mode in
   struct irq_alloc_info.

Changes since v1:
 * Introduced this patch.
---
 arch/x86/include/asm/hw_irq.h | 5 +++--
 arch/x86/kernel/apic/vector.c | 9 +
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
index d465ece58151..5ac5e6c603ee 100644
--- a/arch/x86/include/asm/hw_irq.h
+++ b/arch/x86/include/asm/hw_irq.h
@@ -88,8 +88,9 @@ struct irq_alloc_info {
 };
 
 struct irq_cfg {
-   unsigned intdest_apicid;
-   unsigned intvector;
+   unsigned intdest_apicid;
+   unsigned intvector;
+   enum apic_delivery_modesdelivery_mode;
 };
 
 extern struct irq_cfg *irq_cfg(unsigned int irq);
diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 3e6f6b448f6a..838e220e8860 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -567,6 +567,7 @@ static int x86_vector_alloc_irqs(struct irq_domain *domain, 
unsigned int virq,
irqd->chip_data = apicd;
irqd->hwirq = virq + i;
irqd_set_single_target(irqd);
+
/*
 * Prevent that any of these interrupts is invoked in
 * non interrupt context via e.g. generic_handle_irq()
@@ -577,6 +578,14 @@ static int x86_vector_alloc_irqs(struct irq_domain 
*domain, unsigned int virq,
/* Don't invoke affinity setter on deactivated interrupts */
irqd_set_affinity_on_activate(irqd);
 
+   /*
+* Initialize the delivery mode of this irq to match the
+* default delivery mode of the APIC. Children irq domains
+* may take the delivery mode from the individual irq
+* configuration rather than from the APIC driver.
+*/
+   apicd->hw_irq_cfg.delivery_mode = apic->delivery_mode;
+
/*
 * Legacy vectors are already assigned when the IOAPIC
 * takes them over. They stay on the same vector. This is
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 09/29] iommu/vt-d: Set the IRTE delivery mode individually for each IRQ

2022-05-05 Thread Ricardo Neri
There are no hardware requirements to use the same delivery mode for all
interrupts. Use the mode specified in the provided IRQ hardware
configuration data. Since all IRQs are configured to use the delivery mode
of the APIC drive, the only functional changes are where IRQs are
configured to use a specific delivery mode.

Cc: Andi Kleen 
Cc: David Woodhouse 
Cc: "Ravi V. Shankar" 
Cc: Lu Baolu 
Cc: Stephane Eranian 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Reviewed-by: Tony Luck 
Reviewed-by: Lu Baolu 
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * Introduced this patch.

Changes since v4:
 * N/A

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 drivers/iommu/intel/irq_remapping.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/intel/irq_remapping.c 
b/drivers/iommu/intel/irq_remapping.c
index 66d37186ec28..fb2d71bea98d 100644
--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -1125,7 +1125,7 @@ static void prepare_irte(struct irte *irte, struct 
irq_cfg *irq_cfg)
 * irq migration in the presence of interrupt-remapping.
*/
irte->trigger_mode = 0;
-   irte->dlvry_mode = apic->delivery_mode;
+   irte->dlvry_mode = irq_cfg->delivery_mode;
irte->vector = irq_cfg->vector;
irte->dest_id = IRTE_DEST(irq_cfg->dest_apicid);
 
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 04/29] x86/apic: Add the X86_IRQ_ALLOC_AS_NMI irq allocation flag

2022-05-05 Thread Ricardo Neri
There are cases in which it is necessary to set the delivery mode of an
interrupt as NMI. Add a new flag that callers can specify when allocating
an IRQ.

Cc: Andi Kleen 
Cc: "Ravi V. Shankar" 
Cc: Stephane Eranian 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Suggested-by: Thomas Gleixner 
Reviewed-by: Tony Luck 
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * Introduced this patch.

Changes since v4:
 * N/A

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 arch/x86/include/asm/irqdomain.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/include/asm/irqdomain.h b/arch/x86/include/asm/irqdomain.h
index 125c23b7bad3..de1cf2e80443 100644
--- a/arch/x86/include/asm/irqdomain.h
+++ b/arch/x86/include/asm/irqdomain.h
@@ -10,6 +10,7 @@ enum {
/* Allocate contiguous CPU vectors */
X86_IRQ_ALLOC_CONTIGUOUS_VECTORS= 0x1,
X86_IRQ_ALLOC_LEGACY= 0x2,
+   X86_IRQ_ALLOC_AS_NMI= 0x4,
 };
 
 extern int x86_fwspec_is_ioapic(struct irq_fwspec *fwspec);
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 05/29] x86/apic/vector: Do not allocate vectors for NMIs

2022-05-05 Thread Ricardo Neri
Vectors are meaningless when allocating IRQs with NMI as the delivery mode.
In such case, skip the reservation of IRQ vectors. Do it in the lowest-
level functions where the actual IRQ reservation takes place.

Since NMIs target specific CPUs, keep the functionality to find the best
CPU.

Cc: Andi Kleen 
Cc: "Ravi V. Shankar" 
Cc: Stephane Eranian 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Reviewed-by: Tony Luck 
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * Introduced this patch.

Changes since v4:
 * N/A

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 arch/x86/kernel/apic/vector.c | 27 +++
 1 file changed, 27 insertions(+)

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 838e220e8860..11f881f45cec 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -245,11 +245,20 @@ assign_vector_locked(struct irq_data *irqd, const struct 
cpumask *dest)
if (apicd->move_in_progress || !hlist_unhashed(&apicd->clist))
return -EBUSY;
 
+   if (apicd->hw_irq_cfg.delivery_mode == APIC_DELIVERY_MODE_NMI) {
+   cpu = irq_matrix_find_best_cpu(vector_matrix, dest);
+   apicd->cpu = cpu;
+   vector = 0;
+   goto no_vector;
+   }
+
vector = irq_matrix_alloc(vector_matrix, dest, resvd, &cpu);
trace_vector_alloc(irqd->irq, vector, resvd, vector);
if (vector < 0)
return vector;
apic_update_vector(irqd, vector, cpu);
+
+no_vector:
apic_update_irq_cfg(irqd, vector, cpu);
 
return 0;
@@ -321,12 +330,22 @@ assign_managed_vector(struct irq_data *irqd, const struct 
cpumask *dest)
/* set_affinity might call here for nothing */
if (apicd->vector && cpumask_test_cpu(apicd->cpu, vector_searchmask))
return 0;
+
+   if (apicd->hw_irq_cfg.delivery_mode == APIC_DELIVERY_MODE_NMI) {
+   cpu = irq_matrix_find_best_cpu_managed(vector_matrix, dest);
+   apicd->cpu = cpu;
+   vector = 0;
+   goto no_vector;
+   }
+
vector = irq_matrix_alloc_managed(vector_matrix, vector_searchmask,
  &cpu);
trace_vector_alloc_managed(irqd->irq, vector, vector);
if (vector < 0)
return vector;
apic_update_vector(irqd, vector, cpu);
+
+no_vector:
apic_update_irq_cfg(irqd, vector, cpu);
return 0;
 }
@@ -376,6 +395,10 @@ static void x86_vector_deactivate(struct irq_domain *dom, 
struct irq_data *irqd)
if (apicd->has_reserved)
return;
 
+   /* NMI IRQs do not have associated vectors; nothing to do. */
+   if (apicd->hw_irq_cfg.delivery_mode == APIC_DELIVERY_MODE_NMI)
+   return;
+
raw_spin_lock_irqsave(&vector_lock, flags);
clear_irq_vector(irqd);
if (apicd->can_reserve)
@@ -472,6 +495,10 @@ static void vector_free_reserved_and_managed(struct 
irq_data *irqd)
trace_vector_teardown(irqd->irq, apicd->is_managed,
  apicd->has_reserved);
 
+   /* NMI IRQs do not have associated vectors; nothing to do. */
+   if (apicd->hw_irq_cfg.delivery_mode == APIC_DELIVERY_MODE_NMI)
+   return;
+
if (apicd->has_reserved)
irq_matrix_remove_reserved(vector_matrix);
if (apicd->is_managed)
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 06/29] x86/apic/vector: Implement support for NMI delivery mode

2022-05-05 Thread Ricardo Neri
The flag X86_IRQ_ALLOC_AS_NMI indicates to the interrupt controller that
it should configure the delivery mode of an IRQ as NMI. Implement such
request. This causes irq_domain children in the hierarchy to configure
their irq_chips accordingly. When no specific delivery mode is requested,
continue using the delivery mode of the APIC driver in use.

Cc: Andi Kleen 
Cc: "Ravi V. Shankar" 
Cc: Stephane Eranian 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Suggested-by: Thomas Gleixner 
Reviewed-by: Tony Luck 
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * Introduced this patch.

Changes since v4:
 * N/A

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 arch/x86/kernel/apic/vector.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 11f881f45cec..df4d7b9f6e27 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -570,6 +570,10 @@ static int x86_vector_alloc_irqs(struct irq_domain 
*domain, unsigned int virq,
if ((info->flags & X86_IRQ_ALLOC_CONTIGUOUS_VECTORS) && nr_irqs > 1)
return -ENOSYS;
 
+   /* Only one IRQ per NMI */
+   if ((info->flags & X86_IRQ_ALLOC_AS_NMI) && nr_irqs != 1)
+   return -EINVAL;
+
/*
 * Catch any attempt to touch the cascade interrupt on a PIC
 * equipped system.
@@ -610,7 +614,15 @@ static int x86_vector_alloc_irqs(struct irq_domain 
*domain, unsigned int virq,
 * default delivery mode of the APIC. Children irq domains
 * may take the delivery mode from the individual irq
 * configuration rather than from the APIC driver.
+*
+* Vectors are meaningless if the delivery mode is NMI. Since
+* nr_irqs is 1, we can return.
 */
+   if (info->flags & X86_IRQ_ALLOC_AS_NMI) {
+   apicd->hw_irq_cfg.delivery_mode = 
APIC_DELIVERY_MODE_NMI;
+   return 0;
+   }
+
apicd->hw_irq_cfg.delivery_mode = apic->delivery_mode;
 
/*
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v5 5/7] iommu/vt-d: Fixup delivery mode of the HPET hardlockup interrupt

2021-05-13 Thread Ricardo Neri
On Wed, May 05, 2021 at 01:03:18AM +0200, Thomas Gleixner wrote:
> On Tue, May 04 2021 at 12:10, Ricardo Neri wrote:

Thank you very much for your feedback, Thomas. I am sorry it took me a
while to reply to your email. I needed to digest and research your
comments.

> > In x86 there is not an IRQF_NMI flag that can be used to indicate the
> 
> There exists no IRQF_NMI flag at all. No architecture provides that.

Thank you for the clarification. I think I meant to say that there is a
request_nmi() function but AFAIK it is only used in the ARM PMU and
would not work on x86.

> 
> > delivery mode when requesting an interrupt (via request_irq()). Thus,
> > there is no way for the interrupt remapping driver to know and set
> > the delivery mode.
> 
> There is no support for this today. So what?

Using request_irq() plus a HPET quirk looked to me a reasonable
way to use the irqdomain hierarchy to allocate an interrupt with NMI as
the delivery mode.

> 
> > Hence, when allocating an interrupt, check if such interrupt belongs to
> > the HPET hardlockup detector and fixup the delivery mode accordingly.
> 
> What?
> 
> > +   /*
> > +* If we find the HPET hardlockup detector irq, fixup the
> > +* delivery mode.
> > +*/
> > +   if (is_hpet_irq_hardlockup_detector(info))
> > +   irq_cfg->delivery_mode = APIC_DELIVERY_MODE_NMI;
> 
> Again. We are not sticking some random device checks into that
> code. It's wrong and I explained it to you before.
> 
>   
> https://lore.kernel.org/lkml/alpine.deb.2.21.1906161042080.1...@nanos.tec.linutronix.de/
> 
> But I'm happy to repeat it again:
> 
>   "No. This is horrible hackery violating all the layering which we carefully
>put into place to avoid exactly this kind of sprinkling conditionals into
>all code pathes.
> 
>With some thought the existing irqdomain hierarchy can be used to achieve
>the same thing without tons of extra functions and conditionals."
> 
> So the outcome of thought and using the irqdomain hierarchy is:
> 
>Replacing an hpet specific conditional in one place with an hpet
>specific conditional in a different place.
> 
> Impressive.

I am sorry Thomas, I did try to make the quirk less hacky but I did not
think of the solution you provide below.

> 
> hpet_assign_irq(, bool nmi)
>   init_info(info)
> ...
> if (nmi)
> info.flags |= X86_IRQ_ALLOC_AS_NMI;
>   
>irq_domain_alloc_irqs(domain, 1, NUMA_NO_NODE, &info)
>  intel_irq_remapping_alloc(..., info)
>irq_domain_alloc_irq_parents(..., info)
>  x86_vector_alloc_irqs(..., info)
>  {   
>if (info->flags & X86_IRQ_ALLOC_AS_NMI && nr_irqs != 1)
>   return -EINVAL;
> 
>for (i = 0; i < nr_irqs; i++) {
>  
>  if (info->flags & X86_IRQ_ALLOC_AS_NMI) {
>  irq_cfg_setup_nmi(apicd);
>  continue;
>  }
>  ...
>  }
> 
> irq_cfg_setup_nmi() sets irq_cfg->delivery_mode and whatever is required
> and everything else just works. Of course this needs a few other minor
> tweaks but none of those introduces random hpet quirks all over the
> place. Not convoluted enough, right?

Thanks for the detailed demonstration! It does seem cleaner than what I
implemented.

> 
> But that solves none of other problems. Let me summarize again which
> options or non-options we have:
> 
> 1) Selective IPIs from NMI context cannot work
> 
>As explained in the other thread.
> 
> 2) Shorthand IPI allbutself from NMI
> 
>This should work, but that obviously does not take the watchdog
>cpumask into account.
> 
>Also this only works when IPI shorthand mode is enabled. See
>apic_smt_update() for details.
> 
> 3) Sending the IPIs from irq_work
> 
>This would solve the problem, but if the CPU which is the NMI
>target is really stuck in an interrupt disabled region then the
>IPIs won't be sent.
> 
>OTOH, if that's the case then the CPU which was processing the
>NMI will continue to be stuck until the next NMI hits which
>will detect that the CPU is stuck which is a good enough
>reason to send a shorthand IPI to all CPUs ignoring the
>watchdog cpumask.
> 
>Same limitation vs. shorthand mode as #2
> 
> 4) Changing affinity of the HPET NMI from NMI
> 
>As we established two years ago that cannot work with interrupt
>remapping
> 
&g

[RFC PATCH v5 3/7] iommu/vt-d: Rework prepare_irte() to support per-irq delivery mode

2021-05-04 Thread Ricardo Neri
A previous changeset introduced a new member to struct irq_cfg to specify
the delivery mode of an interrupt. Supporting the configuration of the
delivery mode would require adding a third argument to prepare_irte().
Instead, simply take a pointer to an irq_cfg data structure as the only
argument.

Always configure the delivery mode of the Interrupt Remapping Table
Entry using the values specified in the irq_cfg data structure.

This change does not change the existing behavior, as the delivery mode
of the APIC is used to configure the irq_cfg of each irq.

Cc: Andi Kleen 
Cc: Borislav Petkov 
Cc: David Woodhouse  (supporter:INTEL IOMMU (VT-d))
Cc: "Ravi V. Shankar" 
Cc: Ingo Molnar 
Cc: Jacob Pan 
Cc: Lu Baolu  (supporter:INTEL IOMMU (VT-d))
Cc: Stephane Eranian 
Cc: Thomas Gleixner 
Cc: iommu@lists.linux-foundation.org (open list:INTEL IOMMU (VT-d))
Cc: x...@kernel.org
Reviewed-by: Ashok Raj 
Signed-off-by: Ricardo Neri 
---
Changes since v4:
 * None

Changes since v3:
 * None

Changes since v2:
 * None

Changes since v1:
 * Introduced this patch.
---
 drivers/iommu/intel/irq_remapping.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/intel/irq_remapping.c 
b/drivers/iommu/intel/irq_remapping.c
index 611ef5243cb6..daa5df53db59 100644
--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -1104,7 +1104,7 @@ void intel_irq_remap_add_device(struct 
dmar_pci_notify_info *info)
dev_set_msi_domain(&info->dev->dev, map_dev_to_ir(info->dev));
 }
 
-static void prepare_irte(struct irte *irte, int vector, unsigned int dest)
+static void prepare_irte(struct irte *irte, struct irq_cfg *irq_cfg)
 {
memset(irte, 0, sizeof(*irte));
 
@@ -1118,9 +1118,9 @@ static void prepare_irte(struct irte *irte, int vector, 
unsigned int dest)
 * irq migration in the presence of interrupt-remapping.
*/
irte->trigger_mode = 0;
-   irte->dlvry_mode = apic->delivery_mode;
-   irte->vector = vector;
-   irte->dest_id = IRTE_DEST(dest);
+   irte->dlvry_mode = irq_cfg->delivery_mode;
+   irte->vector = irq_cfg->vector;
+   irte->dest_id = IRTE_DEST(irq_cfg->dest_apicid);
irte->redir_hint = 1;
 }
 
@@ -1261,8 +1261,7 @@ static void intel_irq_remapping_prepare_irte(struct 
intel_ir_data *data,
 {
struct irte *irte = &data->irte_entry;
 
-   prepare_irte(irte, irq_cfg->vector, irq_cfg->dest_apicid);
-
+   prepare_irte(irte, irq_cfg);
switch (info->type) {
case X86_IRQ_ALLOC_TYPE_IOAPIC:
/* Set source-id of interrupt request */
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH v5 5/7] iommu/vt-d: Fixup delivery mode of the HPET hardlockup interrupt

2021-05-04 Thread Ricardo Neri
The HPET hardlockup detector requires that the HPET timer delivers the
interrupt as NMI. When interrupt remapping is disabled, this can be
done by programming the HPET MSI registers directly. With interrupt
remapping, it is necessary to populate an entry in the interrupt
remapping table.

In x86 there is not an IRQF_NMI flag that can be used to indicate the
delivery mode when requesting an interrupt (via request_irq()). Thus,
there is no way for the interrupt remapping driver to know and set
the delivery mode.

Hence, when allocating an interrupt, check if such interrupt belongs to
the HPET hardlockup detector and fixup the delivery mode accordingly.

Cc: Andi Kleen 
Cc: Borislav Petkov 
Cc: David Woodhouse  (supporter:INTEL IOMMU (VT-d))
Cc: "Ravi V. Shankar" 
Cc: Ingo Molnar 
Cc: Jacob Pan 
Cc: Lu Baolu  (supporter:INTEL IOMMU (VT-d))
Cc: Stephane Eranian 
Cc: Thomas Gleixner 
Cc: iommu@lists.linux-foundation.org (open list:INTEL IOMMU (VT-d))
Cc: x...@kernel.org
Reviewed-by: Ashok Raj 
Signed-off-by: Ricardo Neri 
---
Changes since v4:
 * Introduced this patch.

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 drivers/iommu/intel/irq_remapping.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/drivers/iommu/intel/irq_remapping.c 
b/drivers/iommu/intel/irq_remapping.c
index daa5df53db59..b07c68ecac01 100644
--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -1376,6 +1377,14 @@ static int intel_irq_remapping_alloc(struct irq_domain 
*domain,
irq_data->hwirq = (index << 16) + i;
irq_data->chip_data = ird;
irq_data->chip = &intel_ir_chip;
+
+   /*
+* If we find the HPET hardlockup detector irq, fixup the
+* delivery mode.
+*/
+   if (is_hpet_irq_hardlockup_detector(info))
+   irq_cfg->delivery_mode = APIC_DELIVERY_MODE_NMI;
+
intel_irq_remapping_prepare_irte(ird, irq_cfg, info, index, i);
irq_set_status_flags(virq + i, IRQ_MOVE_PCNTXT);
}
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH v5 1/7] x86/apic: Add irq_cfg::delivery_mode

2021-05-04 Thread Ricardo Neri
Until now, the delivery mode of APIC interrupts is set to the default
mode set in the APIC driver. However, there are no restrictions in hardware
to configure each interrupt with a different delivery mode. Specifying the
delivery mode per interrupt is useful when one is interested in changing
the delivery mode of a particular interrupt. For instance, this can be used
to deliver an interrupt as non-maskable.

Add a new member, delivery_mode, to struct irq_cfg. This new member can
be used to update the configuration of the delivery mode in each interrupt
domain.

Currently, all interrupt domains set the delivery mode of interrupts using
the APIC setting. Interrupt domains use an irq_cfg data structure to
configure their own data structures and hardware resources. Thus, in order
to keep the current behavior, set the delivery mode of the irq
configuration that as the APIC setting. In this manner, irq domains can
obtain the delivery mode from the irq configuration data instead of the
APIC setting, if needed.

Cc: Andi Kleen 
Cc: Borislav Petkov 
Cc: David Woodhouse  (supporter:INTEL IOMMU (VT-d))
Cc: "Ravi V. Shankar" 
Cc: Ingo Molnar 
Cc: Jacob Pan 
Cc: Lu Baolu  (supporter:INTEL IOMMU (VT-d))
Cc: Stephane Eranian 
Cc: Thomas Gleixner 
Cc: iommu@lists.linux-foundation.org (open list:INTEL IOMMU (VT-d))
Cc: x86@kernel.orgReviewed-by: Ashok Raj 
Signed-off-by: Ricardo Neri 
---
Changes since v4:
 * Rebased to use new enumeration apic_delivery_modes.

Changes since v3:
 * None

Changes since v2:
 * Reduced scope to only add the interrupt delivery mode in
   struct irq_alloc_info.

Changes since v1:
 * Introduced this patch.
---
 arch/x86/include/asm/hw_irq.h |  1 +
 arch/x86/kernel/apic/vector.c | 10 ++
 2 files changed, 11 insertions(+)

diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
index d465ece58151..370f4db0372b 100644
--- a/arch/x86/include/asm/hw_irq.h
+++ b/arch/x86/include/asm/hw_irq.h
@@ -90,6 +90,7 @@ struct irq_alloc_info {
 struct irq_cfg {
unsigned intdest_apicid;
unsigned intvector;
+   enum apic_delivery_modesdelivery_mode;
 };
 
 extern struct irq_cfg *irq_cfg(unsigned int irq);
diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 6dbdc7c22bb7..d47ed07a56a4 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -567,6 +567,16 @@ static int x86_vector_alloc_irqs(struct irq_domain 
*domain, unsigned int virq,
irqd->chip_data = apicd;
irqd->hwirq = virq + i;
irqd_set_single_target(irqd);
+
+   /*
+* Initialize the delivery mode of this irq to match the
+* default delivery mode of the APIC. This is useful for
+* children irq domains which want to take the delivery
+* mode from the individual irq configuration rather
+* than from the APIC.
+*/
+apicd->hw_irq_cfg.delivery_mode = apic->delivery_mode;
+
/*
 * Prevent that any of these interrupts is invoked in
 * non interrupt context via e.g. generic_handle_irq()
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH v5 6/7] iommu/amd: Fixup delivery mode of the HPET hardlockup interrupt

2021-05-04 Thread Ricardo Neri
The HPET hardlockup detector requires that the HPET timer delivers the
interrupt as NMI. When interrupt remapping is disabled, this can be
done by programming the HPET MSI registers directly. With interrupt
remapping, it is necessary to populate an entry in the interrupt
remapping table.

In x86 there is not an IRQF_NMI flag that can be used to indicate the
delivery mode when requesting an interrupt (via request_irq()). Thus,
there is no way for the interrupt remapping driver to know and set
the delivery mode.

Hence, when allocating an interrupt, check if such interrupt belongs to
the HPET hardlockup detector and fixup the delivery mode accordingly.

Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Borislav Petkov 
Cc: David Woodhouse  (supporter:INTEL IOMMU (VT-d))
Cc: "Ravi V. Shankar" 
Cc: Ingo Molnar 
Cc: Jacob Pan 
Cc: Lu Baolu  (supporter:INTEL IOMMU (VT-d))
Cc: Stephane Eranian 
Cc: Thomas Gleixner 
Cc: iommu@lists.linux-foundation.org (open list:INTEL IOMMU (VT-d))
Cc: x...@kernel.org
Signed-off-by: Ricardo Neri 
---
Changes since v4:
 * Introduced this patch.

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 drivers/iommu/amd/iommu.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index e8d9fae0c766..758e08ba42e6 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -3254,6 +3255,14 @@ static int irq_remapping_alloc(struct irq_domain 
*domain, unsigned int virq,
irq_data->hwirq = (devid << 16) + i;
irq_data->chip_data = data;
irq_data->chip = &amd_ir_chip;
+
+   /*
+* If we find the HPET hardlockup detector irq, fixup the
+* delivery mode.
+*/
+   if (is_hpet_irq_hardlockup_detector(info))
+   cfg->delivery_mode = APIC_DELIVERY_MODE_NMI;
+
irq_remapping_prepare_irte(data, cfg, info, devid, index, i);
irq_set_status_flags(virq + i, IRQ_MOVE_PCNTXT);
}
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH v5 2/7] x86/hpet: Introduce function to identify HPET hardlockup detector irq

2021-05-04 Thread Ricardo Neri
The HPET hardlockup detector needs to deliver its interrupt as NMI.
In x86 there is not an IRQF_NMI flag that can be used in the irq plumbing
code to tell interrupt remapping drivers to set the interrupt delivery
mode accordingly. Hence, they must fixup the delivery mode internally.

Implement a method to determine if the interrupt being allocated belongs
to the HPET hardlockup detector.

Cc: Andi Kleen 
Cc: Borislav Petkov 
Cc: David Woodhouse  (supporter:INTEL IOMMU (VT-d))
Cc: "Ravi V. Shankar" 
Cc: Ingo Molnar 
Cc: Jacob Pan 
Cc: Lu Baolu  (supporter:INTEL IOMMU (VT-d))
Cc: Stephane Eranian 
Cc: Thomas Gleixner 
Cc: iommu@lists.linux-foundation.org (open list:INTEL IOMMU (VT-d))
Cc: x...@kernel.org
Reviewed-by: Ashok Raj 
Signed-off-by: Ricardo Neri 
---
Changes since v4:
 * Introduced this patch. Previous versions had special functions to
   allocate and set the affinity of a remapped NMI interrupt.

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 arch/x86/include/asm/hpet.h |  3 +++
 arch/x86/kernel/hpet.c  | 33 +
 2 files changed, 36 insertions(+)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index df11c7d4af44..5bf675970d4b 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -149,6 +149,7 @@ extern void hardlockup_detector_hpet_stop(void);
 extern void hardlockup_detector_hpet_enable(unsigned int cpu);
 extern void hardlockup_detector_hpet_disable(unsigned int cpu);
 extern void hardlockup_detector_switch_to_perf(void);
+extern bool is_hpet_irq_hardlockup_detector(struct irq_alloc_info *info);
 #else
 static inline int hardlockup_detector_hpet_init(void)
 { return -ENODEV; }
@@ -156,6 +157,8 @@ static inline void hardlockup_detector_hpet_stop(void) {}
 static inline void hardlockup_detector_hpet_enable(unsigned int cpu) {}
 static inline void hardlockup_detector_hpet_disable(unsigned int cpu) {}
 static inline void hardlockup_detector_switch_to_perf(void) {}
+static inline bool is_hpet_irq_hardlockup_detector(struct irq_alloc_info *info)
+{ return false; }
 #endif /* CONFIG_X86_HARDLOCKUP_DETECTOR_HPET */
 
 #else /* CONFIG_HPET_TIMER */
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index 5012590dc1b8..3e43e0f348b8 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -1479,6 +1479,39 @@ struct hpet_hld_data *hpet_hld_get_timer(void)
hld_data = NULL;
return NULL;
 }
+
+/**
+ * is_hpet_irq_hardlockup_detector() - Identify the HPET hld interrupt info
+ * @info:  Interrupt allocation info, with private HPET channel data
+ *
+ * The HPET hardlockup detector is special as it needs its interrupts delivered
+ * as NMI. However, for interrupt remapping we use the existing irq subsystem
+ * to configure and route the HPET interrupt. Unfortunately, there is not a
+ * IRQF_NMI flag for x86. Instead, identify whether the interrupt being
+ * allocated for the HPET channel belongs to the hardlockup detector.
+ *
+ * Returns: True if @info indicates that it belongs to the HPET hardlockup
+ * detector. False otherwise.
+ */
+bool is_hpet_irq_hardlockup_detector(struct irq_alloc_info *info)
+{
+   struct hpet_channel *hc;
+
+   if (!info)
+   return false;
+
+   if (info->type != X86_IRQ_ALLOC_TYPE_HPET)
+   return false;
+
+   hc = info->data;
+   if (!hc)
+   return false;
+
+   if (hc->mode == HPET_MODE_NMI_WATCHDOG)
+   return true;
+
+   return false;
+}
 #endif /* CONFIG_X86_HARDLOCKUP_DETECTOR_HPET */
 
 #endif
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH v5 7/7] x86/watchdog/hardlockup/hpet: Support interrupt remapping

2021-05-04 Thread Ricardo Neri
When interrupt remapping is enabled in the system, the MSI interrupt
address and data fields must follow a special format that the IOMMU
defines.

However, the HPET hardlockup detector must rely on the interrupt
subsystem to have the interrupt remapping drivers allocate, activate,
and set the affinity of HPET timer interrupt. Hence, it must use
request_irq() to use such functionality.

In x86 there is not an IRQF_NMI flag to indicate to the interrupt
subsystem the delivery mode of the interrupt. A previous changset added
functionality to detect the interrupt of the HPET hardlockup detector
and fixup the delivery mode accordingly.

Also, since request_irq() is used, a non-NMI interrupt handler must be
defined. Even if it is not needed.

When Interrupt Remapping is enabled, use the new facility to ensure
interrupt is plumbed properly to work with interrupt remapping.

Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Borislav Petkov 
Cc: David Woodhouse  (supporter:INTEL IOMMU (VT-d))
Cc: "Ravi V. Shankar" 
Cc: Ingo Molnar 
Cc: Jacob Pan 
Cc: Lu Baolu  (supporter:INTEL IOMMU (VT-d))
Cc: Stephane Eranian 
Cc: Thomas Gleixner 
Cc: iommu@lists.linux-foundation.org (open list:INTEL IOMMU (VT-d))
Cc: x...@kernel.org
Signed-off-by: Ricardo Neri 
---
Changes since v4:
 * Use request_irq() to obtain an IRTE for the HPET hardlockup detector
   instead of the custom interfaces previously implemented in the
   interrupt remapping drivers.
 * Simplified detection of interrupt remapping by checking the parent
   of the HPET irq domain.
 * Stopped using the HPET magic fields of struct irq_alloc_info. They
   were removed in commit 2bf1e7bcedb8 ("x86/msi: Consolidate HPET
   allocation")
 * Rephrased commit message for clarity. (Ashok)
 * Clarified error message of non-NMI handler. (Ashok)

Changes since v3:
 * None

Changes since v2:
 * None

Changes since v1:
 * Introduced this patch. Added custom functions in the Intel IOMMU driver
   to allocate an IRTE for the HPET hardlockup detector.
---
 arch/x86/include/asm/hpet.h |  2 ++
 arch/x86/kernel/hpet.c  |  3 ++
 arch/x86/kernel/watchdog_hld_hpet.c | 48 +
 3 files changed, 47 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index 5bf675970d4b..d130285ddc96 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -109,6 +109,7 @@ extern void hpet_unregister_irq_handler(rtc_irq_handler 
handler);
  * @tsc_ticks_per_group:   TSC ticks that must elapse for each group of
  * monitored CPUs.
  * @irq:   IRQ number assigned to the HPET channel
+ * @int_remap_enabled: True if interrupt remapping is enabled
  * @handling_cpu:  CPU handling the HPET interrupt
  * @pkgs_per_group:Number of physical packages in a group of CPUs
  * receiving an IPI
@@ -133,6 +134,7 @@ struct hpet_hld_data {
u64 tsc_next;
u64 tsc_ticks_per_group;
int irq;
+   boolintr_remap_enabled;
u32 handling_cpu;
u32 pkgs_per_group;
u32 nr_groups;
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index 3e43e0f348b8..ff4abdef5e15 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -1464,6 +1464,9 @@ struct hpet_hld_data *hpet_hld_get_timer(void)
if (!hpet_domain)
goto err;
 
+   if (hpet_domain->parent != x86_vector_domain)
+   hld_data->intr_remap_enabled = true;
+
hc->mode = HPET_MODE_NMI_WATCHDOG;
irq = hpet_assign_irq(hpet_domain, hc, hc->num);
if (irq <= 0)
diff --git a/arch/x86/kernel/watchdog_hld_hpet.c 
b/arch/x86/kernel/watchdog_hld_hpet.c
index 3fd2405b31fa..265641d001ac 100644
--- a/arch/x86/kernel/watchdog_hld_hpet.c
+++ b/arch/x86/kernel/watchdog_hld_hpet.c
@@ -176,6 +176,14 @@ static int update_msi_destid(struct hpet_hld_data *hdata)
 {
u32 destid;
 
+   if (hdata->intr_remap_enabled) {
+   int ret;
+
+   ret = irq_set_affinity(hdata->irq,
+  cpumask_of(hdata->handling_cpu));
+   return ret;
+   }
+
destid = apic->calc_dest_apicid(hdata->handling_cpu);
/*
 * HPET only supports a 32-bit MSI address register. Thus, only
@@ -393,26 +401,52 @@ static int hardlockup_detector_nmi_handler(unsigned int 
type,
return NMI_DONE;
 }
 
+/*
+ * When interrupt remapping is enabled, we request the irq for the detector
+ * using request_irq() and then we fixup the delivery mode to NMI using
+ * is_hpet_irq_hardlockup_detector(). If the latter fails, we will see a non-
+ * NMI interrupt.
+ *
+ */
+static irqreturn_t hardlockup_detector_irq_handler(int irq, void *data)
+{
+   pr_err_once("Received a n

[RFC PATCH v5 4/7] iommu/amd: Set the IRTE delivery mode from irq_cfg

2021-05-04 Thread Ricardo Neri
There is not hardware requirement to have a different delivery mode for
each interrupt. Instead of using the delivery mode of the APIC driver, use
the delivery mode of each specific interrupt configuration.

This allows to accommodate interrupts which require a specific delivery
mode, such as the HPET hardlockup detector.

Outside of such case, there are not functional changes since the delivery
mode of an interrupt is initialized with the delivery mode of the APIC
driver.

Cc: Andi Kleen 
Cc: Borislav Petkov 
Cc: David Woodhouse  (supporter:INTEL IOMMU (VT-d))
Cc: "Ravi V. Shankar" 
Cc: Ingo Molnar 
Cc: Jacob Pan 
Cc: Lu Baolu  (supporter:INTEL IOMMU (VT-d))
Cc: Stephane Eranian 
Cc: Thomas Gleixner 
Cc: iommu@lists.linux-foundation.org (open list:INTEL IOMMU (VT-d))
Cc: x...@kernel.org
Reviewed-by: Ashok Raj 
Signed-off-by: Ricardo Neri 
---
Changes since v4:
 * Introduced this patch.

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 drivers/iommu/amd/iommu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index a69a8b573e40..e8d9fae0c766 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3122,7 +3122,7 @@ static void irq_remapping_prepare_irte(struct amd_ir_data 
*data,
 
data->irq_2_irte.devid = devid;
data->irq_2_irte.index = index + sub_handle;
-   iommu->irte_ops->prepare(data->entry, apic->delivery_mode,
+   iommu->irte_ops->prepare(data->entry, irq_cfg->delivery_mode,
 apic->dest_mode_logical, irq_cfg->vector,
 irq_cfg->dest_apicid, devid);
 
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH v5 0/7] x86: watchdog/hardlockup/hpet: Add support for interrupt remapping

2021-05-04 Thread Ricardo Neri
Hi IOMMU experts,

I proposed a hardlockup detector driven by the HPET timer [1]. Such
detector is driven by a single timer. The hardlockup detector brings the
extra complexity of having to update the affinity of the interrupt
periodically and initiated from NMI context. The proposed design only
requires updating the affinity every watchdog_thresh (the interval is
between [1, 60] seconds). Also, the affinity update is offloaded to
an irq_work. Handling the HPET interrupt affinity is trivial with
!intremap since the detector composes the MSI message and writes it
directly to the HPET registers.

However, for intremap we must use the existing IOMMU drivers as well as
the kernel's irq plumbing. Thomas Gleixner has imposed two restrictions:
  1) Do not implement an IRQF_NMI flag for x86 as it is not possible to
 determine the source of an NMI [2].
  2) Use the irq subsystem to update the affinity of the HPET
 interrupt [3].

1) implies that the interrupt remapping drivers need to implement a quirk
to identify the HPET interrupt and update its delivery mode to NMI. 2)
means that the hardlockup detector must use request_irq() to allocate the
HPET interrupt.

This patch series attempts to meet the requirements above by
  a) Decoupling the delivery mode of an APIC interrupt from the delivery
 mode of the APIC driver (patch 1)
  b) Implement quirks in the Intel and AMD IOMMU drivers to identify the
 HPET timer and update the delivery mode accordingly (patches 2-5).
  c) Add support for interrupt remapping in the HPET hardlockup detector
 in [1]. This includes the unavoidable eyesore of using request_irq()
 and having a useless regular interrupt handler (patch 6).

I would like to get your feedback on whether the HPET NMI quirk looks
sane to you and whether offloading the affinity setup to an irq_work
could pose issues.

Thanks and BR,
Ricardo

[1]. 
https://lore.kernel.org/lkml/20210504190526.22347-1-ricardo.neri-calde...@linux.intel.com/T/#mf77988cc98f9ca6988831e17f68394577388959d
[2]. 
https://lore.kernel.org/lkml/alpine.deb.2.21.1808021137400.2...@nanos.tec.linutronix.de/
[3]. 
https://lore.kernel.org/lkml/alpine.deb.2.21.1906161042080.1...@nanos.tec.linutronix.de/

Changes since v4:
 * With !CONFIG_IRQ_REMAP [1] now disables the HPET channel before changing
   the MSI Destination ID field. This should avoid races between a pending
   interrupt and updating the detector's interrupt affinity. (Ashok)
 * Rebased to use new enumeration apic_delivery_modes.
 * Removed custom functions to allocate an interrupt for the detector
   and instead added support to identify the detector's interrupt and
   change the delivery mode.
 * With interrupt remapping enabled, use request_irq().
 * Added support for AMD IOMMU.

Changes since v3:
 * None

Changes since v2:
 * None

Changes since v1:
 * Introduced support for interrupt remapping

Ricardo Neri (7):
  x86/apic: Add irq_cfg::delivery_mode
  x86/hpet: Introduce function to identify HPET hardlockup detector irq
  iommu/vt-d: Rework prepare_irte() to support per-irq delivery mode
  iommu/amd: Set the IRTE delivery mode from irq_cfg
  iommu/vt-d: Fixup delivery mode of the HPET hardlockup interrupt
  iommu/amd: Fixup delivery mode of the HPET hardlockup interrupt
  x86/watchdog/hardlockup/hpet: Support interrupt remapping

 arch/x86/include/asm/hpet.h |  5 +++
 arch/x86/include/asm/hw_irq.h   |  1 +
 arch/x86/kernel/apic/vector.c   | 10 ++
 arch/x86/kernel/hpet.c  | 36 ++
 arch/x86/kernel/watchdog_hld_hpet.c | 48 +
 drivers/iommu/amd/iommu.c   | 11 ++-
 drivers/iommu/intel/irq_remapping.c | 20 
 7 files changed, 118 insertions(+), 13 deletions(-)

-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v4 20/21] iommu/vt-d: hpet: Reserve an interrupt remampping table entry for watchdog

2019-10-17 Thread Ricardo Neri
On Tue, Jun 18, 2019 at 01:08:06AM +0200, Thomas Gleixner wrote:
> Stephane,
> 
> On Mon, 17 Jun 2019, Stephane Eranian wrote:
> > On Mon, Jun 17, 2019 at 1:25 AM Thomas Gleixner  wrote:
> > > Great that there is no trace of any mail from Andi or Stephane about this
> > > on LKML. There is no problem with talking offlist about this stuff, but
> > > then you should at least provide a rationale for those who were not part 
> > > of
> > > the private conversation.
> > >
> > Let me add some context to this whole patch series. The pressure on the
> > core PMU counters is increasing as more people want to use them to
> > measure always more events. When the PMU is overcommitted, i.e., more
> > events than counters for them, there is multiplexing. It comes with an
> > overhead that is too high for certain applications. One way to avoid this
> > is to lower the multiplexing frequency, which is by default 1ms, but that
> > comes with loss of accuracy. Another approach is to measure only a small
> > number of events at a time and use multiple runs, but then you lose
> > consistent event view. Another approach is to push for increasing the
> > number of counters. But getting new hardware counters takes time. Short
> > term, we can investigate what it would take to free one cycle-capable
> > counter which is commandeered by the hard lockup detector on all X86
> > processors today. The functionality of the watchdog, being able to get a
> > crash dump on kernel deadlocks, is important and we cannot simply disable
> > it. At scale, many bugs are exposed and thus machines
> > deadlock. Therefore, we want to investigate what it would take to move
> > the detector to another NMI-capable source, such as the HPET because the
> > detector does not need high low granularity timer and interrupts only
> > every 2s.
> 
> I'm well aware about the reasons for this.
> 
> > Furthermore, recent Intel erratum, e.g., the TSX issue forcing the TFA
> > code in perf_events, have increased the pressure even more with only 3
> > generic counters left. Thus, it is time to look at alternative ways of
> > getting a hard lockup detector (NMI watchdog) from another NMI source
> > than the PMU. To that extent, I have been discussing about alternatives.
> >
> > Intel suggested using the HPET and Ricardo has been working on
> > producing this patch series. It is clear from your review
> > that the patches have issues, but I am hoping that they can be
> > resolved with constructive feedback knowing what the end goal is.
> 
> Well, I gave constructive feedback from the very first version on. But
> essential parts of that feedback have been ignored for whatever reasons.
> 
> > As for the round-robin changes, yes, we discussed this as an alternative
> > to avoid overloading CPU0 with handling all of the work to broadcasting
> > IPI to 100+ other CPUs.
> 
> I can understand the reason why you don't want to do that, but again, I
> said way before this was tried that changing affinity from NMI context with
> the IOMMU cannot work by just calling into the iommu code and it needs some
> deep investigation with the IOMMU wizards whether a preallocated entry can
> be used lockless (including the subsequently required flush).
> 
> The outcome is that the change was implemented by simply calling into
> functions which I told that they cannot be called from NMI context.
> 
> Unless this problem is not solved and I doubt it can be solved after
> talking to IOMMU people and studying manuals, the round robin mechanics in
> the current form are not going to happen. We'd need a SMI based lockup
> detector to debug the resulting livelock wreckage.
> 
> There are two possible options:
> 
>   1) Back to the IPI approach
> 
>  The probem with broadcast is that it sends IPIs one by one to each
>  online CPU, which sums up with a large number of CPUs.
> 
>  The interesting question is why the kernel does not utilize the all
>  excluding self destination shorthand for this. The SDM is not giving
>  any information.
> 
>  But there is a historic commit which is related and gives a hint:
> 
> commit e77deacb7b078156fcadf27b838a4ce1a65eda04
> Author: Keith Owens 
> Date:   Mon Jun 26 13:59:56 2006 +0200
> 
> [PATCH] x86_64: Avoid broadcasting NMI IPIs
> 
> On some i386/x86_64 systems, sending an NMI IPI as a broadcast will
>   reset the system.  This seems to be a BIOS bug which affects
>   machines where one or more cpus are not under OS control.  It
>   occurs on HT systems with a version of the OS that is not compiled
>   without HT support.  It also occurs when a system is booted with
>   max_cpus=n where 2 <= n < cpus known to the BIOS.  The fix is to
>   always send NMI IPI as a mask instead of as a broadcast.
> 
> I can see the issue with max_cpus and that'd be trivial to solve by
> disabling the HPET watchdog when maxcpus < num_present_cpus is on the
> command line (Th

Re: [RFC PATCH v4 20/21] iommu/vt-d: hpet: Reserve an interrupt remampping table entry for watchdog

2019-06-21 Thread Ricardo Neri
On Fri, Jun 21, 2019 at 10:05:01PM +0200, Thomas Gleixner wrote:
> On Fri, 21 Jun 2019, Jacob Pan wrote:
> > On Fri, 21 Jun 2019 10:31:26 -0700
> > Jacob Pan  wrote:
> > 
> > > On Fri, 21 Jun 2019 17:33:28 +0200 (CEST)
> > > Thomas Gleixner  wrote:
> > > 
> > > > On Wed, 19 Jun 2019, Jacob Pan wrote:  
> > > > > On Tue, 18 Jun 2019 01:08:06 +0200 (CEST)
> > > > > Thomas Gleixner  wrote:
> > > > > > 
> > > > > > Unless this problem is not solved and I doubt it can be solved
> > > > > > after talking to IOMMU people and studying manuals,
> > > > >
> > > > > I agree. modify irte might be done with cmpxchg_double() but the
> > > > > queued invalidation interface for IRTE cache flush is shared with
> > > > > DMA and requires holding a spinlock for enque descriptors, QI tail
> > > > > update etc.
> > > > > 
> > > > > Also, reserving & manipulating IRTE slot for hpet via backdoor
> > > > > might not be needed if the HPET PCI BDF (found in ACPI) can be
> > > > > utilized. But it might need more work to add a fake PCI device for
> > > > > HPET.
> > > > 
> > > > What would PCI/BDF solve?  
> > > I was thinking if HPET is a PCI device then it can naturally
> > > gain slots in IOMMU remapping table IRTEs via PCI MSI code. Then
> > > perhaps it can use the IRQ subsystem to set affinity etc. w/o
> > > directly adding additional helper functions in IRQ remapping code. I
> > > have not followed all the discussions, just a thought.
> > > 
> > I looked at the code again, seems the per cpu HPET code already taken
> > care of HPET MSI management. Why can't we use IR-HPET-MSI chip and
> > domain to allocate and set affinity etc.?
> > Most APIC timer has ARAT not enough per cpu HPET, so per cpu HPET is
> > not used mostly.
> 
> Sure, we can use that, but that does not allow to move the affinity from
> NMI context either. Same issue with the IOMMU as with the other hack.

If I understand Thomas' point correctly, the problem is having to take
lock in NMI context to update the IRTE for the HPET; both as in my hack
and in the generic irq code. The problem is worse when using the generic
irq code as there are several layers and several locks that need to be
handled.

Thanks and BR,
Ricardo


Re: [RFC PATCH v4 04/21] x86/hpet: Add hpet_set_comparator() for periodic and one-shot modes

2019-06-18 Thread Ricardo Neri
On Fri, Jun 14, 2019 at 08:17:14PM +0200, Thomas Gleixner wrote:
> On Thu, 23 May 2019, Ricardo Neri wrote:
> > +/**
> > + * hpet_set_comparator() - Helper function for setting comparator register
> > + * @num:   The timer ID
> > + * @cmp:   The value to be written to the comparator/accumulator
> > + * @period:The value to be written to the period (0 = oneshot mode)
> > + *
> > + * Helper function for updating comparator, accumulator and period values.
> > + *
> > + * In periodic mode, HPET needs HPET_TN_SETVAL to be set before writing
> > + * to the Tn_CMP to update the accumulator. Then, HPET needs a second
> > + * write (with HPET_TN_SETVAL cleared) to Tn_CMP to set the period.
> > + * The HPET_TN_SETVAL bit is automatically cleared after the first write.
> > + *
> > + * For one-shot mode, HPET_TN_SETVAL does not need to be set.
> > + *
> > + * See the following documents:
> > + *   - Intel IA-PC HPET (High Precision Event Timers) Specification
> > + *   - AMD-8111 HyperTransport I/O Hub Data Sheet, Publication # 24674
> > + */
> > +void hpet_set_comparator(int num, unsigned int cmp, unsigned int period)
> > +{
> > +   if (period) {
> > +   unsigned int v = hpet_readl(HPET_Tn_CFG(num));
> > +
> > +   hpet_writel(v | HPET_TN_SETVAL, HPET_Tn_CFG(num));
> > +   }
> > +
> > +   hpet_writel(cmp, HPET_Tn_CMP(num));
> > +
> > +   if (!period)
> > +   return;
> 
> TBH, I hate this conditional handling. What's wrong with two functions?

There is probably nothing wrong with two functions. I can split it into
hpet_set_comparator_periodic() and hpet_set_comparator(). Perhaps the
latter is not needed as it would be a one-line function; you have
suggested earlier to avoid such small functions.
> 
> > +
> > +   /*
> > +* This delay is seldom used: never in one-shot mode and in periodic
> > +* only when reprogramming the timer.
> > +*/
> > +   udelay(1);
> > +   hpet_writel(period, HPET_Tn_CMP(num));
> > +}
> > +EXPORT_SYMBOL_GPL(hpet_set_comparator);
> 
> Why is this exported? Which module user needs this?

It is not used anywhere else. I will remove this export.

Thanks and BR,

Ricardo
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v4 03/21] x86/hpet: Calculate ticks-per-second in a separate function

2019-06-18 Thread Ricardo Neri
On Fri, Jun 14, 2019 at 05:54:05PM +0200, Thomas Gleixner wrote:
> On Thu, 23 May 2019, Ricardo Neri wrote:
> >  
> > +u64 hpet_get_ticks_per_sec(u64 hpet_caps)
> > +{
> > +   u64 ticks_per_sec, period;
> > +
> > +   period = (hpet_caps & HPET_COUNTER_CLK_PERIOD_MASK) >>
> > +HPET_COUNTER_CLK_PERIOD_SHIFT; /* fs, 10^-15 */
> > +
> > +   /*
> > +* The frequency is the reciprocal of the period. The period is given
> > +* in femtoseconds per second. Thus, prepare a dividend to obtain the
> > +* frequency in ticks per second.
> > +*/
> > +
> > +   /* 10^15 femtoseconds per second */
> > +   ticks_per_sec = 1000ULL;
> > +   ticks_per_sec += period >> 1; /* round */
> > +
> > +   /* The quotient is put in the dividend. We drop the remainder. */
> > +   do_div(ticks_per_sec, period);
> > +
> > +   return ticks_per_sec;
> > +}
> > +
> >  int hpet_alloc(struct hpet_data *hdp)
> >  {
> > u64 cap, mcfg;
> > @@ -844,7 +867,6 @@ int hpet_alloc(struct hpet_data *hdp)
> > struct hpets *hpetp;
> > struct hpet __iomem *hpet;
> > static struct hpets *last;
> > -   unsigned long period;
> > unsigned long long temp;
> > u32 remainder;
> >  
> > @@ -894,12 +916,7 @@ int hpet_alloc(struct hpet_data *hdp)
> >  
> > last = hpetp;
> >  
> > -   period = (cap & HPET_COUNTER_CLK_PERIOD_MASK) >>
> > -   HPET_COUNTER_CLK_PERIOD_SHIFT; /* fs, 10^-15 */
> > -   temp = 1000uLL; /* 10^15 femtoseconds per second */
> > -   temp += period >> 1; /* round */
> > -   do_div(temp, period);
> > -   hpetp->hp_tick_freq = temp; /* ticks per second */
> > +   hpetp->hp_tick_freq = hpet_get_ticks_per_sec(cap);
> 
> Why are we actually computing this over and over?
> 
> In hpet_enable() which is the first function invoked we have:
> 
> /*
>  * The period is a femto seconds value. Convert it to a
>  * frequency.
>  */
> freq = FSEC_PER_SEC;
> do_div(freq, hpet_period);
> hpet_freq = freq;
> 
> So we already have ticks per second, aka frequency, right? So why do we
> need yet another function instead of using the value which is computed
> once? The frequency of the HPET channels has to be identical no matter
> what. If it's not HPET is broken beyond repair.

I don't think it needs to be recomputed again. I missed the fact that
the frequency was already computed here.

Also, the hpet char driver has its own frequency computation. Perhaps it
could also obtain it from here, right?

Thanks and BR,
Ricardo
> 
> Thanks,
> 
>   tglx
> 
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v4 05/21] x86/hpet: Reserve timer for the HPET hardlockup detector

2019-06-18 Thread Ricardo Neri
On Fri, Jun 14, 2019 at 06:10:18PM +0200, Thomas Gleixner wrote:
> On Thu, 13 Jun 2019, Ricardo Neri wrote:
> 
> > On Tue, Jun 11, 2019 at 09:54:25PM +0200, Thomas Gleixner wrote:
> > > On Thu, 23 May 2019, Ricardo Neri wrote:
> > > 
> > > > HPET timer 2 will be used to drive the HPET-based hardlockup detector.
> > > > Reserve such timer to ensure it cannot be used by user space programs or
> > > > for clock events.
> > > > 
> > > > When looking for MSI-capable timers for clock events, skip timer 2 if
> > > > the HPET hardlockup detector is selected.
> > > 
> > > Why? Both the changelog and the code change lack an explanation why this
> > > timer is actually touched after it got reserved for the platform. The
> > > reservation should make it inaccessible for other things.
> > 
> > hpet_reserve_platform_timers() will give the HPET char driver a data
> > structure which specifies which drivers are reserved. In this manner,
> > they cannot be used by applications via file opens. The timer used by
> > the hardlockup detector should be marked as reserved.
> > 
> > Also, hpet_msi_capability_lookup() populates another data structure
> > which is used when obtaining an unused timer for a HPET clock event.
> > The timer used by the hardlockup detector should not be included in such
> > data structure.
> > 
> > Is this the explanation you would like to see? If yes, I will include it
> > in the changelog.
> 
> Yes, the explanation makes sense. The code still sucks. Not really your
> fault, but this is not making it any better.
> 
> What bothers me most is the fact that CONFIG_X86_HARDLOCKUP_DETECTOR_HPET
> removes one HPET timer unconditionally. It neither checks whether the hpet
> watchdog is actually enabled on the command line, nor does it validate
> upfront whether the HPET supports FSB delivery.
> 
> That wastes an HPET timer unconditionally for no value. Not that I
> personally care much about /dev/hpet, but some older laptops depend on HPET
> per cpu timers as the local APIC timer stops in C2/3. So this unconditional
> reservation will cause regressions for no reason.
> 
> The proper approach here is to:
> 
>  1) Evaluate the command line _before_ hpet_enable() is invoked
> 
>  2) Check the availability of FSB delivery in hpet_enable()
> 
> Reserve an HPET channel for the watchdog only when #1 and #2 are true.

Sure. I will add the explanation in the message commit and only reserve
the timer if both of the conditions above are met.

Thanks and BR,
Ricardo


Re: [RFC PATCH v4 18/21] x86/apic: Add a parameter for the APIC delivery mode

2019-06-18 Thread Ricardo Neri
On Sun, Jun 16, 2019 at 11:55:03AM +0200, Thomas Gleixner wrote:
> On Thu, 23 May 2019, Ricardo Neri wrote:
> >  
> >  struct irq_cfg {
> > -   unsigned intdest_apicid;
> > -   unsigned intvector;
> > +   unsigned intdest_apicid;
> > +   unsigned intvector;
> > +   enum ioapic_irq_destination_types   delivery_mode;
> 
> And how is this related to IOAPIC?

In my view, IOAPICs can also be programmed with a delivery mode. Mode
values are the same for MSI interrupts.

> I know this enum exists already, but in
> connection with MSI this does not make any sense at all.

Is the issue here the name of the enumeration?

> 
> > +
> > +   /*
> > +* Initialize the delivery mode of this irq to match the
> > +* default delivery mode of the APIC. This is useful for
> > +* children irq domains which want to take the delivery
> > +* mode from the individual irq configuration rather
> > +* than from the APIC.
> > +*/
> > +apicd->hw_irq_cfg.delivery_mode = apic->irq_delivery_mode;
> 
> And here it's initialized from apic->irq_delivery_mode, which is an
> u32. Intuitive and consistent - NOT!

Yes, this is wrong. Then should the member in the structure above be an
u32 instead of enum ioapic_irq_destination_types?

Thanks and BR,
Ricardo
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v4 12/21] watchdog/hardlockup/hpet: Adjust timer expiration on the number of monitored CPUs

2019-06-18 Thread Ricardo Neri
On Tue, Jun 11, 2019 at 10:11:04PM +0200, Thomas Gleixner wrote:
> On Thu, 23 May 2019, Ricardo Neri wrote:
> > @@ -52,10 +59,10 @@ static void kick_timer(struct hpet_hld_data *hdata, 
> > bool force)
> > return;
> >  
> > if (hdata->has_periodic)
> > -   period = watchdog_thresh * hdata->ticks_per_second;
> > +   period = watchdog_thresh * hdata->ticks_per_cpu;
> >  
> > count = hpet_readl(HPET_COUNTER);
> > -   new_compare = count + watchdog_thresh * hdata->ticks_per_second;
> > +   new_compare = count + watchdog_thresh * hdata->ticks_per_cpu;
> > hpet_set_comparator(hdata->num, (u32)new_compare, (u32)period);
> 
> So with this you might get close to the point where you trip over the SMI
> induced madness where CPUs vanish for several milliseconds in some value
> add code. You really want to do a read back of the hpet to detect that. See
> the comment in the hpet code. RHEL 7/8 allow up to 768 logical CPUs

Do you mean adding a readback to check if the new compare value is
greater than the current count? Similar to the check at the end of
hpet_next_event():

return res < HPET_MIN_CYCLES ? -ETIME : 0;

In such a case, should it try to set the comparator again? I think it
should, as otherwise the hardlockup detector would stop working.

Thanks and BR,
Ricardo
> 
> Thanks,
> 
>   tglx
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v4 20/21] iommu/vt-d: hpet: Reserve an interrupt remampping table entry for watchdog

2019-06-18 Thread Ricardo Neri
On Mon, Jun 17, 2019 at 10:25:35AM +0200, Thomas Gleixner wrote:
> On Sun, 16 Jun 2019, Thomas Gleixner wrote:
> > On Thu, 23 May 2019, Ricardo Neri wrote:
> > > When the hardlockup detector is enabled, the function
> > > hld_hpet_intremapactivate_irq() activates the recently created entry
> > > in the interrupt remapping table via the modify_irte() functions. While
> > > doing this, it specifies which CPU the interrupt must target via its APIC
> > > ID. This function can be called every time the destination iD of the
> > > interrupt needs to be updated; there is no need to allocate or remove
> > > entries in the interrupt remapping table.
> > 
> > Brilliant.
> > 
> > > +int hld_hpet_intremap_activate_irq(struct hpet_hld_data *hdata)
> > > +{
> > > + u32 destid = apic->calc_dest_apicid(hdata->handling_cpu);
> > > + struct intel_ir_data *data;
> > > +
> > > + data = (struct intel_ir_data *)hdata->intremap_data;
> > > + data->irte_entry.dest_id = IRTE_DEST(destid);
> > > + return modify_irte(&data->irq_2_iommu, &data->irte_entry);
> > 
> > This calls modify_irte() which does at the very beginning:
> > 
> >raw_spin_lock_irqsave(&irq_2_ir_lock, flags);
> > 
> > How is that supposed to work from NMI context? Not to talk about the
> > other spinlocks which are taken in the subsequent call chain.
> > 
> > You cannot call in any of that code from NMI context.
> > 
> > The only reason why this never deadlocked in your testing is that nothing
> > else touched that particular iommu where the HPET hangs off concurrently.
> > 
> > But that's just pure luck and not design. 
> 
> And just for the record. I warned you about that problem during the review
> of an earlier version and told you to talk to IOMMU folks whether there is
> a way to update the entry w/o running into that lock problem.

I think I misunderstood your feedback. You did mention issues on locking
between NMI and !NMI contexts. However, that was in the context of using the
generic irq code to do things such as set the affinity of the interrupt and
requesting an irq. I understood that I should instead program things directly.
I extrapolated this to the IOMMU driver in which I also added code directly
instead of using the existing layering.

Also, at the time, the question regarding the IOMMU, as I understood, was
whether it was posible to reserve a IOMMU remapping entry upfront. I believe
my patches achieve that, even if they are hacky and ugly, and have locking
issues. I see now that the locking issues are also part of the IOMMU
discussion. Perhaps that was also implicit.
> 
> Can you tell my why am I actually reviewing patches and spending time on
> this when the result is ignored anyway?

Yes, Thomas, I should have checked first with the IOMMU maintainers
first on the issues in the paragraph above. It is not my intention to
waste your time; your feedback has been valuable and has contributed to
improve the code.

> 
> I also tried to figure out why you went away from the IPI broadcast
> design. The only information I found is:
> 
> Changes vs. v1:
> 
>  * Brought back the round-robin mechanism proposed in v1 (this time not
>using the interrupt subsystem). This also requires to compute
>expiration times as in v1 (Andi Kleen, Stephane Eranian).
> 
> Great that there is no trace of any mail from Andi or Stephane about this
> on LKML. There is no problem with talking offlist about this stuff, but
> then you should at least provide a rationale for those who were not part of
> the private conversation.

Stephane has already commented the rationale.

Thanks and BR,

Ricardo
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v4 05/21] x86/hpet: Reserve timer for the HPET hardlockup detector

2019-06-13 Thread Ricardo Neri
On Tue, Jun 11, 2019 at 09:54:25PM +0200, Thomas Gleixner wrote:
> On Thu, 23 May 2019, Ricardo Neri wrote:
> 
> > HPET timer 2 will be used to drive the HPET-based hardlockup detector.
> > Reserve such timer to ensure it cannot be used by user space programs or
> > for clock events.
> > 
> > When looking for MSI-capable timers for clock events, skip timer 2 if
> > the HPET hardlockup detector is selected.
> 
> Why? Both the changelog and the code change lack an explanation why this
> timer is actually touched after it got reserved for the platform. The
> reservation should make it inaccessible for other things.

hpet_reserve_platform_timers() will give the HPET char driver a data
structure which specifies which drivers are reserved. In this manner,
they cannot be used by applications via file opens. The timer used by
the hardlockup detector should be marked as reserved.

Also, hpet_msi_capability_lookup() populates another data structure
which is used when obtaining an unused timer for a HPET clock event.
The timer used by the hardlockup detector should not be included in such
data structure.

Is this the explanation you would like to see? If yes, I will include it
in the changelog.

Thanks and BR,
Ricardo



Re: [RFC PATCH v4 17/21] x86/tsc: Switch to perf-based hardlockup detector if TSC become unstable

2019-06-07 Thread Ricardo Neri
On Thu, Jun 06, 2019 at 05:35:51PM -0700, Stephane Eranian wrote:
> Hi Ricardo,

Hi Stephane,

> Thanks for your contribution here. It is very important to move the
> watchdog out of the PMU wherever possible.

Indeed, using the PMU for the hardlockup detector is still the default
option. This patch series proposes a new kernel command line to switch
to use the HPET.

> 
> On Thu, May 23, 2019 at 6:17 PM Ricardo Neri
>  wrote:
> >
> > The HPET-based hardlockup detector relies on the TSC to determine if an
> > observed NMI interrupt was originated by HPET timer. Hence, this detector
> > can no longer be used with an unstable TSC.
> >
> > In such case, permanently stop the HPET-based hardlockup detector and
> > start the perf-based detector.
> >
> > Signed-off-by: Ricardo Neri 
> > ---
> >  arch/x86/include/asm/hpet.h| 2 ++
> >  arch/x86/kernel/tsc.c  | 2 ++
> >  arch/x86/kernel/watchdog_hld.c | 7 +++
> >  3 files changed, 11 insertions(+)
> >
> > diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
> > index fd99f2390714..a82cbe17479d 100644
> > --- a/arch/x86/include/asm/hpet.h
> > +++ b/arch/x86/include/asm/hpet.h
> > @@ -128,6 +128,7 @@ extern int hardlockup_detector_hpet_init(void);
> >  extern void hardlockup_detector_hpet_stop(void);
> >  extern void hardlockup_detector_hpet_enable(unsigned int cpu);
> >  extern void hardlockup_detector_hpet_disable(unsigned int cpu);
> > +extern void hardlockup_detector_switch_to_perf(void);
> >  #else
> >  static inline struct hpet_hld_data 
> > *hpet_hardlockup_detector_assign_timer(void)
> >  { return NULL; }
> > @@ -136,6 +137,7 @@ static inline int hardlockup_detector_hpet_init(void)
> >  static inline void hardlockup_detector_hpet_stop(void) {}
> >  static inline void hardlockup_detector_hpet_enable(unsigned int cpu) {}
> >  static inline void hardlockup_detector_hpet_disable(unsigned int cpu) {}
> > +static void harrdlockup_detector_switch_to_perf(void) {}
> >  #endif /* CONFIG_X86_HARDLOCKUP_DETECTOR_HPET */
> >
> This does not compile for me when CONFIG_X86_HARDLOCKUP_DETECTOR_HPET
> is not enabled.
> because:
>1- you have a typo on the function name
> 2- you are missing the inline keyword

I am sorry. This was an oversight on my side. I have corrected this in
preparation for a v5.

Thanks and BR,
Ricardo


[RFC PATCH v4 15/21] watchdog/hardlockup/hpet: Only enable the HPET watchdog via a boot parameter

2019-05-23 Thread Ricardo Neri
Keep the HPET-based hardlockup detector disabled unless explicitly enabled
via a command-line argument. If such parameter is not given, the
initialization of the hpet-based hardlockup detector fails and the NMI
watchdog will fallback to use the perf-based implementation.

Given that __setup("nmi_watchdog=") is already used to control the behavior
of the NMI watchdog (via hardlockup_panic_setup()), it cannot be used to
control of the hpet-based implementation. Instead, use a new
early_param("nmi_watchdog").

Cc: "H. Peter Anvin" 
Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: Peter Zijlstra 
Cc: Clemens Ladisch 
Cc: Arnd Bergmann 
Cc: Philippe Ombredanne 
Cc: Kate Stewart 
Cc: "Rafael J. Wysocki" 
Cc: Mimi Zohar 
Cc: Jan Kiszka 
Cc: Nick Desaulniers 
Cc: Masahiro Yamada 
Cc: Nayna Jain 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Signed-off-by: Ricardo Neri 

--
checkpatch gives the following warning:

CHECK: __setup appears un-documented -- check 
Documentation/admin-guide/kernel-parameters.rst
+__setup("nmi_watchdog=", hardlockup_detector_hpet_setup);

This is a false-positive as the option nmi_watchdog is already
documented. The option is re-evaluated in this file as well.
---
 .../admin-guide/kernel-parameters.txt |  8 ++-
 arch/x86/kernel/watchdog_hld_hpet.c   | 22 +++
 2 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 138f6664b2e2..17ed3dcda13e 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2831,7 +2831,7 @@
Format: [state][,regs][,debounce][,die]
 
nmi_watchdog=   [KNL,BUGS=X86] Debugging features for SMP kernels
-   Format: [panic,][nopanic,][num]
+   Format: [panic,][nopanic,][num,][hpet]
Valid num: 0 or 1
0 - turn hardlockup detector in nmi_watchdog off
1 - turn hardlockup detector in nmi_watchdog on
@@ -2841,6 +2841,12 @@
please see 'nowatchdog'.
This is useful when you use a panic=... timeout and
need the box quickly up again.
+   When hpet is specified, the NMI watchdog will be driven
+   by an HPET timer, if available in the system. Otherwise,
+   it falls back to the default implementation (perf or
+   architecture-specific). Specifying hpet has no effect
+   if the NMI watchdog is not enabled (either at build time
+   or via the command line).
 
These settings can be accessed at runtime via
the nmi_watchdog and hardlockup_panic sysctls.
diff --git a/arch/x86/kernel/watchdog_hld_hpet.c 
b/arch/x86/kernel/watchdog_hld_hpet.c
index dcc50cd29374..76eed714a1cb 100644
--- a/arch/x86/kernel/watchdog_hld_hpet.c
+++ b/arch/x86/kernel/watchdog_hld_hpet.c
@@ -351,6 +351,28 @@ void hardlockup_detector_hpet_stop(void)
disable_timer(hld_data);
 }
 
+/**
+ * hardlockup_detector_hpet_setup() - Parse command-line parameters
+ * @str:   A string containing the kernel command line
+ *
+ * Parse the nmi_watchdog parameter from the kernel command line. If
+ * selected by the user, use this implementation to detect hardlockups.
+ */
+static int __init hardlockup_detector_hpet_setup(char *str)
+{
+   if (!str)
+   return -EINVAL;
+
+   if (parse_option_str(str, "hpet"))
+   hardlockup_use_hpet = true;
+
+   if (!nmi_watchdog_user_enabled && hardlockup_use_hpet)
+   pr_warn("Selecting HPET NMI watchdog has no effect with NMI 
watchdog disabled\n");
+
+   return 0;
+}
+early_param("nmi_watchdog", hardlockup_detector_hpet_setup);
+
 /**
  * hardlockup_detector_hpet_init() - Initialize the hardlockup detector
  *
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH v4 17/21] x86/tsc: Switch to perf-based hardlockup detector if TSC become unstable

2019-05-23 Thread Ricardo Neri
The HPET-based hardlockup detector relies on the TSC to determine if an
observed NMI interrupt was originated by HPET timer. Hence, this detector
can no longer be used with an unstable TSC.

In such case, permanently stop the HPET-based hardlockup detector and
start the perf-based detector.

Signed-off-by: Ricardo Neri 
---
 arch/x86/include/asm/hpet.h| 2 ++
 arch/x86/kernel/tsc.c  | 2 ++
 arch/x86/kernel/watchdog_hld.c | 7 +++
 3 files changed, 11 insertions(+)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index fd99f2390714..a82cbe17479d 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -128,6 +128,7 @@ extern int hardlockup_detector_hpet_init(void);
 extern void hardlockup_detector_hpet_stop(void);
 extern void hardlockup_detector_hpet_enable(unsigned int cpu);
 extern void hardlockup_detector_hpet_disable(unsigned int cpu);
+extern void hardlockup_detector_switch_to_perf(void);
 #else
 static inline struct hpet_hld_data *hpet_hardlockup_detector_assign_timer(void)
 { return NULL; }
@@ -136,6 +137,7 @@ static inline int hardlockup_detector_hpet_init(void)
 static inline void hardlockup_detector_hpet_stop(void) {}
 static inline void hardlockup_detector_hpet_enable(unsigned int cpu) {}
 static inline void hardlockup_detector_hpet_disable(unsigned int cpu) {}
+static void harrdlockup_detector_switch_to_perf(void) {}
 #endif /* CONFIG_X86_HARDLOCKUP_DETECTOR_HPET */
 
 #else /* CONFIG_HPET_TIMER */
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 59b57605e66c..b2210728ce3d 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1158,6 +1158,8 @@ void mark_tsc_unstable(char *reason)
 
clocksource_mark_unstable(&clocksource_tsc_early);
clocksource_mark_unstable(&clocksource_tsc);
+
+   hardlockup_detector_switch_to_perf();
 }
 
 EXPORT_SYMBOL_GPL(mark_tsc_unstable);
diff --git a/arch/x86/kernel/watchdog_hld.c b/arch/x86/kernel/watchdog_hld.c
index c2512d4c79c5..c8547c227a41 100644
--- a/arch/x86/kernel/watchdog_hld.c
+++ b/arch/x86/kernel/watchdog_hld.c
@@ -76,3 +76,10 @@ void watchdog_nmi_stop(void)
if (detector_type == X86_HARDLOCKUP_DETECTOR_HPET)
hardlockup_detector_hpet_stop();
 }
+
+void hardlockup_detector_switch_to_perf(void)
+{
+   detector_type = X86_HARDLOCKUP_DETECTOR_PERF;
+   hardlockup_detector_hpet_stop();
+   hardlockup_start_all();
+}
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH v4 19/21] iommu/vt-d: Rework prepare_irte() to support per-irq delivery mode

2019-05-23 Thread Ricardo Neri
A recent change introduced a new member to struct irq_cfg to specify the
delivery mode of an interrupt. Supporting the configuration of the
delivery mode would require adding a third argument to prepare_irte().
Instead, simply take a pointer to a irq_cfg data structure as a the only
argument.

Internally, configure the delivery mode of the Interrupt Remapping Table
Entry as specified in the irq_cfg data structure and not as the APIC
setting.

This change does not change the existing behavior, as the delivery mode
of the APIC is used to configure irq_cfg data structure.

Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: Borislav Petkov 
Cc: Jacob Pan 
Cc: Joerg Roedel 
Cc: Juergen Gross 
Cc: Bjorn Helgaas 
Cc: Wincy Van 
Cc: Kate Stewart 
Cc: Philippe Ombredanne 
Cc: "Eric W. Biederman" 
Cc: Baoquan He 
Cc: Jan Kiszka 
Cc: Lu Baolu 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Cc: iommu@lists.linux-foundation.org
Signed-off-by: Ricardo Neri 
---
 drivers/iommu/intel_irq_remapping.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/intel_irq_remapping.c 
b/drivers/iommu/intel_irq_remapping.c
index 4160aa9f3f80..2e61eaca7d7e 100644
--- a/drivers/iommu/intel_irq_remapping.c
+++ b/drivers/iommu/intel_irq_remapping.c
@@ -1072,7 +1072,7 @@ static int reenable_irq_remapping(int eim)
return -1;
 }
 
-static void prepare_irte(struct irte *irte, int vector, unsigned int dest)
+static void prepare_irte(struct irte *irte, struct irq_cfg *irq_cfg)
 {
memset(irte, 0, sizeof(*irte));
 
@@ -1086,9 +1086,9 @@ static void prepare_irte(struct irte *irte, int vector, 
unsigned int dest)
 * irq migration in the presence of interrupt-remapping.
*/
irte->trigger_mode = 0;
-   irte->dlvry_mode = apic->irq_delivery_mode;
-   irte->vector = vector;
-   irte->dest_id = IRTE_DEST(dest);
+   irte->dlvry_mode = irq_cfg->delivery_mode;
+   irte->vector = irq_cfg->vector;
+   irte->dest_id = IRTE_DEST(irq_cfg->dest_apicid);
irte->redir_hint = 1;
 }
 
@@ -1265,7 +1265,7 @@ static void intel_irq_remapping_prepare_irte(struct 
intel_ir_data *data,
struct irte *irte = &data->irte_entry;
struct msi_msg *msg = &data->msi_entry;
 
-   prepare_irte(irte, irq_cfg->vector, irq_cfg->dest_apicid);
+   prepare_irte(irte, irq_cfg);
switch (info->type) {
case X86_IRQ_ALLOC_TYPE_IOAPIC:
/* Set source-id of interrupt request */
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH v4 00/21] Implement an HPET-based hardlockup detector

2019-05-23 Thread Ricardo Neri
 functions (Thomas Gleixner).
 * Added a new category of NMI handler, NMI_WATCHDOG, which executes after
   NMI_LOCAL handlers (Andi Kleen).
 * Updated handling of "nmi_watchdog" to support comma-separated
   arguments.
 * Undid split of the generic hardlockup detector into a separate file
   (Thomas Gleixner).
 * Added a new intermediate symbol CONFIG_HARDLOCKUP_DETECTOR_CORE to
   select generic parts of the detector (Paul E. McKenney,
   Thomas Gleixner).
 * Removed use of struct cpumask in favor of a variable length array in
   conjunction with kzalloc (Peter Zijlstra).
 * Added CPU as argument hardlockup_detector_hpet_enable()/disable()
   (Thomas Gleixner).
 * Remove unnecessary export of function declarations, flags and bit
   fields (Thomas Gleixner).
 * Removed  unnecessary check for FSB support when reserving timer for the
   detector (Thomas Gleixner).
 * Separated TSC code from HPET code in kick_timer() (Thomas Gleixner).
 * Reworked condition to check if the expected TSC value is within the
   error margin to avoid conditional (Peter Zijlstra).
 * Removed TSC error margin from struct hld_data; use global variable
   instead (Peter Zijlstra).
 * Removed previously introduced watchdog_get_allowed_cpumask*() and
   reworked hardlockup_detector_hpet_enable()/disable() to not need
   access to watchdog_allowed_mask (Thomas Gleixner).

Changes since v1:

 * Removed reads to HPET registers at every NMI. Instead use the time-stamp
   counter to infer the interrupt source (Thomas Gleixner, Andi Kleen).
 * Do not target CPUs in a round-robin manner. Instead, the HPET timer
   always targets the same CPU; other CPUs are monitored via an
   interprocessor interrupt.
 * Removed use of generic irq code to set interrupt affinity and NMI
   delivery. Instead, configure the interrupt directly in HPET registers
   (Thomas Gleixner).
 * Removed the proposed ops structure for NMI watchdogs. Instead, split
   the existing implementation into a generic library and perf-specific
   infrastructure (Thomas Gleixner, Nicholas Piggin).
 * Added an x86-specific shim hardlockup detector that selects between
   HPET and perf infrastructures as needed (Nicholas Piggin).
 * Removed locks taken in NMI and !NMI context. This was wrong and is no
   longer needed (Thomas Gleixner).
 * Fixed unconditonal return NMI_HANDLED when the HPET timer is programmed
   for FSB/MSI delivery (Peter Zijlstra).

References:

[1]. https://lkml.org/lkml/2018/6/12/1027
[2]. https://lkml.org/lkml/2019/2/27/402
[3]. https://lkml.org/lkml/2019/5/14/386

Ricardo Neri (21):
  x86/msi: Add definition for NMI delivery mode
  x86/hpet: Expose hpet_writel() in header
  x86/hpet: Calculate ticks-per-second in a separate function
  x86/hpet: Add hpet_set_comparator() for periodic and one-shot modes
  x86/hpet: Reserve timer for the HPET hardlockup detector
  x86/hpet: Configure the timer used by the hardlockup detector
  watchdog/hardlockup: Define a generic function to detect hardlockups
  watchdog/hardlockup: Decouple the hardlockup detector from perf
  x86/nmi: Add a NMI_WATCHDOG NMI handler category
  watchdog/hardlockup: Add function to enable NMI watchdog on all
allowed CPUs at once
  x86/watchdog/hardlockup: Add an HPET-based hardlockup detector
  watchdog/hardlockup/hpet: Adjust timer expiration on the number of
monitored CPUs
  x86/watchdog/hardlockup/hpet: Determine if HPET timer caused NMI
  watchdog/hardlockup: Use parse_option_str() to handle "nmi_watchdog"
  watchdog/hardlockup/hpet: Only enable the HPET watchdog via a boot
parameter
  x86/watchdog: Add a shim hardlockup detector
  x86/tsc: Switch to perf-based hardlockup detector if TSC become
unstable
  x86/apic: Add a parameter for the APIC delivery mode
  iommu/vt-d: Rework prepare_irte() to support per-irq delivery mode
  iommu/vt-d: hpet: Reserve an interrupt remampping table entry for
watchdog
  x86/watchdog/hardlockup/hpet: Support interrupt remapping

 .../admin-guide/kernel-parameters.txt |   8 +-
 arch/x86/Kconfig.debug|  15 +
 arch/x86/include/asm/hpet.h   |  47 ++
 arch/x86/include/asm/hw_irq.h |   5 +-
 arch/x86/include/asm/msidef.h |   4 +
 arch/x86/include/asm/nmi.h|   1 +
 arch/x86/kernel/Makefile  |   2 +
 arch/x86/kernel/apic/vector.c |  10 +
 arch/x86/kernel/hpet.c| 115 -
 arch/x86/kernel/nmi.c |  10 +
 arch/x86/kernel/tsc.c |   2 +
 arch/x86/kernel/watchdog_hld.c|  85 
 arch/x86/kernel/watchdog_hld_hpet.c   | 453 ++
 drivers/char/hpet.c   |  31 +-
 drivers/iommu/intel_irq_remapping.c   |  59 ++-
 include/linux/hpet.h  |   1 +
 include/linux/nmi.h   |   8 +-
 kernel/Makefile   |   2 +

[RFC PATCH v4 16/21] x86/watchdog: Add a shim hardlockup detector

2019-05-23 Thread Ricardo Neri
The generic hardlockup detector is based on perf. It also provides a set
of weak stubs that CPU architectures can override. Add a shim hardlockup
detector for x86 that selects between perf and hpet implementations.

Specifically, this shim implementation is needed for the HPET-based
hardlockup detector; it can also be used for future implementations.

Cc: "H. Peter Anvin" 
Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: Peter Zijlstra 
Cc: Clemens Ladisch 
Cc: Arnd Bergmann 
Cc: Philippe Ombredanne 
Cc: Kate Stewart 
Cc: "Rafael J. Wysocki" 
Cc: Mimi Zohar 
Cc: Jan Kiszka 
Cc: Nick Desaulniers 
Cc: Masahiro Yamada 
Cc: Nayna Jain 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Suggested-by: Nicholas Piggin 
Signed-off-by: Ricardo Neri 
---
 arch/x86/Kconfig.debug |  4 ++
 arch/x86/kernel/Makefile   |  1 +
 arch/x86/kernel/watchdog_hld.c | 78 ++
 3 files changed, 83 insertions(+)
 create mode 100644 arch/x86/kernel/watchdog_hld.c

diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index 445bbb188f10..52c77e2145c9 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -169,11 +169,15 @@ config IOMMU_LEAK
 config HAVE_MMIOTRACE_SUPPORT
def_bool y
 
+config X86_HARDLOCKUP_DETECTOR
+   bool
+
 config X86_HARDLOCKUP_DETECTOR_HPET
bool "Use HPET Timer for Hard Lockup Detection"
select SOFTLOCKUP_DETECTOR
select HARDLOCKUP_DETECTOR
select HARDLOCKUP_DETECTOR_CORE
+   select X86_HARDLOCKUP_DETECTOR
depends on HPET_TIMER && HPET && X86_64
help
  Say y to enable a hardlockup detector that is driven by a High-
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 3ad55de67e8b..e60244b8a8ec 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -106,6 +106,7 @@ obj-$(CONFIG_VM86)  += vm86_32.o
 obj-$(CONFIG_EARLY_PRINTK) += early_printk.o
 
 obj-$(CONFIG_HPET_TIMER)   += hpet.o
+obj-$(CONFIG_X86_HARDLOCKUP_DETECTOR) += watchdog_hld.o
 obj-$(CONFIG_X86_HARDLOCKUP_DETECTOR_HPET) += watchdog_hld_hpet.o
 obj-$(CONFIG_APB_TIMER)+= apb_timer.o
 
diff --git a/arch/x86/kernel/watchdog_hld.c b/arch/x86/kernel/watchdog_hld.c
new file mode 100644
index ..c2512d4c79c5
--- /dev/null
+++ b/arch/x86/kernel/watchdog_hld.c
@@ -0,0 +1,78 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * A shim hardlockup detector. It overrides the weak stubs of the generic
+ * implementation to select between the perf- or the hpet-based implementation.
+ *
+ * Copyright (C) Intel Corporation 2019
+ */
+
+#include 
+#include 
+
+enum x86_hardlockup_detector {
+   X86_HARDLOCKUP_DETECTOR_PERF,
+   X86_HARDLOCKUP_DETECTOR_HPET,
+};
+
+static enum __read_mostly x86_hardlockup_detector detector_type;
+
+int watchdog_nmi_enable(unsigned int cpu)
+{
+   if (detector_type == X86_HARDLOCKUP_DETECTOR_PERF) {
+   hardlockup_detector_perf_enable();
+   return 0;
+   }
+
+   if (detector_type == X86_HARDLOCKUP_DETECTOR_HPET) {
+   hardlockup_detector_hpet_enable(cpu);
+   return 0;
+   }
+
+   return -ENODEV;
+}
+
+void watchdog_nmi_disable(unsigned int cpu)
+{
+   if (detector_type == X86_HARDLOCKUP_DETECTOR_PERF) {
+   hardlockup_detector_perf_disable();
+   return;
+   }
+
+   if (detector_type == X86_HARDLOCKUP_DETECTOR_HPET) {
+   hardlockup_detector_hpet_disable(cpu);
+   return;
+   }
+}
+
+int __init watchdog_nmi_probe(void)
+{
+   int ret;
+
+   /*
+* Try first with the HPET hardlockup detector. It will only
+* succeed if selected at build time and the nmi_watchdog
+* command-line parameter is configured. This ensure that the
+* perf-based detector is used by default, if selected at
+* build time.
+*/
+   ret = hardlockup_detector_hpet_init();
+   if (!ret) {
+   detector_type = X86_HARDLOCKUP_DETECTOR_HPET;
+   return ret;
+   }
+
+   ret = hardlockup_detector_perf_init();
+   if (!ret) {
+   detector_type = X86_HARDLOCKUP_DETECTOR_PERF;
+   return ret;
+   }
+
+   return ret;
+}
+
+void watchdog_nmi_stop(void)
+{
+   /* Only the HPET lockup detector defines a stop function. */
+   if (detector_type == X86_HARDLOCKUP_DETECTOR_HPET)
+   hardlockup_detector_hpet_stop();
+}
-- 
2.17.1



[RFC PATCH v4 11/21] x86/watchdog/hardlockup: Add an HPET-based hardlockup detector

2019-05-23 Thread Ricardo Neri
This is the initial implementation of a hardlockup detector driven by an
HPET timer. This initial implementation includes functions to control the
timer via its registers. It also requests such timer, installs an NMI
interrupt handler and performs the initial configuration of the timer.

The detector is not functional at this stage. A subsequent changeset will
invoke the interfaces provides by this detector as well as functionality
to determine if the HPET timer caused the NMI.

In order to detect hardlockups in all the monitored CPUs, move the
interrupt to the next monitored CPU while handling the NMI interrupt; wrap
around when reaching the highest CPU in the mask. This rotation is
achieved by setting the affinity mask to only contain the next CPU to
monitor. A cpumask keeps track of all the CPUs that need to be monitored.
Such cpumask is updated when the watchdog is enabled or disabled in a
particular CPU.

This detector relies on an HPET timer that is capable of using Front Side
Bus interrupts. In order to avoid using the generic interrupt code,
program directly the MSI message register of the HPET timer.

HPET registers are only accessed to kick the timer after looking for
hardlockups. This happens every watchdog_thresh seconds. A subsequent
changeset will determine whether the HPET timer caused the interrupt based
on the value of the time-stamp counter. For now, just add a stub function.

Cc: "H. Peter Anvin" 
Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: Peter Zijlstra 
Cc: Clemens Ladisch 
Cc: Arnd Bergmann 
Cc: Philippe Ombredanne 
Cc: Kate Stewart 
Cc: "Rafael J. Wysocki" 
Cc: "Ravi V. Shankar" 
Cc: Mimi Zohar 
Cc: Jan Kiszka 
Cc: Nick Desaulniers 
Cc: Masahiro Yamada 
Cc: Nayna Jain 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: x...@kernel.org
Signed-off-by: Ricardo Neri 
---
 arch/x86/Kconfig.debug  |  11 +
 arch/x86/include/asm/hpet.h |  13 ++
 arch/x86/kernel/Makefile|   1 +
 arch/x86/kernel/hpet.c  |   3 +-
 arch/x86/kernel/watchdog_hld_hpet.c | 335 
 5 files changed, 362 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/kernel/watchdog_hld_hpet.c

diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index f730680dc818..445bbb188f10 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -169,6 +169,17 @@ config IOMMU_LEAK
 config HAVE_MMIOTRACE_SUPPORT
def_bool y
 
+config X86_HARDLOCKUP_DETECTOR_HPET
+   bool "Use HPET Timer for Hard Lockup Detection"
+   select SOFTLOCKUP_DETECTOR
+   select HARDLOCKUP_DETECTOR
+   select HARDLOCKUP_DETECTOR_CORE
+   depends on HPET_TIMER && HPET && X86_64
+   help
+ Say y to enable a hardlockup detector that is driven by a High-
+ Precision Event Timer. This option is helpful to not use counters
+ from the Performance Monitoring Unit to drive the detector.
+
 config X86_DECODER_SELFTEST
bool "x86 instruction decoder selftest"
depends on DEBUG_KERNEL && KPROBES
diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index 20abdaa5372d..31fc27508cf3 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -114,12 +114,25 @@ struct hpet_hld_data {
boolhas_periodic;
u32 num;
u64 ticks_per_second;
+   u32 handling_cpu;
+   u32 enabled_cpus;
+   struct msi_msg  msi_msg;
+   unsigned long   cpu_monitored_mask[0];
 };
 
 extern struct hpet_hld_data *hpet_hardlockup_detector_assign_timer(void);
+extern int hardlockup_detector_hpet_init(void);
+extern void hardlockup_detector_hpet_stop(void);
+extern void hardlockup_detector_hpet_enable(unsigned int cpu);
+extern void hardlockup_detector_hpet_disable(unsigned int cpu);
 #else
 static inline struct hpet_hld_data *hpet_hardlockup_detector_assign_timer(void)
 { return NULL; }
+static inline int hardlockup_detector_hpet_init(void)
+{ return -ENODEV; }
+static inline void hardlockup_detector_hpet_stop(void) {}
+static inline void hardlockup_detector_hpet_enable(unsigned int cpu) {}
+static inline void hardlockup_detector_hpet_disable(unsigned int cpu) {}
 #endif /* CONFIG_X86_HARDLOCKUP_DETECTOR_HPET */
 
 #else /* CONFIG_HPET_TIMER */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 3578ad248bc9..3ad55de67e8b 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -106,6 +106,7 @@ obj-$(CONFIG_VM86)  += vm86_32.o
 obj-$(CONFIG_EARLY_PRINTK) += early_printk.o
 
 obj-$(CONFIG_HPET_TIMER)   += hpet.o
+obj-$(CONFIG_X86_HARDLOCKUP_DETECTOR_HPET) += watchdog_hld_hpet.o
 obj-$(CONFIG_APB_TIMER)+= apb_timer.o
 
 obj-$(CONFIG_AMD_NB)   += amd_nb.o
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index 5f9209949fc7..dd3bb664a188 100644
--- a/arch

[RFC PATCH v4 13/21] x86/watchdog/hardlockup/hpet: Determine if HPET timer caused NMI

2019-05-23 Thread Ricardo Neri
The only direct method to determine whether an HPET timer caused an
interrupt is to read the Interrupt Status register. Unfortunately,
reading HPET registers is slow and, therefore, it is not recommended to
read them while in NMI context. Furthermore, status is not available if
the interrupt is generated vi the Front Side Bus.

An indirect manner to infer if the non-maskable interrupt we see was
caused by the HPET timer is to use the time-stamp counter. Compute the
value that the time-stamp counter should have at the next interrupt of the
HPET timer. Since the hardlockup detector operates in seconds, high
precision is not needed. This implementation considers that the HPET
caused the HMI if the time-stamp counter reads the expected value -/+ 1.5%.
This value is selected as it is equivalent to 1/64 and the division can be
performed using a bit shift operation. Experimentally, the error in the
estimation is consistently less than 1%.

The computation of the expected value of the time-stamp counter must be
performed in relation to watchdog_thresh divided by the number of
monitored CPUs. This quantity is stored in tsc_ticks_per_cpu and must be
updated whenever the number of monitored CPUs changes.

Cc: "H. Peter Anvin" 
Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: Peter Zijlstra 
Cc: Clemens Ladisch 
Cc: Arnd Bergmann 
Cc: Philippe Ombredanne 
Cc: Kate Stewart 
Cc: "Rafael J. Wysocki" 
Cc: Mimi Zohar 
Cc: Jan Kiszka 
Cc: Nick Desaulniers 
Cc: Masahiro Yamada 
Cc: Nayna Jain 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Suggested-by: Andi Kleen 
Signed-off-by: Ricardo Neri 
---
 arch/x86/include/asm/hpet.h |  2 ++
 arch/x86/kernel/watchdog_hld_hpet.c | 27 ++-
 2 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index 64acacce095d..fd99f2390714 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -115,6 +115,8 @@ struct hpet_hld_data {
u32 num;
u64 ticks_per_second;
u64 ticks_per_cpu;
+   u64 tsc_next;
+   u64 tsc_ticks_per_cpu;
u32 handling_cpu;
u32 enabled_cpus;
struct msi_msg  msi_msg;
diff --git a/arch/x86/kernel/watchdog_hld_hpet.c 
b/arch/x86/kernel/watchdog_hld_hpet.c
index 74aeb0535d08..dcc50cd29374 100644
--- a/arch/x86/kernel/watchdog_hld_hpet.c
+++ b/arch/x86/kernel/watchdog_hld_hpet.c
@@ -24,6 +24,7 @@
 
 static struct hpet_hld_data *hld_data;
 static bool hardlockup_use_hpet;
+static u64 tsc_next_error;
 
 /**
  * kick_timer() - Reprogram timer to expire in the future
@@ -33,11 +34,22 @@ static bool hardlockup_use_hpet;
  * Reprogram the timer to expire within watchdog_thresh seconds in the future.
  * If the timer supports periodic mode, it is not kicked unless @force is
  * true.
+ *
+ * Also, compute the expected value of the time-stamp counter at the time of
+ * expiration as well as a deviation from the expected value. The maximum
+ * deviation is of ~1.5%. This deviation can be easily computed by shifting
+ * by 6 positions the delta between the current and expected time-stamp values.
  */
 static void kick_timer(struct hpet_hld_data *hdata, bool force)
 {
+   u64 tsc_curr, tsc_delta, new_compare, count, period = 0;
bool kick_needed = force || !(hdata->has_periodic);
-   u64 new_compare, count, period = 0;
+
+   tsc_curr = rdtsc();
+
+   tsc_delta = (unsigned long)watchdog_thresh * hdata->tsc_ticks_per_cpu;
+   hdata->tsc_next = tsc_curr + tsc_delta;
+   tsc_next_error = tsc_delta >> 6;
 
/*
 * Update the comparator in increments of watch_thresh seconds relative
@@ -93,6 +105,15 @@ static void enable_timer(struct hpet_hld_data *hdata)
  */
 static bool is_hpet_wdt_interrupt(struct hpet_hld_data *hdata)
 {
+   if (smp_processor_id() == hdata->handling_cpu) {
+   u64 tsc_curr;
+
+   tsc_curr = rdtsc();
+
+   return (tsc_curr - hdata->tsc_next) + tsc_next_error <
+  2 * tsc_next_error;
+   }
+
return false;
 }
 
@@ -260,6 +281,10 @@ static void update_ticks_per_cpu(struct hpet_hld_data 
*hdata)
 
do_div(temp, hdata->enabled_cpus);
hdata->ticks_per_cpu = temp;
+
+   temp = (unsigned long)tsc_khz * 1000L;
+   do_div(temp, hdata->enabled_cpus);
+   hdata->tsc_ticks_per_cpu = temp;
 }
 
 /**
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH v4 14/21] watchdog/hardlockup: Use parse_option_str() to handle "nmi_watchdog"

2019-05-23 Thread Ricardo Neri
Prepare hardlockup_panic_setup() to handle a comma-separated list of
options. This is needed to pass options to specific implementations of the
hardlockup detector.

Cc: "H. Peter Anvin" 
Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: Peter Zijlstra 
Cc: Clemens Ladisch 
Cc: Arnd Bergmann 
Cc: Philippe Ombredanne 
Cc: Kate Stewart 
Cc: "Rafael J. Wysocki" 
Cc: Mimi Zohar 
Cc: Jan Kiszka 
Cc: Nick Desaulniers 
Cc: Masahiro Yamada 
Cc: Nayna Jain 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Signed-off-by: Ricardo Neri 
---
 kernel/watchdog.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index be589001200a..fd50049449ec 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -70,13 +70,13 @@ void __init hardlockup_detector_disable(void)
 
 static int __init hardlockup_panic_setup(char *str)
 {
-   if (!strncmp(str, "panic", 5))
+   if (parse_option_str(str, "panic"))
hardlockup_panic = 1;
-   else if (!strncmp(str, "nopanic", 7))
+   else if (parse_option_str(str, "nopanic"))
hardlockup_panic = 0;
-   else if (!strncmp(str, "0", 1))
+   else if (parse_option_str(str, "0"))
nmi_watchdog_user_enabled = 0;
-   else if (!strncmp(str, "1", 1))
+   else if (parse_option_str(str, "1"))
nmi_watchdog_user_enabled = 1;
return 1;
 }
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH v4 21/21] x86/watchdog/hardlockup/hpet: Support interrupt remapping

2019-05-23 Thread Ricardo Neri
When interrupt remapping is enabled in the system, the MSI interrupt
message must follow a special format the IOMMU can understand. Hence,
utilize the functionality provided by the IOMMU driver for such purpose.

The first step is to determine whether interrupt remapping is enabled
by looking for the existence of an interrupt remapping domain. If it
exists, let the IOMMU driver compose the MSI message for us. The hard-
lockup detector is still responsible of writing the message in the
HPET FSB route register.

Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: Borislav Petkov 
Cc: Jacob Pan 
Cc: Joerg Roedel 
Cc: Juergen Gross 
Cc: Bjorn Helgaas 
Cc: Wincy Van 
Cc: Kate Stewart 
Cc: Philippe Ombredanne 
Cc: "Eric W. Biederman" 
Cc: Baoquan He 
Cc: Jan Kiszka 
Cc: Lu Baolu 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Cc: iommu@lists.linux-foundation.org
Signed-off-by: Ricardo Neri 
---
 arch/x86/kernel/watchdog_hld_hpet.c | 33 -
 1 file changed, 32 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/watchdog_hld_hpet.c 
b/arch/x86/kernel/watchdog_hld_hpet.c
index 76eed714a1cb..a266439fdb9e 100644
--- a/arch/x86/kernel/watchdog_hld_hpet.c
+++ b/arch/x86/kernel/watchdog_hld_hpet.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 static struct hpet_hld_data *hld_data;
@@ -117,6 +118,25 @@ static bool is_hpet_wdt_interrupt(struct hpet_hld_data 
*hdata)
return false;
 }
 
+/** irq_remapping_enabled() - Detect if interrupt remapping is enabled
+ * @hdata: A data structure with the HPET block id
+ *
+ * Determine if the HPET block that the hardlockup detector is under
+ * the remapped interrupt domain.
+ *
+ * Returns: True interrupt remapping is enabled. False otherwise.
+ */
+static bool irq_remapping_enabled(struct hpet_hld_data *hdata)
+{
+   struct irq_alloc_info info;
+
+   init_irq_alloc_info(&info, NULL);
+   info.type = X86_IRQ_ALLOC_TYPE_HPET;
+   info.hpet_id = hdata->blockid;
+
+   return !!irq_remapping_get_ir_irq_domain(&info);
+}
+
 /**
  * compose_msi_msg() - Populate address and data fields of an MSI message
  * @hdata: A data strucure with the message to populate
@@ -161,6 +181,9 @@ static int update_msi_destid(struct hpet_hld_data *hdata)
 {
u32 destid;
 
+   if (irq_remapping_enabled(hdata))
+   return hld_hpet_intremap_activate_irq(hdata);
+
hdata->msi_msg.address_lo &= ~MSI_ADDR_DEST_ID_MASK;
destid = apic->calc_dest_apicid(hdata->handling_cpu);
hdata->msi_msg.address_lo |= MSI_ADDR_DEST_ID(destid);
@@ -217,9 +240,17 @@ static int hardlockup_detector_nmi_handler(unsigned int 
type,
  */
 static int setup_irq_msi_mode(struct hpet_hld_data *hdata)
 {
+   s32 ret;
u32 v;
 
-   compose_msi_msg(hdata);
+   if (irq_remapping_enabled(hdata)) {
+   ret = hld_hpet_intremap_alloc_irq(hdata);
+   if (ret)
+   return ret;
+   } else {
+   compose_msi_msg(hdata);
+   }
+
hpet_writel(hdata->msi_msg.data, HPET_Tn_ROUTE(hdata->num));
hpet_writel(hdata->msi_msg.address_lo, HPET_Tn_ROUTE(hdata->num) + 4);
 
-- 
2.17.1



[RFC PATCH v4 20/21] iommu/vt-d: hpet: Reserve an interrupt remampping table entry for watchdog

2019-05-23 Thread Ricardo Neri
When interrupt remapping is enabled, MSI interrupt messages must follow a
special format that the IOMMU can understand. Hence, when the HPET hard
lockup detector is used with interrupt remapping, it must also follow this
special format.

The IOMMU, given the information about a particular interrupt, already
knows how to populate the MSI message with this special format and the
corresponding entry in the interrupt remapping table. Given that this is a
special interrupt case, we want to avoid the interrupt subsystem. Add two
functions to create an entry for the HPET hard lockup detector. Perform
this process in two steps as described below.

When initializing the lockup detector, the function
hld_hpet_intremap_alloc_irq() permanently allocates a new entry in the
interrupt remapping table and populates it with the information the
IOMMU driver needs. In order to populate the table, the IOMMU needs to
know the HPET block ID as described in the ACPI table. Hence, add such
ID to the data of the hardlockup detector.

When the hardlockup detector is enabled, the function
hld_hpet_intremapactivate_irq() activates the recently created entry
in the interrupt remapping table via the modify_irte() functions. While
doing this, it specifies which CPU the interrupt must target via its APIC
ID. This function can be called every time the destination iD of the
interrupt needs to be updated; there is no need to allocate or remove
entries in the interrupt remapping table.

Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: Borislav Petkov 
Cc: Jacob Pan 
Cc: Joerg Roedel 
Cc: Juergen Gross 
Cc: Bjorn Helgaas 
Cc: Wincy Van 
Cc: Kate Stewart 
Cc: Philippe Ombredanne 
Cc: "Eric W. Biederman" 
Cc: Baoquan He 
Cc: Jan Kiszka 
Cc: Lu Baolu 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Cc: iommu@lists.linux-foundation.org
Signed-off-by: Ricardo Neri 
---
 arch/x86/include/asm/hpet.h | 11 +++
 arch/x86/kernel/hpet.c  |  1 +
 drivers/iommu/intel_irq_remapping.c | 49 +
 3 files changed, 61 insertions(+)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index a82cbe17479d..811051fa7ade 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -119,6 +119,8 @@ struct hpet_hld_data {
u64 tsc_ticks_per_cpu;
u32 handling_cpu;
u32 enabled_cpus;
+   u8  blockid;
+   void*intremap_data;
struct msi_msg  msi_msg;
unsigned long   cpu_monitored_mask[0];
 };
@@ -129,6 +131,15 @@ extern void hardlockup_detector_hpet_stop(void);
 extern void hardlockup_detector_hpet_enable(unsigned int cpu);
 extern void hardlockup_detector_hpet_disable(unsigned int cpu);
 extern void hardlockup_detector_switch_to_perf(void);
+#ifdef CONFIG_IRQ_REMAP
+extern int hld_hpet_intremap_activate_irq(struct hpet_hld_data *hdata);
+extern int hld_hpet_intremap_alloc_irq(struct hpet_hld_data *hdata);
+#else
+static inline int hld_hpet_intremap_activate_irq(struct hpet_hld_data *hdata)
+{ return -ENODEV; }
+static inline int hld_hpet_intremap_alloc_irq(struct hpet_hld_data *hdata)
+{ return -ENODEV; }
+#endif /* CONFIG_IRQ_REMAP */
 #else
 static inline struct hpet_hld_data *hpet_hardlockup_detector_assign_timer(void)
 { return NULL; }
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index dd3bb664a188..ddc9be81a075 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -202,6 +202,7 @@ struct hpet_hld_data 
*hpet_hardlockup_detector_assign_timer(void)
 */
temp = (u64)cfg << HPET_COUNTER_CLK_PERIOD_SHIFT;
hdata->ticks_per_second = hpet_get_ticks_per_sec(temp);
+   hdata->blockid = hpet_blockid;
 
return hdata;
 }
diff --git a/drivers/iommu/intel_irq_remapping.c 
b/drivers/iommu/intel_irq_remapping.c
index 2e61eaca7d7e..256466dd30cb 100644
--- a/drivers/iommu/intel_irq_remapping.c
+++ b/drivers/iommu/intel_irq_remapping.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "irq_remapping.h"
 
@@ -1516,3 +1517,51 @@ int dmar_ir_hotplug(struct dmar_drhd_unit *dmaru, bool 
insert)
 
return ret;
 }
+
+#ifdef CONFIG_X86_HARDLOCKUP_DETECTOR_HPET
+int hld_hpet_intremap_activate_irq(struct hpet_hld_data *hdata)
+{
+   u32 destid = apic->calc_dest_apicid(hdata->handling_cpu);
+   struct intel_ir_data *data;
+
+   data = (struct intel_ir_data *)hdata->intremap_data;
+   data->irte_entry.dest_id = IRTE_DEST(destid);
+   return modify_irte(&data->irq_2_iommu, &data->irte_entry);
+}
+
+int hld_hpet_intremap_alloc_irq(struct hpet_hld_data *hdata)
+{
+   struct intel_ir_data *data;
+   struct irq_alloc_info info;
+   struct intel_iommu *iommu;
+   struct irq_cfg irq_cfg;
+   int index;
+
+   iommu = map_hpet_to_ir(hdata->blockid);
+   if (

[RFC PATCH v4 10/21] watchdog/hardlockup: Add function to enable NMI watchdog on all allowed CPUs at once

2019-05-23 Thread Ricardo Neri
When there are more than one implementation of the NMI watchdog, there may
be situations in which switching from one to another is needed (e.g., if
the time-stamp counter becomes unstable, the HPET-based NMI watchdog can
no longer be used.

The perf-based implementation of the hardlockup detector makes use of
various per-CPU variables which are accessed via this_cpu operations.
Hence, each CPU needs to enable its own NMI watchdog if using the perf
implementation.

Add functionality to switch from one NMI watchdog to another and do it
from each allowed CPU.

Cc: "H. Peter Anvin" 
Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: "Rafael J. Wysocki" 
Cc: Don Zickus 
Cc: Nicholas Piggin 
Cc: Michael Ellerman 
Cc: Frederic Weisbecker 
Cc: Alexei Starovoitov 
Cc: Babu Moger 
Cc: "David S. Miller" 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Mathieu Desnoyers 
Cc: Masami Hiramatsu 
Cc: Peter Zijlstra 
Cc: Andrew Morton 
Cc: Philippe Ombredanne 
Cc: Colin Ian King 
Cc: Byungchul Park 
Cc: "Paul E. McKenney" 
Cc: "Luis R. Rodriguez" 
Cc: Waiman Long 
Cc: Josh Poimboeuf 
Cc: Randy Dunlap 
Cc: Davidlohr Bueso 
Cc: Marc Zyngier 
Cc: Kai-Heng Feng 
Cc: Konrad Rzeszutek Wilk 
Cc: David Rientjes 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Cc: sparcli...@vger.kernel.org
Cc: linuxppc-...@lists.ozlabs.org
Signed-off-by: Ricardo Neri 
---
 include/linux/nmi.h |  2 ++
 kernel/watchdog.c   | 15 +++
 2 files changed, 17 insertions(+)

diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index e5f1a86e20b7..6d828334348b 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -83,9 +83,11 @@ static inline void reset_hung_task_detector(void) { }
 
 #if defined(CONFIG_HARDLOCKUP_DETECTOR)
 extern void hardlockup_detector_disable(void);
+extern void hardlockup_start_all(void);
 extern unsigned int hardlockup_panic;
 #else
 static inline void hardlockup_detector_disable(void) {}
+static inline void hardlockup_start_all(void) {}
 #endif
 
 #if defined(CONFIG_HAVE_NMI_WATCHDOG) || defined(CONFIG_HARDLOCKUP_DETECTOR)
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 7f9e7b9306fe..be589001200a 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -566,6 +566,21 @@ int lockup_detector_offline_cpu(unsigned int cpu)
return 0;
 }
 
+static int hardlockup_start_fn(void *data)
+{
+   watchdog_nmi_enable(smp_processor_id());
+   return 0;
+}
+
+void hardlockup_start_all(void)
+{
+   int cpu;
+
+   cpumask_copy(&watchdog_allowed_mask, &watchdog_cpumask);
+   for_each_cpu(cpu, &watchdog_allowed_mask)
+   smp_call_on_cpu(cpu, hardlockup_start_fn, NULL, false);
+}
+
 static void lockup_detector_reconfigure(void)
 {
cpus_read_lock();
-- 
2.17.1



[RFC PATCH v4 12/21] watchdog/hardlockup/hpet: Adjust timer expiration on the number of monitored CPUs

2019-05-23 Thread Ricardo Neri
Each CPU should be monitored for hardlockups every watchdog_thresh seconds.
Since all the CPUs in the system are monitored by the same timer and the
timer interrupt is rotated among the monitored CPUs, the timer must expire
every watchdog_thresh/N seconds; where N is the number of monitored CPUs.
Use the new member of struct hld_data, ticks_per_cpu, to store the
aforementioned quantity.

The ticks-per-CPU quantity is updated every time the number of monitored
CPUs changes: when the watchdog is enabled or disabled for a specific CPU.
If the timer is used in periodic mode, it needs to be adjusted to reflect
the new expected expiration.

Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: Borislav Petkov 
Cc: Jacob Pan 
Cc: "Rafael J. Wysocki" 
Cc: Don Zickus 
Cc: Nicholas Piggin 
Cc: Michael Ellerman 
Cc: Frederic Weisbecker 
Cc: Alexei Starovoitov 
Cc: Babu Moger 
Cc: Mathieu Desnoyers 
Cc: Masami Hiramatsu 
Cc: Peter Zijlstra 
Cc: Andrew Morton 
Cc: Philippe Ombredanne 
Cc: Colin Ian King 
Cc: Byungchul Park 
Cc: "Paul E. McKenney" 
Cc: "Luis R. Rodriguez" 
Cc: Waiman Long 
Cc: Josh Poimboeuf 
Cc: Randy Dunlap 
Cc: Davidlohr Bueso 
Cc: Marc Zyngier 
Cc: Kai-Heng Feng 
Cc: Konrad Rzeszutek Wilk 
Cc: David Rientjes 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Cc: iommu@lists.linux-foundation.org
Signed-off-by: Ricardo Neri 
---
 arch/x86/include/asm/hpet.h |  1 +
 arch/x86/kernel/watchdog_hld_hpet.c | 46 +++--
 2 files changed, 44 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index 31fc27508cf3..64acacce095d 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -114,6 +114,7 @@ struct hpet_hld_data {
boolhas_periodic;
u32 num;
u64 ticks_per_second;
+   u64 ticks_per_cpu;
u32 handling_cpu;
u32 enabled_cpus;
struct msi_msg  msi_msg;
diff --git a/arch/x86/kernel/watchdog_hld_hpet.c 
b/arch/x86/kernel/watchdog_hld_hpet.c
index dff4dadabd4c..74aeb0535d08 100644
--- a/arch/x86/kernel/watchdog_hld_hpet.c
+++ b/arch/x86/kernel/watchdog_hld_hpet.c
@@ -45,6 +45,13 @@ static void kick_timer(struct hpet_hld_data *hdata, bool 
force)
 * are able to update the comparator before the counter reaches such new
 * value.
 *
+* Each CPU must be monitored every watch_thresh seconds. Since the
+* timer targets one CPU at a time, it must expire every
+*
+*ticks_per_cpu = watch_thresh * ticks_per_second /enabled_cpus
+*
+* as computed in update_ticks_per_cpu().
+*
 * Let it wrap around if needed.
 */
 
@@ -52,10 +59,10 @@ static void kick_timer(struct hpet_hld_data *hdata, bool 
force)
return;
 
if (hdata->has_periodic)
-   period = watchdog_thresh * hdata->ticks_per_second;
+   period = watchdog_thresh * hdata->ticks_per_cpu;
 
count = hpet_readl(HPET_COUNTER);
-   new_compare = count + watchdog_thresh * hdata->ticks_per_second;
+   new_compare = count + watchdog_thresh * hdata->ticks_per_cpu;
hpet_set_comparator(hdata->num, (u32)new_compare, (u32)period);
 }
 
@@ -234,6 +241,27 @@ static int setup_hpet_irq(struct hpet_hld_data *hdata)
return ret;
 }
 
+/**
+ * update_ticks_per_cpu() - Update the number of HPET ticks per CPU
+ * @hdata: struct with the timer's the ticks-per-second and CPU mask
+ *
+ * From the overall ticks-per-second of the timer, compute the number of ticks
+ * after which the timer should expire to monitor each CPU every watch_thresh
+ * seconds. The ticks-per-cpu quantity is computed using the number of CPUs 
that
+ * the watchdog currently monitors.
+ */
+static void update_ticks_per_cpu(struct hpet_hld_data *hdata)
+{
+   u64 temp = hdata->ticks_per_second;
+
+   /* Only update if there are monitored CPUs. */
+   if (!hdata->enabled_cpus)
+   return;
+
+   do_div(temp, hdata->enabled_cpus);
+   hdata->ticks_per_cpu = temp;
+}
+
 /**
  * hardlockup_detector_hpet_enable() - Enable the hardlockup detector
  * @cpu:   CPU Index in which the watchdog will be enabled.
@@ -246,13 +274,23 @@ void hardlockup_detector_hpet_enable(unsigned int cpu)
 {
cpumask_set_cpu(cpu, to_cpumask(hld_data->cpu_monitored_mask));
 
-   if (!hld_data->enabled_cpus++) {
+   hld_data->enabled_cpus++;
+   update_ticks_per_cpu(hld_data);
+
+   if (hld_data->enabled_cpus == 1) {
hld_data->handling_cpu = cpu;
update_msi_destid(hld_data);
/* Force timer kick when detector is just enabled */
kick_timer(hld_data, true);
enable_timer(hld_data);
 

[RFC PATCH v4 08/21] watchdog/hardlockup: Decouple the hardlockup detector from perf

2019-05-23 Thread Ricardo Neri
The current default implementation of the hardlockup detector assumes that
it is implemented using perf events. However, the hardlockup detector can
be driven by other sources of non-maskable interrupts (e.g., a properly
configured timer).

Group and wrap in #ifdef CONFIG_HARDLOCKUP_DETECTOR_PERF all the code
specific to perf: create and manage perf events, stop and start the perf-
based detector.

The generic portion of the detector (monitor the timers' thresholds, check
timestamps and detect hardlockups as well as the implementation of
arch_touch_nmi_watchdog()) is now selected with the new intermediate config
symbol CONFIG_HARDLOCKUP_DETECTOR_CORE.

The perf-based implementation of the detector selects the new intermediate
symbol. Other implementations should do the same.

Cc: "H. Peter Anvin" 
Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: "Rafael J. Wysocki" 
Cc: Don Zickus 
Cc: Nicholas Piggin 
Cc: Michael Ellerman 
Cc: Frederic Weisbecker 
Cc: Alexei Starovoitov 
Cc: Babu Moger 
Cc: "David S. Miller" 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Mathieu Desnoyers 
Cc: Masami Hiramatsu 
Cc: Peter Zijlstra 
Cc: Andrew Morton 
Cc: Philippe Ombredanne 
Cc: Colin Ian King 
Cc: Byungchul Park 
Cc: "Paul E. McKenney" 
Cc: "Luis R. Rodriguez" 
Cc: Waiman Long 
Cc: Josh Poimboeuf 
Cc: Randy Dunlap 
Cc: Davidlohr Bueso 
Cc: Marc Zyngier 
Cc: Kai-Heng Feng 
Cc: Konrad Rzeszutek Wilk 
Cc: David Rientjes 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Cc: sparcli...@vger.kernel.org
Cc: linuxppc-...@lists.ozlabs.org
Signed-off-by: Ricardo Neri 
---
 include/linux/nmi.h   |  5 -
 kernel/Makefile   |  2 +-
 kernel/watchdog_hld.c | 32 
 lib/Kconfig.debug |  4 
 4 files changed, 29 insertions(+), 14 deletions(-)

diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index 5a8b19749769..e5f1a86e20b7 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -94,8 +94,11 @@ static inline void hardlockup_detector_disable(void) {}
 # define NMI_WATCHDOG_SYSCTL_PERM  0444
 #endif
 
-#if defined(CONFIG_HARDLOCKUP_DETECTOR_PERF)
+#if defined(CONFIG_HARDLOCKUP_DETECTOR_CORE)
 extern void arch_touch_nmi_watchdog(void);
+#endif
+
+#if defined(CONFIG_HARDLOCKUP_DETECTOR_PERF)
 extern void hardlockup_detector_perf_stop(void);
 extern void hardlockup_detector_perf_restart(void);
 extern void hardlockup_detector_perf_disable(void);
diff --git a/kernel/Makefile b/kernel/Makefile
index 33824f0385b3..d07d52a03cc9 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -83,7 +83,7 @@ obj-$(CONFIG_FAIL_FUNCTION) += fail_function.o
 obj-$(CONFIG_KGDB) += debug/
 obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
 obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
-obj-$(CONFIG_HARDLOCKUP_DETECTOR_PERF) += watchdog_hld.o
+obj-$(CONFIG_HARDLOCKUP_DETECTOR_CORE) += watchdog_hld.o
 obj-$(CONFIG_SECCOMP) += seccomp.o
 obj-$(CONFIG_RELAY) += relay.o
 obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
diff --git a/kernel/watchdog_hld.c b/kernel/watchdog_hld.c
index b352e507b17f..bb6435978c46 100644
--- a/kernel/watchdog_hld.c
+++ b/kernel/watchdog_hld.c
@@ -22,12 +22,8 @@
 
 static DEFINE_PER_CPU(bool, hard_watchdog_warn);
 static DEFINE_PER_CPU(bool, watchdog_nmi_touch);
-static DEFINE_PER_CPU(struct perf_event *, watchdog_ev);
-static DEFINE_PER_CPU(struct perf_event *, dead_event);
-static struct cpumask dead_events_mask;
 
 static unsigned long hardlockup_allcpu_dumped;
-static atomic_t watchdog_cpus = ATOMIC_INIT(0);
 
 notrace void arch_touch_nmi_watchdog(void)
 {
@@ -98,14 +94,6 @@ static inline bool watchdog_check_timestamp(void)
 }
 #endif
 
-static struct perf_event_attr wd_hw_attr = {
-   .type   = PERF_TYPE_HARDWARE,
-   .config = PERF_COUNT_HW_CPU_CYCLES,
-   .size   = sizeof(struct perf_event_attr),
-   .pinned = 1,
-   .disabled   = 1,
-};
-
 void inspect_for_hardlockups(struct pt_regs *regs)
 {
if (__this_cpu_read(watchdog_nmi_touch) == true) {
@@ -157,6 +145,24 @@ void inspect_for_hardlockups(struct pt_regs *regs)
return;
 }
 
+#ifdef CONFIG_HARDLOCKUP_DETECTOR_PERF
+#undef pr_fmt
+#define pr_fmt(fmt) "NMI perf watchdog: " fmt
+
+static DEFINE_PER_CPU(struct perf_event *, watchdog_ev);
+static DEFINE_PER_CPU(struct perf_event *, dead_event);
+static struct cpumask dead_events_mask;
+
+static atomic_t watchdog_cpus = ATOMIC_INIT(0);
+
+static struct perf_event_attr wd_hw_attr = {
+   .type   = PERF_TYPE_HARDWARE,
+   .config = PERF_COUNT_HW_CPU_CYCLES,
+   .size   = sizeof(struct perf_event_attr),
+   .pinned = 1,
+   .disabled   = 1,
+};
+
 /* Callback function for perf event subsystem */
 static void watchdog_overflow_callback(struct perf_event *event,
   struct perf_sample_data *data,
@@ -298,3 +304,5

[RFC PATCH v4 01/21] x86/msi: Add definition for NMI delivery mode

2019-05-23 Thread Ricardo Neri
Until now, the delivery mode of MSI interrupts is set to the default
mode set in the APIC driver. However, there are no restrictions in hardware
to configure each interrupt with a different delivery mode. Specifying the
delivery mode per interrupt is useful when one is interested in changing
the delivery mode of a particular interrupt. For instance, this can be used
to deliver an interrupt as non-maskable.

Cc: "H. Peter Anvin" 
Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: Joerg Roedel 
Cc: Juergen Gross 
Cc: Bjorn Helgaas 
Cc: Wincy Van 
Cc: Kate Stewart 
Cc: Philippe Ombredanne 
Cc: "Eric W. Biederman" 
Cc: Baoquan He 
Cc: Jan Kiszka 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Signed-off-by: Ricardo Neri 
---
 arch/x86/include/asm/msidef.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/include/asm/msidef.h b/arch/x86/include/asm/msidef.h
index ee2f8ccc32d0..38ccfdc2d96e 100644
--- a/arch/x86/include/asm/msidef.h
+++ b/arch/x86/include/asm/msidef.h
@@ -18,6 +18,7 @@
 #define MSI_DATA_DELIVERY_MODE_SHIFT   8
 #define  MSI_DATA_DELIVERY_FIXED   (0 << MSI_DATA_DELIVERY_MODE_SHIFT)
 #define  MSI_DATA_DELIVERY_LOWPRI  (1 << MSI_DATA_DELIVERY_MODE_SHIFT)
+#define  MSI_DATA_DELIVERY_NMI (4 << MSI_DATA_DELIVERY_MODE_SHIFT)
 
 #define MSI_DATA_LEVEL_SHIFT   14
 #define MSI_DATA_LEVEL_DEASSERT(0 << MSI_DATA_LEVEL_SHIFT)
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH v4 05/21] x86/hpet: Reserve timer for the HPET hardlockup detector

2019-05-23 Thread Ricardo Neri
HPET timer 2 will be used to drive the HPET-based hardlockup detector.
Reserve such timer to ensure it cannot be used by user space programs or
for clock events.

When looking for MSI-capable timers for clock events, skip timer 2 if
the HPET hardlockup detector is selected.

Cc: "H. Peter Anvin" 
Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: Clemens Ladisch 
Cc: Arnd Bergmann 
Cc: Philippe Ombredanne 
Cc: Kate Stewart 
Cc: "Rafael J. Wysocki" 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Signed-off-by: Ricardo Neri 
---
 arch/x86/include/asm/hpet.h |  3 +++
 arch/x86/kernel/hpet.c  | 19 ---
 2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index e7098740f5ee..6f099e2781ce 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -61,6 +61,9 @@
  */
 #define HPET_MIN_PERIOD10UL
 
+/* Timer used for the hardlockup detector */
+#define HPET_WD_TIMER_NR 2
+
 /* hpet memory map physical address */
 extern unsigned long hpet_address;
 extern unsigned long force_hpet_address;
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index 1723d55219e8..ff0250831786 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -173,7 +173,8 @@ do {
\
 
 /*
  * When the hpet driver (/dev/hpet) is enabled, we need to reserve
- * timer 0 and timer 1 in case of RTC emulation.
+ * timer 0 and timer 1 in case of RTC emulation. Timer 2 is reserved in case
+ * the HPET-based hardlockup detector is used.
  */
 #ifdef CONFIG_HPET
 
@@ -183,7 +184,7 @@ static void hpet_reserve_platform_timers(unsigned int id)
 {
struct hpet __iomem *hpet = hpet_virt_address;
struct hpet_timer __iomem *timer = &hpet->hpet_timers[2];
-   unsigned int nrtimers, i;
+   unsigned int nrtimers, i, start_timer;
struct hpet_data hd;
 
nrtimers = ((id & HPET_ID_NUMBER) >> HPET_ID_NUMBER_SHIFT) + 1;
@@ -198,6 +199,13 @@ static void hpet_reserve_platform_timers(unsigned int id)
hpet_reserve_timer(&hd, 1);
 #endif
 
+   if (IS_ENABLED(CONFIG_X86_HARDLOCKUP_DETECTOR_HPET)) {
+   hpet_reserve_timer(&hd, HPET_WD_TIMER_NR);
+   start_timer = HPET_WD_TIMER_NR + 1;
+   } else {
+   start_timer = HPET_WD_TIMER_NR;
+   }
+
/*
 * NOTE that hd_irq[] reflects IOAPIC input pins (LEGACY_8254
 * is wrong for i8259!) not the output IRQ.  Many BIOS writers
@@ -206,7 +214,7 @@ static void hpet_reserve_platform_timers(unsigned int id)
hd.hd_irq[0] = HPET_LEGACY_8254;
hd.hd_irq[1] = HPET_LEGACY_RTC;
 
-   for (i = 2; i < nrtimers; timer++, i++) {
+   for (i = start_timer; i < nrtimers; timer++, i++) {
hd.hd_irq[i] = (readl(&timer->hpet_config) &
Tn_INT_ROUTE_CNF_MASK) >> Tn_INT_ROUTE_CNF_SHIFT;
}
@@ -651,6 +659,11 @@ static void hpet_msi_capability_lookup(unsigned int 
start_timer)
struct hpet_dev *hdev = &hpet_devs[num_timers_used];
unsigned int cfg = hpet_readl(HPET_Tn_CFG(i));
 
+   /* Do not use timer reserved for the HPET watchdog. */
+   if (IS_ENABLED(CONFIG_X86_HARDLOCKUP_DETECTOR_HPET) &&
+   i == HPET_WD_TIMER_NR)
+   continue;
+
/* Only consider HPET timer with MSI support */
if (!(cfg & HPET_TN_FSB_CAP))
continue;
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH v4 07/21] watchdog/hardlockup: Define a generic function to detect hardlockups

2019-05-23 Thread Ricardo Neri
The procedure to detect hardlockups is independent of the underlying
mechanism that generates the non-maskable interrupt used to drive the
detector. Thus, it can be put in a separate, generic function. In this
manner, it can be invoked by various implementations of the NMI watchdog.

For this purpose, move the bulk of watchdog_overflow_callback() to the
new function inspect_for_hardlockups(). This function can then be called
from the applicable NMI handlers.

Cc: "H. Peter Anvin" 
Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: Don Zickus 
Cc: Nicholas Piggin 
Cc: Michael Ellerman 
Cc: Frederic Weisbecker 
Cc: Babu Moger 
Cc: "David S. Miller" 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Mathieu Desnoyers 
Cc: Masami Hiramatsu 
Cc: Peter Zijlstra 
Cc: Andrew Morton 
Cc: Philippe Ombredanne 
Cc: Colin Ian King 
Cc: "Luis R. Rodriguez" 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Cc: sparcli...@vger.kernel.org
Cc: linuxppc-...@lists.ozlabs.org
Signed-off-by: Ricardo Neri 
---
 include/linux/nmi.h   |  1 +
 kernel/watchdog_hld.c | 18 +++---
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index 9003e29cde46..5a8b19749769 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -212,6 +212,7 @@ extern int proc_watchdog_thresh(struct ctl_table *, int ,
void __user *, size_t *, loff_t *);
 extern int proc_watchdog_cpumask(struct ctl_table *, int,
 void __user *, size_t *, loff_t *);
+void inspect_for_hardlockups(struct pt_regs *regs);
 
 #ifdef CONFIG_HAVE_ACPI_APEI_NMI
 #include 
diff --git a/kernel/watchdog_hld.c b/kernel/watchdog_hld.c
index 247bf0b1582c..b352e507b17f 100644
--- a/kernel/watchdog_hld.c
+++ b/kernel/watchdog_hld.c
@@ -106,14 +106,8 @@ static struct perf_event_attr wd_hw_attr = {
.disabled   = 1,
 };
 
-/* Callback function for perf event subsystem */
-static void watchdog_overflow_callback(struct perf_event *event,
-  struct perf_sample_data *data,
-  struct pt_regs *regs)
+void inspect_for_hardlockups(struct pt_regs *regs)
 {
-   /* Ensure the watchdog never gets throttled */
-   event->hw.interrupts = 0;
-
if (__this_cpu_read(watchdog_nmi_touch) == true) {
__this_cpu_write(watchdog_nmi_touch, false);
return;
@@ -163,6 +157,16 @@ static void watchdog_overflow_callback(struct perf_event 
*event,
return;
 }
 
+/* Callback function for perf event subsystem */
+static void watchdog_overflow_callback(struct perf_event *event,
+  struct perf_sample_data *data,
+  struct pt_regs *regs)
+{
+   /* Ensure the watchdog never gets throttled */
+   event->hw.interrupts = 0;
+   inspect_for_hardlockups(regs);
+}
+
 static int hardlockup_detector_event_create(void)
 {
unsigned int cpu = smp_processor_id();
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH v4 04/21] x86/hpet: Add hpet_set_comparator() for periodic and one-shot modes

2019-05-23 Thread Ricardo Neri
Instead of setting the timer period directly in hpet_set_periodic(), add a
new helper function hpet_set_comparator() that only sets the accumulator
and comparator. hpet_set_periodic() will only prepare the timer for
periodic mode and leave the expiration programming to
hpet_set_comparator().

This new function can also be used by other components (e.g., the HPET-
based hardlockup detector) which also need to configure HPET timers. Thus,
add its declaration into the hpet header file.

Cc: "H. Peter Anvin" 
Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: Philippe Ombredanne 
Cc: Kate Stewart 
Cc: "Rafael J. Wysocki" 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Originally-by: Suravee Suthikulpanit 
Signed-off-by: Ricardo Neri 
---
 arch/x86/include/asm/hpet.h |  1 +
 arch/x86/kernel/hpet.c  | 57 +
 2 files changed, 46 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index f132fbf984d4..e7098740f5ee 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -102,6 +102,7 @@ extern int hpet_rtc_timer_init(void);
 extern irqreturn_t hpet_rtc_interrupt(int irq, void *dev_id);
 extern int hpet_register_irq_handler(rtc_irq_handler handler);
 extern void hpet_unregister_irq_handler(rtc_irq_handler handler);
+extern void hpet_set_comparator(int num, unsigned int cmp, unsigned int 
period);
 
 #endif /* CONFIG_HPET_EMULATE_RTC */
 
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index 5e86e024c489..1723d55219e8 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -290,6 +290,47 @@ static void hpet_legacy_clockevent_register(void)
printk(KERN_DEBUG "hpet clockevent registered\n");
 }
 
+/**
+ * hpet_set_comparator() - Helper function for setting comparator register
+ * @num:   The timer ID
+ * @cmp:   The value to be written to the comparator/accumulator
+ * @period:The value to be written to the period (0 = oneshot mode)
+ *
+ * Helper function for updating comparator, accumulator and period values.
+ *
+ * In periodic mode, HPET needs HPET_TN_SETVAL to be set before writing
+ * to the Tn_CMP to update the accumulator. Then, HPET needs a second
+ * write (with HPET_TN_SETVAL cleared) to Tn_CMP to set the period.
+ * The HPET_TN_SETVAL bit is automatically cleared after the first write.
+ *
+ * For one-shot mode, HPET_TN_SETVAL does not need to be set.
+ *
+ * See the following documents:
+ *   - Intel IA-PC HPET (High Precision Event Timers) Specification
+ *   - AMD-8111 HyperTransport I/O Hub Data Sheet, Publication # 24674
+ */
+void hpet_set_comparator(int num, unsigned int cmp, unsigned int period)
+{
+   if (period) {
+   unsigned int v = hpet_readl(HPET_Tn_CFG(num));
+
+   hpet_writel(v | HPET_TN_SETVAL, HPET_Tn_CFG(num));
+   }
+
+   hpet_writel(cmp, HPET_Tn_CMP(num));
+
+   if (!period)
+   return;
+
+   /*
+* This delay is seldom used: never in one-shot mode and in periodic
+* only when reprogramming the timer.
+*/
+   udelay(1);
+   hpet_writel(period, HPET_Tn_CMP(num));
+}
+EXPORT_SYMBOL_GPL(hpet_set_comparator);
+
 static int hpet_set_periodic(struct clock_event_device *evt, int timer)
 {
unsigned int cfg, cmp, now;
@@ -301,19 +342,11 @@ static int hpet_set_periodic(struct clock_event_device 
*evt, int timer)
now = hpet_readl(HPET_COUNTER);
cmp = now + (unsigned int)delta;
cfg = hpet_readl(HPET_Tn_CFG(timer));
-   cfg |= HPET_TN_ENABLE | HPET_TN_PERIODIC | HPET_TN_SETVAL |
-  HPET_TN_32BIT;
+   cfg |= HPET_TN_ENABLE | HPET_TN_PERIODIC | HPET_TN_32BIT;
hpet_writel(cfg, HPET_Tn_CFG(timer));
-   hpet_writel(cmp, HPET_Tn_CMP(timer));
-   udelay(1);
-   /*
-* HPET on AMD 81xx needs a second write (with HPET_TN_SETVAL
-* cleared) to T0_CMP to set the period. The HPET_TN_SETVAL
-* bit is automatically cleared after the first write.
-* (See AMD-8111 HyperTransport I/O Hub Data Sheet,
-* Publication # 24674)
-*/
-   hpet_writel((unsigned int)delta, HPET_Tn_CMP(timer));
+
+   hpet_set_comparator(timer, cmp, (unsigned int)delta);
+
hpet_start_counter();
hpet_print_config();
 
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH v4 03/21] x86/hpet: Calculate ticks-per-second in a separate function

2019-05-23 Thread Ricardo Neri
It is easier to compute the expiration times of an HPET timer by using
its frequency (i.e., the number of times it ticks in a second) than its
period, as given in the capabilities register.

In addition to the HPET char driver, the HPET-based hardlockup detector
will also need to know the timer's frequency. Thus, create a common
function that both can use.

Cc: "H. Peter Anvin" 
Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: Clemens Ladisch 
Cc: Arnd Bergmann 
Cc: Philippe Ombredanne 
Cc: Kate Stewart 
Cc: "Rafael J. Wysocki" 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Signed-off-by: Ricardo Neri 
---
 drivers/char/hpet.c  | 31 ---
 include/linux/hpet.h |  1 +
 2 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/drivers/char/hpet.c b/drivers/char/hpet.c
index 3a1e6b3ccd10..747255f552a9 100644
--- a/drivers/char/hpet.c
+++ b/drivers/char/hpet.c
@@ -836,6 +836,29 @@ static unsigned long hpet_calibrate(struct hpets *hpetp)
return ret;
 }
 
+u64 hpet_get_ticks_per_sec(u64 hpet_caps)
+{
+   u64 ticks_per_sec, period;
+
+   period = (hpet_caps & HPET_COUNTER_CLK_PERIOD_MASK) >>
+HPET_COUNTER_CLK_PERIOD_SHIFT; /* fs, 10^-15 */
+
+   /*
+* The frequency is the reciprocal of the period. The period is given
+* in femtoseconds per second. Thus, prepare a dividend to obtain the
+* frequency in ticks per second.
+*/
+
+   /* 10^15 femtoseconds per second */
+   ticks_per_sec = 1000ULL;
+   ticks_per_sec += period >> 1; /* round */
+
+   /* The quotient is put in the dividend. We drop the remainder. */
+   do_div(ticks_per_sec, period);
+
+   return ticks_per_sec;
+}
+
 int hpet_alloc(struct hpet_data *hdp)
 {
u64 cap, mcfg;
@@ -844,7 +867,6 @@ int hpet_alloc(struct hpet_data *hdp)
struct hpets *hpetp;
struct hpet __iomem *hpet;
static struct hpets *last;
-   unsigned long period;
unsigned long long temp;
u32 remainder;
 
@@ -894,12 +916,7 @@ int hpet_alloc(struct hpet_data *hdp)
 
last = hpetp;
 
-   period = (cap & HPET_COUNTER_CLK_PERIOD_MASK) >>
-   HPET_COUNTER_CLK_PERIOD_SHIFT; /* fs, 10^-15 */
-   temp = 1000uLL; /* 10^15 femtoseconds per second */
-   temp += period >> 1; /* round */
-   do_div(temp, period);
-   hpetp->hp_tick_freq = temp; /* ticks per second */
+   hpetp->hp_tick_freq = hpet_get_ticks_per_sec(cap);
 
printk(KERN_INFO "hpet%d: at MMIO 0x%lx, IRQ%s",
hpetp->hp_which, hdp->hd_phys_address,
diff --git a/include/linux/hpet.h b/include/linux/hpet.h
index 8604564b985d..e7b36bcf4699 100644
--- a/include/linux/hpet.h
+++ b/include/linux/hpet.h
@@ -107,5 +107,6 @@ static inline void hpet_reserve_timer(struct hpet_data *hd, 
int timer)
 }
 
 int hpet_alloc(struct hpet_data *);
+u64 hpet_get_ticks_per_sec(u64 hpet_caps);
 
 #endif /* !__HPET__ */
-- 
2.17.1



[RFC PATCH v4 18/21] x86/apic: Add a parameter for the APIC delivery mode

2019-05-23 Thread Ricardo Neri
Until now, the delivery mode of APIC interrupts is set to the default
mode set in the APIC driver. However, there are no restrictions in hardware
to configure each interrupt with a different delivery mode. Specifying the
delivery mode per interrupt is useful when one is interested in changing
the delivery mode of a particular interrupt. For instance, this can be used
to deliver an interrupt as non-maskable.

Add a new member, delivery_mode, to struct irq_cfg. This new member, can
be used to update the configuration of the delivery mode in each interrupt
domain. Likewise, add equivalent macros to populate MSI messages.

Currently, all interrupt domains set the delivery mode of interrupts using
the APIC setting. Interrupt domains use an irq_cfg data structure to
configure their own data structures and hardware resources. Thus, in order
to keep the current behavior, set the delivery mode of the irq
configuration that as the APIC setting. In this manner, irq domains can
obtain the delivery mode from the irq configuration data instead of the
APIC setting, if needed.

Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: Borislav Petkov 
Cc: Jacob Pan 
Cc: Joerg Roedel 
Cc: Juergen Gross 
Cc: Bjorn Helgaas 
Cc: Wincy Van 
Cc: Kate Stewart 
Cc: Philippe Ombredanne 
Cc: "Eric W. Biederman" 
Cc: Baoquan He 
Cc: Jan Kiszka 
Cc: Lu Baolu 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Cc: iommu@lists.linux-foundation.org
Signed-off-by: Ricardo Neri 
---
 arch/x86/include/asm/hw_irq.h |  5 +++--
 arch/x86/include/asm/msidef.h |  3 +++
 arch/x86/kernel/apic/vector.c | 10 ++
 3 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
index 32e666e1231e..c024e5976b78 100644
--- a/arch/x86/include/asm/hw_irq.h
+++ b/arch/x86/include/asm/hw_irq.h
@@ -117,8 +117,9 @@ struct irq_alloc_info {
 };
 
 struct irq_cfg {
-   unsigned intdest_apicid;
-   unsigned intvector;
+   unsigned intdest_apicid;
+   unsigned intvector;
+   enum ioapic_irq_destination_types   delivery_mode;
 };
 
 extern struct irq_cfg *irq_cfg(unsigned int irq);
diff --git a/arch/x86/include/asm/msidef.h b/arch/x86/include/asm/msidef.h
index 38ccfdc2d96e..6d666c90f057 100644
--- a/arch/x86/include/asm/msidef.h
+++ b/arch/x86/include/asm/msidef.h
@@ -16,6 +16,9 @@
 MSI_DATA_VECTOR_MASK)
 
 #define MSI_DATA_DELIVERY_MODE_SHIFT   8
+#define MSI_DATA_DELIVERY_MODE_MASK0x0700
+#define MSI_DATA_DELIVERY_MODE(dm) (((dm) << MSI_DATA_DELIVERY_MODE_SHIFT) 
& \
+MSI_DATA_DELIVERY_MODE_MASK)
 #define  MSI_DATA_DELIVERY_FIXED   (0 << MSI_DATA_DELIVERY_MODE_SHIFT)
 #define  MSI_DATA_DELIVERY_LOWPRI  (1 << MSI_DATA_DELIVERY_MODE_SHIFT)
 #define  MSI_DATA_DELIVERY_NMI (4 << MSI_DATA_DELIVERY_MODE_SHIFT)
diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 3173e07d3791..99436fe7e932 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -548,6 +548,16 @@ static int x86_vector_alloc_irqs(struct irq_domain 
*domain, unsigned int virq,
irqd->chip_data = apicd;
irqd->hwirq = virq + i;
irqd_set_single_target(irqd);
+
+   /*
+* Initialize the delivery mode of this irq to match the
+* default delivery mode of the APIC. This is useful for
+* children irq domains which want to take the delivery
+* mode from the individual irq configuration rather
+* than from the APIC.
+*/
+apicd->hw_irq_cfg.delivery_mode = apic->irq_delivery_mode;
+
/*
 * Legacy vectors are already assigned when the IOAPIC
 * takes them over. They stay on the same vector. This is
-- 
2.17.1



[RFC PATCH v4 02/21] x86/hpet: Expose hpet_writel() in header

2019-05-23 Thread Ricardo Neri
In order to allow hpet_writel() to be used by other components (e.g.,
the HPET-based hardlockup detector) expose it in the HPET header file.

No empty definition is needed if CONFIG_HPET is not selected as all
existing callers select such config symbol.

Cc: "H. Peter Anvin" 
Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: Philippe Ombredanne 
Cc: Kate Stewart 
Cc: "Rafael J. Wysocki" 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Signed-off-by: Ricardo Neri 
---
 arch/x86/include/asm/hpet.h | 1 +
 arch/x86/kernel/hpet.c  | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index 67385d56d4f4..f132fbf984d4 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -72,6 +72,7 @@ extern int is_hpet_enabled(void);
 extern int hpet_enable(void);
 extern void hpet_disable(void);
 extern unsigned int hpet_readl(unsigned int a);
+extern void hpet_writel(unsigned int d, unsigned int a);
 extern void force_hpet_resume(void);
 
 struct irq_data;
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index a0573f2e7763..5e86e024c489 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -62,7 +62,7 @@ inline unsigned int hpet_readl(unsigned int a)
return readl(hpet_virt_address + a);
 }
 
-static inline void hpet_writel(unsigned int d, unsigned int a)
+inline void hpet_writel(unsigned int d, unsigned int a)
 {
writel(d, hpet_virt_address + a);
 }
-- 
2.17.1



[RFC PATCH v4 06/21] x86/hpet: Configure the timer used by the hardlockup detector

2019-05-23 Thread Ricardo Neri
Implement the initial configuration of the timer to be used by the
hardlockup detector. Return a data structure with a description of the
timer; this information is subsequently used by the hardlockup detector.

Only provide the timer if it supports Front Side Bus interrupt delivery.
This condition greatly simplifies the implementation of the detector.
Specifically, it helps to avoid the complexities of routing the interrupt
via the IO-APIC (e.g., potential race conditions that arise from re-
programming the IO-APIC in NMI context).

Cc: "H. Peter Anvin" 
Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: Clemens Ladisch 
Cc: Arnd Bergmann 
Cc: Philippe Ombredanne 
Cc: Kate Stewart 
Cc: "Rafael J. Wysocki" 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Signed-off-by: Ricardo Neri 
---
 arch/x86/include/asm/hpet.h | 13 +
 arch/x86/kernel/hpet.c  | 35 +++
 2 files changed, 48 insertions(+)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index 6f099e2781ce..20abdaa5372d 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -109,6 +109,19 @@ extern void hpet_set_comparator(int num, unsigned int cmp, 
unsigned int period);
 
 #endif /* CONFIG_HPET_EMULATE_RTC */
 
+#ifdef CONFIG_X86_HARDLOCKUP_DETECTOR_HPET
+struct hpet_hld_data {
+   boolhas_periodic;
+   u32 num;
+   u64 ticks_per_second;
+};
+
+extern struct hpet_hld_data *hpet_hardlockup_detector_assign_timer(void);
+#else
+static inline struct hpet_hld_data *hpet_hardlockup_detector_assign_timer(void)
+{ return NULL; }
+#endif /* CONFIG_X86_HARDLOCKUP_DETECTOR_HPET */
+
 #else /* CONFIG_HPET_TIMER */
 
 static inline int hpet_enable(void) { return 0; }
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index ff0250831786..5f9209949fc7 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -171,6 +171,41 @@ do {   
\
_hpet_print_config(__func__, __LINE__); \
 } while (0)
 
+#ifdef CONFIG_X86_HARDLOCKUP_DETECTOR_HPET
+struct hpet_hld_data *hpet_hardlockup_detector_assign_timer(void)
+{
+   struct hpet_hld_data *hdata;
+   u64 temp;
+   u32 cfg;
+
+   cfg = hpet_readl(HPET_Tn_CFG(HPET_WD_TIMER_NR));
+
+   if (!(cfg & HPET_TN_FSB_CAP))
+   return NULL;
+
+   hdata = kzalloc(sizeof(*hdata), GFP_KERNEL);
+   if (!hdata)
+   return NULL;
+
+   if (cfg & HPET_TN_PERIODIC_CAP)
+   hdata->has_periodic = true;
+
+   hdata->num = HPET_WD_TIMER_NR;
+
+   cfg = hpet_readl(HPET_PERIOD);
+
+   /*
+* hpet_get_ticks_per_sec() expects the contents of the general
+* capabilities register. The period is in the 32 most significant
+* bits.
+*/
+   temp = (u64)cfg << HPET_COUNTER_CLK_PERIOD_SHIFT;
+   hdata->ticks_per_second = hpet_get_ticks_per_sec(temp);
+
+   return hdata;
+}
+#endif /* CONFIG_X86_HARDLOCKUP_DETECTOR_HPET */
+
 /*
  * When the hpet driver (/dev/hpet) is enabled, we need to reserve
  * timer 0 and timer 1 in case of RTC emulation. Timer 2 is reserved in case
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH v4 09/21] x86/nmi: Add a NMI_WATCHDOG NMI handler category

2019-05-23 Thread Ricardo Neri
Add a NMI_WATCHDOG as a new category of NMI handler. This new category
is to be used with the HPET-based hardlockup detector. This detector
does not have a direct way of checking if the HPET timer is the source of
the NMI. Instead it indirectly estimate it using the time-stamp counter.

Therefore, we may have false-positives in case another NMI occurs within
the estimated time window. For this reason, we want the handler of the
detector to be called after all the NMI_LOCAL handlers. A simple way
of achieving this with a new NMI handler category.

Signed-off-by: Ricardo Neri 
---
 arch/x86/include/asm/nmi.h |  1 +
 arch/x86/kernel/nmi.c  | 10 ++
 2 files changed, 11 insertions(+)

diff --git a/arch/x86/include/asm/nmi.h b/arch/x86/include/asm/nmi.h
index 75ded1d13d98..75aa98313cde 100644
--- a/arch/x86/include/asm/nmi.h
+++ b/arch/x86/include/asm/nmi.h
@@ -29,6 +29,7 @@ enum {
NMI_UNKNOWN,
NMI_SERR,
NMI_IO_CHECK,
+   NMI_WATCHDOG,
NMI_MAX
 };
 
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 4df7705022b9..43e96aedc6fe 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -64,6 +64,10 @@ static struct nmi_desc nmi_desc[NMI_MAX] =
.lock = __RAW_SPIN_LOCK_UNLOCKED(&nmi_desc[3].lock),
.head = LIST_HEAD_INIT(nmi_desc[3].head),
},
+   {
+   .lock = __RAW_SPIN_LOCK_UNLOCKED(&nmi_desc[4].lock),
+   .head = LIST_HEAD_INIT(nmi_desc[4].head),
+   },
 
 };
 
@@ -174,6 +178,8 @@ int __register_nmi_handler(unsigned int type, struct 
nmiaction *action)
 */
WARN_ON_ONCE(type == NMI_SERR && !list_empty(&desc->head));
WARN_ON_ONCE(type == NMI_IO_CHECK && !list_empty(&desc->head));
+   WARN_ON_ONCE(type == NMI_WATCHDOG && !list_empty(&desc->head));
+
 
/*
 * some handlers need to be executed first otherwise a fake
@@ -384,6 +390,10 @@ static void default_do_nmi(struct pt_regs *regs)
}
raw_spin_unlock(&nmi_reason_lock);
 
+   handled = nmi_handle(NMI_WATCHDOG, regs);
+   if (handled == NMI_HANDLED)
+   return;
+
/*
 * Only one NMI can be latched at a time.  To handle
 * this we may process multiple nmi handlers at once to
-- 
2.17.1



Re: [RFC PATCH v3 11/21] x86/watchdog/hardlockup: Add an HPET-based hardlockup detector

2019-05-15 Thread Ricardo Neri
On Tue, May 14, 2019 at 07:26:58AM -0700, Randy Dunlap wrote:
> On 5/14/19 7:02 AM, Ricardo Neri wrote:
> > diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
> > index 15d0fbe27872..376a5db81aec 100644
> > --- a/arch/x86/Kconfig.debug
> > +++ b/arch/x86/Kconfig.debug
> > @@ -169,6 +169,17 @@ config IOMMU_LEAK
> >  config HAVE_MMIOTRACE_SUPPORT
> > def_bool y
> >  
> > +config X86_HARDLOCKUP_DETECTOR_HPET
> > +   bool "Use HPET Timer for Hard Lockup Detection"
> > +   select SOFTLOCKUP_DETECTOR
> > +   select HARDLOCKUP_DETECTOR
> > +   select HARDLOCKUP_DETECTOR_CORE
> > +   depends on HPET_TIMER && HPET && X86_64
> > +   help
> > + Say y to enable a hardlockup detector that is driven by an High-
> 
>  by a
> 
I'll correct.

Thanks and BR,
Ricardo
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v3 04/21] x86/hpet: Add hpet_set_comparator() for periodic and one-shot modes

2019-05-15 Thread Ricardo Neri
On Tue, May 14, 2019 at 07:24:38AM -0700, Randy Dunlap wrote:
> On 5/14/19 7:01 AM, Ricardo Neri wrote:
> > Instead of setting the timer period directly in hpet_set_periodic(), add a
> > new helper function hpet_set_comparator() that only sets the accumulator
> > and comparator. hpet_set_periodic() will only prepare the timer for
> > periodic mode and leave the expiration programming to
> > hpet_set_comparator().
> > 
> > This new function can also be used by other components (e.g., the HPET-
> > based hardlockup detector) which also need to configure HPET timers. Thus,
> > add its declaration into the hpet header file.
> > 
> > Cc: "H. Peter Anvin" 
> > Cc: Ashok Raj 
> > Cc: Andi Kleen 
> > Cc: Tony Luck 
> > Cc: Philippe Ombredanne 
> > Cc: Kate Stewart 
> > Cc: "Rafael J. Wysocki" 
> > Cc: Stephane Eranian 
> > Cc: Suravee Suthikulpanit 
> > Cc: "Ravi V. Shankar" 
> > Cc: x...@kernel.org
> > Originally-by: Suravee Suthikulpanit 
> > Signed-off-by: Ricardo Neri 
> > ---
> >  arch/x86/include/asm/hpet.h |  1 +
> >  arch/x86/kernel/hpet.c  | 57 -
> >  2 files changed, 45 insertions(+), 13 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
> > index f132fbf984d4..e7098740f5ee 100644
> > --- a/arch/x86/include/asm/hpet.h
> > +++ b/arch/x86/include/asm/hpet.h
> > @@ -102,6 +102,7 @@ extern int hpet_rtc_timer_init(void);
> >  extern irqreturn_t hpet_rtc_interrupt(int irq, void *dev_id);
> >  extern int hpet_register_irq_handler(rtc_irq_handler handler);
> >  extern void hpet_unregister_irq_handler(rtc_irq_handler handler);
> > +extern void hpet_set_comparator(int num, unsigned int cmp, unsigned int 
> > period);
> >  
> >  #endif /* CONFIG_HPET_EMULATE_RTC */
> >  
> > diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
> > index 560fc28e1d13..c5c5fc150193 100644
> > --- a/arch/x86/kernel/hpet.c
> > +++ b/arch/x86/kernel/hpet.c
> > @@ -289,6 +289,46 @@ static void hpet_legacy_clockevent_register(void)
> > printk(KERN_DEBUG "hpet clockevent registered\n");
> >  }
> >  
> > +/**
> > + * hpet_set_comparator() - Helper function for setting comparator register
> > + * @num:   The timer ID
> > + * @cmp:   The value to be written to the comparator/accumulator
> > + * @period:The value to be written to the period (0 = oneshot mode)
> > + *
> > + * Helper function for updating comparator, accumulator and period values.
> > + *
> > + * In periodic mode, HPET needs HPET_TN_SETVAL to be set before writing
> > + * to the Tn_CMP to update the accumulator. Then, HPET needs a second
> > + * write (with HPET_TN_SETVAL cleared) to Tn_CMP to set the period.
> > + * The HPET_TN_SETVAL bit is automatically cleared after the first write.
> > + *
> > + * For one-shot mode, HPET_TN_SETVAL does not need to be set.
> > + *
> > + * See the following documents:
> > + *   - Intel IA-PC HPET (High Precision Event Timers) Specification
> > + *   - AMD-8111 HyperTransport I/O Hub Data Sheet, Publication # 24674
> > + */
> > +void hpet_set_comparator(int num, unsigned int cmp, unsigned int period)
> > +{
> > +   if (period) {
> > +   unsigned int v = hpet_readl(HPET_Tn_CFG(num));
> > +
> > +   hpet_writel(v | HPET_TN_SETVAL, HPET_Tn_CFG(num));
> > +   }
> > +
> > +   hpet_writel(cmp, HPET_Tn_CMP(num));
> > +
> > +   if (!period)
> > +   return;
> > +
> > +   /* This delay is seldom used: never in one-shot mode and in periodic
> > +* only when reprogramming the timer.
> > +*/
> 
> comment style warning ;)
>

Uh! I'll correct this. Strangely, I reran checkpatch and it didn't catch
it.

Thanks and BR,
Ricardo


Re: [RFC PATCH v3 03/21] x86/hpet: Calculate ticks-per-second in a separate function

2019-05-15 Thread Ricardo Neri
On Tue, May 14, 2019 at 07:23:47AM -0700, Randy Dunlap wrote:
> On 5/14/19 7:01 AM, Ricardo Neri wrote:
> > It is easier to compute the expiration times of an HPET timer by using
> > its frequency (i.e., the number of times it ticks in a second) than its
> > period, as given in the capabilities register.
> > 
> > In addition to the HPET char driver, the HPET-based hardlockup detector
> > will also need to know the timer's frequency. Thus, create a common
> > function that both can use.
> > 
> > Cc: "H. Peter Anvin" 
> > Cc: Ashok Raj 
> > Cc: Andi Kleen 
> > Cc: Tony Luck 
> > Cc: Clemens Ladisch 
> > Cc: Arnd Bergmann 
> > Cc: Philippe Ombredanne 
> > Cc: Kate Stewart 
> > Cc: "Rafael J. Wysocki" 
> > Cc: Stephane Eranian 
> > Cc: Suravee Suthikulpanit 
> > Cc: "Ravi V. Shankar" 
> > Cc: x...@kernel.org
> > Signed-off-by: Ricardo Neri 
> > ---
> >  drivers/char/hpet.c  | 31 ---
> >  include/linux/hpet.h |  1 +
> >  2 files changed, 25 insertions(+), 7 deletions(-)
> > 
> > diff --git a/drivers/char/hpet.c b/drivers/char/hpet.c
> > index d0ad85900b79..bdcbecfdb858 100644
> > --- a/drivers/char/hpet.c
> > +++ b/drivers/char/hpet.c
> > @@ -836,6 +836,29 @@ static unsigned long hpet_calibrate(struct hpets 
> > *hpetp)
> > return ret;
> >  }
> >  
> > +u64 hpet_get_ticks_per_sec(u64 hpet_caps)
> > +{
> > +   u64 ticks_per_sec, period;
> > +
> > +   period = (hpet_caps & HPET_COUNTER_CLK_PERIOD_MASK) >>
> > +HPET_COUNTER_CLK_PERIOD_SHIFT; /* fs, 10^-15 */
> > +
> > +   /*
> > +* The frequency is the reciprocal of the period. The period is given
> > +* femtoseconds per second. Thus, prepare a dividend to obtain the
> 
>* in femtoseconds per second.
> 

Thanks for your review Randy! I'll fix this grammar issue.
> > +* frequency in ticks per second.
> > +*/
> > +
> > +   /* 10^15 femtoseconds per second */
> > +   ticks_per_sec = 1000uLL;
> 
>   ULL is overwhelmingly used in the kernel.
> 

Sure, I'll update it.

BR,
Ricardo


[RFC PATCH v3 10/21] watchdog/hardlockup: Add function to enable NMI watchdog on all allowed CPUs at once

2019-05-14 Thread Ricardo Neri
When there are more than one implementation of the NMI watchdog, there may
be situations in which switching from one to another is needed (e.g., if
the time-stamp counter becomes unstable, the HPET-based NMI watchdog can
no longer be used.

The perf-based implementation of the hardlockup detector makes use of
various per-CPU variables which are accessed via this_cpu operations.
Hence, each CPU needs to enable its own NMI watchdog if using the perf
implementation.

Add functionality to switch from one NMI watchdog to another and do it
from each allowed CPU.

Cc: "H. Peter Anvin" 
Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: "Rafael J. Wysocki" 
Cc: Don Zickus 
Cc: Nicholas Piggin 
Cc: Michael Ellerman 
Cc: Frederic Weisbecker 
Cc: Alexei Starovoitov 
Cc: Babu Moger 
Cc: "David S. Miller" 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Mathieu Desnoyers 
Cc: Masami Hiramatsu 
Cc: Peter Zijlstra 
Cc: Andrew Morton 
Cc: Philippe Ombredanne 
Cc: Colin Ian King 
Cc: Byungchul Park 
Cc: "Paul E. McKenney" 
Cc: "Luis R. Rodriguez" 
Cc: Waiman Long 
Cc: Josh Poimboeuf 
Cc: Randy Dunlap 
Cc: Davidlohr Bueso 
Cc: Marc Zyngier 
Cc: Kai-Heng Feng 
Cc: Konrad Rzeszutek Wilk 
Cc: David Rientjes 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Cc: sparcli...@vger.kernel.org
Cc: linuxppc-...@lists.ozlabs.org
Signed-off-by: Ricardo Neri 
---
 include/linux/nmi.h |  2 ++
 kernel/watchdog.c   | 15 +++
 2 files changed, 17 insertions(+)

diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index e5f1a86e20b7..6d828334348b 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -83,9 +83,11 @@ static inline void reset_hung_task_detector(void) { }
 
 #if defined(CONFIG_HARDLOCKUP_DETECTOR)
 extern void hardlockup_detector_disable(void);
+extern void hardlockup_start_all(void);
 extern unsigned int hardlockup_panic;
 #else
 static inline void hardlockup_detector_disable(void) {}
+static inline void hardlockup_start_all(void) {}
 #endif
 
 #if defined(CONFIG_HAVE_NMI_WATCHDOG) || defined(CONFIG_HARDLOCKUP_DETECTOR)
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 7f9e7b9306fe..be589001200a 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -566,6 +566,21 @@ int lockup_detector_offline_cpu(unsigned int cpu)
return 0;
 }
 
+static int hardlockup_start_fn(void *data)
+{
+   watchdog_nmi_enable(smp_processor_id());
+   return 0;
+}
+
+void hardlockup_start_all(void)
+{
+   int cpu;
+
+   cpumask_copy(&watchdog_allowed_mask, &watchdog_cpumask);
+   for_each_cpu(cpu, &watchdog_allowed_mask)
+   smp_call_on_cpu(cpu, hardlockup_start_fn, NULL, false);
+}
+
 static void lockup_detector_reconfigure(void)
 {
cpus_read_lock();
-- 
2.17.1



[RFC PATCH v3 08/21] watchdog/hardlockup: Decouple the hardlockup detector from perf

2019-05-14 Thread Ricardo Neri
The current default implementation of the hardlockup detector assumes that
it is implemented using perf events. However, the hardlockup detector can
be driven by other sources of non-maskable interrupts (e.g., a properly
configured timer).

Group and wrap in #ifdef CONFIG_HARDLOCKUP_DETECTOR_PERF all the code
specific to perf: create and manage perf events, stop and start the perf-
based detector.

The generic portion of the detector (monitor the timers' thresholds, check
timestamps and detect hardlockups as well as the implementation of
arch_touch_nmi_watchdog()) is now selected with the new intermediate config
symbol CONFIG_HARDLOCKUP_DETECTOR_CORE.

The perf-based implementation of the detector selects the new intermediate
symbol. Other implementations should do the same.

Cc: "H. Peter Anvin" 
Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: "Rafael J. Wysocki" 
Cc: Don Zickus 
Cc: Nicholas Piggin 
Cc: Michael Ellerman 
Cc: Frederic Weisbecker 
Cc: Alexei Starovoitov 
Cc: Babu Moger 
Cc: "David S. Miller" 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Mathieu Desnoyers 
Cc: Masami Hiramatsu 
Cc: Peter Zijlstra 
Cc: Andrew Morton 
Cc: Philippe Ombredanne 
Cc: Colin Ian King 
Cc: Byungchul Park 
Cc: "Paul E. McKenney" 
Cc: "Luis R. Rodriguez" 
Cc: Waiman Long 
Cc: Josh Poimboeuf 
Cc: Randy Dunlap 
Cc: Davidlohr Bueso 
Cc: Marc Zyngier 
Cc: Kai-Heng Feng 
Cc: Konrad Rzeszutek Wilk 
Cc: David Rientjes 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Cc: sparcli...@vger.kernel.org
Cc: linuxppc-...@lists.ozlabs.org
Signed-off-by: Ricardo Neri 
---
 include/linux/nmi.h   |  5 -
 kernel/Makefile   |  2 +-
 kernel/watchdog_hld.c | 32 
 lib/Kconfig.debug |  4 
 4 files changed, 29 insertions(+), 14 deletions(-)

diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index 5a8b19749769..e5f1a86e20b7 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -94,8 +94,11 @@ static inline void hardlockup_detector_disable(void) {}
 # define NMI_WATCHDOG_SYSCTL_PERM  0444
 #endif
 
-#if defined(CONFIG_HARDLOCKUP_DETECTOR_PERF)
+#if defined(CONFIG_HARDLOCKUP_DETECTOR_CORE)
 extern void arch_touch_nmi_watchdog(void);
+#endif
+
+#if defined(CONFIG_HARDLOCKUP_DETECTOR_PERF)
 extern void hardlockup_detector_perf_stop(void);
 extern void hardlockup_detector_perf_restart(void);
 extern void hardlockup_detector_perf_disable(void);
diff --git a/kernel/Makefile b/kernel/Makefile
index 62471e75a2b0..e9bdbaa1ed50 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -82,7 +82,7 @@ obj-$(CONFIG_FAIL_FUNCTION) += fail_function.o
 obj-$(CONFIG_KGDB) += debug/
 obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
 obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
-obj-$(CONFIG_HARDLOCKUP_DETECTOR_PERF) += watchdog_hld.o
+obj-$(CONFIG_HARDLOCKUP_DETECTOR_CORE) += watchdog_hld.o
 obj-$(CONFIG_SECCOMP) += seccomp.o
 obj-$(CONFIG_RELAY) += relay.o
 obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
diff --git a/kernel/watchdog_hld.c b/kernel/watchdog_hld.c
index b352e507b17f..bb6435978c46 100644
--- a/kernel/watchdog_hld.c
+++ b/kernel/watchdog_hld.c
@@ -22,12 +22,8 @@
 
 static DEFINE_PER_CPU(bool, hard_watchdog_warn);
 static DEFINE_PER_CPU(bool, watchdog_nmi_touch);
-static DEFINE_PER_CPU(struct perf_event *, watchdog_ev);
-static DEFINE_PER_CPU(struct perf_event *, dead_event);
-static struct cpumask dead_events_mask;
 
 static unsigned long hardlockup_allcpu_dumped;
-static atomic_t watchdog_cpus = ATOMIC_INIT(0);
 
 notrace void arch_touch_nmi_watchdog(void)
 {
@@ -98,14 +94,6 @@ static inline bool watchdog_check_timestamp(void)
 }
 #endif
 
-static struct perf_event_attr wd_hw_attr = {
-   .type   = PERF_TYPE_HARDWARE,
-   .config = PERF_COUNT_HW_CPU_CYCLES,
-   .size   = sizeof(struct perf_event_attr),
-   .pinned = 1,
-   .disabled   = 1,
-};
-
 void inspect_for_hardlockups(struct pt_regs *regs)
 {
if (__this_cpu_read(watchdog_nmi_touch) == true) {
@@ -157,6 +145,24 @@ void inspect_for_hardlockups(struct pt_regs *regs)
return;
 }
 
+#ifdef CONFIG_HARDLOCKUP_DETECTOR_PERF
+#undef pr_fmt
+#define pr_fmt(fmt) "NMI perf watchdog: " fmt
+
+static DEFINE_PER_CPU(struct perf_event *, watchdog_ev);
+static DEFINE_PER_CPU(struct perf_event *, dead_event);
+static struct cpumask dead_events_mask;
+
+static atomic_t watchdog_cpus = ATOMIC_INIT(0);
+
+static struct perf_event_attr wd_hw_attr = {
+   .type   = PERF_TYPE_HARDWARE,
+   .config = PERF_COUNT_HW_CPU_CYCLES,
+   .size   = sizeof(struct perf_event_attr),
+   .pinned = 1,
+   .disabled   = 1,
+};
+
 /* Callback function for perf event subsystem */
 static void watchdog_overflow_callback(struct perf_event *event,
   struct perf_sample_data *data,
@@ -298,3 +304,5

[RFC PATCH v3 09/21] x86/nmi: Add a NMI_WATCHDOG NMI handler category

2019-05-14 Thread Ricardo Neri
Add a NMI_WATCHDOG as a new category of NMI handler. This new category
is to be used with the HPET-based hardlockup detector. This detector
does not have a direct way of checking if the HPET timer is the source of
the NMI. Instead it indirectly estimate it using the time-stamp counter.

Therefore, we may have false-positives in case another NMI occurs within
the estimated time window. For this reason, we want the handler of the
detector to be called after all the NMI_LOCAL handlers. A simple way
of achieving this with a new NMI handler category.

Signed-off-by: Ricardo Neri 
---
 arch/x86/include/asm/nmi.h |  1 +
 arch/x86/kernel/nmi.c  | 10 ++
 2 files changed, 11 insertions(+)

diff --git a/arch/x86/include/asm/nmi.h b/arch/x86/include/asm/nmi.h
index 75ded1d13d98..75aa98313cde 100644
--- a/arch/x86/include/asm/nmi.h
+++ b/arch/x86/include/asm/nmi.h
@@ -29,6 +29,7 @@ enum {
NMI_UNKNOWN,
NMI_SERR,
NMI_IO_CHECK,
+   NMI_WATCHDOG,
NMI_MAX
 };
 
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 3755d0310026..a43213f0ab26 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -62,6 +62,10 @@ static struct nmi_desc nmi_desc[NMI_MAX] =
.lock = __RAW_SPIN_LOCK_UNLOCKED(&nmi_desc[3].lock),
.head = LIST_HEAD_INIT(nmi_desc[3].head),
},
+   {
+   .lock = __RAW_SPIN_LOCK_UNLOCKED(&nmi_desc[4].lock),
+   .head = LIST_HEAD_INIT(nmi_desc[4].head),
+   },
 
 };
 
@@ -172,6 +176,8 @@ int __register_nmi_handler(unsigned int type, struct 
nmiaction *action)
 */
WARN_ON_ONCE(type == NMI_SERR && !list_empty(&desc->head));
WARN_ON_ONCE(type == NMI_IO_CHECK && !list_empty(&desc->head));
+   WARN_ON_ONCE(type == NMI_WATCHDOG && !list_empty(&desc->head));
+
 
/*
 * some handlers need to be executed first otherwise a fake
@@ -382,6 +388,10 @@ static void default_do_nmi(struct pt_regs *regs)
}
raw_spin_unlock(&nmi_reason_lock);
 
+   handled = nmi_handle(NMI_WATCHDOG, regs);
+   if (handled == NMI_HANDLED)
+   return;
+
/*
 * Only one NMI can be latched at a time.  To handle
 * this we may process multiple nmi handlers at once to
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH v3 03/21] x86/hpet: Calculate ticks-per-second in a separate function

2019-05-14 Thread Ricardo Neri
It is easier to compute the expiration times of an HPET timer by using
its frequency (i.e., the number of times it ticks in a second) than its
period, as given in the capabilities register.

In addition to the HPET char driver, the HPET-based hardlockup detector
will also need to know the timer's frequency. Thus, create a common
function that both can use.

Cc: "H. Peter Anvin" 
Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: Clemens Ladisch 
Cc: Arnd Bergmann 
Cc: Philippe Ombredanne 
Cc: Kate Stewart 
Cc: "Rafael J. Wysocki" 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Signed-off-by: Ricardo Neri 
---
 drivers/char/hpet.c  | 31 ---
 include/linux/hpet.h |  1 +
 2 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/drivers/char/hpet.c b/drivers/char/hpet.c
index d0ad85900b79..bdcbecfdb858 100644
--- a/drivers/char/hpet.c
+++ b/drivers/char/hpet.c
@@ -836,6 +836,29 @@ static unsigned long hpet_calibrate(struct hpets *hpetp)
return ret;
 }
 
+u64 hpet_get_ticks_per_sec(u64 hpet_caps)
+{
+   u64 ticks_per_sec, period;
+
+   period = (hpet_caps & HPET_COUNTER_CLK_PERIOD_MASK) >>
+HPET_COUNTER_CLK_PERIOD_SHIFT; /* fs, 10^-15 */
+
+   /*
+* The frequency is the reciprocal of the period. The period is given
+* femtoseconds per second. Thus, prepare a dividend to obtain the
+* frequency in ticks per second.
+*/
+
+   /* 10^15 femtoseconds per second */
+   ticks_per_sec = 1000uLL;
+   ticks_per_sec += period >> 1; /* round */
+
+   /* The quotient is put in the dividend. We drop the remainder. */
+   do_div(ticks_per_sec, period);
+
+   return ticks_per_sec;
+}
+
 int hpet_alloc(struct hpet_data *hdp)
 {
u64 cap, mcfg;
@@ -844,7 +867,6 @@ int hpet_alloc(struct hpet_data *hdp)
struct hpets *hpetp;
struct hpet __iomem *hpet;
static struct hpets *last;
-   unsigned long period;
unsigned long long temp;
u32 remainder;
 
@@ -894,12 +916,7 @@ int hpet_alloc(struct hpet_data *hdp)
 
last = hpetp;
 
-   period = (cap & HPET_COUNTER_CLK_PERIOD_MASK) >>
-   HPET_COUNTER_CLK_PERIOD_SHIFT; /* fs, 10^-15 */
-   temp = 1000uLL; /* 10^15 femtoseconds per second */
-   temp += period >> 1; /* round */
-   do_div(temp, period);
-   hpetp->hp_tick_freq = temp; /* ticks per second */
+   hpetp->hp_tick_freq = hpet_get_ticks_per_sec(cap);
 
printk(KERN_INFO "hpet%d: at MMIO 0x%lx, IRQ%s",
hpetp->hp_which, hdp->hd_phys_address,
diff --git a/include/linux/hpet.h b/include/linux/hpet.h
index 8604564b985d..e7b36bcf4699 100644
--- a/include/linux/hpet.h
+++ b/include/linux/hpet.h
@@ -107,5 +107,6 @@ static inline void hpet_reserve_timer(struct hpet_data *hd, 
int timer)
 }
 
 int hpet_alloc(struct hpet_data *);
+u64 hpet_get_ticks_per_sec(u64 hpet_caps);
 
 #endif /* !__HPET__ */
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH v3 12/21] watchdog/hardlockup/hpet: Adjust timer expiration on the number of monitored CPUs

2019-05-14 Thread Ricardo Neri
Each CPU should be monitored for hardlockups every watchdog_thresh seconds.
Since all the CPUs in the system are monitored by the same timer and the
timer interrupt is rotated among the monitored CPUs, the timer must expire
every watchdog_thresh/N seconds; where N is the number of monitored CPUs.
Use the new member of struct hld_data, ticks_per_cpu, to store the
aforementioned quantity.

The ticks-per-CPU quantity is updated every time the number of monitored
CPUs changes: when the watchdog is enabled or disabled for a specific CPU.
If the timer is used in periodic mode, it needs to be adjusted to reflect
the new expected expiration.

Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: Borislav Petkov 
Cc: Jacob Pan 
Cc: "Rafael J. Wysocki" 
Cc: Don Zickus 
Cc: Nicholas Piggin 
Cc: Michael Ellerman 
Cc: Frederic Weisbecker 
Cc: Alexei Starovoitov 
Cc: Babu Moger 
Cc: Mathieu Desnoyers 
Cc: Masami Hiramatsu 
Cc: Peter Zijlstra 
Cc: Andrew Morton 
Cc: Philippe Ombredanne 
Cc: Colin Ian King 
Cc: Byungchul Park 
Cc: "Paul E. McKenney" 
Cc: "Luis R. Rodriguez" 
Cc: Waiman Long 
Cc: Josh Poimboeuf 
Cc: Randy Dunlap 
Cc: Davidlohr Bueso 
Cc: Marc Zyngier 
Cc: Kai-Heng Feng 
Cc: Konrad Rzeszutek Wilk 
Cc: David Rientjes 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Cc: iommu@lists.linux-foundation.org
Signed-off-by: Ricardo Neri 
---
 arch/x86/include/asm/hpet.h |  1 +
 arch/x86/kernel/watchdog_hld_hpet.c | 46 +++--
 2 files changed, 44 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index 31fc27508cf3..64acacce095d 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -114,6 +114,7 @@ struct hpet_hld_data {
boolhas_periodic;
u32 num;
u64 ticks_per_second;
+   u64 ticks_per_cpu;
u32 handling_cpu;
u32 enabled_cpus;
struct msi_msg  msi_msg;
diff --git a/arch/x86/kernel/watchdog_hld_hpet.c 
b/arch/x86/kernel/watchdog_hld_hpet.c
index c20b378b8c0c..9a3431a54616 100644
--- a/arch/x86/kernel/watchdog_hld_hpet.c
+++ b/arch/x86/kernel/watchdog_hld_hpet.c
@@ -44,6 +44,13 @@ static void kick_timer(struct hpet_hld_data *hdata, bool 
force)
 * are able to update the comparator before the counter reaches such new
 * value.
 *
+* Each CPU must be monitored every watch_thresh seconds. Since the
+* timer targets one CPU at a time, it must expire every
+*
+*ticks_per_cpu = watch_thresh * ticks_per_second /enabled_cpus
+*
+* as computed in update_ticks_per_cpu().
+*
 * Let it wrap around if needed.
 */
 
@@ -51,10 +58,10 @@ static void kick_timer(struct hpet_hld_data *hdata, bool 
force)
return;
 
if (hdata->has_periodic)
-   period = watchdog_thresh * hdata->ticks_per_second;
+   period = watchdog_thresh * hdata->ticks_per_cpu;
 
count = hpet_readl(HPET_COUNTER);
-   new_compare = count + watchdog_thresh * hdata->ticks_per_second;
+   new_compare = count + watchdog_thresh * hdata->ticks_per_cpu;
hpet_set_comparator(hdata->num, (u32)new_compare, (u32)period);
 }
 
@@ -233,6 +240,27 @@ static int setup_hpet_irq(struct hpet_hld_data *hdata)
return ret;
 }
 
+/**
+ * update_ticks_per_cpu() - Update the number of HPET ticks per CPU
+ * @hdata: struct with the timer's the ticks-per-second and CPU mask
+ *
+ * From the overall ticks-per-second of the timer, compute the number of ticks
+ * after which the timer should expire to monitor each CPU every watch_thresh
+ * seconds. The ticks-per-cpu quantity is computed using the number of CPUs 
that
+ * the watchdog currently monitors.
+ */
+static void update_ticks_per_cpu(struct hpet_hld_data *hdata)
+{
+   u64 temp = hdata->ticks_per_second;
+
+   /* Only update if there are monitored CPUs. */
+   if (!hdata->enabled_cpus)
+   return;
+
+   do_div(temp, hdata->enabled_cpus);
+   hdata->ticks_per_cpu = temp;
+}
+
 /**
  * hardlockup_detector_hpet_enable() - Enable the hardlockup detector
  * @cpu:   CPU Index in which the watchdog will be enabled.
@@ -245,13 +273,23 @@ void hardlockup_detector_hpet_enable(unsigned int cpu)
 {
cpumask_set_cpu(cpu, to_cpumask(hld_data->cpu_monitored_mask));
 
-   if (!hld_data->enabled_cpus++) {
+   hld_data->enabled_cpus++;
+   update_ticks_per_cpu(hld_data);
+
+   if (hld_data->enabled_cpus == 1) {
hld_data->handling_cpu = cpu;
update_msi_destid(hld_data);
/* Force timer kick when detector is just enabled */
kick_timer(hld_data, true);
enable_timer(hld_data);
 

[RFC PATCH v3 18/21] x86/apic: Add a parameter for the APIC delivery mode

2019-05-14 Thread Ricardo Neri
Until now, the delivery mode of APIC interrupts is set to the default
mode set in the APIC driver. However, there are no restrictions in hardware
to configure each interrupt with a different delivery mode. Specifying the
delivery mode per interrupt is useful when one is interested in changing
the delivery mode of a particular interrupt. For instance, this can be used
to deliver an interrupt as non-maskable.

Add a new member, delivery_mode, to struct irq_cfg. This new member, can
be used to update the configuration of the delivery mode in each interrupt
domain. Likewise, add equivalent macros to populate MSI messages.

Currently, all interrupt domains set the delivery mode of interrupts using
the APIC setting. Interrupt domains use an irq_cfg data structure to
configure their own data structures and hardware resources. Thus, in order
to keep the current behavior, set the delivery mode of the irq
configuration that as the APIC setting. In this manner, irq domains can
obtain the delivery mode from the irq configuration data instead of the
APIC setting, if needed.

Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: Borislav Petkov 
Cc: Jacob Pan 
Cc: Joerg Roedel 
Cc: Juergen Gross 
Cc: Bjorn Helgaas 
Cc: Wincy Van 
Cc: Kate Stewart 
Cc: Philippe Ombredanne 
Cc: "Eric W. Biederman" 
Cc: Baoquan He 
Cc: Dou Liyang 
Cc: Jan Kiszka 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Cc: iommu@lists.linux-foundation.org
Signed-off-by: Ricardo Neri 
---
 arch/x86/include/asm/hw_irq.h |  5 +++--
 arch/x86/include/asm/msidef.h |  3 +++
 arch/x86/kernel/apic/vector.c | 10 ++
 3 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
index 32e666e1231e..c024e5976b78 100644
--- a/arch/x86/include/asm/hw_irq.h
+++ b/arch/x86/include/asm/hw_irq.h
@@ -117,8 +117,9 @@ struct irq_alloc_info {
 };
 
 struct irq_cfg {
-   unsigned intdest_apicid;
-   unsigned intvector;
+   unsigned intdest_apicid;
+   unsigned intvector;
+   enum ioapic_irq_destination_types   delivery_mode;
 };
 
 extern struct irq_cfg *irq_cfg(unsigned int irq);
diff --git a/arch/x86/include/asm/msidef.h b/arch/x86/include/asm/msidef.h
index 38ccfdc2d96e..6d666c90f057 100644
--- a/arch/x86/include/asm/msidef.h
+++ b/arch/x86/include/asm/msidef.h
@@ -16,6 +16,9 @@
 MSI_DATA_VECTOR_MASK)
 
 #define MSI_DATA_DELIVERY_MODE_SHIFT   8
+#define MSI_DATA_DELIVERY_MODE_MASK0x0700
+#define MSI_DATA_DELIVERY_MODE(dm) (((dm) << MSI_DATA_DELIVERY_MODE_SHIFT) 
& \
+MSI_DATA_DELIVERY_MODE_MASK)
 #define  MSI_DATA_DELIVERY_FIXED   (0 << MSI_DATA_DELIVERY_MODE_SHIFT)
 #define  MSI_DATA_DELIVERY_LOWPRI  (1 << MSI_DATA_DELIVERY_MODE_SHIFT)
 #define  MSI_DATA_DELIVERY_NMI (4 << MSI_DATA_DELIVERY_MODE_SHIFT)
diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 3173e07d3791..99436fe7e932 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -548,6 +548,16 @@ static int x86_vector_alloc_irqs(struct irq_domain 
*domain, unsigned int virq,
irqd->chip_data = apicd;
irqd->hwirq = virq + i;
irqd_set_single_target(irqd);
+
+   /*
+* Initialize the delivery mode of this irq to match the
+* default delivery mode of the APIC. This is useful for
+* children irq domains which want to take the delivery
+* mode from the individual irq configuration rather
+* than from the APIC.
+*/
+apicd->hw_irq_cfg.delivery_mode = apic->irq_delivery_mode;
+
/*
 * Legacy vectors are already assigned when the IOAPIC
 * takes them over. They stay on the same vector. This is
-- 
2.17.1



[RFC PATCH v3 13/21] x86/watchdog/hardlockup/hpet: Determine if HPET timer caused NMI

2019-05-14 Thread Ricardo Neri
The only direct method to determine whether an HPET timer caused an
interrupt is to read the Interrupt Status register. Unfortunately,
reading HPET registers is slow and, therefore, it is not recommended to
read them while in NMI context. Furthermore, status is not available if
the interrupt is generated vi the Front Side Bus.

An indirect manner to infer if the non-maskable interrupt we see was
caused by the HPET timer is to use the time-stamp counter. Compute the
value that the time-stamp counter should have at the next interrupt of the
HPET timer. Since the hardlockup detector operates in seconds, high
precision is not needed. This implementation considers that the HPET
caused the HMI if the time-stamp counter reads the expected value -/+ 1.5%.
This value is selected as it is equivalent to 1/64 and the division can be
performed using a bit shift operation. Experimentally, the error in the
estimation is consistently less than 1%.

The computation of the expected value of the time-stamp counter must be
performed in relation to watchdog_thresh divided by the number of
monitored CPUs. This quantity is stored in tsc_ticks_per_cpu and must be
updated whenever the number of monitored CPUs changes.

Cc: "H. Peter Anvin" 
Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: Peter Zijlstra 
Cc: Clemens Ladisch 
Cc: Arnd Bergmann 
Cc: Philippe Ombredanne 
Cc: Kate Stewart 
Cc: "Rafael J. Wysocki" 
Cc: Mimi Zohar 
Cc: Jan Kiszka 
Cc: Nick Desaulniers 
Cc: Masahiro Yamada 
Cc: Nayna Jain 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Suggested-by: Andi Kleen 
Signed-off-by: Ricardo Neri 
---
 arch/x86/include/asm/hpet.h |  2 ++
 arch/x86/kernel/watchdog_hld_hpet.c | 27 ++-
 2 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index 64acacce095d..fd99f2390714 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -115,6 +115,8 @@ struct hpet_hld_data {
u32 num;
u64 ticks_per_second;
u64 ticks_per_cpu;
+   u64 tsc_next;
+   u64 tsc_ticks_per_cpu;
u32 handling_cpu;
u32 enabled_cpus;
struct msi_msg  msi_msg;
diff --git a/arch/x86/kernel/watchdog_hld_hpet.c 
b/arch/x86/kernel/watchdog_hld_hpet.c
index 9a3431a54616..6f1f540cfee9 100644
--- a/arch/x86/kernel/watchdog_hld_hpet.c
+++ b/arch/x86/kernel/watchdog_hld_hpet.c
@@ -23,6 +23,7 @@
 
 static struct hpet_hld_data *hld_data;
 static bool hardlockup_use_hpet;
+static u64 tsc_next_error;
 
 /**
  * kick_timer() - Reprogram timer to expire in the future
@@ -32,11 +33,22 @@ static bool hardlockup_use_hpet;
  * Reprogram the timer to expire within watchdog_thresh seconds in the future.
  * If the timer supports periodic mode, it is not kicked unless @force is
  * true.
+ *
+ * Also, compute the expected value of the time-stamp counter at the time of
+ * expiration as well as a deviation from the expected value. The maximum
+ * deviation is of ~1.5%. This deviation can be easily computed by shifting
+ * by 6 positions the delta between the current and expected time-stamp values.
  */
 static void kick_timer(struct hpet_hld_data *hdata, bool force)
 {
+   u64 tsc_curr, tsc_delta, new_compare, count, period = 0;
bool kick_needed = force || !(hdata->has_periodic);
-   u64 new_compare, count, period = 0;
+
+   tsc_curr = rdtsc();
+
+   tsc_delta = (unsigned long)watchdog_thresh * hdata->tsc_ticks_per_cpu;
+   hdata->tsc_next = tsc_curr + tsc_delta;
+   tsc_next_error = tsc_delta >> 6;
 
/*
 * Update the comparator in increments of watch_thresh seconds relative
@@ -92,6 +104,15 @@ static void enable_timer(struct hpet_hld_data *hdata)
  */
 static bool is_hpet_wdt_interrupt(struct hpet_hld_data *hdata)
 {
+   if (smp_processor_id() == hdata->handling_cpu) {
+   u64 tsc_curr;
+
+   tsc_curr = rdtsc();
+
+   return (tsc_curr - hdata->tsc_next) + tsc_next_error <
+  2 * tsc_next_error;
+   }
+
return false;
 }
 
@@ -259,6 +280,10 @@ static void update_ticks_per_cpu(struct hpet_hld_data 
*hdata)
 
do_div(temp, hdata->enabled_cpus);
hdata->ticks_per_cpu = temp;
+
+   temp = (unsigned long)tsc_khz * 1000L;
+   do_div(temp, hdata->enabled_cpus);
+   hdata->tsc_ticks_per_cpu = temp;
 }
 
 /**
-- 
2.17.1



[RFC PATCH v3 14/21] watchdog/hardlockup: Use parse_option_str() to handle "nmi_watchdog"

2019-05-14 Thread Ricardo Neri
Prepare hardlockup_panic_setup() to handle a comma-separated list of
options. This is needed to pass options to specific implementations of the
hardlockup detector.

Cc: "H. Peter Anvin" 
Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: Peter Zijlstra 
Cc: Clemens Ladisch 
Cc: Arnd Bergmann 
Cc: Philippe Ombredanne 
Cc: Kate Stewart 
Cc: "Rafael J. Wysocki" 
Cc: Mimi Zohar 
Cc: Jan Kiszka 
Cc: Nick Desaulniers 
Cc: Masahiro Yamada 
Cc: Nayna Jain 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Signed-off-by: Ricardo Neri 
---
 kernel/watchdog.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index be589001200a..fd50049449ec 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -70,13 +70,13 @@ void __init hardlockup_detector_disable(void)
 
 static int __init hardlockup_panic_setup(char *str)
 {
-   if (!strncmp(str, "panic", 5))
+   if (parse_option_str(str, "panic"))
hardlockup_panic = 1;
-   else if (!strncmp(str, "nopanic", 7))
+   else if (parse_option_str(str, "nopanic"))
hardlockup_panic = 0;
-   else if (!strncmp(str, "0", 1))
+   else if (parse_option_str(str, "0"))
nmi_watchdog_user_enabled = 0;
-   else if (!strncmp(str, "1", 1))
+   else if (parse_option_str(str, "1"))
nmi_watchdog_user_enabled = 1;
return 1;
 }
-- 
2.17.1



  1   2   >