AMD IOMMU BCM 5762 watchdog timeout under heavy traffic

2019-05-14 Thread Elmar Melcher
Kernel bisect leads to
[256e4621c21aa1bf704e1a12e643923fdb732d04] iommu/amd: Make use of the
generic IOVA allocator
as cause.
More details: https://bugzilla.kernel.org/show_bug.cgi?id=203607
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 02/16] iommu: Introduce cache_invalidate API

2019-05-14 Thread Jacob Pan
On Tue, 14 May 2019 10:44:01 -0700
Jacob Pan  wrote:

> Hi Thank you both for the explanation.
> 
> On Tue, 14 May 2019 11:41:24 +0100
> Jean-Philippe Brucker  wrote:
> 
> > On 14/05/2019 08:36, Auger Eric wrote:  
> > > Hi Jacob,
> > > 
> > > On 5/14/19 12:16 AM, Jacob Pan wrote:
> > >> On Mon, 13 May 2019 18:09:48 +0100
> > >> Jean-Philippe Brucker  wrote:
> > >>
> > >>> On 13/05/2019 17:50, Auger Eric wrote:
> > > struct iommu_inv_pasid_info {
> > > #define IOMMU_INV_PASID_FLAGS_PASID   (1 << 0)
> > > #define IOMMU_INV_PASID_FLAGS_ARCHID  (1 << 1)
> > >   __u32   flags;
> > >   __u32   archid;
> > >   __u64   pasid;
> > > };  
> >  I agree it does the job now. However it looks a bit strange to
> >  do a PASID based invalidation in my case - SMMUv3 nested stage
> >  - where I don't have any PASID involved.
> > 
> >  Couldn't we call it context based invalidation then? A context
> >  can be tagged by a PASID or/and an ARCHID.  
> > >>>
> > >>> I think calling it "context" would be confusing as well (I
> > >>> shouldn't have used it earlier), since VT-d uses that name for
> > >>> device table entries (=STE on Arm SMMU). Maybe "addr_space"?
> > >>>
> > >> I am still struggling to understand what ARCHID is after scanning
> > >> through SMMUv3.1 spec. It seems to be a constant for a given
> > >> SMMU. Why do you need to pass it down every time? Could you
> > >> point to me the document or explain a little more on ARCHID use
> > >> cases. We have three fileds called pasid under this struct
> > >> iommu_cache_invalidate_info{}
> > >> Gets confusing :)
> > > archid is a generic term. That's why you did not find it in the
> > > spec ;-)
> > > 
> > > On ARM SMMU the archid is called the ASID (Address Space ID, up to
> > > 16 bits. The ASID is stored in the Context Descriptor Entry (your
> > > PASID entry) and thus characterizes a given stage 1 translation
> > > "context"/"adress space".
> > 
> > Yes, another way to look at it is, for a given address space:
> > * PASID tags device-IOTLB (ATC) entries.
> > * ASID (here called archid) tags IOTLB entries.
> > 
> > They could have the same value, but it depends on the guest's
> > allocation policy which isn't in our control. With my PASID patches
> > for SMMUv3, they have different values. So we need both fields if we
> > intend to invalidate both ATC and IOTLB with a single call.
> >   
> For ASID invalidation, there is also page/address selective within an
> ASID, right? I guess it is CMD_TLBI_NH_VA?
> So the single call to invalidate both ATC & IOTLB should share the
> same address information. i.e.
> struct iommu_inv_addr_info {}
> 
Nevermind for this question. archid field is already in the addr_info.
Sorry.
> Just out of curiosity, what is the advantage of having guest tag its
> ATC with its own PASID? I thought you were planning to use custom
> ioasid allocator to get PASID from host.
> 
> Also ASID is 16 bit as Eric said and PASID (substreamID?) is 20 bit,
> right?
> 
> > Thanks,
> > Jean
> >   
> > > 
> > > At the moment the ASID is allocated per iommu domain. With aux
> > > domains we should have one ASID per aux domain, Jean-Philippe
> > > said.
> > > 
> > > ASID tags IOTLB S1 entries. As the ASID is part of the "context
> > > descriptor" which is owned by the guest, the API must pass it
> > > somehow.
> > > 
> > > 4.4.1.2 CMD_TLBI_NH_ASID(VMID, ASID) invalidation command allows
> > > to invalidate all IOTLB S1 entries for a given VMID/ASID and this
> > > is the functionality which is currently missing in the API. This
> > > is not an address based invalidation or a "pure" PASID based
> > > invalidation. At the moment we don't support PASIDs on ARM and I
> > > need this capability.
> > >   
> Got it.
> > > Thanks
> > > 
> > > Eric
> > > 
> > > 
> > > 
> > >>> Thanks,
> > >>> Jean
> > >>>
> > 
> >  Domain invalidation would invalidate all the contexts belonging
> >  to that domain.
> > 
> >  Thanks
> > 
> >  Eric  
> > >>
> > >> [Jacob Pan]
> > >>
> >   
> 
> [Jacob Pan]

[Jacob Pan]


Re: [PATCH v3 02/16] iommu: Introduce cache_invalidate API

2019-05-14 Thread Jacob Pan
On Tue, 14 May 2019 13:02:47 +0200
Auger Eric  wrote:

> Hi Jean,
> 
> On 5/14/19 12:42 PM, Jean-Philippe Brucker wrote:
> > On 14/05/2019 08:46, Auger Eric wrote:  
> >> Hi Jean,
> >>
> >> On 5/13/19 7:09 PM, Jean-Philippe Brucker wrote:  
> >>> On 13/05/2019 17:50, Auger Eric wrote:  
> > struct iommu_inv_pasid_info {
> > #define IOMMU_INV_PASID_FLAGS_PASID (1 << 0)
> > #define IOMMU_INV_PASID_FLAGS_ARCHID(1 << 1)
> > __u32   flags;
> > __u32   archid;
> > __u64   pasid;
> > };  
>  I agree it does the job now. However it looks a bit strange to
>  do a PASID based invalidation in my case - SMMUv3 nested stage -
>  where I don't have any PASID involved.
> 
>  Couldn't we call it context based invalidation then? A context
>  can be tagged by a PASID or/and an ARCHID.  
> >>>
> >>> I think calling it "context" would be confusing as well (I
> >>> shouldn't have used it earlier), since VT-d uses that name for
> >>> device table entries (=STE on Arm SMMU). Maybe "addr_space"?  
> >> yes you're right. Well we already pasid table table terminology so
> >> we can use it here as well - as long as we understand what purpose
> >> it serves ;-) - So OK for iommu_inv_pasid_info.
> >>
> >> I think Jean understood we would keep pasid standalone field in  
> I meant Jacob here.
> >> iommu_cache_invalidate_info's union. I understand the struct
> >> iommu_inv_pasid_info now would replace it, correct?  
> 
> Thank you for the confirmation.
> 
Yes, I agree to replace the standalone __64 pasid with this struct.
Looks more inline with address selective info., Just to double confirm
the new struct.

Jean, will you put this in your sva/api repo?

struct iommu_cache_invalidate_info {
#define IOMMU_CACHE_INVALIDATE_INFO_VERSION_1 1
__u32   version;
/* IOMMU paging structure cache */
#define IOMMU_CACHE_INV_TYPE_IOTLB  (1 << 0) /* IOMMU IOTLB */
#define IOMMU_CACHE_INV_TYPE_DEV_IOTLB  (1 << 1) /* Device IOTLB
*/
#define IOMMU_CACHE_INV_TYPE_PASID  (1 << 2) /* PASID cache */
#define IOMMU_CACHE_TYPE_NR (3)
__u8cache;
__u8granularity;
__u8padding[2];
union {
struct iommu_inv_pasid_info pasid_info;
struct iommu_inv_addr_info addr_info;
};
};





Re: [PATCH v3 02/16] iommu: Introduce cache_invalidate API

2019-05-14 Thread Jacob Pan
Hi Thank you both for the explanation.

On Tue, 14 May 2019 11:41:24 +0100
Jean-Philippe Brucker  wrote:

> On 14/05/2019 08:36, Auger Eric wrote:
> > Hi Jacob,
> > 
> > On 5/14/19 12:16 AM, Jacob Pan wrote:  
> >> On Mon, 13 May 2019 18:09:48 +0100
> >> Jean-Philippe Brucker  wrote:
> >>  
> >>> On 13/05/2019 17:50, Auger Eric wrote:  
> > struct iommu_inv_pasid_info {
> > #define IOMMU_INV_PASID_FLAGS_PASID (1 << 0)
> > #define IOMMU_INV_PASID_FLAGS_ARCHID(1 << 1)
> > __u32   flags;
> > __u32   archid;
> > __u64   pasid;
> > };
>  I agree it does the job now. However it looks a bit strange to
>  do a PASID based invalidation in my case - SMMUv3 nested stage -
>  where I don't have any PASID involved.
> 
>  Couldn't we call it context based invalidation then? A context
>  can be tagged by a PASID or/and an ARCHID.
> >>>
> >>> I think calling it "context" would be confusing as well (I
> >>> shouldn't have used it earlier), since VT-d uses that name for
> >>> device table entries (=STE on Arm SMMU). Maybe "addr_space"?
> >>>  
> >> I am still struggling to understand what ARCHID is after scanning
> >> through SMMUv3.1 spec. It seems to be a constant for a given SMMU.
> >> Why do you need to pass it down every time? Could you point to me
> >> the document or explain a little more on ARCHID use cases.
> >> We have three fileds called pasid under this struct
> >> iommu_cache_invalidate_info{}
> >> Gets confusing :)  
> > archid is a generic term. That's why you did not find it in the
> > spec ;-)
> > 
> > On ARM SMMU the archid is called the ASID (Address Space ID, up to
> > 16 bits. The ASID is stored in the Context Descriptor Entry (your
> > PASID entry) and thus characterizes a given stage 1 translation
> > "context"/"adress space".  
> 
> Yes, another way to look at it is, for a given address space:
> * PASID tags device-IOTLB (ATC) entries.
> * ASID (here called archid) tags IOTLB entries.
> 
> They could have the same value, but it depends on the guest's
> allocation policy which isn't in our control. With my PASID patches
> for SMMUv3, they have different values. So we need both fields if we
> intend to invalidate both ATC and IOTLB with a single call.
> 
For ASID invalidation, there is also page/address selective within an
ASID, right? I guess it is CMD_TLBI_NH_VA?
So the single call to invalidate both ATC & IOTLB should share the same
address information. i.e.
struct iommu_inv_addr_info {}

Just out of curiosity, what is the advantage of having guest tag its
ATC with its own PASID? I thought you were planning to use custom
ioasid allocator to get PASID from host.

Also ASID is 16 bit as Eric said and PASID (substreamID?) is 20 bit,
right?

> Thanks,
> Jean
> 
> > 
> > At the moment the ASID is allocated per iommu domain. With aux
> > domains we should have one ASID per aux domain, Jean-Philippe said.
> > 
> > ASID tags IOTLB S1 entries. As the ASID is part of the "context
> > descriptor" which is owned by the guest, the API must pass it
> > somehow.
> > 
> > 4.4.1.2 CMD_TLBI_NH_ASID(VMID, ASID) invalidation command allows to
> > invalidate all IOTLB S1 entries for a given VMID/ASID and this is
> > the functionality which is currently missing in the API. This is
> > not an address based invalidation or a "pure" PASID based
> > invalidation. At the moment we don't support PASIDs on ARM and I
> > need this capability.
> > 
Got it.
> > Thanks
> > 
> > Eric
> > 
> > 
> >   
> >>> Thanks,
> >>> Jean
> >>>  
> 
>  Domain invalidation would invalidate all the contexts belonging
>  to that domain.
> 
>  Thanks
> 
>  Eric
> >>
> >> [Jacob Pan]
> >>  
> 

[Jacob Pan]
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] iommu: arm-smmu: stall support

2019-05-14 Thread Rob Clark
On Tue, May 14, 2019 at 3:24 AM Robin Murphy  wrote:
>
> On 14/05/2019 02:54, Rob Clark wrote:
> > On Mon, May 13, 2019 at 11:37 AM Jean-Philippe Brucker
> >  wrote:
> >>
> >> Hi Rob,
> >>
> >> On 10/05/2019 19:23, Rob Clark wrote:
> >>> On Fri, Sep 22, 2017 at 2:58 AM Jean-Philippe Brucker
> >>>  wrote:
> 
>  On 22/09/17 10:02, Joerg Roedel wrote:
> > On Tue, Sep 19, 2017 at 10:23:43AM -0400, Rob Clark wrote:
> >> I would like to decide in the IRQ whether or not to queue work or not,
> >> because when we get a gpu fault, we tend to get 1000's of gpu faults
> >> all at once (and I really only need to handle the first one).  I
> >> suppose that could also be achieved by having a special return value
> >> from the fault handler to say "call me again from a wq"..
> >>
> >> Note that in the drm driver I already have a suitable wq to queue the
> >> work, so it really doesn't buy me anything to have the iommu driver
> >> toss things off to a wq for me.  Might be a different situation for
> >> other drivers (but I guess mostly other drivers are using iommu API
> >> indirectly via dma-mapping?)
> >
> > Okay, so since you are the only user for now, we don't need a
> > work-queue. But I still want the ->resume call-back to be hidden in the
> > iommu code and not be exposed to users.
> >
> > We already have per-domain fault-handlers, so the best solution for now
> > is to call ->resume from report_iommu_fault() when the fault-handler
> > returns a special value.
> 
>  The problem is that report_iommu_fault is called from IRQ context by the
>  SMMU driver, so the device driver callback cannot sleep.
> 
>  So if the device driver needs to be able to sleep between fault report 
>  and
>  resume, as I understand Rob needs for writing debugfs, we can either:
> 
>  * call report_iommu_fault from higher up, in a thread or workqueue.
>  * split the fault reporting as this patch proposes. The exact same
> mechanism is needed for the vSVM work by Intel: in order to inject 
>  fault
> into the guest, they would like to have an atomic notifier registered 
>  by
> VFIO for passing down the Page Request, and a new function in the 
>  IOMMU
> API to resume/complete the fault.
> 
> >>>
> >>> So I was thinking about this topic again.. I would still like to get
> >>> some sort of async resume so that I can wire up GPU cmdstream/state
> >>> logging on iommu fault (without locally resurrecting and rebasing this
> >>> patch and drm/msm side changes each time I need to debug iommu
> >>> faults)..
> >>
> >> We've been working on the new fault reporting API with Jacob and Eric,
> >> and I intend to send it out soon. It is supposed to be used for
> >> reporting faults to guests via VFIO, handling page faults via mm, and
> >> also reporting events directly to device drivers. Please let us know
> >> what works and what doesn't in your case
> >>
> >> The most recent version of the patches is at
> >> http://www.linux-arm.org/git?p=linux-jpb.git;a=shortlog;h=refs/heads/sva/api
> >> (git://www.linux-arm.org/linux-jpb.git branch sva/api). Hopefully on the
> >> list sometimes next week, I'll add you on Cc.
> >>
> >> In particular, see commits
> >>  iommu: Introduce device fault data
> >>  iommu: Introduce device fault report API
> >>  iommu: Add recoverable fault reporting
> >>
> >> The device driver calls iommu_register_device_fault_handler(dev, cb,
> >> data). To report a fault, the SMMU driver calls
> >> iommu_report_device_fault(dev, fault). This calls into the device driver
> >> directly, there isn't any workqueue. If the fault is recoverable (the
> >> SMMU driver set type IOMMU_FAULT_PAGE_REQ rather than
> >> IOMMU_FAULT_DMA_UNRECOV), the device driver calls iommu_page_response()
> >> once it has dealt with the fault (after sleeping if it needs to). This
> >> invokes the SMMU driver's resume callback.
> >
> > Ok, this sounds at a high level similar to my earlier RFC, in that
> > resume is split (and that was the main thing I was interested in).
> > And it does solve one thing I was struggling with, namely that when
> > the domain is created it doesn't know which iommu device it will be
> > attached to (given that at least the original arm-smmu.c driver cannot
> > support stall in all cases)..
> >
> > For GPU translation faults, I also don't really need to know if the
> > faulting translation is stalled until the callback (I mainly want to
> > not bother to snapshot GPU state if it is not stalled, because in that
> > case the data we snapshot is unlikely to be related to the fault if
> > the translation is not stalled).
> >
> >> At the moment we use mutexes, so iommu_report_device_fault() can only be
> >> called from an IRQ thread, which is incompatible with the current SMMUv2
> >> driver. Either we need to switch the SMMUv2 driver to an IRQ 

Re: [RFC v3 1/3] vfio_pci: split vfio_pci.c into two source files

2019-05-14 Thread Alex Williamson
On Tue, 23 Apr 2019 20:14:38 +0800
"Liu, Yi L"  wrote:

> This patch splits the non-module specific codes from original
> drivers/vfio/pci/vfio_pci.c into a common.c under drivers/vfio/pci.
> This is for potential code sharing. e.g. vfio-mdev-pci driver
> 
> Cc: Kevin Tian 
> Cc: Lu Baolu 
> Signed-off-by: Liu, Yi L 
> ---
>  drivers/vfio/pci/Makefile   |2 +-
>  drivers/vfio/pci/common.c   | 1511 
> +++
>  drivers/vfio/pci/vfio_pci.c | 1476 +-
>  drivers/vfio/pci/vfio_pci_private.h |   27 +
>  4 files changed, 1551 insertions(+), 1465 deletions(-)
>  create mode 100644 drivers/vfio/pci/common.c
> 
> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> index 9662c06..813f6b3 100644
> --- a/drivers/vfio/pci/Makefile
> +++ b/drivers/vfio/pci/Makefile
> @@ -1,5 +1,5 @@
>  
> -vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
> +vfio-pci-y := vfio_pci.o common.o vfio_pci_intrs.o vfio_pci_rdwr.o 
> vfio_pci_config.o
>  vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
>  vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
>  
> diff --git a/drivers/vfio/pci/common.c b/drivers/vfio/pci/common.c
> new file mode 100644
> index 000..847e2e4
> --- /dev/null
> +++ b/drivers/vfio/pci/common.c

Nit, I realize that our file naming scheme includes a lot of
redundancy, but better to have redundancy than inconsistency imo.
Perhaps vfio_pci_common.c.

> @@ -0,0 +1,1511 @@
> +/*
> + * Copyright © 2019 Intel Corporation.
> + * Author: Liu, Yi L 
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * Derived from original vfio_pci.c:
> + * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
> + * Author: Alex Williamson 
> + *
> + * Derived from original vfio:
> + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
> + * Author: Tom Lyon, p...@cisco.com
> + */
> +
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include "vfio_pci_private.h"
> +

[snip faithful code moves]

> +void vfio_pci_vga_probe(struct vfio_pci_device *vdev)
> +{
> + vga_client_register(vdev->pdev, vdev, NULL, vfio_pci_set_vga_decode);
> + vga_set_legacy_decoding(vdev->pdev,
> + vfio_pci_set_vga_decode(vdev, false));
> +}
> +
> +void vfio_pci_vga_remove(struct vfio_pci_device *vdev)
> +{
> + vga_client_register(vdev->pdev, NULL, NULL, NULL);
> + vga_set_legacy_decoding(vdev->pdev,
> + VGA_RSRC_NORMAL_IO | VGA_RSRC_NORMAL_MEM |
> + VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM);
> +}

Two new functions, though the names don't really match their purpose.

[snip more faithful code moves]
> +
> +void vfio_pci_probe_idle_d3(struct vfio_pci_device *vdev)
> +{
> +
> + /*
> +  * pci-core sets the device power state to an unknown value at
> +  * bootup and after being removed from a driver.  The only
> +  * transition it allows from this unknown state is to D0, which
> +  * typically happens when a driver calls pci_enable_device().
> +  * We're not ready to enable the device yet, but we do want to
> +  * be able to get to D3.  Therefore first do a D0 transition
> +  * before going to D3.
> +  */
> + vfio_pci_set_power_state(vdev, PCI_D0);
> + vfio_pci_set_power_state(vdev, PCI_D3hot);
> +}

Another new function.  This also doesn't really match function name to
purpose.

[snip more faithful code moves]
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 3fa20e9..6ce1a81 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
[en masse code deletes]
> @@ -1324,6 +147,11 @@ static int vfio_pci_probe(struct pci_dev *pdev, const 
> struct pci_device_id *id)
>   spin_lock_init(>irqlock);
>   mutex_init(>ioeventfds_lock);
>   INIT_LIST_HEAD(>ioeventfds_list);
> + vdev->nointxmask = nointxmask;
> +#ifdef CONFIG_VFIO_PCI_VGA
> + vdev->disable_vga = disable_vga;
> +#endif
> + vdev->disable_idle_d3 = disable_idle_d3;
>  
>   ret = vfio_add_group_dev(>dev, _pci_ops, vdev);
>   if (ret) {
> @@ -1340,27 +168,13 @@ static int vfio_pci_probe(struct pci_dev *pdev, const 
> struct pci_device_id *id)
>   return ret;
>   }
>  
> - if (vfio_pci_is_vga(pdev)) {
> - vga_client_register(pdev, vdev, NULL, vfio_pci_set_vga_decode);
> - vga_set_legacy_decoding(pdev,
> - vfio_pci_set_vga_decode(vdev, false));
> - }
> + if (vfio_pci_is_vga(pdev))
> + 

Re: [PATCH] iommu/amd: print out "tag" in INVALID_PPR_REQUEST

2019-05-14 Thread Qian Cai
On Tue, 2019-05-07 at 13:47 +, Gary R Hook wrote:
> On 5/5/19 11:11 PM, Qian Cai wrote:
> > [CAUTION: External Email]
> > 
> > The commit e7f63ffc1bf1 ("iommu/amd: Update logging information for new
> > event type") introduced a variable "tag" but had never used it which
> > generates a warning below,
> > 
> > drivers/iommu/amd_iommu.c: In function 'iommu_print_event':
> > drivers/iommu/amd_iommu.c:567:33: warning: variable 'tag' set but not
> > used [-Wunused-but-set-variable]
> >    int type, devid, pasid, flags, tag;
> >   ^~~
> > so just use it during the logging.
> > 
> > Signed-off-by: Qian Cai 
> > ---
> >   drivers/iommu/amd_iommu.c | 4 ++--
> >   1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
> > index f7cdd2ab7f11..52f41369c5b3 100644
> > --- a/drivers/iommu/amd_iommu.c
> > +++ b/drivers/iommu/amd_iommu.c
> > @@ -631,9 +631,9 @@ static void iommu_print_event(struct amd_iommu *iommu,
> > void *__evt)
> >  pasid = ((event[0] >> 16) & 0x)
> >  | ((event[1] << 6) & 0xF);
> >  tag = event[1] & 0x03FF;
> > -   dev_err(dev, "Event logged [INVALID_PPR_REQUEST
> > device=%02x:%02x.%x pasid=0x%05x address=0x%llx flags=0x%04x]\n",
> > +   dev_err(dev, "Event logged [INVALID_PPR_REQUEST
> > device=%02x:%02x.%x pasid=0x%05x tag=0x%04x address=0x%llx flags=0x%04x]\n",
> >  PCI_BUS_NUM(devid), PCI_SLOT(devid),
> > PCI_FUNC(devid),
> > -   pasid, address, flags);
> > +   pasid, tag, address, flags);
> >  break;
> >  default:
> >  dev_err(dev, "Event logged [UNKNOWN event[0]=0x%08x
> > event[1]=0x%08x event[2]=0x%08x event[3]=0x%08x\n",
> 
> I did manage to overlook that variable when I posted the original patch. 
> But it looks to me like 41e59a41fc5d1 (iommu tree) already fixed this... 
> I'm not sure why it never got pushed to the main tree.

Jroedel, I am wondering what the plan for 41e59a41fc5d1 (iommu tree) or this
patch to be pushed to the linux-next or mainline...
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC PATCH v3 11/21] x86/watchdog/hardlockup: Add an HPET-based hardlockup detector

2019-05-14 Thread Randy Dunlap
On 5/14/19 7:02 AM, Ricardo Neri wrote:
> diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
> index 15d0fbe27872..376a5db81aec 100644
> --- a/arch/x86/Kconfig.debug
> +++ b/arch/x86/Kconfig.debug
> @@ -169,6 +169,17 @@ config IOMMU_LEAK
>  config HAVE_MMIOTRACE_SUPPORT
>   def_bool y
>  
> +config X86_HARDLOCKUP_DETECTOR_HPET
> + bool "Use HPET Timer for Hard Lockup Detection"
> + select SOFTLOCKUP_DETECTOR
> + select HARDLOCKUP_DETECTOR
> + select HARDLOCKUP_DETECTOR_CORE
> + depends on HPET_TIMER && HPET && X86_64
> + help
> +   Say y to enable a hardlockup detector that is driven by an High-

   by a

> +   Precision Event Timer. This option is helpful to not use counters
> +   from the Performance Monitoring Unit to drive the detector.
> +
>  config X86_DECODER_SELFTEST
>   bool "x86 instruction decoder selftest"
>   depends on DEBUG_KERNEL && KPROBES


-- 
~Randy


Re: [RFC PATCH v3 04/21] x86/hpet: Add hpet_set_comparator() for periodic and one-shot modes

2019-05-14 Thread Randy Dunlap
On 5/14/19 7:01 AM, Ricardo Neri wrote:
> Instead of setting the timer period directly in hpet_set_periodic(), add a
> new helper function hpet_set_comparator() that only sets the accumulator
> and comparator. hpet_set_periodic() will only prepare the timer for
> periodic mode and leave the expiration programming to
> hpet_set_comparator().
> 
> This new function can also be used by other components (e.g., the HPET-
> based hardlockup detector) which also need to configure HPET timers. Thus,
> add its declaration into the hpet header file.
> 
> Cc: "H. Peter Anvin" 
> Cc: Ashok Raj 
> Cc: Andi Kleen 
> Cc: Tony Luck 
> Cc: Philippe Ombredanne 
> Cc: Kate Stewart 
> Cc: "Rafael J. Wysocki" 
> Cc: Stephane Eranian 
> Cc: Suravee Suthikulpanit 
> Cc: "Ravi V. Shankar" 
> Cc: x...@kernel.org
> Originally-by: Suravee Suthikulpanit 
> Signed-off-by: Ricardo Neri 
> ---
>  arch/x86/include/asm/hpet.h |  1 +
>  arch/x86/kernel/hpet.c  | 57 -
>  2 files changed, 45 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
> index f132fbf984d4..e7098740f5ee 100644
> --- a/arch/x86/include/asm/hpet.h
> +++ b/arch/x86/include/asm/hpet.h
> @@ -102,6 +102,7 @@ extern int hpet_rtc_timer_init(void);
>  extern irqreturn_t hpet_rtc_interrupt(int irq, void *dev_id);
>  extern int hpet_register_irq_handler(rtc_irq_handler handler);
>  extern void hpet_unregister_irq_handler(rtc_irq_handler handler);
> +extern void hpet_set_comparator(int num, unsigned int cmp, unsigned int 
> period);
>  
>  #endif /* CONFIG_HPET_EMULATE_RTC */
>  
> diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
> index 560fc28e1d13..c5c5fc150193 100644
> --- a/arch/x86/kernel/hpet.c
> +++ b/arch/x86/kernel/hpet.c
> @@ -289,6 +289,46 @@ static void hpet_legacy_clockevent_register(void)
>   printk(KERN_DEBUG "hpet clockevent registered\n");
>  }
>  
> +/**
> + * hpet_set_comparator() - Helper function for setting comparator register
> + * @num: The timer ID
> + * @cmp: The value to be written to the comparator/accumulator
> + * @period:  The value to be written to the period (0 = oneshot mode)
> + *
> + * Helper function for updating comparator, accumulator and period values.
> + *
> + * In periodic mode, HPET needs HPET_TN_SETVAL to be set before writing
> + * to the Tn_CMP to update the accumulator. Then, HPET needs a second
> + * write (with HPET_TN_SETVAL cleared) to Tn_CMP to set the period.
> + * The HPET_TN_SETVAL bit is automatically cleared after the first write.
> + *
> + * For one-shot mode, HPET_TN_SETVAL does not need to be set.
> + *
> + * See the following documents:
> + *   - Intel IA-PC HPET (High Precision Event Timers) Specification
> + *   - AMD-8111 HyperTransport I/O Hub Data Sheet, Publication # 24674
> + */
> +void hpet_set_comparator(int num, unsigned int cmp, unsigned int period)
> +{
> + if (period) {
> + unsigned int v = hpet_readl(HPET_Tn_CFG(num));
> +
> + hpet_writel(v | HPET_TN_SETVAL, HPET_Tn_CFG(num));
> + }
> +
> + hpet_writel(cmp, HPET_Tn_CMP(num));
> +
> + if (!period)
> + return;
> +
> + /* This delay is seldom used: never in one-shot mode and in periodic
> +  * only when reprogramming the timer.
> +  */

comment style warning ;)

> + udelay(1);
> + hpet_writel(period, HPET_Tn_CMP(num));
> +}
> +EXPORT_SYMBOL_GPL(hpet_set_comparator);
> +
>  static int hpet_set_periodic(struct clock_event_device *evt, int timer)
>  {
>   unsigned int cfg, cmp, now;
> @@ -300,19 +340,10 @@ static int hpet_set_periodic(struct clock_event_device 
> *evt, int timer)
>   now = hpet_readl(HPET_COUNTER);
>   cmp = now + (unsigned int)delta;
>   cfg = hpet_readl(HPET_Tn_CFG(timer));
> - cfg |= HPET_TN_ENABLE | HPET_TN_PERIODIC | HPET_TN_SETVAL |
> -HPET_TN_32BIT;
> - hpet_writel(cfg, HPET_Tn_CFG(timer));
> - hpet_writel(cmp, HPET_Tn_CMP(timer));
> - udelay(1);
> - /*
> -  * HPET on AMD 81xx needs a second write (with HPET_TN_SETVAL
> -  * cleared) to T0_CMP to set the period. The HPET_TN_SETVAL
> -  * bit is automatically cleared after the first write.
> -  * (See AMD-8111 HyperTransport I/O Hub Data Sheet,
> -  * Publication # 24674)
> -  */
> - hpet_writel((unsigned int)delta, HPET_Tn_CMP(timer));
> + cfg |= HPET_TN_ENABLE | HPET_TN_PERIODIC | HPET_TN_32BIT;
> +
> + hpet_set_comparator(timer, cmp, (unsigned int)delta);
> +
>   hpet_start_counter();
>   hpet_print_config();
>  
> 


-- 
~Randy
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v3 03/21] x86/hpet: Calculate ticks-per-second in a separate function

2019-05-14 Thread Randy Dunlap
On 5/14/19 7:01 AM, Ricardo Neri wrote:
> It is easier to compute the expiration times of an HPET timer by using
> its frequency (i.e., the number of times it ticks in a second) than its
> period, as given in the capabilities register.
> 
> In addition to the HPET char driver, the HPET-based hardlockup detector
> will also need to know the timer's frequency. Thus, create a common
> function that both can use.
> 
> Cc: "H. Peter Anvin" 
> Cc: Ashok Raj 
> Cc: Andi Kleen 
> Cc: Tony Luck 
> Cc: Clemens Ladisch 
> Cc: Arnd Bergmann 
> Cc: Philippe Ombredanne 
> Cc: Kate Stewart 
> Cc: "Rafael J. Wysocki" 
> Cc: Stephane Eranian 
> Cc: Suravee Suthikulpanit 
> Cc: "Ravi V. Shankar" 
> Cc: x...@kernel.org
> Signed-off-by: Ricardo Neri 
> ---
>  drivers/char/hpet.c  | 31 ---
>  include/linux/hpet.h |  1 +
>  2 files changed, 25 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/char/hpet.c b/drivers/char/hpet.c
> index d0ad85900b79..bdcbecfdb858 100644
> --- a/drivers/char/hpet.c
> +++ b/drivers/char/hpet.c
> @@ -836,6 +836,29 @@ static unsigned long hpet_calibrate(struct hpets *hpetp)
>   return ret;
>  }
>  
> +u64 hpet_get_ticks_per_sec(u64 hpet_caps)
> +{
> + u64 ticks_per_sec, period;
> +
> + period = (hpet_caps & HPET_COUNTER_CLK_PERIOD_MASK) >>
> +  HPET_COUNTER_CLK_PERIOD_SHIFT; /* fs, 10^-15 */
> +
> + /*
> +  * The frequency is the reciprocal of the period. The period is given
> +  * femtoseconds per second. Thus, prepare a dividend to obtain the

 * in femtoseconds per second.

> +  * frequency in ticks per second.
> +  */
> +
> + /* 10^15 femtoseconds per second */
> + ticks_per_sec = 1000uLL;

ULL is overwhelmingly used in the kernel.

> + ticks_per_sec += period >> 1; /* round */
> +
> + /* The quotient is put in the dividend. We drop the remainder. */
> + do_div(ticks_per_sec, period);
> +
> + return ticks_per_sec;
> +}
> +
>  int hpet_alloc(struct hpet_data *hdp)
>  {
>   u64 cap, mcfg;
> @@ -844,7 +867,6 @@ int hpet_alloc(struct hpet_data *hdp)
>   struct hpets *hpetp;
>   struct hpet __iomem *hpet;
>   static struct hpets *last;
> - unsigned long period;
>   unsigned long long temp;
>   u32 remainder;
>  
> @@ -894,12 +916,7 @@ int hpet_alloc(struct hpet_data *hdp)
>  
>   last = hpetp;
>  
> - period = (cap & HPET_COUNTER_CLK_PERIOD_MASK) >>
> - HPET_COUNTER_CLK_PERIOD_SHIFT; /* fs, 10^-15 */
> - temp = 1000uLL; /* 10^15 femtoseconds per second */
> - temp += period >> 1; /* round */
> - do_div(temp, period);
> - hpetp->hp_tick_freq = temp; /* ticks per second */
> + hpetp->hp_tick_freq = hpet_get_ticks_per_sec(cap);
>  
>   printk(KERN_INFO "hpet%d: at MMIO 0x%lx, IRQ%s",
>   hpetp->hp_which, hdp->hd_phys_address,
> diff --git a/include/linux/hpet.h b/include/linux/hpet.h
> index 8604564b985d..e7b36bcf4699 100644
> --- a/include/linux/hpet.h
> +++ b/include/linux/hpet.h
> @@ -107,5 +107,6 @@ static inline void hpet_reserve_timer(struct hpet_data 
> *hd, int timer)
>  }
>  
>  int hpet_alloc(struct hpet_data *);
> +u64 hpet_get_ticks_per_sec(u64 hpet_caps);
>  
>  #endif   /* !__HPET__ */
> 


-- 
~Randy
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH v3 10/21] watchdog/hardlockup: Add function to enable NMI watchdog on all allowed CPUs at once

2019-05-14 Thread Ricardo Neri
When there are more than one implementation of the NMI watchdog, there may
be situations in which switching from one to another is needed (e.g., if
the time-stamp counter becomes unstable, the HPET-based NMI watchdog can
no longer be used.

The perf-based implementation of the hardlockup detector makes use of
various per-CPU variables which are accessed via this_cpu operations.
Hence, each CPU needs to enable its own NMI watchdog if using the perf
implementation.

Add functionality to switch from one NMI watchdog to another and do it
from each allowed CPU.

Cc: "H. Peter Anvin" 
Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: "Rafael J. Wysocki" 
Cc: Don Zickus 
Cc: Nicholas Piggin 
Cc: Michael Ellerman 
Cc: Frederic Weisbecker 
Cc: Alexei Starovoitov 
Cc: Babu Moger 
Cc: "David S. Miller" 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Mathieu Desnoyers 
Cc: Masami Hiramatsu 
Cc: Peter Zijlstra 
Cc: Andrew Morton 
Cc: Philippe Ombredanne 
Cc: Colin Ian King 
Cc: Byungchul Park 
Cc: "Paul E. McKenney" 
Cc: "Luis R. Rodriguez" 
Cc: Waiman Long 
Cc: Josh Poimboeuf 
Cc: Randy Dunlap 
Cc: Davidlohr Bueso 
Cc: Marc Zyngier 
Cc: Kai-Heng Feng 
Cc: Konrad Rzeszutek Wilk 
Cc: David Rientjes 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Cc: sparcli...@vger.kernel.org
Cc: linuxppc-...@lists.ozlabs.org
Signed-off-by: Ricardo Neri 
---
 include/linux/nmi.h |  2 ++
 kernel/watchdog.c   | 15 +++
 2 files changed, 17 insertions(+)

diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index e5f1a86e20b7..6d828334348b 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -83,9 +83,11 @@ static inline void reset_hung_task_detector(void) { }
 
 #if defined(CONFIG_HARDLOCKUP_DETECTOR)
 extern void hardlockup_detector_disable(void);
+extern void hardlockup_start_all(void);
 extern unsigned int hardlockup_panic;
 #else
 static inline void hardlockup_detector_disable(void) {}
+static inline void hardlockup_start_all(void) {}
 #endif
 
 #if defined(CONFIG_HAVE_NMI_WATCHDOG) || defined(CONFIG_HARDLOCKUP_DETECTOR)
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 7f9e7b9306fe..be589001200a 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -566,6 +566,21 @@ int lockup_detector_offline_cpu(unsigned int cpu)
return 0;
 }
 
+static int hardlockup_start_fn(void *data)
+{
+   watchdog_nmi_enable(smp_processor_id());
+   return 0;
+}
+
+void hardlockup_start_all(void)
+{
+   int cpu;
+
+   cpumask_copy(_allowed_mask, _cpumask);
+   for_each_cpu(cpu, _allowed_mask)
+   smp_call_on_cpu(cpu, hardlockup_start_fn, NULL, false);
+}
+
 static void lockup_detector_reconfigure(void)
 {
cpus_read_lock();
-- 
2.17.1



[RFC PATCH v3 08/21] watchdog/hardlockup: Decouple the hardlockup detector from perf

2019-05-14 Thread Ricardo Neri
The current default implementation of the hardlockup detector assumes that
it is implemented using perf events. However, the hardlockup detector can
be driven by other sources of non-maskable interrupts (e.g., a properly
configured timer).

Group and wrap in #ifdef CONFIG_HARDLOCKUP_DETECTOR_PERF all the code
specific to perf: create and manage perf events, stop and start the perf-
based detector.

The generic portion of the detector (monitor the timers' thresholds, check
timestamps and detect hardlockups as well as the implementation of
arch_touch_nmi_watchdog()) is now selected with the new intermediate config
symbol CONFIG_HARDLOCKUP_DETECTOR_CORE.

The perf-based implementation of the detector selects the new intermediate
symbol. Other implementations should do the same.

Cc: "H. Peter Anvin" 
Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: "Rafael J. Wysocki" 
Cc: Don Zickus 
Cc: Nicholas Piggin 
Cc: Michael Ellerman 
Cc: Frederic Weisbecker 
Cc: Alexei Starovoitov 
Cc: Babu Moger 
Cc: "David S. Miller" 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Mathieu Desnoyers 
Cc: Masami Hiramatsu 
Cc: Peter Zijlstra 
Cc: Andrew Morton 
Cc: Philippe Ombredanne 
Cc: Colin Ian King 
Cc: Byungchul Park 
Cc: "Paul E. McKenney" 
Cc: "Luis R. Rodriguez" 
Cc: Waiman Long 
Cc: Josh Poimboeuf 
Cc: Randy Dunlap 
Cc: Davidlohr Bueso 
Cc: Marc Zyngier 
Cc: Kai-Heng Feng 
Cc: Konrad Rzeszutek Wilk 
Cc: David Rientjes 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Cc: sparcli...@vger.kernel.org
Cc: linuxppc-...@lists.ozlabs.org
Signed-off-by: Ricardo Neri 
---
 include/linux/nmi.h   |  5 -
 kernel/Makefile   |  2 +-
 kernel/watchdog_hld.c | 32 
 lib/Kconfig.debug |  4 
 4 files changed, 29 insertions(+), 14 deletions(-)

diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index 5a8b19749769..e5f1a86e20b7 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -94,8 +94,11 @@ static inline void hardlockup_detector_disable(void) {}
 # define NMI_WATCHDOG_SYSCTL_PERM  0444
 #endif
 
-#if defined(CONFIG_HARDLOCKUP_DETECTOR_PERF)
+#if defined(CONFIG_HARDLOCKUP_DETECTOR_CORE)
 extern void arch_touch_nmi_watchdog(void);
+#endif
+
+#if defined(CONFIG_HARDLOCKUP_DETECTOR_PERF)
 extern void hardlockup_detector_perf_stop(void);
 extern void hardlockup_detector_perf_restart(void);
 extern void hardlockup_detector_perf_disable(void);
diff --git a/kernel/Makefile b/kernel/Makefile
index 62471e75a2b0..e9bdbaa1ed50 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -82,7 +82,7 @@ obj-$(CONFIG_FAIL_FUNCTION) += fail_function.o
 obj-$(CONFIG_KGDB) += debug/
 obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
 obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
-obj-$(CONFIG_HARDLOCKUP_DETECTOR_PERF) += watchdog_hld.o
+obj-$(CONFIG_HARDLOCKUP_DETECTOR_CORE) += watchdog_hld.o
 obj-$(CONFIG_SECCOMP) += seccomp.o
 obj-$(CONFIG_RELAY) += relay.o
 obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
diff --git a/kernel/watchdog_hld.c b/kernel/watchdog_hld.c
index b352e507b17f..bb6435978c46 100644
--- a/kernel/watchdog_hld.c
+++ b/kernel/watchdog_hld.c
@@ -22,12 +22,8 @@
 
 static DEFINE_PER_CPU(bool, hard_watchdog_warn);
 static DEFINE_PER_CPU(bool, watchdog_nmi_touch);
-static DEFINE_PER_CPU(struct perf_event *, watchdog_ev);
-static DEFINE_PER_CPU(struct perf_event *, dead_event);
-static struct cpumask dead_events_mask;
 
 static unsigned long hardlockup_allcpu_dumped;
-static atomic_t watchdog_cpus = ATOMIC_INIT(0);
 
 notrace void arch_touch_nmi_watchdog(void)
 {
@@ -98,14 +94,6 @@ static inline bool watchdog_check_timestamp(void)
 }
 #endif
 
-static struct perf_event_attr wd_hw_attr = {
-   .type   = PERF_TYPE_HARDWARE,
-   .config = PERF_COUNT_HW_CPU_CYCLES,
-   .size   = sizeof(struct perf_event_attr),
-   .pinned = 1,
-   .disabled   = 1,
-};
-
 void inspect_for_hardlockups(struct pt_regs *regs)
 {
if (__this_cpu_read(watchdog_nmi_touch) == true) {
@@ -157,6 +145,24 @@ void inspect_for_hardlockups(struct pt_regs *regs)
return;
 }
 
+#ifdef CONFIG_HARDLOCKUP_DETECTOR_PERF
+#undef pr_fmt
+#define pr_fmt(fmt) "NMI perf watchdog: " fmt
+
+static DEFINE_PER_CPU(struct perf_event *, watchdog_ev);
+static DEFINE_PER_CPU(struct perf_event *, dead_event);
+static struct cpumask dead_events_mask;
+
+static atomic_t watchdog_cpus = ATOMIC_INIT(0);
+
+static struct perf_event_attr wd_hw_attr = {
+   .type   = PERF_TYPE_HARDWARE,
+   .config = PERF_COUNT_HW_CPU_CYCLES,
+   .size   = sizeof(struct perf_event_attr),
+   .pinned = 1,
+   .disabled   = 1,
+};
+
 /* Callback function for perf event subsystem */
 static void watchdog_overflow_callback(struct perf_event *event,
   struct perf_sample_data *data,
@@ -298,3 +304,5 @@ int __init hardlockup_detector_perf_init(void)
}
return 

[RFC PATCH v3 09/21] x86/nmi: Add a NMI_WATCHDOG NMI handler category

2019-05-14 Thread Ricardo Neri
Add a NMI_WATCHDOG as a new category of NMI handler. This new category
is to be used with the HPET-based hardlockup detector. This detector
does not have a direct way of checking if the HPET timer is the source of
the NMI. Instead it indirectly estimate it using the time-stamp counter.

Therefore, we may have false-positives in case another NMI occurs within
the estimated time window. For this reason, we want the handler of the
detector to be called after all the NMI_LOCAL handlers. A simple way
of achieving this with a new NMI handler category.

Signed-off-by: Ricardo Neri 
---
 arch/x86/include/asm/nmi.h |  1 +
 arch/x86/kernel/nmi.c  | 10 ++
 2 files changed, 11 insertions(+)

diff --git a/arch/x86/include/asm/nmi.h b/arch/x86/include/asm/nmi.h
index 75ded1d13d98..75aa98313cde 100644
--- a/arch/x86/include/asm/nmi.h
+++ b/arch/x86/include/asm/nmi.h
@@ -29,6 +29,7 @@ enum {
NMI_UNKNOWN,
NMI_SERR,
NMI_IO_CHECK,
+   NMI_WATCHDOG,
NMI_MAX
 };
 
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 3755d0310026..a43213f0ab26 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -62,6 +62,10 @@ static struct nmi_desc nmi_desc[NMI_MAX] =
.lock = __RAW_SPIN_LOCK_UNLOCKED(_desc[3].lock),
.head = LIST_HEAD_INIT(nmi_desc[3].head),
},
+   {
+   .lock = __RAW_SPIN_LOCK_UNLOCKED(_desc[4].lock),
+   .head = LIST_HEAD_INIT(nmi_desc[4].head),
+   },
 
 };
 
@@ -172,6 +176,8 @@ int __register_nmi_handler(unsigned int type, struct 
nmiaction *action)
 */
WARN_ON_ONCE(type == NMI_SERR && !list_empty(>head));
WARN_ON_ONCE(type == NMI_IO_CHECK && !list_empty(>head));
+   WARN_ON_ONCE(type == NMI_WATCHDOG && !list_empty(>head));
+
 
/*
 * some handlers need to be executed first otherwise a fake
@@ -382,6 +388,10 @@ static void default_do_nmi(struct pt_regs *regs)
}
raw_spin_unlock(_reason_lock);
 
+   handled = nmi_handle(NMI_WATCHDOG, regs);
+   if (handled == NMI_HANDLED)
+   return;
+
/*
 * Only one NMI can be latched at a time.  To handle
 * this we may process multiple nmi handlers at once to
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH v3 12/21] watchdog/hardlockup/hpet: Adjust timer expiration on the number of monitored CPUs

2019-05-14 Thread Ricardo Neri
Each CPU should be monitored for hardlockups every watchdog_thresh seconds.
Since all the CPUs in the system are monitored by the same timer and the
timer interrupt is rotated among the monitored CPUs, the timer must expire
every watchdog_thresh/N seconds; where N is the number of monitored CPUs.
Use the new member of struct hld_data, ticks_per_cpu, to store the
aforementioned quantity.

The ticks-per-CPU quantity is updated every time the number of monitored
CPUs changes: when the watchdog is enabled or disabled for a specific CPU.
If the timer is used in periodic mode, it needs to be adjusted to reflect
the new expected expiration.

Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: Borislav Petkov 
Cc: Jacob Pan 
Cc: "Rafael J. Wysocki" 
Cc: Don Zickus 
Cc: Nicholas Piggin 
Cc: Michael Ellerman 
Cc: Frederic Weisbecker 
Cc: Alexei Starovoitov 
Cc: Babu Moger 
Cc: Mathieu Desnoyers 
Cc: Masami Hiramatsu 
Cc: Peter Zijlstra 
Cc: Andrew Morton 
Cc: Philippe Ombredanne 
Cc: Colin Ian King 
Cc: Byungchul Park 
Cc: "Paul E. McKenney" 
Cc: "Luis R. Rodriguez" 
Cc: Waiman Long 
Cc: Josh Poimboeuf 
Cc: Randy Dunlap 
Cc: Davidlohr Bueso 
Cc: Marc Zyngier 
Cc: Kai-Heng Feng 
Cc: Konrad Rzeszutek Wilk 
Cc: David Rientjes 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Cc: iommu@lists.linux-foundation.org
Signed-off-by: Ricardo Neri 
---
 arch/x86/include/asm/hpet.h |  1 +
 arch/x86/kernel/watchdog_hld_hpet.c | 46 +++--
 2 files changed, 44 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index 31fc27508cf3..64acacce095d 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -114,6 +114,7 @@ struct hpet_hld_data {
boolhas_periodic;
u32 num;
u64 ticks_per_second;
+   u64 ticks_per_cpu;
u32 handling_cpu;
u32 enabled_cpus;
struct msi_msg  msi_msg;
diff --git a/arch/x86/kernel/watchdog_hld_hpet.c 
b/arch/x86/kernel/watchdog_hld_hpet.c
index c20b378b8c0c..9a3431a54616 100644
--- a/arch/x86/kernel/watchdog_hld_hpet.c
+++ b/arch/x86/kernel/watchdog_hld_hpet.c
@@ -44,6 +44,13 @@ static void kick_timer(struct hpet_hld_data *hdata, bool 
force)
 * are able to update the comparator before the counter reaches such new
 * value.
 *
+* Each CPU must be monitored every watch_thresh seconds. Since the
+* timer targets one CPU at a time, it must expire every
+*
+*ticks_per_cpu = watch_thresh * ticks_per_second /enabled_cpus
+*
+* as computed in update_ticks_per_cpu().
+*
 * Let it wrap around if needed.
 */
 
@@ -51,10 +58,10 @@ static void kick_timer(struct hpet_hld_data *hdata, bool 
force)
return;
 
if (hdata->has_periodic)
-   period = watchdog_thresh * hdata->ticks_per_second;
+   period = watchdog_thresh * hdata->ticks_per_cpu;
 
count = hpet_readl(HPET_COUNTER);
-   new_compare = count + watchdog_thresh * hdata->ticks_per_second;
+   new_compare = count + watchdog_thresh * hdata->ticks_per_cpu;
hpet_set_comparator(hdata->num, (u32)new_compare, (u32)period);
 }
 
@@ -233,6 +240,27 @@ static int setup_hpet_irq(struct hpet_hld_data *hdata)
return ret;
 }
 
+/**
+ * update_ticks_per_cpu() - Update the number of HPET ticks per CPU
+ * @hdata: struct with the timer's the ticks-per-second and CPU mask
+ *
+ * From the overall ticks-per-second of the timer, compute the number of ticks
+ * after which the timer should expire to monitor each CPU every watch_thresh
+ * seconds. The ticks-per-cpu quantity is computed using the number of CPUs 
that
+ * the watchdog currently monitors.
+ */
+static void update_ticks_per_cpu(struct hpet_hld_data *hdata)
+{
+   u64 temp = hdata->ticks_per_second;
+
+   /* Only update if there are monitored CPUs. */
+   if (!hdata->enabled_cpus)
+   return;
+
+   do_div(temp, hdata->enabled_cpus);
+   hdata->ticks_per_cpu = temp;
+}
+
 /**
  * hardlockup_detector_hpet_enable() - Enable the hardlockup detector
  * @cpu:   CPU Index in which the watchdog will be enabled.
@@ -245,13 +273,23 @@ void hardlockup_detector_hpet_enable(unsigned int cpu)
 {
cpumask_set_cpu(cpu, to_cpumask(hld_data->cpu_monitored_mask));
 
-   if (!hld_data->enabled_cpus++) {
+   hld_data->enabled_cpus++;
+   update_ticks_per_cpu(hld_data);
+
+   if (hld_data->enabled_cpus == 1) {
hld_data->handling_cpu = cpu;
update_msi_destid(hld_data);
/* Force timer kick when detector is just enabled */
kick_timer(hld_data, true);
enable_timer(hld_data);
}
+
+   /*
+* When in periodic mode, we only kick the timer here. Hence,
+* 

[RFC PATCH v3 03/21] x86/hpet: Calculate ticks-per-second in a separate function

2019-05-14 Thread Ricardo Neri
It is easier to compute the expiration times of an HPET timer by using
its frequency (i.e., the number of times it ticks in a second) than its
period, as given in the capabilities register.

In addition to the HPET char driver, the HPET-based hardlockup detector
will also need to know the timer's frequency. Thus, create a common
function that both can use.

Cc: "H. Peter Anvin" 
Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: Clemens Ladisch 
Cc: Arnd Bergmann 
Cc: Philippe Ombredanne 
Cc: Kate Stewart 
Cc: "Rafael J. Wysocki" 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Signed-off-by: Ricardo Neri 
---
 drivers/char/hpet.c  | 31 ---
 include/linux/hpet.h |  1 +
 2 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/drivers/char/hpet.c b/drivers/char/hpet.c
index d0ad85900b79..bdcbecfdb858 100644
--- a/drivers/char/hpet.c
+++ b/drivers/char/hpet.c
@@ -836,6 +836,29 @@ static unsigned long hpet_calibrate(struct hpets *hpetp)
return ret;
 }
 
+u64 hpet_get_ticks_per_sec(u64 hpet_caps)
+{
+   u64 ticks_per_sec, period;
+
+   period = (hpet_caps & HPET_COUNTER_CLK_PERIOD_MASK) >>
+HPET_COUNTER_CLK_PERIOD_SHIFT; /* fs, 10^-15 */
+
+   /*
+* The frequency is the reciprocal of the period. The period is given
+* femtoseconds per second. Thus, prepare a dividend to obtain the
+* frequency in ticks per second.
+*/
+
+   /* 10^15 femtoseconds per second */
+   ticks_per_sec = 1000uLL;
+   ticks_per_sec += period >> 1; /* round */
+
+   /* The quotient is put in the dividend. We drop the remainder. */
+   do_div(ticks_per_sec, period);
+
+   return ticks_per_sec;
+}
+
 int hpet_alloc(struct hpet_data *hdp)
 {
u64 cap, mcfg;
@@ -844,7 +867,6 @@ int hpet_alloc(struct hpet_data *hdp)
struct hpets *hpetp;
struct hpet __iomem *hpet;
static struct hpets *last;
-   unsigned long period;
unsigned long long temp;
u32 remainder;
 
@@ -894,12 +916,7 @@ int hpet_alloc(struct hpet_data *hdp)
 
last = hpetp;
 
-   period = (cap & HPET_COUNTER_CLK_PERIOD_MASK) >>
-   HPET_COUNTER_CLK_PERIOD_SHIFT; /* fs, 10^-15 */
-   temp = 1000uLL; /* 10^15 femtoseconds per second */
-   temp += period >> 1; /* round */
-   do_div(temp, period);
-   hpetp->hp_tick_freq = temp; /* ticks per second */
+   hpetp->hp_tick_freq = hpet_get_ticks_per_sec(cap);
 
printk(KERN_INFO "hpet%d: at MMIO 0x%lx, IRQ%s",
hpetp->hp_which, hdp->hd_phys_address,
diff --git a/include/linux/hpet.h b/include/linux/hpet.h
index 8604564b985d..e7b36bcf4699 100644
--- a/include/linux/hpet.h
+++ b/include/linux/hpet.h
@@ -107,5 +107,6 @@ static inline void hpet_reserve_timer(struct hpet_data *hd, 
int timer)
 }
 
 int hpet_alloc(struct hpet_data *);
+u64 hpet_get_ticks_per_sec(u64 hpet_caps);
 
 #endif /* !__HPET__ */
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH v3 18/21] x86/apic: Add a parameter for the APIC delivery mode

2019-05-14 Thread Ricardo Neri
Until now, the delivery mode of APIC interrupts is set to the default
mode set in the APIC driver. However, there are no restrictions in hardware
to configure each interrupt with a different delivery mode. Specifying the
delivery mode per interrupt is useful when one is interested in changing
the delivery mode of a particular interrupt. For instance, this can be used
to deliver an interrupt as non-maskable.

Add a new member, delivery_mode, to struct irq_cfg. This new member, can
be used to update the configuration of the delivery mode in each interrupt
domain. Likewise, add equivalent macros to populate MSI messages.

Currently, all interrupt domains set the delivery mode of interrupts using
the APIC setting. Interrupt domains use an irq_cfg data structure to
configure their own data structures and hardware resources. Thus, in order
to keep the current behavior, set the delivery mode of the irq
configuration that as the APIC setting. In this manner, irq domains can
obtain the delivery mode from the irq configuration data instead of the
APIC setting, if needed.

Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: Borislav Petkov 
Cc: Jacob Pan 
Cc: Joerg Roedel 
Cc: Juergen Gross 
Cc: Bjorn Helgaas 
Cc: Wincy Van 
Cc: Kate Stewart 
Cc: Philippe Ombredanne 
Cc: "Eric W. Biederman" 
Cc: Baoquan He 
Cc: Dou Liyang 
Cc: Jan Kiszka 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Cc: iommu@lists.linux-foundation.org
Signed-off-by: Ricardo Neri 
---
 arch/x86/include/asm/hw_irq.h |  5 +++--
 arch/x86/include/asm/msidef.h |  3 +++
 arch/x86/kernel/apic/vector.c | 10 ++
 3 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
index 32e666e1231e..c024e5976b78 100644
--- a/arch/x86/include/asm/hw_irq.h
+++ b/arch/x86/include/asm/hw_irq.h
@@ -117,8 +117,9 @@ struct irq_alloc_info {
 };
 
 struct irq_cfg {
-   unsigned intdest_apicid;
-   unsigned intvector;
+   unsigned intdest_apicid;
+   unsigned intvector;
+   enum ioapic_irq_destination_types   delivery_mode;
 };
 
 extern struct irq_cfg *irq_cfg(unsigned int irq);
diff --git a/arch/x86/include/asm/msidef.h b/arch/x86/include/asm/msidef.h
index 38ccfdc2d96e..6d666c90f057 100644
--- a/arch/x86/include/asm/msidef.h
+++ b/arch/x86/include/asm/msidef.h
@@ -16,6 +16,9 @@
 MSI_DATA_VECTOR_MASK)
 
 #define MSI_DATA_DELIVERY_MODE_SHIFT   8
+#define MSI_DATA_DELIVERY_MODE_MASK0x0700
+#define MSI_DATA_DELIVERY_MODE(dm) (((dm) << MSI_DATA_DELIVERY_MODE_SHIFT) 
& \
+MSI_DATA_DELIVERY_MODE_MASK)
 #define  MSI_DATA_DELIVERY_FIXED   (0 << MSI_DATA_DELIVERY_MODE_SHIFT)
 #define  MSI_DATA_DELIVERY_LOWPRI  (1 << MSI_DATA_DELIVERY_MODE_SHIFT)
 #define  MSI_DATA_DELIVERY_NMI (4 << MSI_DATA_DELIVERY_MODE_SHIFT)
diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 3173e07d3791..99436fe7e932 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -548,6 +548,16 @@ static int x86_vector_alloc_irqs(struct irq_domain 
*domain, unsigned int virq,
irqd->chip_data = apicd;
irqd->hwirq = virq + i;
irqd_set_single_target(irqd);
+
+   /*
+* Initialize the delivery mode of this irq to match the
+* default delivery mode of the APIC. This is useful for
+* children irq domains which want to take the delivery
+* mode from the individual irq configuration rather
+* than from the APIC.
+*/
+apicd->hw_irq_cfg.delivery_mode = apic->irq_delivery_mode;
+
/*
 * Legacy vectors are already assigned when the IOAPIC
 * takes them over. They stay on the same vector. This is
-- 
2.17.1



[RFC PATCH v3 13/21] x86/watchdog/hardlockup/hpet: Determine if HPET timer caused NMI

2019-05-14 Thread Ricardo Neri
The only direct method to determine whether an HPET timer caused an
interrupt is to read the Interrupt Status register. Unfortunately,
reading HPET registers is slow and, therefore, it is not recommended to
read them while in NMI context. Furthermore, status is not available if
the interrupt is generated vi the Front Side Bus.

An indirect manner to infer if the non-maskable interrupt we see was
caused by the HPET timer is to use the time-stamp counter. Compute the
value that the time-stamp counter should have at the next interrupt of the
HPET timer. Since the hardlockup detector operates in seconds, high
precision is not needed. This implementation considers that the HPET
caused the HMI if the time-stamp counter reads the expected value -/+ 1.5%.
This value is selected as it is equivalent to 1/64 and the division can be
performed using a bit shift operation. Experimentally, the error in the
estimation is consistently less than 1%.

The computation of the expected value of the time-stamp counter must be
performed in relation to watchdog_thresh divided by the number of
monitored CPUs. This quantity is stored in tsc_ticks_per_cpu and must be
updated whenever the number of monitored CPUs changes.

Cc: "H. Peter Anvin" 
Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: Peter Zijlstra 
Cc: Clemens Ladisch 
Cc: Arnd Bergmann 
Cc: Philippe Ombredanne 
Cc: Kate Stewart 
Cc: "Rafael J. Wysocki" 
Cc: Mimi Zohar 
Cc: Jan Kiszka 
Cc: Nick Desaulniers 
Cc: Masahiro Yamada 
Cc: Nayna Jain 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Suggested-by: Andi Kleen 
Signed-off-by: Ricardo Neri 
---
 arch/x86/include/asm/hpet.h |  2 ++
 arch/x86/kernel/watchdog_hld_hpet.c | 27 ++-
 2 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index 64acacce095d..fd99f2390714 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -115,6 +115,8 @@ struct hpet_hld_data {
u32 num;
u64 ticks_per_second;
u64 ticks_per_cpu;
+   u64 tsc_next;
+   u64 tsc_ticks_per_cpu;
u32 handling_cpu;
u32 enabled_cpus;
struct msi_msg  msi_msg;
diff --git a/arch/x86/kernel/watchdog_hld_hpet.c 
b/arch/x86/kernel/watchdog_hld_hpet.c
index 9a3431a54616..6f1f540cfee9 100644
--- a/arch/x86/kernel/watchdog_hld_hpet.c
+++ b/arch/x86/kernel/watchdog_hld_hpet.c
@@ -23,6 +23,7 @@
 
 static struct hpet_hld_data *hld_data;
 static bool hardlockup_use_hpet;
+static u64 tsc_next_error;
 
 /**
  * kick_timer() - Reprogram timer to expire in the future
@@ -32,11 +33,22 @@ static bool hardlockup_use_hpet;
  * Reprogram the timer to expire within watchdog_thresh seconds in the future.
  * If the timer supports periodic mode, it is not kicked unless @force is
  * true.
+ *
+ * Also, compute the expected value of the time-stamp counter at the time of
+ * expiration as well as a deviation from the expected value. The maximum
+ * deviation is of ~1.5%. This deviation can be easily computed by shifting
+ * by 6 positions the delta between the current and expected time-stamp values.
  */
 static void kick_timer(struct hpet_hld_data *hdata, bool force)
 {
+   u64 tsc_curr, tsc_delta, new_compare, count, period = 0;
bool kick_needed = force || !(hdata->has_periodic);
-   u64 new_compare, count, period = 0;
+
+   tsc_curr = rdtsc();
+
+   tsc_delta = (unsigned long)watchdog_thresh * hdata->tsc_ticks_per_cpu;
+   hdata->tsc_next = tsc_curr + tsc_delta;
+   tsc_next_error = tsc_delta >> 6;
 
/*
 * Update the comparator in increments of watch_thresh seconds relative
@@ -92,6 +104,15 @@ static void enable_timer(struct hpet_hld_data *hdata)
  */
 static bool is_hpet_wdt_interrupt(struct hpet_hld_data *hdata)
 {
+   if (smp_processor_id() == hdata->handling_cpu) {
+   u64 tsc_curr;
+
+   tsc_curr = rdtsc();
+
+   return (tsc_curr - hdata->tsc_next) + tsc_next_error <
+  2 * tsc_next_error;
+   }
+
return false;
 }
 
@@ -259,6 +280,10 @@ static void update_ticks_per_cpu(struct hpet_hld_data 
*hdata)
 
do_div(temp, hdata->enabled_cpus);
hdata->ticks_per_cpu = temp;
+
+   temp = (unsigned long)tsc_khz * 1000L;
+   do_div(temp, hdata->enabled_cpus);
+   hdata->tsc_ticks_per_cpu = temp;
 }
 
 /**
-- 
2.17.1



[RFC PATCH v3 14/21] watchdog/hardlockup: Use parse_option_str() to handle "nmi_watchdog"

2019-05-14 Thread Ricardo Neri
Prepare hardlockup_panic_setup() to handle a comma-separated list of
options. This is needed to pass options to specific implementations of the
hardlockup detector.

Cc: "H. Peter Anvin" 
Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: Peter Zijlstra 
Cc: Clemens Ladisch 
Cc: Arnd Bergmann 
Cc: Philippe Ombredanne 
Cc: Kate Stewart 
Cc: "Rafael J. Wysocki" 
Cc: Mimi Zohar 
Cc: Jan Kiszka 
Cc: Nick Desaulniers 
Cc: Masahiro Yamada 
Cc: Nayna Jain 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Signed-off-by: Ricardo Neri 
---
 kernel/watchdog.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index be589001200a..fd50049449ec 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -70,13 +70,13 @@ void __init hardlockup_detector_disable(void)
 
 static int __init hardlockup_panic_setup(char *str)
 {
-   if (!strncmp(str, "panic", 5))
+   if (parse_option_str(str, "panic"))
hardlockup_panic = 1;
-   else if (!strncmp(str, "nopanic", 7))
+   else if (parse_option_str(str, "nopanic"))
hardlockup_panic = 0;
-   else if (!strncmp(str, "0", 1))
+   else if (parse_option_str(str, "0"))
nmi_watchdog_user_enabled = 0;
-   else if (!strncmp(str, "1", 1))
+   else if (parse_option_str(str, "1"))
nmi_watchdog_user_enabled = 1;
return 1;
 }
-- 
2.17.1



[RFC PATCH v3 16/21] x86/watchdog: Add a shim hardlockup detector

2019-05-14 Thread Ricardo Neri
The generic hardlockup detector is based on perf. It also provides a set
of weak stubs that CPU architectures can override. Add a shim hardlockup
detector for x86 that selects between perf and hpet implementations.

Specifically, this shim implementation is needed for the HPET-based
hardlockup detector; it can also be used for future implementations.

Cc: "H. Peter Anvin" 
Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: Peter Zijlstra 
Cc: Clemens Ladisch 
Cc: Arnd Bergmann 
Cc: Philippe Ombredanne 
Cc: Kate Stewart 
Cc: "Rafael J. Wysocki" 
Cc: Mimi Zohar 
Cc: Jan Kiszka 
Cc: Nick Desaulniers 
Cc: Masahiro Yamada 
Cc: Nayna Jain 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Suggested-by: Nicholas Piggin 
Signed-off-by: Ricardo Neri 
---
 arch/x86/Kconfig.debug |  4 ++
 arch/x86/kernel/Makefile   |  1 +
 arch/x86/kernel/watchdog_hld.c | 78 ++
 3 files changed, 83 insertions(+)
 create mode 100644 arch/x86/kernel/watchdog_hld.c

diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index 376a5db81aec..0d9e11eb070c 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -169,11 +169,15 @@ config IOMMU_LEAK
 config HAVE_MMIOTRACE_SUPPORT
def_bool y
 
+config X86_HARDLOCKUP_DETECTOR
+   bool
+
 config X86_HARDLOCKUP_DETECTOR_HPET
bool "Use HPET Timer for Hard Lockup Detection"
select SOFTLOCKUP_DETECTOR
select HARDLOCKUP_DETECTOR
select HARDLOCKUP_DETECTOR_CORE
+   select X86_HARDLOCKUP_DETECTOR
depends on HPET_TIMER && HPET && X86_64
help
  Say y to enable a hardlockup detector that is driven by an High-
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index f9222769d84b..f89a259931f7 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -106,6 +106,7 @@ obj-$(CONFIG_VM86)  += vm86_32.o
 obj-$(CONFIG_EARLY_PRINTK) += early_printk.o
 
 obj-$(CONFIG_HPET_TIMER)   += hpet.o
+obj-$(CONFIG_X86_HARDLOCKUP_DETECTOR) += watchdog_hld.o
 obj-$(CONFIG_X86_HARDLOCKUP_DETECTOR_HPET) += watchdog_hld_hpet.o
 obj-$(CONFIG_APB_TIMER)+= apb_timer.o
 
diff --git a/arch/x86/kernel/watchdog_hld.c b/arch/x86/kernel/watchdog_hld.c
new file mode 100644
index ..c2512d4c79c5
--- /dev/null
+++ b/arch/x86/kernel/watchdog_hld.c
@@ -0,0 +1,78 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * A shim hardlockup detector. It overrides the weak stubs of the generic
+ * implementation to select between the perf- or the hpet-based implementation.
+ *
+ * Copyright (C) Intel Corporation 2019
+ */
+
+#include 
+#include 
+
+enum x86_hardlockup_detector {
+   X86_HARDLOCKUP_DETECTOR_PERF,
+   X86_HARDLOCKUP_DETECTOR_HPET,
+};
+
+static enum __read_mostly x86_hardlockup_detector detector_type;
+
+int watchdog_nmi_enable(unsigned int cpu)
+{
+   if (detector_type == X86_HARDLOCKUP_DETECTOR_PERF) {
+   hardlockup_detector_perf_enable();
+   return 0;
+   }
+
+   if (detector_type == X86_HARDLOCKUP_DETECTOR_HPET) {
+   hardlockup_detector_hpet_enable(cpu);
+   return 0;
+   }
+
+   return -ENODEV;
+}
+
+void watchdog_nmi_disable(unsigned int cpu)
+{
+   if (detector_type == X86_HARDLOCKUP_DETECTOR_PERF) {
+   hardlockup_detector_perf_disable();
+   return;
+   }
+
+   if (detector_type == X86_HARDLOCKUP_DETECTOR_HPET) {
+   hardlockup_detector_hpet_disable(cpu);
+   return;
+   }
+}
+
+int __init watchdog_nmi_probe(void)
+{
+   int ret;
+
+   /*
+* Try first with the HPET hardlockup detector. It will only
+* succeed if selected at build time and the nmi_watchdog
+* command-line parameter is configured. This ensure that the
+* perf-based detector is used by default, if selected at
+* build time.
+*/
+   ret = hardlockup_detector_hpet_init();
+   if (!ret) {
+   detector_type = X86_HARDLOCKUP_DETECTOR_HPET;
+   return ret;
+   }
+
+   ret = hardlockup_detector_perf_init();
+   if (!ret) {
+   detector_type = X86_HARDLOCKUP_DETECTOR_PERF;
+   return ret;
+   }
+
+   return ret;
+}
+
+void watchdog_nmi_stop(void)
+{
+   /* Only the HPET lockup detector defines a stop function. */
+   if (detector_type == X86_HARDLOCKUP_DETECTOR_HPET)
+   hardlockup_detector_hpet_stop();
+}
-- 
2.17.1



[RFC PATCH v3 00/21] Implement an HPET-based hardlockup detector

2019-05-14 Thread Ricardo Neri
Hi,

This is the third attempt to demonstrate the implementation of a
hardlockup detector driven by the High-Precision Event Timer. The two
initial implementations can be found here [1] and here [2].

== Introduction ==

In CPU architectures that do not have an NMI watchdog, one can be
constructed using a counter of the Performance Monitoring Unit (PMU).
Counters in the PMU have high granularity and high visibility of the CPU.
These capabilities and their limited number make these counters precious
resources. Unfortunately, the perf-based hardlockup detector permanently
consumes one of these counters per CPU.

These counters could be freed for profiling purposes if the hardlockup
detector were driven by another timer.

The hardlockup detector runs relatively infrequently and does not require
visibility of the CPU activity (in addition to detect locked-up CPUs). A
timer that is external to the CPU (e.g., in the chipset) can be used to
drive the detector.

A key requirement is that the timer needs to be capable of issuing a
non-maskable interrupt to the CPU. In most cases, this can be achieved
by tweaking the delivery mode of the interrupt. It is especially
straightforward for MSI interrupts.

== Details of this implementation

This implementation uses an HPET timer to deliver an NMI interrupt via
an MSI message.

Unlike the the perf-based hardlockup detector, this implementation is
driven by a single timer. The timer targets one CPU at a time in a round-
robin manner. This means that if a CPU must be monitored every watch_thresh
seconds, in a system with N monitored CPUs the timer must expire every
watch_thresh/N. A timer expiration per CPU attribute is maintained.

The timer expiration time per CPU is updated every time CPUs are put
online or offline (a CPU hotplug thread enables and disables the watchdog
in these events) or the user changes the file /proc/sys/kernel/
watchdog_cpumask.

Also, given that a single timer drives the detector, a cpumask is needed
to keep track of which online CPUs are allowed to be monitored. This mask
is also updated every time a CPU is put online or offline or when the user
modifies the mask in /proc/sys/kernel/watchdog_cpumask. This mask
is needed to keep the current behavior of the lockup detector.

In order to avoid reading HPET registers in every NMI, the time-stamp
counter is used to determine whether the HPET caused the interrupt. At
every timer expiration, we compute the value the time-stamp counter is
expected to have the next time the timer expires. I have found
experimentally that expected TSC value consistently has an error of less
than 1.5%

Furthermore, only one write to HPET registers is done every
watchdog_thresh seconds. This write can be eliminated if the HPET timer
is periodic.

== Parts of this series ==

For clarity, patches are grouped as follows:

 1) New irq definition. Patch 1 adds a definition for NMI delivery mode
in MSI interrupts. No other changes are done to generic irq code.

 2) HPET updates. Patches 2-7 prepare the HPET code to accommodate the
new detector: rework periodic programming, reserve and configure a
timer for the detector and expose a few existing functions.

 3) NMI watchdog. Patches 8-11 updates the existing hardlockup detector
to uncouple it from perf, switch back to the perf implementation if
TSC becomes unstable, and introduce a new NMI handler category
intended to run after the NMI_LOCAL handlers.

 4) New HPET-based hardlockup detector. Patches 12-17 includes changes to
probe the hardware resources, configure the interrupt and rotate the
destination of the interrupts among all monitored CPUs. Also, it
includes an x86-specific shim hardlockup detector that selects
between HPET and perf implementations.

 5) Interrupt remapping. Patches 18-22 add support to operate this new
detector with interrupt remapping enabled.

Thanks and BR,
Ricardo

Changes since v2:
 * Added functionality to switch to the perf-based hardlockup
   detector if the TSC becomes unstable (Thomas Gleixner).
 * Brought back the round-robin mechanism proposed in v1 (this time not
   using the interrupt subsystem). This also requires to compute
   expiration times as in v1 (Andi Kleen, Stephane Eranian).
 * Fixed a bug in which using a periodic timer was not working(thanks
   to Suravee Suthikulpanit!).
 * In this version, I incorporate support for interrupt remapping in the
   last 4 patches so that they can be reviewed separately if needed.
 * Removed redundant documentation of functions (Thomas Gleixner).
 * Added a new category of NMI handler, NMI_WATCHDOG, which executes after
   NMI_LOCAL handlers (Andi Kleen).
 * Updated handling of "nmi_watchdog" to support comma-separated
   arguments.
 * Undid split of the generic hardlockup detector into a separate file
   (Thomas Gleixner).
 * Added a new intermediate symbol CONFIG_HARDLOCKUP_DETECTOR_CORE to
   select generic parts of the detector (Paul E. 

[RFC PATCH v3 21/21] x86/watchdog/hardlockup/hpet: Support interrupt remapping

2019-05-14 Thread Ricardo Neri
When interrupt remapping is enabled in the system, the MSI interrupt
message must follow a special format the IOMMU can understand. Hence,
utilize the functionality provided by the IOMMU driver for such purpose.

The first step is to determine whether interrupt remapping is enabled
by looking for the existence of an interrupt remapping domain. If it
exists, let the IOMMU driver compose the MSI message for us. The hard-
lockup detector is still responsible of writing the message in the
HPET FSB route register.

Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: Borislav Petkov 
Cc: Jacob Pan 
Cc: Joerg Roedel 
Cc: Juergen Gross 
Cc: Bjorn Helgaas 
Cc: Wincy Van 
Cc: Kate Stewart 
Cc: Philippe Ombredanne 
Cc: "Eric W. Biederman" 
Cc: Baoquan He 
Cc: Dou Liyang 
Cc: Jan Kiszka 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Cc: iommu@lists.linux-foundation.org
Signed-off-by: Ricardo Neri 
---
 arch/x86/kernel/watchdog_hld_hpet.c | 33 -
 1 file changed, 32 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/watchdog_hld_hpet.c 
b/arch/x86/kernel/watchdog_hld_hpet.c
index 90680a8cf9fc..2d59b8f0390e 100644
--- a/arch/x86/kernel/watchdog_hld_hpet.c
+++ b/arch/x86/kernel/watchdog_hld_hpet.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 static struct hpet_hld_data *hld_data;
@@ -116,6 +117,25 @@ static bool is_hpet_wdt_interrupt(struct hpet_hld_data 
*hdata)
return false;
 }
 
+/** irq_remapping_enabled() - Detect if interrupt remapping is enabled
+ * @hdata: A data structure with the HPET block id
+ *
+ * Determine if the HPET block that the hardlockup detector is under
+ * the remapped interrupt domain.
+ *
+ * Returns: True interrupt remapping is enabled. False otherwise.
+ */
+static bool irq_remapping_enabled(struct hpet_hld_data *hdata)
+{
+   struct irq_alloc_info info;
+
+   init_irq_alloc_info(, NULL);
+   info.type = X86_IRQ_ALLOC_TYPE_HPET;
+   info.hpet_id = hdata->blockid;
+
+   return !!irq_remapping_get_ir_irq_domain();
+}
+
 /**
  * compose_msi_msg() - Populate address and data fields of an MSI message
  * @hdata: A data strucure with the message to populate
@@ -160,6 +180,9 @@ static int update_msi_destid(struct hpet_hld_data *hdata)
 {
u32 destid;
 
+   if (irq_remapping_enabled(hdata))
+   return hld_hpet_intremap_activate_irq(hdata);
+
hdata->msi_msg.address_lo &= ~MSI_ADDR_DEST_ID_MASK;
destid = apic->calc_dest_apicid(hdata->handling_cpu);
hdata->msi_msg.address_lo |= MSI_ADDR_DEST_ID(destid);
@@ -216,9 +239,17 @@ static int hardlockup_detector_nmi_handler(unsigned int 
type,
  */
 static int setup_irq_msi_mode(struct hpet_hld_data *hdata)
 {
+   s32 ret;
u32 v;
 
-   compose_msi_msg(hdata);
+   if (irq_remapping_enabled(hdata)) {
+   ret = hld_hpet_intremap_alloc_irq(hdata);
+   if (ret)
+   return ret;
+   } else {
+   compose_msi_msg(hdata);
+   }
+
hpet_writel(hdata->msi_msg.data, HPET_Tn_ROUTE(hdata->num));
hpet_writel(hdata->msi_msg.address_lo, HPET_Tn_ROUTE(hdata->num) + 4);
 
-- 
2.17.1



[RFC PATCH v3 20/21] iommu/vt-d: hpet: Reserve an interrupt remampping table entry for watchdog

2019-05-14 Thread Ricardo Neri
When interrupt remapping is enabled, MSI interrupt messages must follow a
special format that the IOMMU can understand. Hence, when the HPET hard
lockup detector is used with interrupt remapping, it must also follow this
specia format.

The IOMMU, given the information about a particular interrupt, already
knows how to populate the MSI message with this special format and the
corresponding entry in the interrupt remapping table. Given that this is a
special interrupt case, we want to avoid the interrupt subsystem. Add two
functions to create an entry for the HPET hard lockup detector. Perform
this process in two steps as described below.

When initializing the lockup detector, the function
hld_hpet_intremap_alloc_irq() permanently allocates a new entry in the
interrupt remapping table and populates it with the information the
IOMMU driver needs. In order to populate the table, the IOMMU needs to
know the HPET block ID as described in the ACPI table. Hence, add such
ID to the data of the hardlockup detector.

When the hardlockup detector is enabled, the function
hld_hpet_intremapactivate_irq() activates the recently created entry
in the interrupt remapping table via the modify_irte() functions. While
doing this, it specifies which CPU the interrupt must target via its APIC
ID. This function can be called every time the destination iD of the
interrupt needs to be updated; there is no need to allocate or remove
entries in the interrupt remapping table.

Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: Borislav Petkov 
Cc: Jacob Pan 
Cc: Joerg Roedel 
Cc: Juergen Gross 
Cc: Bjorn Helgaas 
Cc: Wincy Van 
Cc: Kate Stewart 
Cc: Philippe Ombredanne 
Cc: "Eric W. Biederman" 
Cc: Baoquan He 
Cc: Dou Liyang 
Cc: Jan Kiszka 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Cc: iommu@lists.linux-foundation.org
Signed-off-by: Ricardo Neri 
---
 arch/x86/include/asm/hpet.h | 11 +++
 arch/x86/kernel/hpet.c  |  1 +
 drivers/iommu/intel_irq_remapping.c | 49 +
 3 files changed, 61 insertions(+)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index a82cbe17479d..811051fa7ade 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -119,6 +119,8 @@ struct hpet_hld_data {
u64 tsc_ticks_per_cpu;
u32 handling_cpu;
u32 enabled_cpus;
+   u8  blockid;
+   void*intremap_data;
struct msi_msg  msi_msg;
unsigned long   cpu_monitored_mask[0];
 };
@@ -129,6 +131,15 @@ extern void hardlockup_detector_hpet_stop(void);
 extern void hardlockup_detector_hpet_enable(unsigned int cpu);
 extern void hardlockup_detector_hpet_disable(unsigned int cpu);
 extern void hardlockup_detector_switch_to_perf(void);
+#ifdef CONFIG_IRQ_REMAP
+extern int hld_hpet_intremap_activate_irq(struct hpet_hld_data *hdata);
+extern int hld_hpet_intremap_alloc_irq(struct hpet_hld_data *hdata);
+#else
+static inline int hld_hpet_intremap_activate_irq(struct hpet_hld_data *hdata)
+{ return -ENODEV; }
+static inline int hld_hpet_intremap_alloc_irq(struct hpet_hld_data *hdata)
+{ return -ENODEV; }
+#endif /* CONFIG_IRQ_REMAP */
 #else
 static inline struct hpet_hld_data *hpet_hardlockup_detector_assign_timer(void)
 { return NULL; }
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index 44459b36d333..d911a357e98f 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -191,6 +191,7 @@ struct hpet_hld_data 
*hpet_hardlockup_detector_assign_timer(void)
 
hdata->num = HPET_WD_TIMER_NR;
hdata->ticks_per_second = hpet_get_ticks_per_sec(hpet_readq(HPET_ID));
+   hdata->blockid = hpet_blockid;
 
return hdata;
 }
diff --git a/drivers/iommu/intel_irq_remapping.c 
b/drivers/iommu/intel_irq_remapping.c
index 4ebf3af76589..bfa58ef5e85c 100644
--- a/drivers/iommu/intel_irq_remapping.c
+++ b/drivers/iommu/intel_irq_remapping.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "irq_remapping.h"
 
@@ -1517,3 +1518,51 @@ int dmar_ir_hotplug(struct dmar_drhd_unit *dmaru, bool 
insert)
 
return ret;
 }
+
+#ifdef CONFIG_X86_HARDLOCKUP_DETECTOR_HPET
+int hld_hpet_intremap_activate_irq(struct hpet_hld_data *hdata)
+{
+   u32 destid = apic->calc_dest_apicid(hdata->handling_cpu);
+   struct intel_ir_data *data;
+
+   data = (struct intel_ir_data *)hdata->intremap_data;
+   data->irte_entry.dest_id = IRTE_DEST(destid);
+   return modify_irte(>irq_2_iommu, >irte_entry);
+}
+
+int hld_hpet_intremap_alloc_irq(struct hpet_hld_data *hdata)
+{
+   struct intel_ir_data *data;
+   struct irq_alloc_info info;
+   struct intel_iommu *iommu;
+   struct irq_cfg irq_cfg;
+   int index;
+
+   iommu = map_hpet_to_ir(hdata->blockid);
+   if (!iommu)
+   return -ENODEV;
+
+   data = kzalloc(sizeof(*data), GFP_KERNEL);
+   

[RFC PATCH v3 11/21] x86/watchdog/hardlockup: Add an HPET-based hardlockup detector

2019-05-14 Thread Ricardo Neri
This is the initial implementation of a hardlockup detector driven by an
HPET timer. This initial implementation includes functions to control the
timer via its registers. It also requests such timer, installs an NMI
interrupt handler and performs the initial configuration of the timer.

The detector is not functional at this stage. A subsequent changeset will
invoke the interfaces provides by this detector as well as functionality
to determine if the HPET timer caused the NMI.

In order to detect hardlockups in all the monitored CPUs, move the
interrupt to the next monitored CPU while handling the NMI interrupt; wrap
around when reaching the highest CPU in the mask. This rotation is
achieved by setting the affinity mask to only contain the next CPU to
monitor. A cpumask keeps track of all the CPUs that need to be monitored.
Such cpumask is updated when the watchdog is enabled or disabled in a
particular CPU.

This detector relies on an HPET timer that is capable of using Front Side
Bus interrupts. In order to avoid using the generic interrupt code,
program directly the MSI message register of the HPET timer.

HPET registers are only accessed to kick the timer after looking for
hardlockups. This happens every watchdog_thresh seconds. A subsequent
changeset will determine whether the HPET timer caused the interrupt based
on the value of the time-stamp counter. For now, just add a stub function.

Cc: "H. Peter Anvin" 
Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: Peter Zijlstra 
Cc: Clemens Ladisch 
Cc: Arnd Bergmann 
Cc: Philippe Ombredanne 
Cc: Kate Stewart 
Cc: "Rafael J. Wysocki" 
Cc: "Ravi V. Shankar" 
Cc: Mimi Zohar 
Cc: Jan Kiszka 
Cc: Nick Desaulniers 
Cc: Masahiro Yamada 
Cc: Nayna Jain 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: x...@kernel.org
Signed-off-by: Ricardo Neri 
---
 arch/x86/Kconfig.debug  |  11 +
 arch/x86/include/asm/hpet.h |  13 ++
 arch/x86/kernel/Makefile|   1 +
 arch/x86/kernel/hpet.c  |   3 +-
 arch/x86/kernel/watchdog_hld_hpet.c | 334 
 5 files changed, 361 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/kernel/watchdog_hld_hpet.c

diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index 15d0fbe27872..376a5db81aec 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -169,6 +169,17 @@ config IOMMU_LEAK
 config HAVE_MMIOTRACE_SUPPORT
def_bool y
 
+config X86_HARDLOCKUP_DETECTOR_HPET
+   bool "Use HPET Timer for Hard Lockup Detection"
+   select SOFTLOCKUP_DETECTOR
+   select HARDLOCKUP_DETECTOR
+   select HARDLOCKUP_DETECTOR_CORE
+   depends on HPET_TIMER && HPET && X86_64
+   help
+ Say y to enable a hardlockup detector that is driven by an High-
+ Precision Event Timer. This option is helpful to not use counters
+ from the Performance Monitoring Unit to drive the detector.
+
 config X86_DECODER_SELFTEST
bool "x86 instruction decoder selftest"
depends on DEBUG_KERNEL && KPROBES
diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index 20abdaa5372d..31fc27508cf3 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -114,12 +114,25 @@ struct hpet_hld_data {
boolhas_periodic;
u32 num;
u64 ticks_per_second;
+   u32 handling_cpu;
+   u32 enabled_cpus;
+   struct msi_msg  msi_msg;
+   unsigned long   cpu_monitored_mask[0];
 };
 
 extern struct hpet_hld_data *hpet_hardlockup_detector_assign_timer(void);
+extern int hardlockup_detector_hpet_init(void);
+extern void hardlockup_detector_hpet_stop(void);
+extern void hardlockup_detector_hpet_enable(unsigned int cpu);
+extern void hardlockup_detector_hpet_disable(unsigned int cpu);
 #else
 static inline struct hpet_hld_data *hpet_hardlockup_detector_assign_timer(void)
 { return NULL; }
+static inline int hardlockup_detector_hpet_init(void)
+{ return -ENODEV; }
+static inline void hardlockup_detector_hpet_stop(void) {}
+static inline void hardlockup_detector_hpet_enable(unsigned int cpu) {}
+static inline void hardlockup_detector_hpet_disable(unsigned int cpu) {}
 #endif /* CONFIG_X86_HARDLOCKUP_DETECTOR_HPET */
 
 #else /* CONFIG_HPET_TIMER */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 62e78a3fd31e..f9222769d84b 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -106,6 +106,7 @@ obj-$(CONFIG_VM86)  += vm86_32.o
 obj-$(CONFIG_EARLY_PRINTK) += early_printk.o
 
 obj-$(CONFIG_HPET_TIMER)   += hpet.o
+obj-$(CONFIG_X86_HARDLOCKUP_DETECTOR_HPET) += watchdog_hld_hpet.o
 obj-$(CONFIG_APB_TIMER)+= apb_timer.o
 
 obj-$(CONFIG_AMD_NB)   += amd_nb.o
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index 20a16a304f89..44459b36d333 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -181,7 +181,8 @@ struct 

[RFC PATCH v3 17/21] x86/tsc: Switch to perf-based hardlockup detector if TSC become unstable

2019-05-14 Thread Ricardo Neri
The HPET-based hardlockup detector relies on the TSC to determine if an
observed NMI interrupt was originated by HPET timer. Hence, this detector
can no longer be used with an unstable TSC.

In such case, permanently stop the HPET-based hardlockup detector and
start the perf-based detector.

Signed-off-by: Ricardo Neri 
---
 arch/x86/include/asm/hpet.h| 2 ++
 arch/x86/kernel/tsc.c  | 2 ++
 arch/x86/kernel/watchdog_hld.c | 7 +++
 3 files changed, 11 insertions(+)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index fd99f2390714..a82cbe17479d 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -128,6 +128,7 @@ extern int hardlockup_detector_hpet_init(void);
 extern void hardlockup_detector_hpet_stop(void);
 extern void hardlockup_detector_hpet_enable(unsigned int cpu);
 extern void hardlockup_detector_hpet_disable(unsigned int cpu);
+extern void hardlockup_detector_switch_to_perf(void);
 #else
 static inline struct hpet_hld_data *hpet_hardlockup_detector_assign_timer(void)
 { return NULL; }
@@ -136,6 +137,7 @@ static inline int hardlockup_detector_hpet_init(void)
 static inline void hardlockup_detector_hpet_stop(void) {}
 static inline void hardlockup_detector_hpet_enable(unsigned int cpu) {}
 static inline void hardlockup_detector_hpet_disable(unsigned int cpu) {}
+static void harrdlockup_detector_switch_to_perf(void) {}
 #endif /* CONFIG_X86_HARDLOCKUP_DETECTOR_HPET */
 
 #else /* CONFIG_HPET_TIMER */
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 8f47c4862c56..5e4b6d219bec 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1157,6 +1157,8 @@ void mark_tsc_unstable(char *reason)
 
clocksource_mark_unstable(_tsc_early);
clocksource_mark_unstable(_tsc);
+
+   hardlockup_detector_switch_to_perf();
 }
 
 EXPORT_SYMBOL_GPL(mark_tsc_unstable);
diff --git a/arch/x86/kernel/watchdog_hld.c b/arch/x86/kernel/watchdog_hld.c
index c2512d4c79c5..c8547c227a41 100644
--- a/arch/x86/kernel/watchdog_hld.c
+++ b/arch/x86/kernel/watchdog_hld.c
@@ -76,3 +76,10 @@ void watchdog_nmi_stop(void)
if (detector_type == X86_HARDLOCKUP_DETECTOR_HPET)
hardlockup_detector_hpet_stop();
 }
+
+void hardlockup_detector_switch_to_perf(void)
+{
+   detector_type = X86_HARDLOCKUP_DETECTOR_PERF;
+   hardlockup_detector_hpet_stop();
+   hardlockup_start_all();
+}
-- 
2.17.1



[RFC PATCH v3 06/21] x86/hpet: Configure the timer used by the hardlockup detector

2019-05-14 Thread Ricardo Neri
Implement the initial configuration of the timer to be used by the
hardlockup detector. Return a data structure with a description of the
timer; this information is subsequently used by the hardlockup detector.

Only provide the timer if it supports Front Side Bus interrupt delivery.
This condition greatly simplifies the implementation of the detector.
Specifically, it helps to avoid the complexities of routing the interrupt
via the IO-APIC (e.g., potential race conditions that arise from re-
programming the IO-APIC in NMI context).

Cc: "H. Peter Anvin" 
Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: Clemens Ladisch 
Cc: Arnd Bergmann 
Cc: Philippe Ombredanne 
Cc: Kate Stewart 
Cc: "Rafael J. Wysocki" 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Signed-off-by: Ricardo Neri 
---
 arch/x86/include/asm/hpet.h | 13 +
 arch/x86/kernel/hpet.c  | 25 +
 2 files changed, 38 insertions(+)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index 6f099e2781ce..20abdaa5372d 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -109,6 +109,19 @@ extern void hpet_set_comparator(int num, unsigned int cmp, 
unsigned int period);
 
 #endif /* CONFIG_HPET_EMULATE_RTC */
 
+#ifdef CONFIG_X86_HARDLOCKUP_DETECTOR_HPET
+struct hpet_hld_data {
+   boolhas_periodic;
+   u32 num;
+   u64 ticks_per_second;
+};
+
+extern struct hpet_hld_data *hpet_hardlockup_detector_assign_timer(void);
+#else
+static inline struct hpet_hld_data *hpet_hardlockup_detector_assign_timer(void)
+{ return NULL; }
+#endif /* CONFIG_X86_HARDLOCKUP_DETECTOR_HPET */
+
 #else /* CONFIG_HPET_TIMER */
 
 static inline int hpet_enable(void) { return 0; }
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index ba0a5cc075d5..20a16a304f89 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -170,6 +170,31 @@ do {   
\
_hpet_print_config(__func__, __LINE__); \
 } while (0)
 
+#ifdef CONFIG_X86_HARDLOCKUP_DETECTOR_HPET
+struct hpet_hld_data *hpet_hardlockup_detector_assign_timer(void)
+{
+   struct hpet_hld_data *hdata;
+   unsigned int cfg;
+
+   cfg = hpet_readl(HPET_Tn_CFG(HPET_WD_TIMER_NR));
+
+   if (!(cfg & HPET_TN_FSB_CAP))
+   return NULL;
+
+   hdata = kzalloc(sizeof(*hdata), GFP_KERNEL);
+   if (!hdata)
+   return NULL;
+
+   if (cfg & HPET_TN_PERIODIC_CAP)
+   hdata->has_periodic = true;
+
+   hdata->num = HPET_WD_TIMER_NR;
+   hdata->ticks_per_second = hpet_get_ticks_per_sec(hpet_readq(HPET_ID));
+
+   return hdata;
+}
+#endif /* CONFIG_X86_HARDLOCKUP_DETECTOR_HPET */
+
 /*
  * When the hpet driver (/dev/hpet) is enabled, we need to reserve
  * timer 0 and timer 1 in case of RTC emulation. Timer 2 is reserved in case
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH v3 07/21] watchdog/hardlockup: Define a generic function to detect hardlockups

2019-05-14 Thread Ricardo Neri
The procedure to detect hardlockups is independent of the underlying
mechanism that generates the non-maskable interrupt used to drive the
detector. Thus, it can be put in a separate, generic function. In this
manner, it can be invoked by various implementations of the NMI watchdog.

For this purpose, move the bulk of watchdog_overflow_callback() to the
new function inspect_for_hardlockups(). This function can then be called
from the applicable NMI handlers.

Cc: "H. Peter Anvin" 
Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: Don Zickus 
Cc: Nicholas Piggin 
Cc: Michael Ellerman 
Cc: Frederic Weisbecker 
Cc: Babu Moger 
Cc: "David S. Miller" 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Mathieu Desnoyers 
Cc: Masami Hiramatsu 
Cc: Peter Zijlstra 
Cc: Andrew Morton 
Cc: Philippe Ombredanne 
Cc: Colin Ian King 
Cc: "Luis R. Rodriguez" 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Cc: sparcli...@vger.kernel.org
Cc: linuxppc-...@lists.ozlabs.org
Signed-off-by: Ricardo Neri 
---
 include/linux/nmi.h   |  1 +
 kernel/watchdog_hld.c | 18 +++---
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index 9003e29cde46..5a8b19749769 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -212,6 +212,7 @@ extern int proc_watchdog_thresh(struct ctl_table *, int ,
void __user *, size_t *, loff_t *);
 extern int proc_watchdog_cpumask(struct ctl_table *, int,
 void __user *, size_t *, loff_t *);
+void inspect_for_hardlockups(struct pt_regs *regs);
 
 #ifdef CONFIG_HAVE_ACPI_APEI_NMI
 #include 
diff --git a/kernel/watchdog_hld.c b/kernel/watchdog_hld.c
index 247bf0b1582c..b352e507b17f 100644
--- a/kernel/watchdog_hld.c
+++ b/kernel/watchdog_hld.c
@@ -106,14 +106,8 @@ static struct perf_event_attr wd_hw_attr = {
.disabled   = 1,
 };
 
-/* Callback function for perf event subsystem */
-static void watchdog_overflow_callback(struct perf_event *event,
-  struct perf_sample_data *data,
-  struct pt_regs *regs)
+void inspect_for_hardlockups(struct pt_regs *regs)
 {
-   /* Ensure the watchdog never gets throttled */
-   event->hw.interrupts = 0;
-
if (__this_cpu_read(watchdog_nmi_touch) == true) {
__this_cpu_write(watchdog_nmi_touch, false);
return;
@@ -163,6 +157,16 @@ static void watchdog_overflow_callback(struct perf_event 
*event,
return;
 }
 
+/* Callback function for perf event subsystem */
+static void watchdog_overflow_callback(struct perf_event *event,
+  struct perf_sample_data *data,
+  struct pt_regs *regs)
+{
+   /* Ensure the watchdog never gets throttled */
+   event->hw.interrupts = 0;
+   inspect_for_hardlockups(regs);
+}
+
 static int hardlockup_detector_event_create(void)
 {
unsigned int cpu = smp_processor_id();
-- 
2.17.1



[RFC PATCH v3 04/21] x86/hpet: Add hpet_set_comparator() for periodic and one-shot modes

2019-05-14 Thread Ricardo Neri
Instead of setting the timer period directly in hpet_set_periodic(), add a
new helper function hpet_set_comparator() that only sets the accumulator
and comparator. hpet_set_periodic() will only prepare the timer for
periodic mode and leave the expiration programming to
hpet_set_comparator().

This new function can also be used by other components (e.g., the HPET-
based hardlockup detector) which also need to configure HPET timers. Thus,
add its declaration into the hpet header file.

Cc: "H. Peter Anvin" 
Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: Philippe Ombredanne 
Cc: Kate Stewart 
Cc: "Rafael J. Wysocki" 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Originally-by: Suravee Suthikulpanit 
Signed-off-by: Ricardo Neri 
---
 arch/x86/include/asm/hpet.h |  1 +
 arch/x86/kernel/hpet.c  | 57 -
 2 files changed, 45 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index f132fbf984d4..e7098740f5ee 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -102,6 +102,7 @@ extern int hpet_rtc_timer_init(void);
 extern irqreturn_t hpet_rtc_interrupt(int irq, void *dev_id);
 extern int hpet_register_irq_handler(rtc_irq_handler handler);
 extern void hpet_unregister_irq_handler(rtc_irq_handler handler);
+extern void hpet_set_comparator(int num, unsigned int cmp, unsigned int 
period);
 
 #endif /* CONFIG_HPET_EMULATE_RTC */
 
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index 560fc28e1d13..c5c5fc150193 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -289,6 +289,46 @@ static void hpet_legacy_clockevent_register(void)
printk(KERN_DEBUG "hpet clockevent registered\n");
 }
 
+/**
+ * hpet_set_comparator() - Helper function for setting comparator register
+ * @num:   The timer ID
+ * @cmp:   The value to be written to the comparator/accumulator
+ * @period:The value to be written to the period (0 = oneshot mode)
+ *
+ * Helper function for updating comparator, accumulator and period values.
+ *
+ * In periodic mode, HPET needs HPET_TN_SETVAL to be set before writing
+ * to the Tn_CMP to update the accumulator. Then, HPET needs a second
+ * write (with HPET_TN_SETVAL cleared) to Tn_CMP to set the period.
+ * The HPET_TN_SETVAL bit is automatically cleared after the first write.
+ *
+ * For one-shot mode, HPET_TN_SETVAL does not need to be set.
+ *
+ * See the following documents:
+ *   - Intel IA-PC HPET (High Precision Event Timers) Specification
+ *   - AMD-8111 HyperTransport I/O Hub Data Sheet, Publication # 24674
+ */
+void hpet_set_comparator(int num, unsigned int cmp, unsigned int period)
+{
+   if (period) {
+   unsigned int v = hpet_readl(HPET_Tn_CFG(num));
+
+   hpet_writel(v | HPET_TN_SETVAL, HPET_Tn_CFG(num));
+   }
+
+   hpet_writel(cmp, HPET_Tn_CMP(num));
+
+   if (!period)
+   return;
+
+   /* This delay is seldom used: never in one-shot mode and in periodic
+* only when reprogramming the timer.
+*/
+   udelay(1);
+   hpet_writel(period, HPET_Tn_CMP(num));
+}
+EXPORT_SYMBOL_GPL(hpet_set_comparator);
+
 static int hpet_set_periodic(struct clock_event_device *evt, int timer)
 {
unsigned int cfg, cmp, now;
@@ -300,19 +340,10 @@ static int hpet_set_periodic(struct clock_event_device 
*evt, int timer)
now = hpet_readl(HPET_COUNTER);
cmp = now + (unsigned int)delta;
cfg = hpet_readl(HPET_Tn_CFG(timer));
-   cfg |= HPET_TN_ENABLE | HPET_TN_PERIODIC | HPET_TN_SETVAL |
-  HPET_TN_32BIT;
-   hpet_writel(cfg, HPET_Tn_CFG(timer));
-   hpet_writel(cmp, HPET_Tn_CMP(timer));
-   udelay(1);
-   /*
-* HPET on AMD 81xx needs a second write (with HPET_TN_SETVAL
-* cleared) to T0_CMP to set the period. The HPET_TN_SETVAL
-* bit is automatically cleared after the first write.
-* (See AMD-8111 HyperTransport I/O Hub Data Sheet,
-* Publication # 24674)
-*/
-   hpet_writel((unsigned int)delta, HPET_Tn_CMP(timer));
+   cfg |= HPET_TN_ENABLE | HPET_TN_PERIODIC | HPET_TN_32BIT;
+
+   hpet_set_comparator(timer, cmp, (unsigned int)delta);
+
hpet_start_counter();
hpet_print_config();
 
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH v3 05/21] x86/hpet: Reserve timer for the HPET hardlockup detector

2019-05-14 Thread Ricardo Neri
HPET timer 2 will be used to drive the HPET-based hardlockup detector.
Reserve such timer to ensure it cannot be used by user space programs or
for clock events.

When looking for MSI-capable timers for clock events, skip timer 2 if
the HPET hardlockup detector is selected.

Cc: "H. Peter Anvin" 
Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: Clemens Ladisch 
Cc: Arnd Bergmann 
Cc: Philippe Ombredanne 
Cc: Kate Stewart 
Cc: "Rafael J. Wysocki" 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Signed-off-by: Ricardo Neri 
---
 arch/x86/include/asm/hpet.h |  3 +++
 arch/x86/kernel/hpet.c  | 19 ---
 2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index e7098740f5ee..6f099e2781ce 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -61,6 +61,9 @@
  */
 #define HPET_MIN_PERIOD10UL
 
+/* Timer used for the hardlockup detector */
+#define HPET_WD_TIMER_NR 2
+
 /* hpet memory map physical address */
 extern unsigned long hpet_address;
 extern unsigned long force_hpet_address;
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index c5c5fc150193..ba0a5cc075d5 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -172,7 +172,8 @@ do {
\
 
 /*
  * When the hpet driver (/dev/hpet) is enabled, we need to reserve
- * timer 0 and timer 1 in case of RTC emulation.
+ * timer 0 and timer 1 in case of RTC emulation. Timer 2 is reserved in case
+ * the HPET-based hardlockup detector is used.
  */
 #ifdef CONFIG_HPET
 
@@ -182,7 +183,7 @@ static void hpet_reserve_platform_timers(unsigned int id)
 {
struct hpet __iomem *hpet = hpet_virt_address;
struct hpet_timer __iomem *timer = >hpet_timers[2];
-   unsigned int nrtimers, i;
+   unsigned int nrtimers, i, start_timer;
struct hpet_data hd;
 
nrtimers = ((id & HPET_ID_NUMBER) >> HPET_ID_NUMBER_SHIFT) + 1;
@@ -197,6 +198,13 @@ static void hpet_reserve_platform_timers(unsigned int id)
hpet_reserve_timer(, 1);
 #endif
 
+   if (IS_ENABLED(CONFIG_X86_HARDLOCKUP_DETECTOR_HPET)) {
+   hpet_reserve_timer(, HPET_WD_TIMER_NR);
+   start_timer = HPET_WD_TIMER_NR + 1;
+   } else {
+   start_timer = HPET_WD_TIMER_NR;
+   }
+
/*
 * NOTE that hd_irq[] reflects IOAPIC input pins (LEGACY_8254
 * is wrong for i8259!) not the output IRQ.  Many BIOS writers
@@ -205,7 +213,7 @@ static void hpet_reserve_platform_timers(unsigned int id)
hd.hd_irq[0] = HPET_LEGACY_8254;
hd.hd_irq[1] = HPET_LEGACY_RTC;
 
-   for (i = 2; i < nrtimers; timer++, i++) {
+   for (i = start_timer; i < nrtimers; timer++, i++) {
hd.hd_irq[i] = (readl(>hpet_config) &
Tn_INT_ROUTE_CNF_MASK) >> Tn_INT_ROUTE_CNF_SHIFT;
}
@@ -648,6 +656,11 @@ static void hpet_msi_capability_lookup(unsigned int 
start_timer)
struct hpet_dev *hdev = _devs[num_timers_used];
unsigned int cfg = hpet_readl(HPET_Tn_CFG(i));
 
+   /* Do not use timer reserved for the HPET watchdog. */
+   if (IS_ENABLED(CONFIG_X86_HARDLOCKUP_DETECTOR_HPET) &&
+   i == HPET_WD_TIMER_NR)
+   continue;
+
/* Only consider HPET timer with MSI support */
if (!(cfg & HPET_TN_FSB_CAP))
continue;
-- 
2.17.1



[RFC PATCH v3 15/21] watchdog/hardlockup/hpet: Only enable the HPET watchdog via a boot parameter

2019-05-14 Thread Ricardo Neri
Keep the HPET-based hardlockup detector disabled unless explicitly enabled
via a command-line argument. If such parameter is not given, the
initialization of the hpet-based hardlockup detector fails and the NMI
watchdog will fallback to use the perf-based implementation.

Given that __setup("nmi_watchdog=") is already used to control the behavior
of the NMI watchdog (via hardlockup_panic_setup()), it cannot be used to
control of the hpet-based implementation. Instead, use a new
early_param("nmi_watchdog").

Cc: "H. Peter Anvin" 
Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: Peter Zijlstra 
Cc: Clemens Ladisch 
Cc: Arnd Bergmann 
Cc: Philippe Ombredanne 
Cc: Kate Stewart 
Cc: "Rafael J. Wysocki" 
Cc: Mimi Zohar 
Cc: Jan Kiszka 
Cc: Nick Desaulniers 
Cc: Masahiro Yamada 
Cc: Nayna Jain 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Signed-off-by: Ricardo Neri 

--
checkpatch gives the following warning:

CHECK: __setup appears un-documented -- check 
Documentation/admin-guide/kernel-parameters.rst
+__setup("nmi_watchdog=", hardlockup_detector_hpet_setup);

This is a false-positive as the option nmi_watchdog is already
documented. The option is re-evaluated in this file as well.
---
 .../admin-guide/kernel-parameters.txt |  8 ++-
 arch/x86/kernel/watchdog_hld_hpet.c   | 22 +++
 2 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index fd03e2b629bb..3c42205b469c 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2801,7 +2801,7 @@
Format: [state][,regs][,debounce][,die]
 
nmi_watchdog=   [KNL,BUGS=X86] Debugging features for SMP kernels
-   Format: [panic,][nopanic,][num]
+   Format: [panic,][nopanic,][num,][hpet]
Valid num: 0 or 1
0 - turn hardlockup detector in nmi_watchdog off
1 - turn hardlockup detector in nmi_watchdog on
@@ -2811,6 +2811,12 @@
please see 'nowatchdog'.
This is useful when you use a panic=... timeout and
need the box quickly up again.
+   When hpet is specified, the NMI watchdog will be driven
+   by an HPET timer, if available in the system. Otherwise,
+   it falls back to the default implementation (perf or
+   architecture-specific). Specifying hpet has no effect
+   if the NMI watchdog is not enabled (either at build time
+   or via the command line).
 
These settings can be accessed at runtime via
the nmi_watchdog and hardlockup_panic sysctls.
diff --git a/arch/x86/kernel/watchdog_hld_hpet.c 
b/arch/x86/kernel/watchdog_hld_hpet.c
index 6f1f540cfee9..90680a8cf9fc 100644
--- a/arch/x86/kernel/watchdog_hld_hpet.c
+++ b/arch/x86/kernel/watchdog_hld_hpet.c
@@ -350,6 +350,28 @@ void hardlockup_detector_hpet_stop(void)
disable_timer(hld_data);
 }
 
+/**
+ * hardlockup_detector_hpet_setup() - Parse command-line parameters
+ * @str:   A string containing the kernel command line
+ *
+ * Parse the nmi_watchdog parameter from the kernel command line. If
+ * selected by the user, use this implementation to detect hardlockups.
+ */
+static int __init hardlockup_detector_hpet_setup(char *str)
+{
+   if (!str)
+   return -EINVAL;
+
+   if (parse_option_str(str, "hpet"))
+   hardlockup_use_hpet = true;
+
+   if (!nmi_watchdog_user_enabled && hardlockup_use_hpet)
+   pr_warn("Selecting HPET NMI watchdog has no effect with NMI 
watchdog disabled\n");
+
+   return 0;
+}
+early_param("nmi_watchdog", hardlockup_detector_hpet_setup);
+
 /**
  * hardlockup_detector_hpet_init() - Initialize the hardlockup detector
  *
-- 
2.17.1



[RFC PATCH v3 19/21] iommu/vt-d: Rework prepare_irte() to support per-irq delivery mode

2019-05-14 Thread Ricardo Neri
A recent change introduced a new member to struct irq_cfg to specify the
delivery mode of an interrupt. Supporting the configuration of the
delivery mode would require adding a third argument to prepare_irte().
Instead, simply take a pointer to a irq_cfg data structure as a the only
argument.

Internally, configure the delivery mode of the Interrupt Remapping Table
Entry as specified in the irq_cfg data structure and not as the APIC
setting.

This change does not change the existing behavior, as the delivery mode
of the APIC is used to configure irq_cfg data structure.

Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: Borislav Petkov 
Cc: Jacob Pan 
Cc: Joerg Roedel 
Cc: Juergen Gross 
Cc: Bjorn Helgaas 
Cc: Wincy Van 
Cc: Kate Stewart 
Cc: Philippe Ombredanne 
Cc: "Eric W. Biederman" 
Cc: Baoquan He 
Cc: Dou Liyang 
Cc: Jan Kiszka 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Cc: iommu@lists.linux-foundation.org
Signed-off-by: Ricardo Neri 
---
 drivers/iommu/intel_irq_remapping.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/intel_irq_remapping.c 
b/drivers/iommu/intel_irq_remapping.c
index 2d74641b7f7b..4ebf3af76589 100644
--- a/drivers/iommu/intel_irq_remapping.c
+++ b/drivers/iommu/intel_irq_remapping.c
@@ -1073,7 +1073,7 @@ static int reenable_irq_remapping(int eim)
return -1;
 }
 
-static void prepare_irte(struct irte *irte, int vector, unsigned int dest)
+static void prepare_irte(struct irte *irte, struct irq_cfg *irq_cfg)
 {
memset(irte, 0, sizeof(*irte));
 
@@ -1087,9 +1087,9 @@ static void prepare_irte(struct irte *irte, int vector, 
unsigned int dest)
 * irq migration in the presence of interrupt-remapping.
*/
irte->trigger_mode = 0;
-   irte->dlvry_mode = apic->irq_delivery_mode;
-   irte->vector = vector;
-   irte->dest_id = IRTE_DEST(dest);
+   irte->dlvry_mode = irq_cfg->delivery_mode;
+   irte->vector = irq_cfg->vector;
+   irte->dest_id = IRTE_DEST(irq_cfg->dest_apicid);
irte->redir_hint = 1;
 }
 
@@ -1266,7 +1266,7 @@ static void intel_irq_remapping_prepare_irte(struct 
intel_ir_data *data,
struct irte *irte = >irte_entry;
struct msi_msg *msg = >msi_entry;
 
-   prepare_irte(irte, irq_cfg->vector, irq_cfg->dest_apicid);
+   prepare_irte(irte, irq_cfg);
switch (info->type) {
case X86_IRQ_ALLOC_TYPE_IOAPIC:
/* Set source-id of interrupt request */
-- 
2.17.1



[RFC PATCH v3 02/21] x86/hpet: Expose hpet_writel() in header

2019-05-14 Thread Ricardo Neri
In order to allow hpet_writel() to be used by other components (e.g.,
the HPET-based hardlockup detector) expose it in the HPET header file.

No empty definition is needed if CONFIG_HPET is not selected as all
existing callers select such config symbol.

Cc: "H. Peter Anvin" 
Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: Philippe Ombredanne 
Cc: Kate Stewart 
Cc: "Rafael J. Wysocki" 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Signed-off-by: Ricardo Neri 
---
 arch/x86/include/asm/hpet.h | 1 +
 arch/x86/kernel/hpet.c  | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index 67385d56d4f4..f132fbf984d4 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -72,6 +72,7 @@ extern int is_hpet_enabled(void);
 extern int hpet_enable(void);
 extern void hpet_disable(void);
 extern unsigned int hpet_readl(unsigned int a);
+extern void hpet_writel(unsigned int d, unsigned int a);
 extern void force_hpet_resume(void);
 
 struct irq_data;
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index fb32925a2e62..560fc28e1d13 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -61,7 +61,7 @@ inline unsigned int hpet_readl(unsigned int a)
return readl(hpet_virt_address + a);
 }
 
-static inline void hpet_writel(unsigned int d, unsigned int a)
+inline void hpet_writel(unsigned int d, unsigned int a)
 {
writel(d, hpet_virt_address + a);
 }
-- 
2.17.1



[RFC PATCH v3 01/21] x86/msi: Add definition for NMI delivery mode

2019-05-14 Thread Ricardo Neri
Until now, the delivery mode of MSI interrupts is set to the default
mode set in the APIC driver. However, there are no restrictions in hardware
to configure each interrupt with a different delivery mode. Specifying the
delivery mode per interrupt is useful when one is interested in changing
the delivery mode of a particular interrupt. For instance, this can be used
to deliver an interrupt as non-maskable.

Cc: "H. Peter Anvin" 
Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: Joerg Roedel 
Cc: Juergen Gross 
Cc: Bjorn Helgaas 
Cc: Wincy Van 
Cc: Kate Stewart 
Cc: Philippe Ombredanne 
Cc: "Eric W. Biederman" 
Cc: Baoquan He 
Cc: Dou Liyang 
Cc: Jan Kiszka 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Signed-off-by: Ricardo Neri 
---
 arch/x86/include/asm/msidef.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/include/asm/msidef.h b/arch/x86/include/asm/msidef.h
index ee2f8ccc32d0..38ccfdc2d96e 100644
--- a/arch/x86/include/asm/msidef.h
+++ b/arch/x86/include/asm/msidef.h
@@ -18,6 +18,7 @@
 #define MSI_DATA_DELIVERY_MODE_SHIFT   8
 #define  MSI_DATA_DELIVERY_FIXED   (0 << MSI_DATA_DELIVERY_MODE_SHIFT)
 #define  MSI_DATA_DELIVERY_LOWPRI  (1 << MSI_DATA_DELIVERY_MODE_SHIFT)
+#define  MSI_DATA_DELIVERY_NMI (4 << MSI_DATA_DELIVERY_MODE_SHIFT)
 
 #define MSI_DATA_LEVEL_SHIFT   14
 #define MSI_DATA_LEVEL_DEASSERT(0 << MSI_DATA_LEVEL_SHIFT)
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 02/16] iommu: Introduce cache_invalidate API

2019-05-14 Thread Auger Eric
Hi Jean,

On 5/14/19 12:42 PM, Jean-Philippe Brucker wrote:
> On 14/05/2019 08:46, Auger Eric wrote:
>> Hi Jean,
>>
>> On 5/13/19 7:09 PM, Jean-Philippe Brucker wrote:
>>> On 13/05/2019 17:50, Auger Eric wrote:
> struct iommu_inv_pasid_info {
> #define IOMMU_INV_PASID_FLAGS_PASID   (1 << 0)
> #define IOMMU_INV_PASID_FLAGS_ARCHID  (1 << 1)
>   __u32   flags;
>   __u32   archid;
>   __u64   pasid;
> };
 I agree it does the job now. However it looks a bit strange to do a
 PASID based invalidation in my case - SMMUv3 nested stage - where I
 don't have any PASID involved.

 Couldn't we call it context based invalidation then? A context can be
 tagged by a PASID or/and an ARCHID.
>>>
>>> I think calling it "context" would be confusing as well (I shouldn't
>>> have used it earlier), since VT-d uses that name for device table
>>> entries (=STE on Arm SMMU). Maybe "addr_space"?
>> yes you're right. Well we already pasid table table terminology so we
>> can use it here as well - as long as we understand what purpose it
>> serves ;-) - So OK for iommu_inv_pasid_info.
>>
>> I think Jean understood we would keep pasid standalone field in
I meant Jacob here.
>> iommu_cache_invalidate_info's union. I understand the struct
>> iommu_inv_pasid_info now would replace it, correct?

Thank you for the confirmation.

Eric

> 
> Yes
> 
> Thanks,
> Jean
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 02/16] iommu: Introduce cache_invalidate API

2019-05-14 Thread Jean-Philippe Brucker
On 14/05/2019 08:36, Auger Eric wrote:
> Hi Jacob,
> 
> On 5/14/19 12:16 AM, Jacob Pan wrote:
>> On Mon, 13 May 2019 18:09:48 +0100
>> Jean-Philippe Brucker  wrote:
>>
>>> On 13/05/2019 17:50, Auger Eric wrote:
> struct iommu_inv_pasid_info {
> #define IOMMU_INV_PASID_FLAGS_PASID   (1 << 0)
> #define IOMMU_INV_PASID_FLAGS_ARCHID  (1 << 1)
>   __u32   flags;
>   __u32   archid;
>   __u64   pasid;
> };  
 I agree it does the job now. However it looks a bit strange to do a
 PASID based invalidation in my case - SMMUv3 nested stage - where I
 don't have any PASID involved.

 Couldn't we call it context based invalidation then? A context can
 be tagged by a PASID or/and an ARCHID.  
>>>
>>> I think calling it "context" would be confusing as well (I shouldn't
>>> have used it earlier), since VT-d uses that name for device table
>>> entries (=STE on Arm SMMU). Maybe "addr_space"?
>>>
>> I am still struggling to understand what ARCHID is after scanning
>> through SMMUv3.1 spec. It seems to be a constant for a given SMMU. Why
>> do you need to pass it down every time? Could you point to me the
>> document or explain a little more on ARCHID use cases.
>> We have three fileds called pasid under this struct
>> iommu_cache_invalidate_info{}
>> Gets confusing :)
> archid is a generic term. That's why you did not find it in the spec ;-)
> 
> On ARM SMMU the archid is called the ASID (Address Space ID, up to 16
> bits. The ASID is stored in the Context Descriptor Entry (your PASID
> entry) and thus characterizes a given stage 1 translation
> "context"/"adress space".

Yes, another way to look at it is, for a given address space:
* PASID tags device-IOTLB (ATC) entries.
* ASID (here called archid) tags IOTLB entries.

They could have the same value, but it depends on the guest's allocation
policy which isn't in our control. With my PASID patches for SMMUv3,
they have different values. So we need both fields if we intend to
invalidate both ATC and IOTLB with a single call.

Thanks,
Jean

> 
> At the moment the ASID is allocated per iommu domain. With aux domains
> we should have one ASID per aux domain, Jean-Philippe said.
> 
> ASID tags IOTLB S1 entries. As the ASID is part of the "context
> descriptor" which is owned by the guest, the API must pass it somehow.
> 
> 4.4.1.2 CMD_TLBI_NH_ASID(VMID, ASID) invalidation command allows to
> invalidate all IOTLB S1 entries for a given VMID/ASID and this is the
> functionality which is currently missing in the API. This is not an
> address based invalidation or a "pure" PASID based invalidation. At the
> moment we don't support PASIDs on ARM and I need this capability.
> 
> Thanks
> 
> Eric
> 
> 
> 
>>> Thanks,
>>> Jean
>>>

 Domain invalidation would invalidate all the contexts belonging to
 that domain.

 Thanks

 Eric  
>>
>> [Jacob Pan]
>>



Re: [RFC] iommu: arm-smmu: stall support

2019-05-14 Thread Robin Murphy

On 14/05/2019 02:54, Rob Clark wrote:

On Mon, May 13, 2019 at 11:37 AM Jean-Philippe Brucker
 wrote:


Hi Rob,

On 10/05/2019 19:23, Rob Clark wrote:

On Fri, Sep 22, 2017 at 2:58 AM Jean-Philippe Brucker
 wrote:


On 22/09/17 10:02, Joerg Roedel wrote:

On Tue, Sep 19, 2017 at 10:23:43AM -0400, Rob Clark wrote:

I would like to decide in the IRQ whether or not to queue work or not,
because when we get a gpu fault, we tend to get 1000's of gpu faults
all at once (and I really only need to handle the first one).  I
suppose that could also be achieved by having a special return value
from the fault handler to say "call me again from a wq"..

Note that in the drm driver I already have a suitable wq to queue the
work, so it really doesn't buy me anything to have the iommu driver
toss things off to a wq for me.  Might be a different situation for
other drivers (but I guess mostly other drivers are using iommu API
indirectly via dma-mapping?)


Okay, so since you are the only user for now, we don't need a
work-queue. But I still want the ->resume call-back to be hidden in the
iommu code and not be exposed to users.

We already have per-domain fault-handlers, so the best solution for now
is to call ->resume from report_iommu_fault() when the fault-handler
returns a special value.


The problem is that report_iommu_fault is called from IRQ context by the
SMMU driver, so the device driver callback cannot sleep.

So if the device driver needs to be able to sleep between fault report and
resume, as I understand Rob needs for writing debugfs, we can either:

* call report_iommu_fault from higher up, in a thread or workqueue.
* split the fault reporting as this patch proposes. The exact same
   mechanism is needed for the vSVM work by Intel: in order to inject fault
   into the guest, they would like to have an atomic notifier registered by
   VFIO for passing down the Page Request, and a new function in the IOMMU
   API to resume/complete the fault.



So I was thinking about this topic again.. I would still like to get
some sort of async resume so that I can wire up GPU cmdstream/state
logging on iommu fault (without locally resurrecting and rebasing this
patch and drm/msm side changes each time I need to debug iommu
faults)..


We've been working on the new fault reporting API with Jacob and Eric,
and I intend to send it out soon. It is supposed to be used for
reporting faults to guests via VFIO, handling page faults via mm, and
also reporting events directly to device drivers. Please let us know
what works and what doesn't in your case

The most recent version of the patches is at
http://www.linux-arm.org/git?p=linux-jpb.git;a=shortlog;h=refs/heads/sva/api
(git://www.linux-arm.org/linux-jpb.git branch sva/api). Hopefully on the
list sometimes next week, I'll add you on Cc.

In particular, see commits
 iommu: Introduce device fault data
 iommu: Introduce device fault report API
 iommu: Add recoverable fault reporting

The device driver calls iommu_register_device_fault_handler(dev, cb,
data). To report a fault, the SMMU driver calls
iommu_report_device_fault(dev, fault). This calls into the device driver
directly, there isn't any workqueue. If the fault is recoverable (the
SMMU driver set type IOMMU_FAULT_PAGE_REQ rather than
IOMMU_FAULT_DMA_UNRECOV), the device driver calls iommu_page_response()
once it has dealt with the fault (after sleeping if it needs to). This
invokes the SMMU driver's resume callback.


Ok, this sounds at a high level similar to my earlier RFC, in that
resume is split (and that was the main thing I was interested in).
And it does solve one thing I was struggling with, namely that when
the domain is created it doesn't know which iommu device it will be
attached to (given that at least the original arm-smmu.c driver cannot
support stall in all cases)..

For GPU translation faults, I also don't really need to know if the
faulting translation is stalled until the callback (I mainly want to
not bother to snapshot GPU state if it is not stalled, because in that
case the data we snapshot is unlikely to be related to the fault if
the translation is not stalled).


At the moment we use mutexes, so iommu_report_device_fault() can only be
called from an IRQ thread, which is incompatible with the current SMMUv2
driver. Either we need to switch the SMMUv2 driver to an IRQ thread, or
rework the fault handler to be called from an IRQ handler. The reporting
also has to be per device rather than per domain, and I'm not sure if
the SMMUv2 driver can deal with this.


I'll take a closer look at the branch and try to formulate some plan
to add v2 support for this.


What's fun is that we should be able to identify a stream ID for most 
context faults *except* translation faults...


We've considered threaded IRQs before, and IIRC the problem with doing 
it at the architectural level is that in some cases the fault interrupt 
can only be deasserted by actually resuming/terminating the 

Re: [PATCH v3 02/16] iommu: Introduce cache_invalidate API

2019-05-14 Thread Auger Eric
Hi Jean,

On 5/13/19 7:09 PM, Jean-Philippe Brucker wrote:
> On 13/05/2019 17:50, Auger Eric wrote:
>>> struct iommu_inv_pasid_info {
>>> #define IOMMU_INV_PASID_FLAGS_PASID (1 << 0)
>>> #define IOMMU_INV_PASID_FLAGS_ARCHID(1 << 1)
>>> __u32   flags;
>>> __u32   archid;
>>> __u64   pasid;
>>> };
>> I agree it does the job now. However it looks a bit strange to do a
>> PASID based invalidation in my case - SMMUv3 nested stage - where I
>> don't have any PASID involved.
>>
>> Couldn't we call it context based invalidation then? A context can be
>> tagged by a PASID or/and an ARCHID.
> 
> I think calling it "context" would be confusing as well (I shouldn't
> have used it earlier), since VT-d uses that name for device table
> entries (=STE on Arm SMMU). Maybe "addr_space"?
yes you're right. Well we already pasid table table terminology so we
can use it here as well - as long as we understand what purpose it
serves ;-) - So OK for iommu_inv_pasid_info.

I think Jean understood we would keep pasid standalone field in
iommu_cache_invalidate_info's union. I understand the struct
iommu_inv_pasid_info now would replace it, correct?

Thanks

Eric
> 
> Thanks,
> Jean
> 
>>
>> Domain invalidation would invalidate all the contexts belonging to that
>> domain.
>>
>> Thanks
>>
>> Eric


Re: [PATCH v3 02/16] iommu: Introduce cache_invalidate API

2019-05-14 Thread Auger Eric
Hi Jacob,

On 5/14/19 12:16 AM, Jacob Pan wrote:
> On Mon, 13 May 2019 18:09:48 +0100
> Jean-Philippe Brucker  wrote:
> 
>> On 13/05/2019 17:50, Auger Eric wrote:
 struct iommu_inv_pasid_info {
 #define IOMMU_INV_PASID_FLAGS_PASID(1 << 0)
 #define IOMMU_INV_PASID_FLAGS_ARCHID   (1 << 1)
__u32   flags;
__u32   archid;
__u64   pasid;
 };  
>>> I agree it does the job now. However it looks a bit strange to do a
>>> PASID based invalidation in my case - SMMUv3 nested stage - where I
>>> don't have any PASID involved.
>>>
>>> Couldn't we call it context based invalidation then? A context can
>>> be tagged by a PASID or/and an ARCHID.  
>>
>> I think calling it "context" would be confusing as well (I shouldn't
>> have used it earlier), since VT-d uses that name for device table
>> entries (=STE on Arm SMMU). Maybe "addr_space"?
>>
> I am still struggling to understand what ARCHID is after scanning
> through SMMUv3.1 spec. It seems to be a constant for a given SMMU. Why
> do you need to pass it down every time? Could you point to me the
> document or explain a little more on ARCHID use cases.
> We have three fileds called pasid under this struct
> iommu_cache_invalidate_info{}
> Gets confusing :)
archid is a generic term. That's why you did not find it in the spec ;-)

On ARM SMMU the archid is called the ASID (Address Space ID, up to 16
bits. The ASID is stored in the Context Descriptor Entry (your PASID
entry) and thus characterizes a given stage 1 translation
"context"/"adress space".

At the moment the ASID is allocated per iommu domain. With aux domains
we should have one ASID per aux domain, Jean-Philippe said.

ASID tags IOTLB S1 entries. As the ASID is part of the "context
descriptor" which is owned by the guest, the API must pass it somehow.

4.4.1.2 CMD_TLBI_NH_ASID(VMID, ASID) invalidation command allows to
invalidate all IOTLB S1 entries for a given VMID/ASID and this is the
functionality which is currently missing in the API. This is not an
address based invalidation or a "pure" PASID based invalidation. At the
moment we don't support PASIDs on ARM and I need this capability.

Thanks

Eric



>> Thanks,
>> Jean
>>
>>>
>>> Domain invalidation would invalidate all the contexts belonging to
>>> that domain.
>>>
>>> Thanks
>>>
>>> Eric  
> 
> [Jacob Pan]
> 


Re: [PATCH v4 1/1] iommu/io-pgtable-arm: Add support to use system cache

2019-05-14 Thread Vivek Gautam
Hi Robin,


On Mon, May 13, 2019 at 5:02 PM Robin Murphy  wrote:
>
> On 13/05/2019 11:04, Vivek Gautam wrote:
> > Few Qualcomm platforms such as, sdm845 have an additional outer
> > cache called as System cache, aka. Last level cache (LLC) that
> > allows non-coherent devices to upgrade to using caching.
> > This cache sits right before the DDR, and is tightly coupled
> > with the memory controller. The clients using this cache request
> > their slices from this system cache, make it active, and can then
> > start using it.
> >
> > There is a fundamental assumption that non-coherent devices can't
> > access caches. This change adds an exception where they *can* use
> > some level of cache despite still being non-coherent overall.
> > The coherent devices that use cacheable memory, and CPU make use of
> > this system cache by default.
> >
> > Looking at memory types, we have following -
> > a) Normal uncached :- MAIR 0x44, inner non-cacheable,
> >outer non-cacheable;
> > b) Normal cached :-   MAIR 0xff, inner read write-back non-transient,
> >outer read write-back non-transient;
> >attribute setting for coherenet I/O devices.
> > and, for non-coherent i/o devices that can allocate in system cache
> > another type gets added -
> > c) Normal sys-cached :- MAIR 0xf4, inner non-cacheable,
> >  outer read write-back non-transient
> >
> > Coherent I/O devices use system cache by marking the memory as
> > normal cached.
> > Non-coherent I/O devices should mark the memory as normal
> > sys-cached in page tables to use system cache.
> >
> > Signed-off-by: Vivek Gautam 
> > ---
> >
> > V3 version of this patch and related series can be found at [1].
> >
> > This change is a realisation of following changes from downstream msm-4.9:
> > iommu: io-pgtable-arm: Implement IOMMU_USE_UPSTREAM_HINT[2]
> >
> > Changes since v3:
> >   - Dropping support to cache i/o page tables to system cache. Getting 
> > support
> > for data buffers is the first step.
> > Removed io-pgtable quirk and related change to add domain attribute.
> >
> > Glmark2 numbers on SDM845 based cheza board:
> >
> > S.No.|with LLC support   |without LLC support
> >   |   for data buffers   |
> > ---
> > 1|4480; 72.3fps  |4042; 65.2fps
> > 2|4500; 72.6fps  |4039; 65.1fps
> > 3|4523; 72.9fps  |4106; 66.2fps
> > 4|4489; 72.4fps  |4104; 66.2fps
> > 5|4518; 72.9fps  |4072; 65.7fps
> >
> > [1] https://patchwork.kernel.org/cover/10772629/
> > [2] 
> > https://source.codeaurora.org/quic/la/kernel/msm-4.9/commit/?h=msm-4.9=d4c72c413ea27c43f60825193d4de9cb8ffd9602
> >
> >   drivers/iommu/io-pgtable-arm.c | 9 -
> >   include/linux/iommu.h  | 1 +
> >   2 files changed, 9 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
> > index d3700ec15cbd..2dbafe697531 100644
> > --- a/drivers/iommu/io-pgtable-arm.c
> > +++ b/drivers/iommu/io-pgtable-arm.c
> > @@ -167,10 +167,12 @@
> >   #define ARM_LPAE_MAIR_ATTR_MASK 0xff
> >   #define ARM_LPAE_MAIR_ATTR_DEVICE   0x04
> >   #define ARM_LPAE_MAIR_ATTR_NC   0x44
> > +#define ARM_LPAE_MAIR_ATTR_QCOM_SYS_CACHE0xf4
> >   #define ARM_LPAE_MAIR_ATTR_WBRWA0xff
> >   #define ARM_LPAE_MAIR_ATTR_IDX_NC   0
> >   #define ARM_LPAE_MAIR_ATTR_IDX_CACHE1
> >   #define ARM_LPAE_MAIR_ATTR_IDX_DEV  2
> > +#define ARM_LPAE_MAIR_ATTR_IDX_QCOM_SYS_CACHE3
>
> Here at the implementation level, I'd rather just call these what they
> are, i.e. s/QCOM_SYS_CACHE/INC_OWBRWA/.
>

Thanks for the review.
Sure, will change this as suggested.

> >
> >   /* IOPTE accessors */
> >   #define iopte_deref(pte,d) __va(iopte_to_paddr(pte, d))
> > @@ -442,6 +444,9 @@ static arm_lpae_iopte arm_lpae_prot_to_pte(struct 
> > arm_lpae_io_pgtable *data,
> >   else if (prot & IOMMU_CACHE)
> >   pte |= (ARM_LPAE_MAIR_ATTR_IDX_CACHE
> >   << ARM_LPAE_PTE_ATTRINDX_SHIFT);
> > + else if (prot & IOMMU_QCOM_SYS_CACHE)
>
> Where in the call stack is this going to be decided? (I don't recall the
> previous discussions ever really reaching a solid conclusion on how to
> separate responsibilities).
>

Based on the last discussion [1], I understood that we may not want to expose
these cache protections to DMA APIs. So such control would lie with the masters
that are creating the individual domains. An example [2] of this is
graphics on sdm845.
Please ignore the change in naming at [2] IOMMU_UPSTREAM_HINT in [2] is same as
IOMMU_QCOM_SYS_CACHE here.

At that point [1] I also pointed to the fact that video that uses DMA
APIs to handle
buffers too uses system cache on sdm845. In this case shouldn't we expose the
protection