[PATCH v2] iommu/amd: Add support to indicate whether DMA remap support is enabled

2022-03-18 Thread Mario Limonciello via iommu
Bit 1 of the IVFS IVInfo field indicates that IOMMU has been used for
pre-boot DMA protection.

Export this capability to allow other places in the kernel to be able to
check for it on AMD systems.

Cc: Robin Murphy 
Link: https://www.amd.com/system/files/TechDocs/48882_IOMMU.pdf
Signed-off-by: Mario Limonciello 
---
changes from v1->v2:
 * Rebase on top of Robin Murphy's patch series to add generic interface
   
https://lore.kernel.org/linux-usb/cover.1647624084.git.robin.mur...@arm.com/T/#t
 * Drop changes to Thunderbolt driver

Robin,
If your patch series revs again, and this looks good suggest to just roll it 
into
your series as a 3rd patch.

 drivers/iommu/amd/amd_iommu_types.h | 4 
 drivers/iommu/amd/init.c| 3 +++
 drivers/iommu/amd/iommu.c   | 2 ++
 3 files changed, 9 insertions(+)

diff --git a/drivers/iommu/amd/amd_iommu_types.h 
b/drivers/iommu/amd/amd_iommu_types.h
index 47108ed44fbb..72d0f5e2f651 100644
--- a/drivers/iommu/amd/amd_iommu_types.h
+++ b/drivers/iommu/amd/amd_iommu_types.h
@@ -407,6 +407,7 @@
 /* IOMMU IVINFO */
 #define IOMMU_IVINFO_OFFSET 36
 #define IOMMU_IVINFO_EFRSUP BIT(0)
+#define IOMMU_IVINFO_DMA_REMAP  BIT(1)
 
 /* IOMMU Feature Reporting Field (for IVHD type 10h */
 #define IOMMU_FEAT_GASUP_SHIFT 6
@@ -449,6 +450,9 @@ extern struct irq_remap_table **irq_lookup_table;
 /* Interrupt remapping feature used? */
 extern bool amd_iommu_irq_remap;
 
+/* IVRS indicates that pre-boot remapping was enabled */
+extern bool amdr_ivrs_remap_support;
+
 /* kmem_cache to get tables with 128 byte alignement */
 extern struct kmem_cache *amd_iommu_irq_cache;
 
diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
index 7bfe37e52e21..fc12ead49a03 100644
--- a/drivers/iommu/amd/init.c
+++ b/drivers/iommu/amd/init.c
@@ -182,6 +182,7 @@ u32 amd_iommu_max_pasid __read_mostly = ~0;
 
 bool amd_iommu_v2_present __read_mostly;
 static bool amd_iommu_pc_present __read_mostly;
+bool amdr_ivrs_remap_support __read_mostly;
 
 bool amd_iommu_force_isolation __read_mostly;
 
@@ -326,6 +327,8 @@ static void __init early_iommu_features_init(struct 
amd_iommu *iommu,
 {
if (amd_iommu_ivinfo & IOMMU_IVINFO_EFRSUP)
iommu->features = h->efr_reg;
+   if (amd_iommu_ivinfo & IOMMU_IVINFO_DMA_REMAP)
+   amdr_ivrs_remap_support = true;
 }
 
 /* Access to l1 and l2 indexed register spaces */
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index a18b549951bb..e4b4dad027f7 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2162,6 +2162,8 @@ static bool amd_iommu_capable(enum iommu_cap cap)
return (irq_remapping_enabled == 1);
case IOMMU_CAP_NOEXEC:
return false;
+   case IOMMU_CAP_PRE_BOOT_PROTECTION:
+   return amdr_ivrs_remap_support;
default:
break;
}
-- 
2.34.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v2 2/2] thunderbolt: Make iommu_dma_protection more accurate

2022-03-18 Thread Limonciello, Mario via iommu
[Public]

> Between me trying to get rid of iommu_present() and Mario wanting to
> support the AMD equivalent of DMAR_PLATFORM_OPT_IN, scrutiny has
> shown
> that the iommu_dma_protection attribute is being far too optimistic.
> Even if an IOMMU might be present for some PCI segment in the system,
> that doesn't necessarily mean it provides translation for the device(s)
> we care about. Furthermore, all that DMAR_PLATFORM_OPT_IN really does
> is tell us that memory was protected before the kernel was loaded, and
> prevent the user from disabling the intel-iommu driver entirely. While
> that lets us assume kernel integrity, what matters for actual runtime
> DMA protection is whether we trust individual devices, based on the
> "external facing" property that we expect firmware to describe for
> Thunderbolt ports.
> 
> It's proven challenging to determine the appropriate ports accurately
> given the variety of possible topologies, so while still not getting a
> perfect answer, by putting enough faith in firmware we can at least get
> a good bit closer. If we can see that any device near a Thunderbolt NHI
> has all the requisites for Kernel DMA Protection, chances are that it
> *is* a relevant port, but moreover that implies that firmware is playing
> the game overall, so we'll use that to assume that all Thunderbolt ports
> should be correctly marked and thus will end up fully protected.
> 

This approach looks generally good to me.  I do worry a little bit about older
systems that didn't set ExternalFacingPort in the FW but were previously setting
iommu_dma_protection, but I think that those could be treated on a quirk
basis to set PCI IDs for those root ports as external facing if/when they come
up.

I'll send up a follow up patch that adds the AMD ACPI table check.
If it looks good can roll it into your series for v3, or if this series goes
as is for v2 it can come on its own.

> CC: Mario Limonciello 
> Signed-off-by: Robin Murphy 
> ---
> 
> v2: Give up trying to look for specific devices, just look for evidence
> that firmware cares at all.

I still do think you could know exactly which devices to use if you're in
SW CM mode, but I guess the consensus is to not bifurcate the way this
can be checked.

> 
>  drivers/thunderbolt/domain.c | 12 +++
>  drivers/thunderbolt/nhi.c| 41
> 
>  include/linux/thunderbolt.h  |  2 ++
>  3 files changed, 46 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/thunderbolt/domain.c b/drivers/thunderbolt/domain.c
> index 7018d959f775..2889a214dadc 100644
> --- a/drivers/thunderbolt/domain.c
> +++ b/drivers/thunderbolt/domain.c
> @@ -7,9 +7,7 @@
>   */
> 
>  #include 
> -#include 
>  #include 
> -#include 
>  #include 
>  #include 
>  #include 
> @@ -257,13 +255,9 @@ static ssize_t iommu_dma_protection_show(struct
> device *dev,
>struct device_attribute *attr,
>char *buf)
>  {
> - /*
> -  * Kernel DMA protection is a feature where Thunderbolt security is
> -  * handled natively using IOMMU. It is enabled when IOMMU is
> -  * enabled and ACPI DMAR table has DMAR_PLATFORM_OPT_IN set.
> -  */
> - return sprintf(buf, "%d\n",
> -iommu_present(_bus_type) &&
> dmar_platform_optin());
> + struct tb *tb = container_of(dev, struct tb, dev);
> +
> + return sysfs_emit(buf, "%d\n", tb->nhi->iommu_dma_protection);
>  }
>  static DEVICE_ATTR_RO(iommu_dma_protection);
> 
> diff --git a/drivers/thunderbolt/nhi.c b/drivers/thunderbolt/nhi.c
> index c73da0532be4..9e396e283792 100644
> --- a/drivers/thunderbolt/nhi.c
> +++ b/drivers/thunderbolt/nhi.c
> @@ -14,6 +14,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -1102,6 +1103,45 @@ static void nhi_check_quirks(struct tb_nhi *nhi)
>   nhi->quirks |= QUIRK_AUTO_CLEAR_INT;
>  }
> 
> +static int nhi_check_iommu_pdev(struct pci_dev *pdev, void *data)
> +{
> + if (!pdev->untrusted ||
> + !dev_iommu_capable(>dev,
> IOMMU_CAP_PRE_BOOT_PROTECTION))
> + return 0;
> + *(bool *)data = true;
> + return 1; /* Stop walking */
> +}
> +
> +static void nhi_check_iommu(struct tb_nhi *nhi)
> +{
> + struct pci_bus *bus = nhi->pdev->bus;
> + bool port_ok = false;
> +
> + /*
> +  * Ideally what we'd do here is grab every PCI device that
> +  * represents a tunnelling adapter for this NHI and check their
> +  * status directly, but unfortunately USB4 seems to make it
> +  * obnoxiously difficult to reliably make any correlation.
> +  *
> +  * So for now we'll have to bodge it... Hoping that the system
> +  * is at least sane enough that an adapter is in the same PCI
> +  * segment as its NHI, if we can find *something* on that segment
> +  * which meets the requirements for Kernel DMA Protection, we'll
> +  * take that to imply that 

RE: [PATCH 2/4 RESEND] dma-mapping: Add wrapper function to set dma_coherent

2022-03-18 Thread Michael Kelley (LINUX) via iommu
From: Robin Murphy  Sent: Friday, March 18, 2022 4:08 AM
> 
> On 2022-03-17 19:13, Michael Kelley (LINUX) wrote:
> > From: Robin Murphy  Sent: Thursday, March 17, 2022
> 10:20 AM
> >>
> >> On 2022-03-17 16:25, Michael Kelley via iommu wrote:
> >>> Add a wrapper function to set dma_coherent, avoiding the need for
> >>> complex #ifdef's when setting it in architecture independent code.
> >>
> >> No. It might happen to work out on the architectures you're looking at,
> >> but if Hyper-V were ever to support, say, AArch32 VMs you might see the
> >> problem. arch_setup_dma_ops() is the tool for this job.
> >>
> >
> > OK.   There's currently no vIOMMU in a Hyper-V guest, so presumably the
> > code would call arch_setup_dma_ops() with the dma_base, size, and iommu
> > parameters set to 0 and NULL.  This call can then be used in Patch 3 instead
> > of acpi_dma_configure(), and in the Patch 4 hv_dma_configure() function
> > as you suggested.  arch_setup_dma_ops() is not exported, so I'd need to
> > wrap it in a Hyper-V specific function in built-in code that is exported.
> >
> > But at some point in the future if there's a vIOMMU in Hyper-V guests,
> > this approach will need some rework.
> >
> > Does that make sense?  Thanks for your input and suggestions ...
> 
> Yes, that's essentially what I had in mind. If you did present a vIOMMU
> to the guest, presumably you'd either have to construct a regular
> IORT/VIOT, and thus end up adding the root complex to the ACPI namespace
> too so it can be referenced, at which point it would all get picked up
> by the standard machinery, or come up with some magic VMBus mechanism
> that would need a whole load of work to wire up in all the relevant
> places anyway.
> 
> (But please lean extremely heavily towards the former option!)

Agreed!

> 
> Thanks,
> Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH 4/4 RESEND] PCI: hv: Propagate coherence from VMbus device to PCI device

2022-03-18 Thread Michael Kelley (LINUX) via iommu
From: Robin Murphy  Sent: Friday, March 18, 2022 3:58 AM
> 
> On 2022-03-18 05:12, Michael Kelley (LINUX) wrote:
> > From: Robin Murphy  Sent: Thursday, March 17, 2022
> 10:15 AM
> >>
> >> On 2022-03-17 16:25, Michael Kelley via iommu wrote:
> >>> PCI pass-thru devices in a Hyper-V VM are represented as a VMBus
> >>> device and as a PCI device.  The coherence of the VMbus device is
> >>> set based on the VMbus node in ACPI, but the PCI device has no
> >>> ACPI node and defaults to not hardware coherent.  This results
> >>> in extra software coherence management overhead on ARM64 when
> >>> devices are hardware coherent.
> >>>
> >>> Fix this by propagating the coherence of the VMbus device to the
> >>> PCI device.  There's no effect on x86/x64 where devices are
> >>> always hardware coherent.
> >>>
> >>> Signed-off-by: Michael Kelley 
> >>> ---
> >>>drivers/pci/controller/pci-hyperv.c | 17 +
> >>>1 file changed, 13 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/drivers/pci/controller/pci-hyperv.c 
> >>> b/drivers/pci/controller/pci-hyperv.c
> >>> index ae0bc2f..14276f5 100644
> >>> --- a/drivers/pci/controller/pci-hyperv.c
> >>> +++ b/drivers/pci/controller/pci-hyperv.c
> >>> @@ -49,6 +49,7 @@
> >>>#include 
> >>>#include 
> >>>#include 
> >>> +#include 
> >>>#include 
> >>>
> >>>/*
> >>> @@ -2142,9 +2143,9 @@ static void hv_pci_remove_slots(struct 
> >>> hv_pcibus_device
> >> *hbus)
> >>>}
> >>>
> >>>/*
> >>> - * Set NUMA node for the devices on the bus
> >>> + * Set NUMA node and DMA coherence for the devices on the bus
> >>> */
> >>> -static void hv_pci_assign_numa_node(struct hv_pcibus_device *hbus)
> >>> +static void hv_pci_assign_properties(struct hv_pcibus_device *hbus)
> >>>{
> >>>   struct pci_dev *dev;
> >>>   struct pci_bus *bus = hbus->bridge->bus;
> >>> @@ -2167,6 +2168,14 @@ static void hv_pci_assign_numa_node(struct
> >> hv_pcibus_device *hbus)
> >>>numa_map_to_online_node(
> >>>
> >>> hv_dev->desc.virtual_numa_node));
> >>>
> >>> + /*
> >>> +  * On ARM64, propagate the DMA coherence from the VMbus device
> >>> +  * to the corresponding PCI device. On x86/x64, these calls
> >>> +  * have no effect because DMA is always hardware coherent.
> >>> +  */
> >>> + dev_set_dma_coherent(>dev,
> >>> + dev_is_dma_coherent(>hdev->device));
> >>
> >> Eww... if you really have to do this, I'd prefer to see a proper
> >> hv_dma_configure() helper implemented and wired up to
> >> pci_dma_configure(). Although since it's a generic property I guess at
> >> worst pci_dma_configure could perhaps propagate coherency from the host
> >> bridge to its children by itself in the absence of any other firmware
> >> info. And it's built-in so could use arch_setup_dma_ops() like everyone
> >> else.
> >>
> >
> > I'm not seeing an existing mechanism to provide a "helper" or override
> > of pci_dma_configure().   Could you elaborate?  Or is this something
> > that needs to be created?
> 
> I mean something like the diff below (other #includes omitted for
> clarity). Essentially if VMBus has its own way of describing parts of
> the system, then for those parts it's nice if it could fit into the same
> abstractions we use for firmware-based system description.

OK, got it.  Adding the VMbus special case into pci_dma_configure()
would work.  But after investigating further, let me throw out two
other possible solutions as well.

1)  It's easy to make the top-level VMbus node in the DSDT be
the ACPI companion for the pci_host bridge.  The function
pcibios_root_bridge_prepare() will do this if the pci-hyperv.c
driver sets sysdata->parent to that top-level VMbus node.  Then
pci_dma_configure()works as is.  Also, commit 7d40c0f70 that
special cases pcibios_root_bridge_prepare() for Hyper-V becomes
unnecessary.   I've coded this approach and it seems to work, but
I don't know all the implications of making that VMbus node be
the ACPI companion, potentially for multiple pci_host bridges.

2)  Since there's no vIOMMU and we don't know how it will be
described if there should be one in the future, it's a bit premature
to try to set things up "correctly" now to handle it.  A simple
approach is to set  dma_default_coherent to true if the top-level
VMbus node in the DSDT indicates coherent.  No other changes are
needed.  If there's a vIOMMU at some later time, then we add the
use of arch_setup_dma_ops() for each device.

Any thoughts on these two approaches vs. adding the VMbus
special case into pci_dma_configure()?  My preference would be
to avoid adding that special case if one of the other two approaches
is reasonable.

Michael

> 
> Cheers,
> Robin.
> 
> ->8-
> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> index 588588cfda48..7d92ccad1569 100644
> --- 

[PATCH v2 2/2] thunderbolt: Make iommu_dma_protection more accurate

2022-03-18 Thread Robin Murphy
Between me trying to get rid of iommu_present() and Mario wanting to
support the AMD equivalent of DMAR_PLATFORM_OPT_IN, scrutiny has shown
that the iommu_dma_protection attribute is being far too optimistic.
Even if an IOMMU might be present for some PCI segment in the system,
that doesn't necessarily mean it provides translation for the device(s)
we care about. Furthermore, all that DMAR_PLATFORM_OPT_IN really does
is tell us that memory was protected before the kernel was loaded, and
prevent the user from disabling the intel-iommu driver entirely. While
that lets us assume kernel integrity, what matters for actual runtime
DMA protection is whether we trust individual devices, based on the
"external facing" property that we expect firmware to describe for
Thunderbolt ports.

It's proven challenging to determine the appropriate ports accurately
given the variety of possible topologies, so while still not getting a
perfect answer, by putting enough faith in firmware we can at least get
a good bit closer. If we can see that any device near a Thunderbolt NHI
has all the requisites for Kernel DMA Protection, chances are that it
*is* a relevant port, but moreover that implies that firmware is playing
the game overall, so we'll use that to assume that all Thunderbolt ports
should be correctly marked and thus will end up fully protected.

CC: Mario Limonciello 
Signed-off-by: Robin Murphy 
---

v2: Give up trying to look for specific devices, just look for evidence
that firmware cares at all.

 drivers/thunderbolt/domain.c | 12 +++
 drivers/thunderbolt/nhi.c| 41 
 include/linux/thunderbolt.h  |  2 ++
 3 files changed, 46 insertions(+), 9 deletions(-)

diff --git a/drivers/thunderbolt/domain.c b/drivers/thunderbolt/domain.c
index 7018d959f775..2889a214dadc 100644
--- a/drivers/thunderbolt/domain.c
+++ b/drivers/thunderbolt/domain.c
@@ -7,9 +7,7 @@
  */
 
 #include 
-#include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -257,13 +255,9 @@ static ssize_t iommu_dma_protection_show(struct device 
*dev,
 struct device_attribute *attr,
 char *buf)
 {
-   /*
-* Kernel DMA protection is a feature where Thunderbolt security is
-* handled natively using IOMMU. It is enabled when IOMMU is
-* enabled and ACPI DMAR table has DMAR_PLATFORM_OPT_IN set.
-*/
-   return sprintf(buf, "%d\n",
-  iommu_present(_bus_type) && dmar_platform_optin());
+   struct tb *tb = container_of(dev, struct tb, dev);
+
+   return sysfs_emit(buf, "%d\n", tb->nhi->iommu_dma_protection);
 }
 static DEVICE_ATTR_RO(iommu_dma_protection);
 
diff --git a/drivers/thunderbolt/nhi.c b/drivers/thunderbolt/nhi.c
index c73da0532be4..9e396e283792 100644
--- a/drivers/thunderbolt/nhi.c
+++ b/drivers/thunderbolt/nhi.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1102,6 +1103,45 @@ static void nhi_check_quirks(struct tb_nhi *nhi)
nhi->quirks |= QUIRK_AUTO_CLEAR_INT;
 }
 
+static int nhi_check_iommu_pdev(struct pci_dev *pdev, void *data)
+{
+   if (!pdev->untrusted ||
+   !dev_iommu_capable(>dev, IOMMU_CAP_PRE_BOOT_PROTECTION))
+   return 0;
+   *(bool *)data = true;
+   return 1; /* Stop walking */
+}
+
+static void nhi_check_iommu(struct tb_nhi *nhi)
+{
+   struct pci_bus *bus = nhi->pdev->bus;
+   bool port_ok = false;
+
+   /*
+* Ideally what we'd do here is grab every PCI device that
+* represents a tunnelling adapter for this NHI and check their
+* status directly, but unfortunately USB4 seems to make it
+* obnoxiously difficult to reliably make any correlation.
+*
+* So for now we'll have to bodge it... Hoping that the system
+* is at least sane enough that an adapter is in the same PCI
+* segment as its NHI, if we can find *something* on that segment
+* which meets the requirements for Kernel DMA Protection, we'll
+* take that to imply that firmware is aware and has (hopefully)
+* done the right thing in general. We need to know that the PCI
+* layer has seen the ExternalFacingPort property and propagated
+* it to the "untrusted" flag that the IOMMU layer will then
+* enforce, but also that the IOMMU driver itself can be trusted
+* not to have been subverted by a pre-boot DMA attack.
+*/
+   while (bus->parent)
+   bus = bus->parent;
+
+   pci_walk_bus(bus, nhi_check_iommu_pdev, _ok);
+
+   nhi->iommu_dma_protection = port_ok;
+}
+
 static int nhi_init_msi(struct tb_nhi *nhi)
 {
struct pci_dev *pdev = nhi->pdev;
@@ -1219,6 +1259,7 @@ static int nhi_probe(struct pci_dev *pdev, const struct 
pci_device_id *id)
return -ENOMEM;
 
nhi_check_quirks(nhi);
+   

[PATCH v2 1/2] iommu: Add capability for pre-boot DMA protection

2022-03-18 Thread Robin Murphy
VT-d's dmar_platform_optin() actually represents a combination of
properties fairly well standardised by Microsoft as "Pre-boot DMA
Protection" and "Kernel DMA Protection"[1]. As such, we can provide
interested consumers with an abstracted capability rather than
driver-specific interfaces that won't scale. We name it for the former
aspect since that's what external callers are most likely to be
interested in; the latter is for the IOMMU layer to handle itself.

Also use this as an opportunity to draw a line in the sand and add a
new interface so as not to introduce any more callers of iommu_capable()
which I also want to get rid of. For now it's a quick'n'dirty wrapper
function, but will evolve to subsume the internal interface in future.

[1] 
https://docs.microsoft.com/en-us/windows-hardware/design/device-experiences/oem-kernel-dma-protection

Suggested-by: Christoph Hellwig 
Signed-off-by: Robin Murphy 
---

v2: New patch

 drivers/iommu/intel/iommu.c | 2 ++
 include/linux/iommu.h   | 7 +++
 2 files changed, 9 insertions(+)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 0c7975848972..20d8e1f60068 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -4817,6 +4817,8 @@ static bool intel_iommu_capable(enum iommu_cap cap)
return domain_update_iommu_snooping(NULL);
if (cap == IOMMU_CAP_INTR_REMAP)
return irq_remapping_enabled == 1;
+   if (cap == IOMMU_CAP_PRE_BOOT_PROTECTION)
+   return dmar_platform_optin();
 
return false;
 }
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 4a25f8241207..e16d54e15fee 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -107,6 +107,8 @@ enum iommu_cap {
   transactions */
IOMMU_CAP_INTR_REMAP,   /* IOMMU supports interrupt isolation */
IOMMU_CAP_NOEXEC,   /* IOMMU_NOEXEC flag */
+   IOMMU_CAP_PRE_BOOT_PROTECTION,  /* Firmware says it used the IOMMU for
+  DMA protection and we should too */
 };
 
 /* These are the possible reserved region types */
@@ -1042,6 +1044,11 @@ static inline size_t iommu_map_sgtable(struct 
iommu_domain *domain,
return iommu_map_sg(domain, iova, sgt->sgl, sgt->orig_nents, prot);
 }
 
+static inline bool dev_iommu_capable(struct device *dev, enum iommu_cap cap)
+{
+   return device_iommu_mapped(dev) && iommu_capable(dev->bus, cap);
+}
+
 #ifdef CONFIG_IOMMU_DEBUGFS
 extern struct dentry *iommu_debugfs_dir;
 void iommu_debugfs_setup(void);
-- 
2.28.0.dirty

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 0/2] thunderbolt: Make iommu_dma_protection more accurate

2022-03-18 Thread Robin Murphy
OK, here's chapter 3 in the story of "but I really just want to
remove iommu_present()", which is the second go at this approach
on the Thunderbolt end, but now improving the IOMMU end as well
since the subtlety of why that still matters has been clarified.
Previous thread here:

https://lore.kernel.org/linux-iommu/f887686a-e7e4-f031-97e8-dbeb1c088...@arm.com/T/

Thanks,
Robin.


Robin Murphy (2):
  iommu: Add capability for pre-boot DMA protection
  thunderbolt: Make iommu_dma_protection more accurate

 drivers/iommu/intel/iommu.c  |  2 ++
 drivers/thunderbolt/domain.c | 12 +++
 drivers/thunderbolt/nhi.c| 41 
 include/linux/iommu.h|  7 ++
 include/linux/thunderbolt.h  |  2 ++
 5 files changed, 55 insertions(+), 9 deletions(-)

-- 
2.28.0.dirty

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility

2022-03-18 Thread Jason Gunthorpe via iommu
iommufd can directly implement the /dev/vfio/vfio container IOCTLs by
mapping them into io_pagetable operations. Doing so allows the use of
iommufd by symliking /dev/vfio/vfio to /dev/iommufd. Allowing VFIO to
SET_CONTAINER using a iommufd instead of a container fd is a followup
series.

Internally the compatibility API uses a normal IOAS object that, like
vfio, is automatically allocated when the first device is
attached.

Userspace can also query or set this IOAS object directly using the
IOMMU_VFIO_IOAS ioctl. This allows mixing and matching new iommufd only
features while still using the VFIO style map/unmap ioctls.

While this is enough to operate qemu, it is still a bit of a WIP with a
few gaps to be resolved:

 - Only the TYPE1v2 mode is supported where unmap cannot punch holes or
   split areas. The old mode can be implemented with a new operation to
   split an iopt_area into two without disturbing the iopt_pages or the
   domains, then unmapping a whole area as normal.

 - Resource limits rely on memory cgroups to bound what userspace can do
   instead of the module parameter dma_entry_limit.

 - VFIO P2P is not implemented. Avoiding the follow_pfn() mis-design will
   require some additional work to properly expose PFN lifecycle between
   VFIO and iommfd

 - Various components of the mdev API are not completed yet

 - Indefinite suspend of SW access (VFIO_DMA_MAP_FLAG_VADDR) is not
   implemented.

 - The 'dirty tracking' is not implemented

 - A full audit for pedantic compatibility details (eg errnos, etc) has
   not yet been done

 - powerpc SPAPR is left out, as it is not connected to the iommu_domain
   framework. My hope is that SPAPR will be moved into the iommu_domain
   framework as a special HW specific type and would expect power to
   support the generic interface through a normal iommu_domain.

Signed-off-by: Nicolin Chen 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/iommufd/Makefile  |   3 +-
 drivers/iommu/iommufd/iommufd_private.h |   6 +
 drivers/iommu/iommufd/main.c|  16 +-
 drivers/iommu/iommufd/vfio_compat.c | 401 
 include/uapi/linux/iommufd.h|  36 +++
 5 files changed, 456 insertions(+), 6 deletions(-)
 create mode 100644 drivers/iommu/iommufd/vfio_compat.c

diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index ca28a135b9675f..2fdff04000b326 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -5,6 +5,7 @@ iommufd-y := \
io_pagetable.o \
ioas.o \
main.o \
-   pages.o
+   pages.o \
+   vfio_compat.o
 
 obj-$(CONFIG_IOMMUFD) += iommufd.o
diff --git a/drivers/iommu/iommufd/iommufd_private.h 
b/drivers/iommu/iommufd/iommufd_private.h
index e5c717231f851e..31628591591c17 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -67,6 +67,8 @@ void iopt_remove_reserved_iova(struct io_pagetable *iopt, 
void *owner);
 struct iommufd_ctx {
struct file *filp;
struct xarray objects;
+
+   struct iommufd_ioas *vfio_ioas;
 };
 
 struct iommufd_ctx *iommufd_fget(int fd);
@@ -78,6 +80,9 @@ struct iommufd_ucmd {
void *cmd;
 };
 
+int iommufd_vfio_ioctl(struct iommufd_ctx *ictx, unsigned int cmd,
+  unsigned long arg);
+
 /* Copy the response in ucmd->cmd back to userspace. */
 static inline int iommufd_ucmd_respond(struct iommufd_ucmd *ucmd,
   size_t cmd_len)
@@ -186,6 +191,7 @@ int iommufd_ioas_iova_ranges(struct iommufd_ucmd *ucmd);
 int iommufd_ioas_map(struct iommufd_ucmd *ucmd);
 int iommufd_ioas_copy(struct iommufd_ucmd *ucmd);
 int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd);
+int iommufd_vfio_ioas(struct iommufd_ucmd *ucmd);
 
 /*
  * A HW pagetable is called an iommu_domain inside the kernel. This user object
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 6a895489fb5b82..f746fcff8145cb 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -122,6 +122,8 @@ bool iommufd_object_destroy_user(struct iommufd_ctx *ictx,
return false;
}
__xa_erase(>objects, obj->id);
+   if (ictx->vfio_ioas && >vfio_ioas->obj == obj)
+   ictx->vfio_ioas = NULL;
xa_unlock(>objects);
 
iommufd_object_ops[obj->type].destroy(obj);
@@ -219,27 +221,31 @@ static struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
 __reserved),
IOCTL_OP(IOMMU_IOAS_UNMAP, iommufd_ioas_unmap, struct iommu_ioas_unmap,
 length),
+   IOCTL_OP(IOMMU_VFIO_IOAS, iommufd_vfio_ioas, struct iommu_vfio_ioas,
+__reserved),
 };
 
 static long iommufd_fops_ioctl(struct file *filp, unsigned int cmd,
   unsigned long arg)
 {
+   struct iommufd_ctx *ictx = filp->private_data;
struct iommufd_ucmd ucmd = {};
struct iommufd_ioctl_op *op;
  

[PATCH RFC 03/12] iommufd: File descriptor, context, kconfig and makefiles

2022-03-18 Thread Jason Gunthorpe via iommu
This is the basic infrastructure of a new miscdevice to hold the iommufd
IOCTL API.

It provides:
 - A miscdevice to create file descriptors to run the IOCTL interface over

 - A table based ioctl dispatch and centralized extendable pre-validation
   step

 - An xarray mapping user ID's to kernel objects. The design has multiple
   inter-related objects held within in a single IOMMUFD fd

 - A simple usage count to build a graph of object relations and protect
   against hostile userspace racing ioctls

The only IOCTL provided in this patch is the generic 'destroy any object
by handle' operation.

Signed-off-by: Yi Liu 
Signed-off-by: Jason Gunthorpe 
---
 .../userspace-api/ioctl/ioctl-number.rst  |   1 +
 MAINTAINERS   |  10 +
 drivers/iommu/Kconfig |   1 +
 drivers/iommu/Makefile|   2 +-
 drivers/iommu/iommufd/Kconfig |  13 +
 drivers/iommu/iommufd/Makefile|   5 +
 drivers/iommu/iommufd/iommufd_private.h   |  95 ++
 drivers/iommu/iommufd/main.c  | 305 ++
 include/uapi/linux/iommufd.h  |  55 
 9 files changed, 486 insertions(+), 1 deletion(-)
 create mode 100644 drivers/iommu/iommufd/Kconfig
 create mode 100644 drivers/iommu/iommufd/Makefile
 create mode 100644 drivers/iommu/iommufd/iommufd_private.h
 create mode 100644 drivers/iommu/iommufd/main.c
 create mode 100644 include/uapi/linux/iommufd.h

diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst 
b/Documentation/userspace-api/ioctl/ioctl-number.rst
index e6fce2cbd99ed4..4a041dfc61fe95 100644
--- a/Documentation/userspace-api/ioctl/ioctl-number.rst
+++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
@@ -105,6 +105,7 @@ Code  Seq#Include File  
 Comments
 '8'   allSNP8023 
advanced NIC card
  

 ';'   64-7F  linux/vfio.h
+';'   80-FF  linux/iommufd.h
 '='   00-3f  uapi/linux/ptp_clock.h  

 '@'   00-0F  linux/radeonfb.hconflict!
 '@'   00-0F  drivers/video/aty/aty128fb.cconflict!
diff --git a/MAINTAINERS b/MAINTAINERS
index 1ba1e4af2cbc80..23a9c631051ee8 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -10038,6 +10038,16 @@ L: linux-m...@vger.kernel.org
 S: Maintained
 F: drivers/net/ethernet/sgi/ioc3-eth.c
 
+IOMMU FD
+M: Jason Gunthorpe 
+M: Kevin Tian 
+L: iommu@lists.linux-foundation.org
+S: Maintained
+F: Documentation/userspace-api/iommufd.rst
+F: drivers/iommu/iommufd/
+F: include/uapi/linux/iommufd.h
+F: include/linux/iommufd.h
+
 IOMAP FILESYSTEM LIBRARY
 M: Christoph Hellwig 
 M: Darrick J. Wong 
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 3eb68fa1b8cc02..754d2a9ff64623 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -177,6 +177,7 @@ config MSM_IOMMU
 
 source "drivers/iommu/amd/Kconfig"
 source "drivers/iommu/intel/Kconfig"
+source "drivers/iommu/iommufd/Kconfig"
 
 config IRQ_REMAP
bool "Support for Interrupt Remapping"
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index bc7f730edbb0be..6b38d12692b213 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -1,5 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0
-obj-y += amd/ intel/ arm/
+obj-y += amd/ intel/ arm/ iommufd/
 obj-$(CONFIG_IOMMU_API) += iommu.o
 obj-$(CONFIG_IOMMU_API) += iommu-traces.o
 obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o
diff --git a/drivers/iommu/iommufd/Kconfig b/drivers/iommu/iommufd/Kconfig
new file mode 100644
index 00..fddd453bb0e764
--- /dev/null
+++ b/drivers/iommu/iommufd/Kconfig
@@ -0,0 +1,13 @@
+# SPDX-License-Identifier: GPL-2.0-only
+config IOMMUFD
+   tristate "IOMMU Userspace API"
+   select INTERVAL_TREE
+   select IOMMU_API
+   default n
+   help
+ Provides /dev/iommu the user API to control the IOMMU subsystem as
+ it relates to managing IO page tables that point at user space memory.
+
+ This would commonly be used in combination with VFIO.
+
+ If you don't know what to do here, say N.
diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
new file mode 100644
index 00..a07a8cffe937c6
--- /dev/null
+++ b/drivers/iommu/iommufd/Makefile
@@ -0,0 +1,5 @@
+# SPDX-License-Identifier: GPL-2.0-only
+iommufd-y := \
+   main.o
+
+obj-$(CONFIG_IOMMUFD) += iommufd.o
diff --git a/drivers/iommu/iommufd/iommufd_private.h 
b/drivers/iommu/iommufd/iommufd_private.h
new file mode 100644
index 00..2d0bba3965be1a
--- /dev/null
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -0,0 +1,95 @@
+/* SPDX-License-Identifier: GPL-2.0-only */

[PATCH RFC 12/12] iommufd: Add a selftest

2022-03-18 Thread Jason Gunthorpe via iommu
Cover the essential functionality of the iommufd with a directed
test. This aims to achieve reasonable functional coverage using the
in-kernel self test framework.

It provides a mock for the iommu_domain that allows it to run without any
HW and the mocking provides a way to directly validate that the PFNs
loaded into the iommu_domain are correct.

The mock also simulates the rare case of PAGE_SIZE > iommu page size as
the mock will typically operate at a 2K iommu page size. This allows
exercising all of the calculations to support this mismatch.

This allows achieving high coverage of the corner cases in the iopt_pages.

However, it is an unusually invasive config option to enable all of
this. The config option should never be enabled in a production kernel.

Signed-off-by: Jason Gunthorpe 
Signed-off-by: Nicolin Chen 
---
 drivers/iommu/iommufd/Kconfig|9 +
 drivers/iommu/iommufd/Makefile   |2 +
 drivers/iommu/iommufd/iommufd_private.h  |9 +
 drivers/iommu/iommufd/iommufd_test.h |   65 ++
 drivers/iommu/iommufd/main.c |   12 +
 drivers/iommu/iommufd/pages.c|4 +
 drivers/iommu/iommufd/selftest.c |  495 +
 tools/testing/selftests/Makefile |1 +
 tools/testing/selftests/iommu/.gitignore |2 +
 tools/testing/selftests/iommu/Makefile   |   11 +
 tools/testing/selftests/iommu/config |2 +
 tools/testing/selftests/iommu/iommufd.c  | 1225 ++
 12 files changed, 1837 insertions(+)
 create mode 100644 drivers/iommu/iommufd/iommufd_test.h
 create mode 100644 drivers/iommu/iommufd/selftest.c
 create mode 100644 tools/testing/selftests/iommu/.gitignore
 create mode 100644 tools/testing/selftests/iommu/Makefile
 create mode 100644 tools/testing/selftests/iommu/config
 create mode 100644 tools/testing/selftests/iommu/iommufd.c

diff --git a/drivers/iommu/iommufd/Kconfig b/drivers/iommu/iommufd/Kconfig
index fddd453bb0e764..9b41fde7c839c5 100644
--- a/drivers/iommu/iommufd/Kconfig
+++ b/drivers/iommu/iommufd/Kconfig
@@ -11,3 +11,12 @@ config IOMMUFD
  This would commonly be used in combination with VFIO.
 
  If you don't know what to do here, say N.
+
+config IOMMUFD_TEST
+   bool "IOMMU Userspace API Test support"
+   depends on IOMMUFD
+   depends on RUNTIME_TESTING_MENU
+   default n
+   help
+ This is dangerous, do not enable unless running
+ tools/testing/selftests/iommu
diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index 2fdff04000b326..8aeba81800c512 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -8,4 +8,6 @@ iommufd-y := \
pages.o \
vfio_compat.o
 
+iommufd-$(CONFIG_IOMMUFD_TEST) += selftest.o
+
 obj-$(CONFIG_IOMMUFD) += iommufd.o
diff --git a/drivers/iommu/iommufd/iommufd_private.h 
b/drivers/iommu/iommufd/iommufd_private.h
index 31628591591c17..6f11470c8ea677 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -102,6 +102,9 @@ enum iommufd_object_type {
IOMMUFD_OBJ_NONE,
IOMMUFD_OBJ_ANY = IOMMUFD_OBJ_NONE,
IOMMUFD_OBJ_DEVICE,
+#ifdef CONFIG_IOMMUFD_TEST
+   IOMMUFD_OBJ_SELFTEST,
+#endif
IOMMUFD_OBJ_HW_PAGETABLE,
IOMMUFD_OBJ_IOAS,
IOMMUFD_OBJ_MAX,
@@ -219,4 +222,10 @@ void iommufd_hw_pagetable_destroy(struct iommufd_object 
*obj);
 
 void iommufd_device_destroy(struct iommufd_object *obj);
 
+#ifdef CONFIG_IOMMUFD_TEST
+int iommufd_test(struct iommufd_ucmd *ucmd);
+void iommufd_selftest_destroy(struct iommufd_object *obj);
+extern size_t iommufd_test_memory_limit;
+#endif
+
 #endif
diff --git a/drivers/iommu/iommufd/iommufd_test.h 
b/drivers/iommu/iommufd/iommufd_test.h
new file mode 100644
index 00..d22ef484af1a90
--- /dev/null
+++ b/drivers/iommu/iommufd/iommufd_test.h
@@ -0,0 +1,65 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES.
+ */
+#ifndef _UAPI_IOMMUFD_TEST_H
+#define _UAPI_IOMMUFD_TEST_H
+
+#include 
+#include 
+
+enum {
+   IOMMU_TEST_OP_ADD_RESERVED,
+   IOMMU_TEST_OP_MOCK_DOMAIN,
+   IOMMU_TEST_OP_MD_CHECK_MAP,
+   IOMMU_TEST_OP_MD_CHECK_REFS,
+   IOMMU_TEST_OP_ACCESS_PAGES,
+   IOMMU_TEST_OP_SET_TEMP_MEMORY_LIMIT,
+};
+
+enum {
+   MOCK_APERTURE_START = 1UL << 24,
+   MOCK_APERTURE_LAST = (1UL << 31) - 1,
+};
+
+enum {
+   MOCK_FLAGS_ACCESS_WRITE = 1 << 0,
+};
+
+struct iommu_test_cmd {
+   __u32 size;
+   __u32 op;
+   __u32 id;
+   union {
+   struct {
+   __u32 device_id;
+   } mock_domain;
+   struct {
+   __aligned_u64 start;
+   __aligned_u64 length;
+   } add_reserved;
+   struct {
+   __aligned_u64 iova;
+   __aligned_u64 length;
+   

[PATCH RFC 00/12] IOMMUFD Generic interface

2022-03-18 Thread Jason Gunthorpe via iommu
iommufd is the user API to control the IOMMU subsystem as it relates to
managing IO page tables that point at user space memory.

It takes over from drivers/vfio/vfio_iommu_type1.c (aka the VFIO
container) which is the VFIO specific interface for a similar idea.

We see a broad need for extended features, some being highly IOMMU device
specific:
 - Binding iommu_domain's to PASID/SSID
 - Userspace page tables, for ARM, x86 and S390
 - Kernel bypass'd invalidation of user page tables
 - Re-use of the KVM page table in the IOMMU
 - Dirty page tracking in the IOMMU
 - Runtime Increase/Decrease of IOPTE size
 - PRI support with faults resolved in userspace

As well as a need to access these features beyond just VFIO, VDPA for
instance, but other classes of accelerator HW are touching on these areas
now too.

The v1 series proposed re-using the VFIO type 1 data structure, however it
was suggested that if we are doing this big update then we should also
come with a data structure that solves the limitations that VFIO type1
has. Notably this addresses:

 - Multiple IOAS/'containers' and multiple domains inside a single FD

 - Single-pin operation no matter how many domains and containers use
   a page

 - A fine grained locking scheme supporting user managed concurrency for
   multi-threaded map/unmap

 - A pre-registration mechanism to optimize vIOMMU use cases by
   pre-pinning pages

 - Extended ioctl API that can manage these new objects and exposes
   domains directly to user space

 - domains are sharable between subsystems, eg VFIO and VDPA

The bulk of this code is a new data structure design to track how the
IOVAs are mapped to PFNs.

iommufd intends to be general and consumable by any driver that wants to
DMA to userspace. From a driver perspective it can largely be dropped in
in-place of iommu_attach_device() and provides a uniform full feature set
to all consumers.

As this is a larger project this series is the first step. This series
provides the iommfd "generic interface" which is designed to be suitable
for applications like DPDK and VMM flows that are not optimized to
specific HW scenarios. It is close to being a drop in replacement for the
existing VFIO type 1.

This is part two of three for an initial sequence:
 - Move IOMMU Group security into the iommu layer
   
https://lore.kernel.org/linux-iommu/20220218005521.172832-1-baolu...@linux.intel.com/
 * Generic IOMMUFD implementation
 - VFIO ability to consume IOMMUFD
   An early exploration of this is available here:
https://github.com/luxis1999/iommufd/commits/iommufd-v5.17-rc6

Various parts of the above extended features are in WIP stages currently
to define how their IOCTL interface should work.

At this point, using the draft VFIO series, unmodified qemu has been
tested to operate using iommufd on x86 and ARM systems.

Several people have contributed directly to this work: Eric Auger, Kevin
Tian, Lu Baolu, Nicolin Chen, Yi L Liu. Many more have participated in the
discussions that lead here, and provided ideas. Thanks to all!

This is on github: https://github.com/jgunthorpe/linux/commits/iommufd

# S390 in-kernel page table walker
Cc: Niklas Schnelle 
Cc: Matthew Rosato 
# AMD Dirty page tracking
Cc: Joao Martins 
# ARM SMMU Dirty page tracking
Cc: Keqian Zhu 
Cc: Shameerali Kolothum Thodi 
# ARM SMMU nesting
Cc: Eric Auger 
Cc: Jean-Philippe Brucker 
# Map/unmap performance
Cc: Daniel Jordan 
# VDPA
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
# Power
Cc: David Gibson 
# vfio
Cc: Alex Williamson 
Cc: Cornelia Huck 
Cc: k...@vger.kernel.org
# iommu
Cc: iommu@lists.linux-foundation.org
# Collaborators
Cc: "Chaitanya Kulkarni" 
Cc: Nicolin Chen 
Cc: Lu Baolu 
Cc: Kevin Tian 
Cc: Yi Liu 
Signed-off-by: Jason Gunthorpe 

Jason Gunthorpe (11):
  interval-tree: Add a utility to iterate over spans in an interval tree
  iommufd: File descriptor, context, kconfig and makefiles
  kernel/user: Allow user::locked_vm to be usable for iommufd
  iommufd: PFN handling for iopt_pages
  iommufd: Algorithms for PFN storage
  iommufd: Data structure to provide IOVA to PFN mapping
  iommufd: IOCTLs for the io_pagetable
  iommufd: Add a HW pagetable object
  iommufd: Add kAPI toward external drivers
  iommufd: vfio container FD ioctl compatibility
  iommufd: Add a selftest

Kevin Tian (1):
  iommufd: Overview documentation

 Documentation/userspace-api/index.rst |1 +
 .../userspace-api/ioctl/ioctl-number.rst  |1 +
 Documentation/userspace-api/iommufd.rst   |  224 +++
 MAINTAINERS   |   10 +
 drivers/iommu/Kconfig |1 +
 drivers/iommu/Makefile|2 +-
 drivers/iommu/iommufd/Kconfig |   22 +
 drivers/iommu/iommufd/Makefile|   13 +
 drivers/iommu/iommufd/device.c|  274 
 drivers/iommu/iommufd/hw_pagetable.c  |  142 ++
 drivers/iommu/iommufd/io_pagetable.c  |  890 +++
 

[PATCH RFC 10/12] iommufd: Add kAPI toward external drivers

2022-03-18 Thread Jason Gunthorpe via iommu
Add the four functions external drivers need to connect physical DMA to
the IOMMUFD:

iommufd_bind_pci_device() / iommufd_unbind_device()
  Register the device with iommufd and establish security isolation.

iommufd_device_attach() / iommufd_device_detach()
  Connect a bound device to a page table

binding a device creates a device object ID in the uAPI, however the
generic API provides no IOCTLs to manipulate them.

An API to support the VFIO mdevs is a WIP at this point, but likely
involves requesting a struct iommufd_device without providing any struct
device, and then using the pin/unpin/rw operations on that iommufd_device.

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/iommufd/Makefile  |   1 +
 drivers/iommu/iommufd/device.c  | 274 
 drivers/iommu/iommufd/iommufd_private.h |   4 +
 drivers/iommu/iommufd/main.c|   3 +
 include/linux/iommufd.h |  50 +
 5 files changed, 332 insertions(+)
 create mode 100644 drivers/iommu/iommufd/device.c
 create mode 100644 include/linux/iommufd.h

diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index e13e971aa28c60..ca28a135b9675f 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -1,5 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 iommufd-y := \
+   device.o \
hw_pagetable.o \
io_pagetable.o \
ioas.o \
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
new file mode 100644
index 00..c20bc9eab07e13
--- /dev/null
+++ b/drivers/iommu/iommufd/device.c
@@ -0,0 +1,274 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "iommufd_private.h"
+
+/*
+ * A iommufd_device object represents the binding relationship between a
+ * consuming driver and the iommufd. These objects are created/destroyed by
+ * external drivers, not by userspace.
+ */
+struct iommufd_device {
+   struct iommufd_object obj;
+   struct iommufd_ctx *ictx;
+   struct iommufd_hw_pagetable *hwpt;
+   /* Head at iommufd_hw_pagetable::devices */
+   struct list_head devices_item;
+   /* always the physical device */
+   struct device *dev;
+   struct iommu_group *group;
+};
+
+void iommufd_device_destroy(struct iommufd_object *obj)
+{
+   struct iommufd_device *idev =
+   container_of(obj, struct iommufd_device, obj);
+
+   iommu_group_release_dma_owner(idev->group);
+   iommu_group_put(idev->group);
+   fput(idev->ictx->filp);
+}
+
+/**
+ * iommufd_bind_pci_device - Bind a physical device to an iommu fd
+ * @fd: iommufd file descriptor.
+ * @pdev: Pointer to a physical PCI device struct
+ * @id: Output ID number to return to userspace for this device
+ *
+ * A successful bind establishes an ownership over the device and returns
+ * struct iommufd_device pointer, otherwise returns error pointer.
+ *
+ * A driver using this API must set driver_managed_dma and must not touch
+ * the device until this routine succeeds and establishes ownership.
+ *
+ * Binding a PCI device places the entire RID under iommufd control.
+ *
+ * The caller must undo this with iommufd_unbind_device()
+ */
+struct iommufd_device *iommufd_bind_pci_device(int fd, struct pci_dev *pdev,
+  u32 *id)
+{
+   struct iommufd_device *idev;
+   struct iommufd_ctx *ictx;
+   struct iommu_group *group;
+   int rc;
+
+   ictx = iommufd_fget(fd);
+   if (!ictx)
+   return ERR_PTR(-EINVAL);
+
+   group = iommu_group_get(>dev);
+   if (!group) {
+   rc = -ENODEV;
+   goto out_file_put;
+   }
+
+   /*
+* FIXME: Use a device-centric iommu api and this won't work with
+* multi-device groups
+*/
+   rc = iommu_group_claim_dma_owner(group, ictx->filp);
+   if (rc)
+   goto out_group_put;
+
+   idev = iommufd_object_alloc(ictx, idev, IOMMUFD_OBJ_DEVICE);
+   if (IS_ERR(idev)) {
+   rc = PTR_ERR(idev);
+   goto out_release_owner;
+   }
+   idev->ictx = ictx;
+   idev->dev = >dev;
+   /* The calling driver is a user until iommufd_unbind_device() */
+   refcount_inc(>obj.users);
+   /* group refcount moves into iommufd_device */
+   idev->group = group;
+
+   /*
+* If the caller fails after this success it must call
+* iommufd_unbind_device() which is safe since we hold this refcount.
+* This also means the device is a leaf in the graph and no other object
+* can take a reference on it.
+*/
+   iommufd_object_finalize(ictx, >obj);
+   *id = idev->obj.id;
+   return idev;
+
+out_release_owner:
+   iommu_group_release_dma_owner(group);
+out_group_put:
+   

[PATCH RFC 01/12] interval-tree: Add a utility to iterate over spans in an interval tree

2022-03-18 Thread Jason Gunthorpe via iommu
The span iterator travels over the indexes of the interval_tree, not the
nodes, and classifies spans of indexes as either 'used' or 'hole'.

'used' spans are fully covered by nodes in the tree and 'hole' spans have
no node intersecting the span.

This is done greedily such that spans are maximally sized and every
iteration step switches between used/hole.

As an example a trivial allocator can be written as:

for (interval_tree_span_iter_first(, itree, 0, ULONG_MAX);
 !interval_tree_span_iter_done();
 interval_tree_span_iter_next())
if (span.is_hole &&
span.last_hole - span.start_hole >= allocation_size - 1)
return span.start_hole;

With all the tricky boundary conditions handled by the library code.

The following iommufd patches have several algorithms for two of its
overlapping node interval trees that are significantly simplified with
this kind of iteration primitive. As it seems generally useful, put it
into lib/.

Signed-off-by: Jason Gunthorpe 
---
 include/linux/interval_tree.h | 41 +++
 lib/interval_tree.c   | 98 +++
 2 files changed, 139 insertions(+)

diff --git a/include/linux/interval_tree.h b/include/linux/interval_tree.h
index 288c26f50732d7..3817af0fa54028 100644
--- a/include/linux/interval_tree.h
+++ b/include/linux/interval_tree.h
@@ -27,4 +27,45 @@ extern struct interval_tree_node *
 interval_tree_iter_next(struct interval_tree_node *node,
unsigned long start, unsigned long last);
 
+/*
+ * This iterator travels over spans in an interval tree. It does not return
+ * nodes but classifies each span as either a hole, where no nodes intersect, 
or
+ * a used, which is fully covered by nodes. Each iteration step toggles between
+ * hole and used until the entire range is covered. The returned spans always
+ * fully cover the requested range.
+ *
+ * The iterator is greedy, it always returns the largest hole or used possible,
+ * consolidating all consecutive nodes.
+ *
+ * Only is_hole, start_hole/used and last_hole/used are part of the external
+ * interface.
+ */
+struct interval_tree_span_iter {
+   struct interval_tree_node *nodes[2];
+   unsigned long first_index;
+   unsigned long last_index;
+   union {
+   unsigned long start_hole;
+   unsigned long start_used;
+   };
+   union {
+   unsigned long last_hole;
+   unsigned long last_used;
+   };
+   /* 0 == used, 1 == is_hole, -1 == done iteration */
+   int is_hole;
+};
+
+void interval_tree_span_iter_first(struct interval_tree_span_iter *state,
+  struct rb_root_cached *itree,
+  unsigned long first_index,
+  unsigned long last_index);
+void interval_tree_span_iter_next(struct interval_tree_span_iter *state);
+
+static inline bool
+interval_tree_span_iter_done(struct interval_tree_span_iter *state)
+{
+   return state->is_hole == -1;
+}
+
 #endif /* _LINUX_INTERVAL_TREE_H */
diff --git a/lib/interval_tree.c b/lib/interval_tree.c
index 593ce56ece5050..5dff0da020923f 100644
--- a/lib/interval_tree.c
+++ b/lib/interval_tree.c
@@ -15,3 +15,101 @@ EXPORT_SYMBOL_GPL(interval_tree_insert);
 EXPORT_SYMBOL_GPL(interval_tree_remove);
 EXPORT_SYMBOL_GPL(interval_tree_iter_first);
 EXPORT_SYMBOL_GPL(interval_tree_iter_next);
+
+static void
+interval_tree_span_iter_next_gap(struct interval_tree_span_iter *state)
+{
+   struct interval_tree_node *cur = state->nodes[1];
+
+   /*
+* Roll nodes[1] into nodes[0] by advancing nodes[1] to the end of a
+* contiguous span of nodes. This makes nodes[0]->last the end of that
+* contiguous span of valid indexes that started at the original
+* nodes[1]->start. nodes[1] is now the next node and a hole is between
+* nodes[0] and [1].
+*/
+   state->nodes[0] = cur;
+   do {
+   if (cur->last > state->nodes[0]->last)
+   state->nodes[0] = cur;
+   cur = interval_tree_iter_next(cur, state->first_index,
+ state->last_index);
+   } while (cur && (state->nodes[0]->last >= cur->start ||
+state->nodes[0]->last + 1 == cur->start));
+   state->nodes[1] = cur;
+}
+
+void interval_tree_span_iter_first(struct interval_tree_span_iter *iter,
+  struct rb_root_cached *itree,
+  unsigned long first_index,
+  unsigned long last_index)
+{
+   iter->first_index = first_index;
+   iter->last_index = last_index;
+   iter->nodes[0] = NULL;
+   iter->nodes[1] =
+   interval_tree_iter_first(itree, first_index, last_index);
+   if (!iter->nodes[1]) {
+   /* No nodes 

[PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd

2022-03-18 Thread Jason Gunthorpe via iommu
Following the pattern of io_uring, perf, skb, and bpf iommfd will use
user->locked_vm for accounting pinned pages. Ensure the value is included
in the struct and export free_uid() as iommufd is modular.

user->locked_vm is the correct accounting to use for ulimit because it is
per-user, and the ulimit is not supposed to be per-process. Other
places (vfio, vdpa and infiniband) have used mm->pinned_vm and/or
mm->locked_vm for accounting pinned pages, but this is only per-process
and inconsistent with the majority of the kernel.

Signed-off-by: Jason Gunthorpe 
---
 include/linux/sched/user.h | 2 +-
 kernel/user.c  | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index 00ed419dd46413..c47dae71dad3c8 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -24,7 +24,7 @@ struct user_struct {
kuid_t uid;
 
 #if defined(CONFIG_PERF_EVENTS) || defined(CONFIG_BPF_SYSCALL) || \
-defined(CONFIG_NET) || defined(CONFIG_IO_URING)
+defined(CONFIG_NET) || defined(CONFIG_IO_URING) || 
IS_ENABLED(CONFIG_IOMMUFD)
atomic_long_t locked_vm;
 #endif
 #ifdef CONFIG_WATCH_QUEUE
diff --git a/kernel/user.c b/kernel/user.c
index e2cf8c22b539a7..d667debeafd609 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -185,6 +185,7 @@ void free_uid(struct user_struct *up)
if (refcount_dec_and_lock_irqsave(>__count, _lock, ))
free_user(up, flags);
 }
+EXPORT_SYMBOL_GPL(free_uid);
 
 struct user_struct *alloc_uid(kuid_t uid)
 {
-- 
2.35.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH RFC 07/12] iommufd: Data structure to provide IOVA to PFN mapping

2022-03-18 Thread Jason Gunthorpe via iommu
This is the remainder of the IOAS data structure. Provide an object called
an io_pagetable that is composed of iopt_areas pointing at iopt_pages,
along with a list of iommu_domains that mirror the IOVA to PFN map.

At the top this is a simple interval tree of iopt_areas indicating the map
of IOVA to iopt_pages. An xarray keeps track of a list of domains. Based
on the attached domains there is a minimum alignment for areas (which may
be smaller than PAGE_SIZE) and an interval tree of reserved IOVA that
can't be mapped.

The concept of a 'user' refers to something like a VFIO mdev that is
accessing the IOVA and using a 'struct page *' for CPU based access.

Externally an API is provided that matches the requirements of the IOCTL
interface for map/unmap and domain attachment.

The API provides a 'copy' primitive to establish a new IOVA map in a
different IOAS from an existing mapping.

This is designed to support a pre-registration flow where userspace would
setup an dummy IOAS with no domains, map in memory and then establish a
user to pin all PFNs into the xarray.

Copy can then be used to create new IOVA mappings in a different IOAS,
with iommu_domains attached. Upon copy the PFNs will be read out of the
xarray and mapped into the iommu_domains, avoiding any pin_user_pages()
overheads.

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/iommufd/Makefile  |   1 +
 drivers/iommu/iommufd/io_pagetable.c| 890 
 drivers/iommu/iommufd/iommufd_private.h |  35 +
 3 files changed, 926 insertions(+)
 create mode 100644 drivers/iommu/iommufd/io_pagetable.c

diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index 05a0e91e30afad..b66a8c47ff55ec 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -1,5 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 iommufd-y := \
+   io_pagetable.o \
main.o \
pages.o
 
diff --git a/drivers/iommu/iommufd/io_pagetable.c 
b/drivers/iommu/iommufd/io_pagetable.c
new file mode 100644
index 00..f9f3b06946bfb9
--- /dev/null
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -0,0 +1,890 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES.
+ *
+ * The io_pagetable is the top of datastructure that maps IOVA's to PFNs. The
+ * PFNs can be placed into an iommu_domain, or returned to the caller as a page
+ * list for access by an in-kernel user.
+ *
+ * The datastructure uses the iopt_pages to optimize the storage of the PFNs
+ * between the domains and xarray.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "io_pagetable.h"
+
+static unsigned long iopt_area_iova_to_index(struct iopt_area *area,
+unsigned long iova)
+{
+   if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
+   WARN_ON(iova < iopt_area_iova(area) ||
+   iova > iopt_area_last_iova(area));
+   return (iova - (iopt_area_iova(area) & PAGE_MASK)) / PAGE_SIZE;
+}
+
+static struct iopt_area *iopt_find_exact_area(struct io_pagetable *iopt,
+ unsigned long iova,
+ unsigned long last_iova)
+{
+   struct iopt_area *area;
+
+   area = iopt_area_iter_first(iopt, iova, last_iova);
+   if (!area || !area->pages || iopt_area_iova(area) != iova ||
+   iopt_area_last_iova(area) != last_iova)
+   return NULL;
+   return area;
+}
+
+static bool __alloc_iova_check_hole(struct interval_tree_span_iter *span,
+   unsigned long length,
+   unsigned long iova_alignment,
+   unsigned long page_offset)
+{
+   if (!span->is_hole || span->last_hole - span->start_hole < length - 1)
+   return false;
+
+   span->start_hole =
+   ALIGN(span->start_hole, iova_alignment) | page_offset;
+   if (span->start_hole > span->last_hole ||
+   span->last_hole - span->start_hole < length - 1)
+   return false;
+   return true;
+}
+
+/*
+ * Automatically find a block of IOVA that is not being used and not reserved.
+ * Does not return a 0 IOVA even if it is valid.
+ */
+static int iopt_alloc_iova(struct io_pagetable *iopt, unsigned long *iova,
+  unsigned long uptr, unsigned long length)
+{
+   struct interval_tree_span_iter reserved_span;
+   unsigned long page_offset = uptr % PAGE_SIZE;
+   struct interval_tree_span_iter area_span;
+   unsigned long iova_alignment;
+
+   lockdep_assert_held(>iova_rwsem);
+
+   /* Protect roundup_pow-of_two() from overflow */
+   if (length == 0 || length >= ULONG_MAX / 2)
+   return -EOVERFLOW;
+
+   /*
+* Keep alignment present in the uptr when building the IOVA, this
+* increases the chance we can map a THP.
+

[PATCH RFC 09/12] iommufd: Add a HW pagetable object

2022-03-18 Thread Jason Gunthorpe via iommu
The hw_pagetable object exposes the internal struct iommu_domain's to
userspace. An iommu_domain is required when any DMA device attaches to an
IOAS to control the io page table through the iommu driver.

For compatibility with VFIO the hw_pagetable is automatically created when
a DMA device is attached to the IOAS. If a compatible iommu_domain already
exists then the hw_pagetable associated with it is used for the
attachment.

In the initial series there is no iommufd uAPI for the hw_pagetable
object. The next patch provides driver facing APIs for IO page table
attachment that allows drivers to accept either an IOAS or a hw_pagetable
ID and for the driver to return the hw_pagetable ID that was auto-selected
from an IOAS. The expectation is the driver will provide uAPI through its
own FD for attaching its device to iommufd. This allows userspace to learn
the mapping of devices to iommu_domains and to override the automatic
attachment.

The future HW specific interface will allow userspace to create
hw_pagetable objects using iommu_domains with IOMMU driver specific
parameters. This infrastructure will allow linking those domains to IOAS's
and devices.

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/iommufd/Makefile  |   1 +
 drivers/iommu/iommufd/hw_pagetable.c| 142 
 drivers/iommu/iommufd/ioas.c|   4 +
 drivers/iommu/iommufd/iommufd_private.h |  35 ++
 drivers/iommu/iommufd/main.c|   3 +
 5 files changed, 185 insertions(+)
 create mode 100644 drivers/iommu/iommufd/hw_pagetable.c

diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index 2b4f36f1b72f9d..e13e971aa28c60 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -1,5 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 iommufd-y := \
+   hw_pagetable.o \
io_pagetable.o \
ioas.o \
main.o \
diff --git a/drivers/iommu/iommufd/hw_pagetable.c 
b/drivers/iommu/iommufd/hw_pagetable.c
new file mode 100644
index 00..bafd7d07918bfd
--- /dev/null
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -0,0 +1,142 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
+ */
+#include 
+
+#include "iommufd_private.h"
+
+void iommufd_hw_pagetable_destroy(struct iommufd_object *obj)
+{
+   struct iommufd_hw_pagetable *hwpt =
+   container_of(obj, struct iommufd_hw_pagetable, obj);
+   struct iommufd_ioas *ioas = hwpt->ioas;
+
+   WARN_ON(!list_empty(>devices));
+   mutex_lock(>mutex);
+   list_del(>auto_domains_item);
+   mutex_unlock(>mutex);
+
+   iommu_domain_free(hwpt->domain);
+   refcount_dec(>ioas->obj.users);
+   mutex_destroy(>devices_lock);
+}
+
+/*
+ * When automatically managing the domains we search for a compatible domain in
+ * the iopt and if one is found use it, otherwise create a new domain.
+ * Automatic domain selection will never pick a manually created domain.
+ */
+static struct iommufd_hw_pagetable *
+iommufd_hw_pagetable_auto_get(struct iommufd_ctx *ictx,
+ struct iommufd_ioas *ioas, struct device *dev)
+{
+   struct iommufd_hw_pagetable *hwpt;
+   int rc;
+
+   /*
+* There is no differentiation when domains are allocated, so any domain
+* from the right ops is interchangeable with any other.
+*/
+   mutex_lock(>mutex);
+   list_for_each_entry (hwpt, >auto_domains, auto_domains_item) {
+   /*
+* FIXME: We really need an op from the driver to test if a
+* device is compatible with a domain. This thing from VFIO
+* works sometimes.
+*/
+   if (hwpt->domain->ops == 
dev_iommu_ops(dev)->default_domain_ops) {
+   if (refcount_inc_not_zero(>obj.users)) {
+   mutex_unlock(>mutex);
+   return hwpt;
+   }
+   }
+   }
+
+   hwpt = iommufd_object_alloc(ictx, hwpt, IOMMUFD_OBJ_HW_PAGETABLE);
+   if (IS_ERR(hwpt)) {
+   rc = PTR_ERR(hwpt);
+   goto out_unlock;
+   }
+
+   hwpt->domain = iommu_domain_alloc(dev->bus);
+   if (!hwpt->domain) {
+   rc = -ENOMEM;
+   goto out_abort;
+   }
+
+   INIT_LIST_HEAD(>devices);
+   mutex_init(>devices_lock);
+   hwpt->ioas = ioas;
+   /* The calling driver is a user until iommufd_hw_pagetable_put() */
+   refcount_inc(>obj.users);
+
+   list_add_tail(>auto_domains_item, >auto_domains);
+   /*
+* iommufd_object_finalize() consumes the refcount, get one for the
+* caller. This pairs with the first put in
+* iommufd_object_destroy_user()
+*/
+   refcount_inc(>obj.users);
+   iommufd_object_finalize(ictx, >obj);
+
+   mutex_unlock(>mutex);
+   return hwpt;

[PATCH RFC 05/12] iommufd: PFN handling for iopt_pages

2022-03-18 Thread Jason Gunthorpe via iommu
The top of the data structure provides an IO Address Space (IOAS) that is
similar to a VFIO container. The IOAS allows map/unmap of memory into
ranges of IOVA called iopt_areas. Domains and in-kernel users (like VFIO
mdevs) can be attached to the IOAS to access the PFNs that those IOVA
areas cover.

The IO Address Space (IOAS) datastructure is composed of:
 - struct io_pagetable holding the IOVA map
 - struct iopt_areas representing populated portions of IOVA
 - struct iopt_pages representing the storage of PFNs
 - struct iommu_domain representing the IO page table in the system IOMMU
 - struct iopt_pages_user representing in-kernel users of PFNs (ie VFIO
   mdevs)
 - struct xarray pinned_pfns holding a list of pages pinned by in-kernel
   users

This patch introduces the lowest part of the datastructure - the movement
of PFNs in a tiered storage scheme:
 1) iopt_pages::pinned_pfns xarray
 2) An iommu_domain
 3) The origin of the PFNs, i.e. the userspace pointer

PFN have to be copied between all combinations of tiers, depending on the
configuration.

The interface is an iterator called a 'pfn_reader' which determines which
tier each PFN is stored and loads it into a list of PFNs held in a struct
pfn_batch.

Each step of the iterator will fill up the pfn_batch, then the caller can
use the pfn_batch to send the PFNs to the required destination. Repeating
this loop will read all the PFNs in an IOVA range.

The pfn_reader and pfn_batch also keep track of the pinned page accounting.

While PFNs are always stored and accessed as full PAGE_SIZE units the
iommu_domain tier can store with a sub-page offset/length to support
IOMMUs with a smaller IOPTE size than PAGE_SIZE.

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/iommufd/Makefile  |   3 +-
 drivers/iommu/iommufd/io_pagetable.h| 101 
 drivers/iommu/iommufd/iommufd_private.h |  20 +
 drivers/iommu/iommufd/pages.c   | 723 
 4 files changed, 846 insertions(+), 1 deletion(-)
 create mode 100644 drivers/iommu/iommufd/io_pagetable.h
 create mode 100644 drivers/iommu/iommufd/pages.c

diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index a07a8cffe937c6..05a0e91e30afad 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -1,5 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 iommufd-y := \
-   main.o
+   main.o \
+   pages.o
 
 obj-$(CONFIG_IOMMUFD) += iommufd.o
diff --git a/drivers/iommu/iommufd/io_pagetable.h 
b/drivers/iommu/iommufd/io_pagetable.h
new file mode 100644
index 00..94ca8712722d31
--- /dev/null
+++ b/drivers/iommu/iommufd/io_pagetable.h
@@ -0,0 +1,101 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES.
+ *
+ */
+#ifndef __IO_PAGETABLE_H
+#define __IO_PAGETABLE_H
+
+#include 
+#include 
+#include 
+#include 
+
+#include "iommufd_private.h"
+
+struct iommu_domain;
+
+/*
+ * Each io_pagetable is composed of intervals of areas which cover regions of
+ * the iova that are backed by something. iova not covered by areas is not
+ * populated in the page table. Each area is fully populated with pages.
+ *
+ * iovas are in byte units, but must be iopt->iova_alignment aligned.
+ *
+ * pages can be NULL, this means some other thread is still working on setting
+ * up the area. When observed under the write side of the domain_rwsem a NULL
+ * pages must mean no domains are filled.
+ *
+ * storage_domain points at an arbitrary iommu_domain that is holding the PFNs
+ * for this area. It is locked by the pages->mutex. This simplifies the locking
+ * as the pages code can rely on the storage_domain without having to get the
+ * iopt->domains_rwsem.
+ *
+ * The io_pagetable::iova_rwsem protects node
+ * The iopt_pages::mutex protects pages_node
+ * iopt and immu_prot are immutable
+ */
+struct iopt_area {
+   struct interval_tree_node node;
+   struct interval_tree_node pages_node;
+   /* How many bytes into the first page the area starts */
+   unsigned int page_offset;
+   struct io_pagetable *iopt;
+   struct iopt_pages *pages;
+   struct iommu_domain *storage_domain;
+   /* IOMMU_READ, IOMMU_WRITE, etc */
+   int iommu_prot;
+   atomic_t num_users;
+};
+
+static inline unsigned long iopt_area_index(struct iopt_area *area)
+{
+   return area->pages_node.start;
+}
+
+static inline unsigned long iopt_area_last_index(struct iopt_area *area)
+{
+   return area->pages_node.last;
+}
+
+static inline unsigned long iopt_area_iova(struct iopt_area *area)
+{
+   return area->node.start;
+}
+
+static inline unsigned long iopt_area_last_iova(struct iopt_area *area)
+{
+   return area->node.last;
+}
+
+/*
+ * This holds a pinned page list for multiple areas of IO address space. The
+ * pages always originate from a linear chunk of userspace VA. Multiple
+ * io_pagetable's, through their iopt_area's, can share a single iopt_pages
+ * which 

[PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable

2022-03-18 Thread Jason Gunthorpe via iommu
Connect the IOAS to its IOCTL interface. This exposes most of the
functionality in the io_pagetable to userspace.

This is intended to be the core of the generic interface that IOMMUFD will
provide. Every IOMMU driver should be able to implement an iommu_domain
that is compatible with this generic mechanism.

It is also designed to be easy to use for simple non virtual machine
monitor users, like DPDK:
 - Universal simple support for all IOMMUs (no PPC special path)
 - An IOVA allocator that considerds the aperture and the reserved ranges
 - io_pagetable allows any number of iommu_domains to be connected to the
   IOAS

Along with room in the design to add non-generic features to cater to
specific HW functionality.

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/iommufd/Makefile  |   1 +
 drivers/iommu/iommufd/ioas.c| 248 
 drivers/iommu/iommufd/iommufd_private.h |  27 +++
 drivers/iommu/iommufd/main.c|  17 ++
 include/uapi/linux/iommufd.h| 132 +
 5 files changed, 425 insertions(+)
 create mode 100644 drivers/iommu/iommufd/ioas.c

diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index b66a8c47ff55ec..2b4f36f1b72f9d 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0-only
 iommufd-y := \
io_pagetable.o \
+   ioas.o \
main.o \
pages.o
 
diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
new file mode 100644
index 00..c530b2ba74b06b
--- /dev/null
+++ b/drivers/iommu/iommufd/ioas.c
@@ -0,0 +1,248 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
+ */
+#include 
+#include 
+#include 
+#include 
+
+#include "io_pagetable.h"
+
+void iommufd_ioas_destroy(struct iommufd_object *obj)
+{
+   struct iommufd_ioas *ioas = container_of(obj, struct iommufd_ioas, obj);
+   int rc;
+
+   rc = iopt_unmap_all(>iopt);
+   WARN_ON(rc);
+   iopt_destroy_table(>iopt);
+}
+
+struct iommufd_ioas *iommufd_ioas_alloc(struct iommufd_ctx *ictx)
+{
+   struct iommufd_ioas *ioas;
+   int rc;
+
+   ioas = iommufd_object_alloc(ictx, ioas, IOMMUFD_OBJ_IOAS);
+   if (IS_ERR(ioas))
+   return ioas;
+
+   rc = iopt_init_table(>iopt);
+   if (rc)
+   goto out_abort;
+   return ioas;
+
+out_abort:
+   iommufd_object_abort(ictx, >obj);
+   return ERR_PTR(rc);
+}
+
+int iommufd_ioas_alloc_ioctl(struct iommufd_ucmd *ucmd)
+{
+   struct iommu_ioas_alloc *cmd = ucmd->cmd;
+   struct iommufd_ioas *ioas;
+   int rc;
+
+   if (cmd->flags)
+   return -EOPNOTSUPP;
+
+   ioas = iommufd_ioas_alloc(ucmd->ictx);
+   if (IS_ERR(ioas))
+   return PTR_ERR(ioas);
+
+   cmd->out_ioas_id = ioas->obj.id;
+   rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+   if (rc)
+   goto out_table;
+   iommufd_object_finalize(ucmd->ictx, >obj);
+   return 0;
+
+out_table:
+   iommufd_ioas_destroy(>obj);
+   return rc;
+}
+
+int iommufd_ioas_iova_ranges(struct iommufd_ucmd *ucmd)
+{
+   struct iommu_ioas_iova_ranges __user *uptr = ucmd->ubuffer;
+   struct iommu_ioas_iova_ranges *cmd = ucmd->cmd;
+   struct iommufd_ioas *ioas;
+   struct interval_tree_span_iter span;
+   u32 max_iovas;
+   int rc;
+
+   if (cmd->__reserved)
+   return -EOPNOTSUPP;
+
+   max_iovas = cmd->size - sizeof(*cmd);
+   if (max_iovas % sizeof(cmd->out_valid_iovas[0]))
+   return -EINVAL;
+   max_iovas /= sizeof(cmd->out_valid_iovas[0]);
+
+   ioas = iommufd_get_ioas(ucmd, cmd->ioas_id);
+   if (IS_ERR(ioas))
+   return PTR_ERR(ioas);
+
+   down_read(>iopt.iova_rwsem);
+   cmd->out_num_iovas = 0;
+   for (interval_tree_span_iter_first(
+, >iopt.reserved_iova_itree, 0, ULONG_MAX);
+!interval_tree_span_iter_done();
+interval_tree_span_iter_next()) {
+   if (!span.is_hole)
+   continue;
+   if (cmd->out_num_iovas < max_iovas) {
+   rc = put_user((u64)span.start_hole,
+ >out_valid_iovas[cmd->out_num_iovas]
+  .start);
+   if (rc)
+   goto out_put;
+   rc = put_user(
+   (u64)span.last_hole,
+   
>out_valid_iovas[cmd->out_num_iovas].last);
+   if (rc)
+   goto out_put;
+   }
+   cmd->out_num_iovas++;
+   }
+   rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+   if (rc)
+   goto out_put;
+   if (cmd->out_num_iovas > max_iovas)
+ 

[PATCH RFC 06/12] iommufd: Algorithms for PFN storage

2022-03-18 Thread Jason Gunthorpe via iommu
The iopt_pages which represents a logical linear list of PFNs held in
different storage tiers. Each area points to a slice of exactly one
iopt_pages, and each iopt_pages can have multiple areas and users.

The three storage tiers are managed to meet these objectives:

 - If no iommu_domain or user exists then minimal memory should be
   consumed by iomufd
 - If a page has been pinned then an iopt_pages will not pin it again
 - If an in-kernel user exists then the xarray must provide the backing
   storage to avoid allocations on domain removals
 - Otherwise any iommu_domain will be used for storage

In a common configuration with only an iommu_domain the iopt_pages does
not allocate significant memory itself.

The external interface for pages has several logical operations:

  iopt_area_fill_domain() will load the PFNs from storage into a single
  domain. This is used when attaching a new domain to an existing IOAS.

  iopt_area_fill_domains() will load the PFNs from storage into multiple
  domains. This is used when creating a new IOVA map in an existing IOAS

  iopt_pages_add_user() creates an iopt_pages_user that tracks an in-kernel
  user of PFNs. This is some external driver that might be accessing the
  IOVA using the CPU, or programming PFNs with the DMA API. ie a VFIO
  mdev.

  iopt_pages_fill_xarray() will load PFNs into the xarray and return a
  'struct page *' array. It is used by iopt_pages_user's to extract PFNs
  for in-kernel use. iopt_pages_fill_from_xarray() is a fast path when it
  is known the xarray is already filled.

As an iopt_pages can be referred to in slices by many areas and users it
uses interval trees to keep track of which storage tiers currently hold
the PFNs. On a page-by-page basis any request for a PFN will be satisfied
from one of the storage tiers and the PFN copied to target domain/array.

Unfill actions are similar, on a page by page basis domains are unmapped,
xarray entries freed or struct pages fully put back.

Significant complexity is required to fully optimize all of these data
motions. The implementation calculates the largest consecutive range of
same-storage indexes and operates in blocks. The accumulation of PFNs
always generates the largest contiguous PFN range possible to optimize and
this gathering can cross storage tier boundaries. For cases like 'fill
domains' care is taken to avoid duplicated work and PFNs are read once and
pushed into all domains.

The map/unmap interaction with the iommu_domain always works in contiguous
PFN blocks. The implementation does not require or benefit from any
split/merge optimization in the iommu_domain driver.

This design suggests several possible improvements in the IOMMU API that
would greatly help performance, particularly a way for the driver to map
and read the pfns lists instead of working with one driver call perpage to
read, and one driver call per contiguous range to store.

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/iommufd/io_pagetable.h |  71 +++-
 drivers/iommu/iommufd/pages.c| 594 +++
 2 files changed, 664 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/iommufd/io_pagetable.h 
b/drivers/iommu/iommufd/io_pagetable.h
index 94ca8712722d31..c8b6a60ff24c94 100644
--- a/drivers/iommu/iommufd/io_pagetable.h
+++ b/drivers/iommu/iommufd/io_pagetable.h
@@ -47,6 +47,14 @@ struct iopt_area {
atomic_t num_users;
 };
 
+int iopt_area_fill_domains(struct iopt_area *area, struct iopt_pages *pages);
+void iopt_area_unfill_domains(struct iopt_area *area, struct iopt_pages 
*pages);
+
+int iopt_area_fill_domain(struct iopt_area *area, struct iommu_domain *domain);
+void iopt_area_unfill_domain(struct iopt_area *area, struct iopt_pages *pages,
+struct iommu_domain *domain);
+void iopt_unmap_domain(struct io_pagetable *iopt, struct iommu_domain *domain);
+
 static inline unsigned long iopt_area_index(struct iopt_area *area)
 {
return area->pages_node.start;
@@ -67,6 +75,37 @@ static inline unsigned long iopt_area_last_iova(struct 
iopt_area *area)
return area->node.last;
 }
 
+static inline size_t iopt_area_length(struct iopt_area *area)
+{
+   return (area->node.last - area->node.start) + 1;
+}
+
+static inline struct iopt_area *iopt_area_iter_first(struct io_pagetable *iopt,
+unsigned long start,
+unsigned long last)
+{
+   struct interval_tree_node *node;
+
+   lockdep_assert_held(>iova_rwsem);
+
+   node = interval_tree_iter_first(>area_itree, start, last);
+   if (!node)
+   return NULL;
+   return container_of(node, struct iopt_area, node);
+}
+
+static inline struct iopt_area *iopt_area_iter_next(struct iopt_area *area,
+   unsigned long start,
+   unsigned long last)
+{
+   struct 

[PATCH RFC 02/12] iommufd: Overview documentation

2022-03-18 Thread Jason Gunthorpe via iommu
From: Kevin Tian 

Add iommufd to the documentation tree.

Signed-off-by: Kevin Tian 
Signed-off-by: Jason Gunthorpe 
---
 Documentation/userspace-api/index.rst   |   1 +
 Documentation/userspace-api/iommufd.rst | 224 
 2 files changed, 225 insertions(+)
 create mode 100644 Documentation/userspace-api/iommufd.rst

diff --git a/Documentation/userspace-api/index.rst 
b/Documentation/userspace-api/index.rst
index a61eac0c73f825..3815f013e4aebd 100644
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -25,6 +25,7 @@ place where this information is gathered.
ebpf/index
ioctl/index
iommu
+   iommufd
media/index
sysfs-platform_profile
vduse
diff --git a/Documentation/userspace-api/iommufd.rst 
b/Documentation/userspace-api/iommufd.rst
new file mode 100644
index 00..38035b3822fd23
--- /dev/null
+++ b/Documentation/userspace-api/iommufd.rst
@@ -0,0 +1,224 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+===
+IOMMUFD
+===
+
+:Author: Jason Gunthorpe
+:Author: Kevin Tian
+
+Overview
+
+
+IOMMUFD is the user API to control the IOMMU subsystem as it relates to 
managing
+IO page tables that point at user space memory. It intends to be general and
+consumable by any driver that wants to DMA to userspace. Those drivers are
+expected to deprecate any proprietary IOMMU logic, if existing (e.g.
+vfio_iommu_type1.c).
+
+At minimum iommufd provides a universal support of managing I/O address spaces
+and I/O page tables for all IOMMUs, with room in the design to add non-generic
+features to cater to specific hardware functionality.
+
+In this context the capital letter (IOMMUFD) refers to the subsystem while the
+small letter (iommufd) refers to the file descriptors created via /dev/iommu to
+run the user API over.
+
+Key Concepts
+
+
+User Visible Objects
+
+
+Following IOMMUFD objects are exposed to userspace:
+
+- IOMMUFD_OBJ_IOAS, representing an I/O address space (IOAS) allowing map/unmap
+  of user space memory into ranges of I/O Virtual Address (IOVA).
+
+  The IOAS is a functional replacement for the VFIO container, and like the 
VFIO
+  container copies its IOVA map to a list of iommu_domains held within it.
+
+- IOMMUFD_OBJ_DEVICE, representing a device that is bound to iommufd by an
+  external driver.
+
+- IOMMUFD_OBJ_HW_PAGETABLE, wrapping an actual hardware I/O page table (i.e. a
+  single struct iommu_domain) managed by the iommu driver.
+
+  The IOAS has a list of HW_PAGETABLES that share the same IOVA mapping and the
+  IOAS will synchronize its mapping with each member HW_PAGETABLE.
+
+All user-visible objects are destroyed via the IOMMU_DESTROY uAPI.
+
+Linkage between user-visible objects and external kernel datastructures are
+reflected by dotted line arrows below, with numbers referring to certain
+operations creating the objects and links::
+
+  _
+ | iommufd |
+ |   [1]   |
+ |  _  |
+ | | | |
+ | | | |
+ | | | |
+ | | | |
+ | | | |
+ | | | |
+ | | |[3] [2]  |
+ | | | __  |
+ | |  IOAS   |<--||<--|  | |
+ | | |   |HW_PAGETABLE|   |  DEVICE  | |
+ | | |   ||   |__| |
+ | | | |   |   |
+ | | | |   |   |
+ | | | |   |   |
+ | | | |   |   |
+ | | | |   |   |
+ | |_| |   |   |
+ | |   |   |   |
+ |_|___|___|___|
+   |   |   |
+   |  _v__  ___v_
+   | PFN storage ||| |
+   |>|iommu_domain||struct device|
+ |||_|
+
+1. IOMMUFD_OBJ_IOAS is created via the IOMMU_IOAS_ALLOC uAPI. One iommufd can
+   hold multiple IOAS objects. IOAS is the most generic object and does not
+   expose interfaces that are specific to single IOMMU drivers. All operations
+   on the IOAS must operate equally on each of the iommu_domains that are 
inside
+   it.
+
+2. 

Re: [PATCH] thunderbolt: Make iommu_dma_protection more accurate

2022-03-18 Thread mika.westerb...@linux.intel.com
Hi Robin,

On Fri, Mar 18, 2022 at 03:15:19PM +, Robin Murphy wrote:
> > IMHO we should just trust the firmare provided information here
> > (otherwise we are screwed anyway as there is no way to tell if the
> > devices connected prior the OS can still do DMA), and use the external
> > facing port indicator to idenfity the ports that need DMA protection.
> 
> Indeed that's exactly what I want to do, but it begs the question of how we
> *find* the firmware-provided information in the first place!

Oh, right :) Its the combination of ACPI _DSD "ExternalFacingPort"
(which we already set, dev->external_facing, dev->untrusted for the
devices behind these ports IIRC) and the DMAR opt-in bit. All these are
already read by the kernel.

> I seem to have already started writing the dumb version that will walk the
> whole PCI segment and assume the presence of any external-facing port
> implies that we're good. Let me know if I should stop ;)

That sounds good to me, so don't stop just yet ;-)
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] thunderbolt: Make iommu_dma_protection more accurate

2022-03-18 Thread Robin Murphy

On 2022-03-18 14:47, mika.westerb...@linux.intel.com wrote:

On Fri, Mar 18, 2022 at 02:08:16PM +, Robin Murphy wrote:

On 2022-03-18 13:25, mika.westerb...@linux.intel.com wrote:

Hi Robin,

On Fri, Mar 18, 2022 at 12:01:42PM +, Robin Murphy wrote:

This adds quite a lot code and complexity, and honestly I would like to
keep it as simple as possible (and this is not enough because we need to
make sure the DMAR bit is there so that none of the possible connected
devices were able to overwrite our memory already).


Shall we forget the standalone sibling check and just make the
pdev->untrusted check directly in tb_acpi_add_link() then?


I think we should leave tb_acpi_add_link() untouched if possible ;-)
This is because it is used to add the device links from firmware
description that we need for proper power management of the tunneled
devices. It has little to do with the identification of the external
facing DMA-capable PCIe ports.

Furthermore these links only exists in USB4 software connection manager
systems so we do not have those in the existing Thunderbolt 3/4 systems
that use firmware based connection manager (pretty much all out there).


On reflection I guess the DMAR bit makes iommu_dma_protection
functionally dependent on ACPI already, so we don't actually lose
anything (and anyone can come back and revisit firmware-agnostic
methods later if a need appears).


I agree.


OK, so do we have any realistic options for identifying the correct PCI
devices, if USB4 PCIe adapters might be anywhere relative to their
associated NHI? Short of maintaining a list of known IDs, the only thought I
have left is that if we walk the whole PCI segment looking specifically for
hotplug-capable Gen1 ports, any system modern enough to have Thunderbolt is
*probably* not going to have any real PCIe Gen1 hotplug slots, so maybe
false negatives might be tolerable, but it still feels like a bit of a
sketchy heuristic.


Indeed.


I suppose we could just look to see if any device anywhere is marked as
external-facing, and hope that if firmware's done that much then it's done
everything right. That's still at least slightly better than what we have
today, but AFAICS still carries significant risk of a false positive for an
add-in card that firmware didn't recognise.


The port in this case, that is marked as external facing, is the PCIe
root port that the add-in-card is connected to and that is known for the
firmware in advance.


I'm satisfied that we've come round to the right conclusion on the DMAR
opt-in - I'm in the middle or writing up patches for that now - but even
Microsoft's spec gives that as a separate requirement from the flagging of
external ports, with both being necessary for Kernel DMA Protection.


Is the problem that we are here trying to solve the fact that user can
disable the IOMMU protection from the command line? Or the fact that the
firmware might not declare all the ports properly so we may end up in a
situation that some of the ports do not get the full IOMMU protection.


It's about knowing whether or not firmware has declared the ports at 
all. If it hasn't then the system is vulnerable to *some* degree of DMA 
attacks regardless of anything else (the exact degree depending on 
kernel config and user overrides). Complete mitigation is simply too 
expensive to apply by default to every device the IOMMU layer is unsure 
about. The Thunderbolt driver cannot be confident that protection is in 
place unless it can somehow know that the IOMMU layer has seen that 
untrusted property on the relevant ports.



These are Microsoft requirements for the OEMs in order to pass their
firmware test suite so here I would not expect to have issues. Otherwise
they simply cannot ship the thing with Windows installed.

IMHO we should just trust the firmare provided information here
(otherwise we are screwed anyway as there is no way to tell if the
devices connected prior the OS can still do DMA), and use the external
facing port indicator to idenfity the ports that need DMA protection.


Indeed that's exactly what I want to do, but it begs the question of how 
we *find* the firmware-provided information in the first place!


I seem to have already started writing the dumb version that will walk 
the whole PCI segment and assume the presence of any external-facing 
port implies that we're good. Let me know if I should stop ;)


Cheers,
Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] thunderbolt: Make iommu_dma_protection more accurate

2022-03-18 Thread mika.westerb...@linux.intel.com
Hi Lukas,

On Fri, Mar 18, 2022 at 03:51:21PM +0100, Lukas Wunner wrote:
> On Fri, Mar 18, 2022 at 02:08:16PM +, Robin Murphy wrote:
> > OK, so do we have any realistic options for identifying the correct PCI
> > devices, if USB4 PCIe adapters might be anywhere relative to their
> > associated NHI? Short of maintaining a list of known IDs, the only thought I
> > have left is that if we walk the whole PCI segment looking specifically for
> > hotplug-capable Gen1 ports, any system modern enough to have Thunderbolt is
> > *probably* not going to have any real PCIe Gen1 hotplug slots, so maybe
> > false negatives might be tolerable, but it still feels like a bit of a
> > sketchy heuristic.
> 
> The Thunderbolt Device ROM contains the PCI slot number, so you can
> correlate the Thunderbolt switch ports with PCIe downstream ports
> and know exactly where PCIe tunnels are terminated.
> 
> Code is here:
> * thunderbolt: Obtain PCI slot number from DROM
>   https://github.com/l1k/linux/commit/756f7148bc10
> * thunderbolt: Move upstream_port to struct tb
>   https://github.com/l1k/linux/commit/58f16e7dd431
> * thunderbolt: Correlate PCI devices with Thunderbolt ports
>   https://github.com/l1k/linux/commit/f53ea40a7487
> 
> I implemented that in 2018, so it won't apply cleanly to current
> mainline.  But I kept forward-porting it on my private branch and
> could push that to GitHub if anyone is interested.
> 
> I don't know if this will work out-of-the-box for SoC-integrated
> Thunderbolt controllers.  It was developed with the discrete
> controllers in mind, which was the only thing available back then.

That DROM entry is completely optional and so is the whole DROM for the
host routers (this is the root of the USB4/TBT topology) so
unfortunately we cannot use it here.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] thunderbolt: Make iommu_dma_protection more accurate

2022-03-18 Thread Lukas Wunner
On Fri, Mar 18, 2022 at 03:51:21PM +0100, Lukas Wunner wrote:
> On Fri, Mar 18, 2022 at 02:08:16PM +, Robin Murphy wrote:
> > OK, so do we have any realistic options for identifying the correct PCI
> > devices, if USB4 PCIe adapters might be anywhere relative to their
> > associated NHI? Short of maintaining a list of known IDs, the only thought I
> > have left is that if we walk the whole PCI segment looking specifically for
> > hotplug-capable Gen1 ports, any system modern enough to have Thunderbolt is
> > *probably* not going to have any real PCIe Gen1 hotplug slots, so maybe
> > false negatives might be tolerable, but it still feels like a bit of a
> > sketchy heuristic.
> 
> The Thunderbolt Device ROM contains the PCI slot number, so you can
> correlate the Thunderbolt switch ports with PCIe downstream ports
> and know exactly where PCIe tunnels are terminated.
[...]
> I implemented that in 2018, so it won't apply cleanly to current
> mainline.  But I kept forward-porting it on my private branch and
> could push that to GitHub if anyone is interested.

FWIW, here's the most recent forward-port I've done:

https://github.com/l1k/linux/commits/thunderbolt_correlate_5.13
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] thunderbolt: Make iommu_dma_protection more accurate

2022-03-18 Thread Lukas Wunner
On Fri, Mar 18, 2022 at 02:08:16PM +, Robin Murphy wrote:
> OK, so do we have any realistic options for identifying the correct PCI
> devices, if USB4 PCIe adapters might be anywhere relative to their
> associated NHI? Short of maintaining a list of known IDs, the only thought I
> have left is that if we walk the whole PCI segment looking specifically for
> hotplug-capable Gen1 ports, any system modern enough to have Thunderbolt is
> *probably* not going to have any real PCIe Gen1 hotplug slots, so maybe
> false negatives might be tolerable, but it still feels like a bit of a
> sketchy heuristic.

The Thunderbolt Device ROM contains the PCI slot number, so you can
correlate the Thunderbolt switch ports with PCIe downstream ports
and know exactly where PCIe tunnels are terminated.

Code is here:
* thunderbolt: Obtain PCI slot number from DROM
  https://github.com/l1k/linux/commit/756f7148bc10
* thunderbolt: Move upstream_port to struct tb
  https://github.com/l1k/linux/commit/58f16e7dd431
* thunderbolt: Correlate PCI devices with Thunderbolt ports
  https://github.com/l1k/linux/commit/f53ea40a7487

I implemented that in 2018, so it won't apply cleanly to current
mainline.  But I kept forward-porting it on my private branch and
could push that to GitHub if anyone is interested.

I don't know if this will work out-of-the-box for SoC-integrated
Thunderbolt controllers.  It was developed with the discrete
controllers in mind, which was the only thing available back then.

Thanks,

Lukas
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] thunderbolt: Make iommu_dma_protection more accurate

2022-03-18 Thread mika.westerb...@linux.intel.com
On Fri, Mar 18, 2022 at 02:08:16PM +, Robin Murphy wrote:
> On 2022-03-18 13:25, mika.westerb...@linux.intel.com wrote:
> > Hi Robin,
> > 
> > On Fri, Mar 18, 2022 at 12:01:42PM +, Robin Murphy wrote:
> > > > This adds quite a lot code and complexity, and honestly I would like to
> > > > keep it as simple as possible (and this is not enough because we need to
> > > > make sure the DMAR bit is there so that none of the possible connected
> > > > devices were able to overwrite our memory already).
> > > 
> > > Shall we forget the standalone sibling check and just make the
> > > pdev->untrusted check directly in tb_acpi_add_link() then?
> > 
> > I think we should leave tb_acpi_add_link() untouched if possible ;-)
> > This is because it is used to add the device links from firmware
> > description that we need for proper power management of the tunneled
> > devices. It has little to do with the identification of the external
> > facing DMA-capable PCIe ports.
> > 
> > Furthermore these links only exists in USB4 software connection manager
> > systems so we do not have those in the existing Thunderbolt 3/4 systems
> > that use firmware based connection manager (pretty much all out there).
> > 
> > > On reflection I guess the DMAR bit makes iommu_dma_protection
> > > functionally dependent on ACPI already, so we don't actually lose
> > > anything (and anyone can come back and revisit firmware-agnostic
> > > methods later if a need appears).
> > 
> > I agree.
> 
> OK, so do we have any realistic options for identifying the correct PCI
> devices, if USB4 PCIe adapters might be anywhere relative to their
> associated NHI? Short of maintaining a list of known IDs, the only thought I
> have left is that if we walk the whole PCI segment looking specifically for
> hotplug-capable Gen1 ports, any system modern enough to have Thunderbolt is
> *probably* not going to have any real PCIe Gen1 hotplug slots, so maybe
> false negatives might be tolerable, but it still feels like a bit of a
> sketchy heuristic.

Indeed.

> I suppose we could just look to see if any device anywhere is marked as
> external-facing, and hope that if firmware's done that much then it's done
> everything right. That's still at least slightly better than what we have
> today, but AFAICS still carries significant risk of a false positive for an
> add-in card that firmware didn't recognise.

The port in this case, that is marked as external facing, is the PCIe
root port that the add-in-card is connected to and that is known for the
firmware in advance. 

> I'm satisfied that we've come round to the right conclusion on the DMAR
> opt-in - I'm in the middle or writing up patches for that now - but even
> Microsoft's spec gives that as a separate requirement from the flagging of
> external ports, with both being necessary for Kernel DMA Protection.

Is the problem that we are here trying to solve the fact that user can
disable the IOMMU protection from the command line? Or the fact that the
firmware might not declare all the ports properly so we may end up in a
situation that some of the ports do not get the full IOMMU protection.

These are Microsoft requirements for the OEMs in order to pass their
firmware test suite so here I would not expect to have issues. Otherwise
they simply cannot ship the thing with Windows installed.

IMHO we should just trust the firmare provided information here
(otherwise we are screwed anyway as there is no way to tell if the
devices connected prior the OS can still do DMA), and use the external
facing port indicator to idenfity the ports that need DMA protection.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH] thunderbolt: Make iommu_dma_protection more accurate

2022-03-18 Thread Limonciello, Mario via iommu
[Public]

> -Original Message-
> From: Robin Murphy 
> Sent: Friday, March 18, 2022 09:08
> To: mika.westerb...@linux.intel.com
> Cc: Limonciello, Mario ;
> andreas.noe...@gmail.com; michael.ja...@intel.com;
> yehezkel...@gmail.com; linux-...@vger.kernel.org; linux-
> ker...@vger.kernel.org; iommu@lists.linux-foundation.org; linux-
> p...@vger.kernel.org
> Subject: Re: [PATCH] thunderbolt: Make iommu_dma_protection more
> accurate
> 
> On 2022-03-18 13:25, mika.westerb...@linux.intel.com wrote:
> > Hi Robin,
> >
> > On Fri, Mar 18, 2022 at 12:01:42PM +, Robin Murphy wrote:
> >>> This adds quite a lot code and complexity, and honestly I would like to
> >>> keep it as simple as possible (and this is not enough because we need to
> >>> make sure the DMAR bit is there so that none of the possible connected
> >>> devices were able to overwrite our memory already).
> >>
> >> Shall we forget the standalone sibling check and just make the
> >> pdev->untrusted check directly in tb_acpi_add_link() then?
> >
> > I think we should leave tb_acpi_add_link() untouched if possible ;-)
> > This is because it is used to add the device links from firmware
> > description that we need for proper power management of the tunneled
> > devices. It has little to do with the identification of the external
> > facing DMA-capable PCIe ports.
> >
> > Furthermore these links only exists in USB4 software connection manager
> > systems so we do not have those in the existing Thunderbolt 3/4 systems
> > that use firmware based connection manager (pretty much all out there).
> >
> >> On reflection I guess the DMAR bit makes iommu_dma_protection
> >> functionally dependent on ACPI already, so we don't actually lose
> >> anything (and anyone can come back and revisit firmware-agnostic
> >> methods later if a need appears).
> >
> > I agree.
> 
> OK, so do we have any realistic options for identifying the correct PCI
> devices, if USB4 PCIe adapters might be anywhere relative to their
> associated NHI? Short of maintaining a list of known IDs, the only
> thought I have left is that if we walk the whole PCI segment looking
> specifically for hotplug-capable Gen1 ports, any system modern enough to
> have Thunderbolt is *probably* not going to have any real PCIe Gen1
> hotplug slots, so maybe false negatives might be tolerable, but it still
> feels like a bit of a sketchy heuristic.
> 
> I suppose we could just look to see if any device anywhere is marked as
> external-facing, and hope that if firmware's done that much then it's
> done everything right. That's still at least slightly better than what
> we have today, but AFAICS still carries significant risk of a false
> positive for an add-in card that firmware didn't recognise.
> 
> I'm satisfied that we've come round to the right conclusion on the DMAR
> opt-in - I'm in the middle or writing up patches for that now - but even
> Microsoft's spec gives that as a separate requirement from the flagging
> of external ports, with both being necessary for Kernel DMA Protection.
> 
> Robin.

The thunderbolt driver already has a good idea whether it's using software
CM or firmware CM.  How about if we use that to make a decision?

If running firmware CM presumably that is a fixed quantity of machines that will
dwindle over time as OEMs release new HW with SW CM designs. So maybe
leave things "as is" for those - opt in bit is sufficient for this check.

And then if running software CM then make it mandatory that any created links
are set untrusted AND some pre-boot bits are set to get iommu dma protection
set.  We effectively end up with the same requirements as Microsoft has in
their spec for new hardware then, and don't break any old hardware that
doesn't have the links made and relies on firmware for the CM.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 14/32] iommu: introduce iommu_domain_alloc_type and the KVM type

2022-03-18 Thread Jason Gunthorpe via iommu
On Fri, Mar 18, 2022 at 02:23:57AM +, Tian, Kevin wrote:

> Yes, that is another major part work besides the iommufd work. And
> it is not compatible with KVM features which rely on the dynamic
> manner of EPT. Though It is a bit questionable whether it's worthy of
> doing so just for saving memory footprint while losing other capabilities,
> it is a requirement for some future security extension in Intel trusted
> computing architecture. And KVM has been pinning pages for SEV/TDX/etc.
> today thus some facilities can be reused. But I agree it is not a simple
> task thus we need start discussion early to explore various gaps in
> iommu and kvm.

Yikes. IMHO this might work better going the other way, have KVM
import the iommu_domain and use that as the KVM page table than vice
versa.

The semantics are a heck of a lot clearer, and it is really obvious
that alot of KVM becomes disabled if you do this.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] thunderbolt: Make iommu_dma_protection more accurate

2022-03-18 Thread Robin Murphy

On 2022-03-18 13:25, mika.westerb...@linux.intel.com wrote:

Hi Robin,

On Fri, Mar 18, 2022 at 12:01:42PM +, Robin Murphy wrote:

This adds quite a lot code and complexity, and honestly I would like to
keep it as simple as possible (and this is not enough because we need to
make sure the DMAR bit is there so that none of the possible connected
devices were able to overwrite our memory already).


Shall we forget the standalone sibling check and just make the
pdev->untrusted check directly in tb_acpi_add_link() then?


I think we should leave tb_acpi_add_link() untouched if possible ;-)
This is because it is used to add the device links from firmware
description that we need for proper power management of the tunneled
devices. It has little to do with the identification of the external
facing DMA-capable PCIe ports.

Furthermore these links only exists in USB4 software connection manager
systems so we do not have those in the existing Thunderbolt 3/4 systems
that use firmware based connection manager (pretty much all out there).


On reflection I guess the DMAR bit makes iommu_dma_protection
functionally dependent on ACPI already, so we don't actually lose
anything (and anyone can come back and revisit firmware-agnostic
methods later if a need appears).


I agree.


OK, so do we have any realistic options for identifying the correct PCI 
devices, if USB4 PCIe adapters might be anywhere relative to their 
associated NHI? Short of maintaining a list of known IDs, the only 
thought I have left is that if we walk the whole PCI segment looking 
specifically for hotplug-capable Gen1 ports, any system modern enough to 
have Thunderbolt is *probably* not going to have any real PCIe Gen1 
hotplug slots, so maybe false negatives might be tolerable, but it still 
feels like a bit of a sketchy heuristic.


I suppose we could just look to see if any device anywhere is marked as 
external-facing, and hope that if firmware's done that much then it's 
done everything right. That's still at least slightly better than what 
we have today, but AFAICS still carries significant risk of a false 
positive for an add-in card that firmware didn't recognise.


I'm satisfied that we've come round to the right conclusion on the DMAR 
opt-in - I'm in the middle or writing up patches for that now - but even 
Microsoft's spec gives that as a separate requirement from the flagging 
of external ports, with both being necessary for Kernel DMA Protection.


Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 2/8] iommu: Add attach/detach_dev_pasid domain ops

2022-03-18 Thread Jason Gunthorpe via iommu
On Fri, Mar 18, 2022 at 08:01:04PM +0800, Lu Baolu wrote:
> On 2022/3/15 19:49, Tian, Kevin wrote:
> > > From: Jean-Philippe Brucker
> > > Sent: Tuesday, March 15, 2022 7:27 PM
> > > 
> > > On Mon, Mar 14, 2022 at 10:07:06PM -0700, Jacob Pan wrote:
> > > > From: Lu Baolu
> > > > 
> > > > An IOMMU domain represents an address space which can be attached by
> > > > devices that perform DMA within a domain. However, for platforms with
> > > > PASID capability the domain attachment needs be handled at device+PASID
> > > > level. There can be multiple PASIDs within a device and multiple devices
> > > > attached to a given domain.
> > > > This patch introduces a new IOMMU op which support device, PASID, and
> > > > IOMMU domain attachment. The immediate use case is for PASID capable
> > > > devices to perform DMA under DMA APIs.
> > > > 
> > > > Signed-off-by: Lu Baolu
> > > > Signed-off-by: Jacob Pan
> > > >   include/linux/iommu.h | 6 ++
> > > >   1 file changed, 6 insertions(+)
> > > > 
> > > > diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> > > > index 369f05c2a4e2..fde5b933dbe3 100644
> > > > +++ b/include/linux/iommu.h
> > > > @@ -227,6 +227,8 @@ struct iommu_iotlb_gather {
> > > >* @aux_get_pasid: get the pasid given an aux-domain
> > > >* @sva_bind: Bind process address space to device
> > > >* @sva_unbind: Unbind process address space from device
> > > > + * @attach_dev_pasid: attach an iommu domain to a pasid of device
> > > > + * @detach_dev_pasid: detach an iommu domain from a pasid of device
> > > Isn't that operation "assign a PASID to a domain" instead?  In patch 5,
> > > the domain is already attached to the device, so set_domain_pasid() might
> > > be clearer and to the point. If the IOMMU driver did the allocation we
> > > could also avoid patch 1.
> > iiuc this API can also work for future SIOV usage where each mdev attached
> > to the domain has its own pasid. "assigning a PASID to a domain" sounds
> > like going back to the previous aux domain approach which has one PASID
> > per domain and that PASID is used on all devices attached to the aux 
> > domain...
> > 
> 
> This also works for SVA as far as I can see. The sva_bind essentially is
> to  attach an SVA domain to the PASID of a device. The sva_bind/unbind
> ops could be removed with these two new callbacks.

As we talked before I would like to see 'sva bind' go away and be
replaced with:

  domain = alloc_sva_iommu_domain(device, current)
  attach_dev_pasid_domain(domain, device, pasid)

Which composes better with everything else. SVA is just a special kind
of iommu_domain

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 2/8] iommu: Add attach/detach_dev_pasid domain ops

2022-03-18 Thread Jason Gunthorpe via iommu
On Fri, Mar 18, 2022 at 07:52:33PM +0800, Lu Baolu wrote:
> On 2022/3/15 13:07, Jacob Pan wrote:
> > From: Lu Baolu 
> > 
> > An IOMMU domain represents an address space which can be attached by
> > devices that perform DMA within a domain. However, for platforms with
> > PASID capability the domain attachment needs be handled at device+PASID
> > level. There can be multiple PASIDs within a device and multiple devices
> > attached to a given domain.
> > This patch introduces a new IOMMU op which support device, PASID, and
> > IOMMU domain attachment. The immediate use case is for PASID capable
> > devices to perform DMA under DMA APIs.
> > 
> > Signed-off-by: Lu Baolu 
> > Signed-off-by: Jacob Pan 
> >   include/linux/iommu.h | 6 ++
> >   1 file changed, 6 insertions(+)
> > 
> > diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> > index 369f05c2a4e2..fde5b933dbe3 100644
> > +++ b/include/linux/iommu.h
> > @@ -227,6 +227,8 @@ struct iommu_iotlb_gather {
> >* @aux_get_pasid: get the pasid given an aux-domain
> >* @sva_bind: Bind process address space to device
> >* @sva_unbind: Unbind process address space from device
> > + * @attach_dev_pasid: attach an iommu domain to a pasid of device
> > + * @detach_dev_pasid: detach an iommu domain from a pasid of device
> >* @sva_get_pasid: Get PASID associated to a SVA handle
> >* @page_response: handle page request response
> >* @cache_invalidate: invalidate translation caches
> > @@ -296,6 +298,10 @@ struct iommu_ops {
> > struct iommu_sva *(*sva_bind)(struct device *dev, struct mm_struct *mm,
> >   void *drvdata);
> > void (*sva_unbind)(struct iommu_sva *handle);
> > +   int (*attach_dev_pasid)(struct iommu_domain *domain,
> > +   struct device *dev, ioasid_t id);
> > +   void (*detach_dev_pasid)(struct iommu_domain *domain,
> > +struct device *dev, ioasid_t id);
> 
> As we have introduced iommu_domain_ops, these callbacks should be part
> of the domain ops.

+1

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 3/8] iommu/vt-d: Implement device_pasid domain attach ops

2022-03-18 Thread Jason Gunthorpe via iommu
On Fri, Mar 18, 2022 at 05:47:23AM +, Tian, Kevin wrote:
> may remember more detail here) and we didn't hear strong interest
> from customer on it. So this is just FYI for theoretical possibility and 
> I'm fine with even disallowing it before those issues are resolved.

I didn't mean disallow, I just ment lets not optimize for it.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 15/32] vfio: introduce KVM-owned IOMMU type

2022-03-18 Thread Jason Gunthorpe via iommu
On Fri, Mar 18, 2022 at 07:01:19AM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Tuesday, March 15, 2022 10:55 PM
> > 
> > The first level iommu_domain has the 'type1' map and unmap and pins
> > the pages. This is the 1:1 map with the GPA and ends up pinning all
> > guest memory because the point is you don't want to take a memory pin
> > on your performance path
> > 
> > The second level iommu_domain points to a single IO page table in GPA
> > and is created/destroyed whenever the guest traps to the hypervisor to
> > manipulate the anchor (ie the GPA of the guest IO page table).
> > 
> 
> Can we use consistent terms as used in iommufd and hardware, i.e.
> with first-level/stage-1 referring to the child (GIOVA->GPA) which is
> further nested on second-level/stage-2 as the parent (GPA->HPA)?

Honestly I don't like injecting terms that only make sense for
virtualization into iommu/vfio land.

That area is intended to be general. If you use what it exposes for
virtualization, then great.

This is why I prefer to use 'user page table' when talking about the
GIOVA->GPA or Stage 1 map because it is a phrase independent of
virtualization or HW and clearly conveys what it is to the kernel and
its inherent order in the translation scheme.

The S1/S2 is gets confusing because the HW people choose those names
so that S1 is the first translation a TLP sees and S2 is the second.

But from a software model, the S2 is the first domain created and the
first level of the translation tree, while the S1 is the second domain
created and the second level of the translation tree. ie the natural
SW numbers are backwards.

And I know Matthew isn't working on HW that has the S1/S2 HW naming :)

But yes, should try harder to have good names. Maybe it will be
clearer with code.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] thunderbolt: Make iommu_dma_protection more accurate

2022-03-18 Thread mika.westerb...@linux.intel.com
Hi Robin,

On Fri, Mar 18, 2022 at 12:01:42PM +, Robin Murphy wrote:
> > This adds quite a lot code and complexity, and honestly I would like to
> > keep it as simple as possible (and this is not enough because we need to
> > make sure the DMAR bit is there so that none of the possible connected
> > devices were able to overwrite our memory already).
> 
> Shall we forget the standalone sibling check and just make the
> pdev->untrusted check directly in tb_acpi_add_link() then?

I think we should leave tb_acpi_add_link() untouched if possible ;-)
This is because it is used to add the device links from firmware
description that we need for proper power management of the tunneled
devices. It has little to do with the identification of the external
facing DMA-capable PCIe ports.

Furthermore these links only exists in USB4 software connection manager
systems so we do not have those in the existing Thunderbolt 3/4 systems
that use firmware based connection manager (pretty much all out there).

> On reflection I guess the DMAR bit makes iommu_dma_protection
> functionally dependent on ACPI already, so we don't actually lose
> anything (and anyone can come back and revisit firmware-agnostic
> methods later if a need appears).

I agree.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 5/8] iommu: Add PASID support for DMA mapping API users

2022-03-18 Thread Lu Baolu

On 2022/3/15 13:07, Jacob Pan wrote:

DMA mapping API is the de facto standard for in-kernel DMA. It operates
on a per device/RID basis which is not PASID-aware.

Some modern devices such as Intel Data Streaming Accelerator, PASID is
required for certain work submissions. To allow such devices use DMA
mapping API, we need the following functionalities:
1. Provide device a way to retrieve a PASID for work submission within
the kernel
2. Enable the kernel PASID on the IOMMU for the device
3. Attach the kernel PASID to the device's default DMA domain, let it
be IOVA or physical address in case of pass-through.

This patch introduces a driver facing API that enables DMA API
PASID usage. Once enabled, device drivers can continue to use DMA APIs as
is. There is no difference in dma_handle between without PASID and with
PASID.

Signed-off-by: Jacob Pan 
---
  drivers/iommu/dma-iommu.c | 65 +++
  include/linux/dma-iommu.h |  7 +
  include/linux/iommu.h |  9 ++
  3 files changed, 81 insertions(+)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index b22034975301..d0ff1a34b1b6 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -39,6 +39,8 @@ enum iommu_dma_cookie_type {
IOMMU_DMA_MSI_COOKIE,
  };
  
+static DECLARE_IOASID_SET(iommu_dma_pasid);

+
  struct iommu_dma_cookie {
enum iommu_dma_cookie_type  type;
union {
@@ -370,6 +372,69 @@ void iommu_put_dma_cookie(struct iommu_domain *domain)
domain->iova_cookie = NULL;
  }
  
+/**

+ * iommu_enable_pasid_dma --Enable in-kernel DMA request with PASID
+ * @dev:   Device to be enabled
+ *
+ * DMA request with PASID will be mapped the same way as the legacy DMA.
+ * If the device is in pass-through, PASID will also pass-through. If the
+ * device is in IOVA map, the supervisor PASID will point to the same IOVA
+ * page table.
+ *
+ * @return the kernel PASID to be used for DMA or INVALID_IOASID on failure


The comment on the return value should be rephrased according to the
real code.


+ */
+int iommu_enable_pasid_dma(struct device *dev, ioasid_t *pasid)
+{
+   struct iommu_domain *dom;
+   ioasid_t id, max;
+   int ret;
+
+   dom = iommu_get_domain_for_dev(dev);
+   if (!dom || !dom->ops || !dom->ops->attach_dev_pasid)
+   return -ENODEV;
+   max = iommu_get_dev_pasid_max(dev);
+   if (!max)
+   return -EINVAL;
+
+   id = ioasid_alloc(_dma_pasid, 1, max, dev);
+   if (id == INVALID_IOASID)
+   return -ENOMEM;
+
+   ret = dom->ops->attach_dev_pasid(dom, dev, id);
+   if (ret) {
+   ioasid_put(id);
+   return ret;
+   }
+   *pasid = id;
+
+   return ret;
+}
+EXPORT_SYMBOL(iommu_enable_pasid_dma);
+
+/**
+ * iommu_disable_pasid_dma --Disable in-kernel DMA request with PASID
+ * @dev:   Device's PASID DMA to be disabled
+ *
+ * It is the device driver's responsibility to ensure no more incoming DMA
+ * requests with the kernel PASID before calling this function. IOMMU driver
+ * ensures PASID cache, IOTLBs related to the kernel PASID are cleared and
+ * drained.
+ *
+ * @return 0 on success or error code on failure


Ditto.


+ */
+void iommu_disable_pasid_dma(struct device *dev, ioasid_t pasid)
+{
+   struct iommu_domain *dom;
+
+   /* TODO: check the given PASID is within the ioasid_set */
+   dom = iommu_get_domain_for_dev(dev);
+   if (!dom->ops->detach_dev_pasid)
+   return;
+   dom->ops->detach_dev_pasid(dom, dev, pasid);
+   ioasid_put(pasid);
+}
+EXPORT_SYMBOL(iommu_disable_pasid_dma);
+
  /**
   * iommu_dma_get_resv_regions - Reserved region driver helper
   * @dev: Device from iommu_get_resv_regions()
diff --git a/include/linux/dma-iommu.h b/include/linux/dma-iommu.h
index 24607dc3c2ac..e6cb9b52a420 100644
--- a/include/linux/dma-iommu.h
+++ b/include/linux/dma-iommu.h
@@ -18,6 +18,13 @@ int iommu_get_dma_cookie(struct iommu_domain *domain);
  int iommu_get_msi_cookie(struct iommu_domain *domain, dma_addr_t base);
  void iommu_put_dma_cookie(struct iommu_domain *domain);
  
+/*

+ * For devices that can do DMA request with PASID, setup a system PASID.
+ * Address modes (IOVA, PA) are selected by the platform code.
+ */


No need for a comment here. Or move it to the function if need.


+int iommu_enable_pasid_dma(struct device *dev, ioasid_t *pasid);
+void iommu_disable_pasid_dma(struct device *dev, ioasid_t pasid);
+
  /* Setup call for arch DMA mapping code */
  void iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 dma_limit);
  int iommu_dma_init_fq(struct iommu_domain *domain);
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index fde5b933dbe3..fb011722e4f8 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -395,6 +395,15 @@ static inline void iommu_set_dev_pasid_max(struct device 
*dev,
  
  	param->pasid_max = max;

  }
+static inline 

Re: [PATCH] thunderbolt: Make iommu_dma_protection more accurate

2022-03-18 Thread Robin Murphy

On 2022-03-18 11:38, mika.westerb...@linux.intel.com wrote:

Hi Mario,

On Thu, Mar 17, 2022 at 08:36:13PM +, Limonciello, Mario wrote:

Here is a proposal on top of what you did for this.
The idea being check the ports right when the links are made if they exist
(all the new USB4 stuff) and then check all siblings on TBT3 stuff.

diff --git a/drivers/thunderbolt/acpi.c b/drivers/thunderbolt/acpi.c
index 79b5abf9d042..89432456dbea 100644
--- a/drivers/thunderbolt/acpi.c
+++ b/drivers/thunderbolt/acpi.c
@@ -14,6 +14,7 @@
  static acpi_status tb_acpi_add_link(acpi_handle handle, u32 level, void *data,
 void **return_value)
  {
+   enum nhi_iommu_status iommu_status = IOMMU_UNKNOWN;
 struct fwnode_reference_args args;
 struct fwnode_handle *fwnode;
 struct tb_nhi *nhi = data;
@@ -91,6 +92,8 @@ static acpi_status tb_acpi_add_link(acpi_handle handle, u32 
level, void *data,
 if (link) {
 dev_dbg(>pdev->dev, "created link from %s\n",
 dev_name(>dev));
+   if (iommu_status != IOMMU_DISABLED)
+   iommu_status = nhi_check_iommu_for_port(pdev);
 } else {
 dev_warn(>pdev->dev, "device link creation from %s 
failed\n",
  dev_name(>dev));
@@ -101,6 +104,7 @@ static acpi_status tb_acpi_add_link(acpi_handle handle, u32 
level, void *data,

  out_put:
 fwnode_handle_put(args.fwnode);
+   nhi->iommu_dma_protection = (iommu_status == IOMMU_ENABLED);
 return AE_OK;
  }

diff --git a/drivers/thunderbolt/nhi.c b/drivers/thunderbolt/nhi.c
index e12c2e266741..b5eb0cab392f 100644
--- a/drivers/thunderbolt/nhi.c
+++ b/drivers/thunderbolt/nhi.c
@@ -1103,10 +1103,30 @@ static void nhi_check_quirks(struct tb_nhi *nhi)
 nhi->quirks |= QUIRK_AUTO_CLEAR_INT;
  }

+enum nhi_iommu_status nhi_check_iommu_for_port(struct pci_dev *pdev)
+{
+   if (!pci_is_pcie(pdev) ||
+   !(pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT ||
+pci_pcie_type(pdev) == PCI_EXP_TYPE_DOWNSTREAM)) {
+   return IOMMU_UNKNOWN;
+   }
+
+   if (!device_iommu_mapped(>dev)) {
+   return IOMMU_DISABLED;
+   }
+
+   if (!pdev->untrusted) {
+   dev_info(>dev,
+   "Assuming unreliable Kernel DMA protection\n");
+   return IOMMU_DISABLED;
+   }
+   return IOMMU_ENABLED;
+}
+
  static void nhi_check_iommu(struct tb_nhi *nhi)
  {
-   struct pci_dev *pdev;
-   bool port_ok = false;
+   enum nhi_iommu_status iommu_status = nhi->iommu_dma_protection ?
+   IOMMU_ENABLED : IOMMU_UNKNOWN;

 /*
  * Check for sibling devices that look like they should be our
@@ -1117,23 +1137,13 @@ static void nhi_check_iommu(struct tb_nhi *nhi)
  * otherwise even if translation is enabled for existing devices it
  * may potentially be overridden for a future tunnelled endpoint.
  */
-   for_each_pci_bridge(pdev, nhi->pdev->bus) {
-   if (!pci_is_pcie(pdev) ||
-   !(pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT ||
- pci_pcie_type(pdev) == PCI_EXP_TYPE_DOWNSTREAM))
-   continue;
-
-   if (!device_iommu_mapped(>dev))
-   return;
-
-   if (!pdev->untrusted) {
-   dev_info(>pdev->dev,
-"Assuming unreliable Kernel DMA protection\n");
-   return;
-   }
-   port_ok = true;
+   if (iommu_status == IOMMU_UNKNOWN) {
+   struct pci_dev *pdev;
+   for_each_pci_bridge(pdev, nhi->pdev->bus)
+   if (iommu_status != IOMMU_DISABLED)
+   iommu_status = nhi_check_iommu_for_port(pdev);
 }
-   nhi->iommu_dma_protection = port_ok;
+   nhi->iommu_dma_protection = (iommu_status == IOMMU_ENABLED);
  }

  static int nhi_init_msi(struct tb_nhi *nhi)

diff --git a/drivers/thunderbolt/nhi.h b/drivers/thunderbolt/nhi.h
index 69083aab2736..1622d49b1763 100644
--- a/drivers/thunderbolt/nhi.h
+++ b/drivers/thunderbolt/nhi.h
@@ -11,6 +11,13 @@

  #include 

+enum nhi_iommu_status {
+   IOMMU_UNKNOWN,
+   IOMMU_DISABLED,
+   IOMMU_ENABLED,
+};
+enum nhi_iommu_status nhi_check_iommu_for_port(struct pci_dev *pdev);
+


This adds quite a lot code and complexity, and honestly I would like to
keep it as simple as possible (and this is not enough because we need to
make sure the DMAR bit is there so that none of the possible connected
devices were able to overwrite our memory already).


Shall we forget the standalone sibling check and just make the 
pdev->untrusted check directly in tb_acpi_add_link() then? On reflection 
I guess the DMAR 

Re: [PATCH v2 2/8] iommu: Add attach/detach_dev_pasid domain ops

2022-03-18 Thread Lu Baolu

On 2022/3/15 19:49, Tian, Kevin wrote:

From: Jean-Philippe Brucker
Sent: Tuesday, March 15, 2022 7:27 PM

On Mon, Mar 14, 2022 at 10:07:06PM -0700, Jacob Pan wrote:

From: Lu Baolu

An IOMMU domain represents an address space which can be attached by
devices that perform DMA within a domain. However, for platforms with
PASID capability the domain attachment needs be handled at device+PASID
level. There can be multiple PASIDs within a device and multiple devices
attached to a given domain.
This patch introduces a new IOMMU op which support device, PASID, and
IOMMU domain attachment. The immediate use case is for PASID capable
devices to perform DMA under DMA APIs.

Signed-off-by: Lu Baolu
Signed-off-by: Jacob Pan
---
  include/linux/iommu.h | 6 ++
  1 file changed, 6 insertions(+)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 369f05c2a4e2..fde5b933dbe3 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -227,6 +227,8 @@ struct iommu_iotlb_gather {
   * @aux_get_pasid: get the pasid given an aux-domain
   * @sva_bind: Bind process address space to device
   * @sva_unbind: Unbind process address space from device
+ * @attach_dev_pasid: attach an iommu domain to a pasid of device
+ * @detach_dev_pasid: detach an iommu domain from a pasid of device

Isn't that operation "assign a PASID to a domain" instead?  In patch 5,
the domain is already attached to the device, so set_domain_pasid() might
be clearer and to the point. If the IOMMU driver did the allocation we
could also avoid patch 1.

iiuc this API can also work for future SIOV usage where each mdev attached
to the domain has its own pasid. "assigning a PASID to a domain" sounds
like going back to the previous aux domain approach which has one PASID
per domain and that PASID is used on all devices attached to the aux domain...



This also works for SVA as far as I can see. The sva_bind essentially is
to  attach an SVA domain to the PASID of a device. The sva_bind/unbind
ops could be removed with these two new callbacks.

Best regards,
baolu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] thunderbolt: Make iommu_dma_protection more accurate

2022-03-18 Thread Robin Murphy

On 2022-03-18 06:44, Mika Westerberg wrote:

Hi Robin,

Thanks for working on this!

On Thu, Mar 17, 2022 at 04:17:07PM +, Robin Murphy wrote:

Between me trying to get rid of iommu_present() and Mario wanting to
support the AMD equivalent of DMAR_PLATFORM_OPT_IN, scrutiny has shown
that the iommu_dma_protection attribute is being far too optimistic.
Even if an IOMMU might be present for some PCI segment in the system,
that doesn't necessarily mean it provides translation for the device(s)
we care about. Furthermore, all that DMAR_PLATFORM_OPT_IN really does
is tell us that memory was protected before the kernel was loaded, and
prevent the user from disabling the intel-iommu driver entirely. What
actually matters is whether we trust individual devices, based on the
"external facing" property that we expect firmware to describe for
Thunderbolt ports.


We still want to know that DMAR_PLATFORM_OPT_IN is set by the firmware
because that tells us that none of the devices connected before OS got
control had the ability to perform DMA outside of the RMRR regions. If
they did then all this is pointless because they could have modified the
system memory as they wished.


Ah, right, it does still matter in terms of whether we can trust our 
*own* integrity. I was thinking that it's OK since by this point the 
IOMMU driver should have reset the hardware and allocated new 
configuration tables etc. such that any previous modifications wouldn't 
matter, but of course the theoretical worst case is that the malicious 
device manages to corrupt the kernel code in memory just before that 
point to subvert the IOMMU driver itself. Got it, thanks again!


So yes, I think we *do* want to abstract that through the IOMMU API (and 
I'll follow up with our ACPI folks internally to check whether anyone's 
thought about this for Arm-based systems yet). Gosh, this is becoming 
quite the adventure, but it's all for good in the end :)


Robin.


Avoid false positives by looking as close as possible to the same PCI
topology that the IOMMU layer will consider once a Thunderbolt endpoint
appears. Crucially, we can't assume that IOMMU translation being enabled
for any reason is sufficient on its own; full (expensive) DMA protection
will still only be imposed on untrusted devices.

CC: Mario Limonciello 
Signed-off-by: Robin Murphy 
---

This supersedes my previous attempt just trying to replace
iommu_present() at [1], further to the original discussion at [2].

[1] 
https://lore.kernel.org/linux-iommu/bl1pr12mb515799c0be396377dbbef055e2...@bl1pr12mb5157.namprd12.prod.outlook.com/T/
[2] https://lore.kernel.org/linux-iommu/202203160844.lkviwr1q-...@intel.com/T/

  drivers/thunderbolt/domain.c | 12 +++-
  drivers/thunderbolt/nhi.c| 35 +++
  include/linux/thunderbolt.h  |  2 ++
  3 files changed, 40 insertions(+), 9 deletions(-)

diff --git a/drivers/thunderbolt/domain.c b/drivers/thunderbolt/domain.c
index 7018d959f775..d5c825e84ac8 100644
--- a/drivers/thunderbolt/domain.c
+++ b/drivers/thunderbolt/domain.c
@@ -7,9 +7,7 @@
   */
  
  #include 

-#include 
  #include 
-#include 
  #include 
  #include 
  #include 
@@ -257,13 +255,9 @@ static ssize_t iommu_dma_protection_show(struct device 
*dev,
 struct device_attribute *attr,
 char *buf)
  {
-   /*
-* Kernel DMA protection is a feature where Thunderbolt security is
-* handled natively using IOMMU. It is enabled when IOMMU is
-* enabled and ACPI DMAR table has DMAR_PLATFORM_OPT_IN set.
-*/
-   return sprintf(buf, "%d\n",
-  iommu_present(_bus_type) && dmar_platform_optin());
+   struct tb *tb = container_of(dev, struct tb, dev);
+
+   return sprintf(buf, "%d\n", tb->nhi->iommu_dma_protection);
  }
  static DEVICE_ATTR_RO(iommu_dma_protection);
  
diff --git a/drivers/thunderbolt/nhi.c b/drivers/thunderbolt/nhi.c

index c73da0532be4..e12c2e266741 100644
--- a/drivers/thunderbolt/nhi.c
+++ b/drivers/thunderbolt/nhi.c
@@ -14,6 +14,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include 
@@ -1102,6 +1103,39 @@ static void nhi_check_quirks(struct tb_nhi *nhi)
nhi->quirks |= QUIRK_AUTO_CLEAR_INT;
  }
  
+static void nhi_check_iommu(struct tb_nhi *nhi)

+{
+   struct pci_dev *pdev;
+   bool port_ok = false;



So for here somewhere we should call dmar_platform_optin() first. I
think this is something we want to move inside the IOMMU API (alongside
with the below checks).


+
+   /*
+* Check for sibling devices that look like they should be our
+* tunnelled ports. We can reasonably assume that if an IOMMU is
+* managing the bridge it will manage any future devices beyond it
+* too. If firmware has described a port as external-facing as
+* expected then we can trust the IOMMU layer to enforce isolation;
+

Re: [PATCH v2 2/8] iommu: Add attach/detach_dev_pasid domain ops

2022-03-18 Thread Lu Baolu

On 2022/3/15 13:07, Jacob Pan wrote:

From: Lu Baolu 

An IOMMU domain represents an address space which can be attached by
devices that perform DMA within a domain. However, for platforms with
PASID capability the domain attachment needs be handled at device+PASID
level. There can be multiple PASIDs within a device and multiple devices
attached to a given domain.
This patch introduces a new IOMMU op which support device, PASID, and
IOMMU domain attachment. The immediate use case is for PASID capable
devices to perform DMA under DMA APIs.

Signed-off-by: Lu Baolu 
Signed-off-by: Jacob Pan 
---
  include/linux/iommu.h | 6 ++
  1 file changed, 6 insertions(+)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 369f05c2a4e2..fde5b933dbe3 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -227,6 +227,8 @@ struct iommu_iotlb_gather {
   * @aux_get_pasid: get the pasid given an aux-domain
   * @sva_bind: Bind process address space to device
   * @sva_unbind: Unbind process address space from device
+ * @attach_dev_pasid: attach an iommu domain to a pasid of device
+ * @detach_dev_pasid: detach an iommu domain from a pasid of device
   * @sva_get_pasid: Get PASID associated to a SVA handle
   * @page_response: handle page request response
   * @cache_invalidate: invalidate translation caches
@@ -296,6 +298,10 @@ struct iommu_ops {
struct iommu_sva *(*sva_bind)(struct device *dev, struct mm_struct *mm,
  void *drvdata);
void (*sva_unbind)(struct iommu_sva *handle);
+   int (*attach_dev_pasid)(struct iommu_domain *domain,
+   struct device *dev, ioasid_t id);
+   void (*detach_dev_pasid)(struct iommu_domain *domain,
+struct device *dev, ioasid_t id);


As we have introduced iommu_domain_ops, these callbacks should be part
of the domain ops.


u32 (*sva_get_pasid)(struct iommu_sva *handle);
  
  	int (*page_response)(struct device *dev,


Best regards,
baolu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] thunderbolt: Make iommu_dma_protection more accurate

2022-03-18 Thread mika.westerb...@linux.intel.com
Hi Mario,

On Thu, Mar 17, 2022 at 08:36:13PM +, Limonciello, Mario wrote:
> Here is a proposal on top of what you did for this.  
> The idea being check the ports right when the links are made if they exist 
> (all the new USB4 stuff) and then check all siblings on TBT3 stuff.
> 
> diff --git a/drivers/thunderbolt/acpi.c b/drivers/thunderbolt/acpi.c
> index 79b5abf9d042..89432456dbea 100644
> --- a/drivers/thunderbolt/acpi.c
> +++ b/drivers/thunderbolt/acpi.c
> @@ -14,6 +14,7 @@
>  static acpi_status tb_acpi_add_link(acpi_handle handle, u32 level, void 
> *data,
> void **return_value)
>  {
> +   enum nhi_iommu_status iommu_status = IOMMU_UNKNOWN;
> struct fwnode_reference_args args;
> struct fwnode_handle *fwnode;
> struct tb_nhi *nhi = data;
> @@ -91,6 +92,8 @@ static acpi_status tb_acpi_add_link(acpi_handle handle, u32 
> level, void *data,
> if (link) {
> dev_dbg(>pdev->dev, "created link from %s\n",
> dev_name(>dev));
> +   if (iommu_status != IOMMU_DISABLED)
> +   iommu_status = nhi_check_iommu_for_port(pdev);
> } else {
> dev_warn(>pdev->dev, "device link creation from 
> %s failed\n",
>  dev_name(>dev));
> @@ -101,6 +104,7 @@ static acpi_status tb_acpi_add_link(acpi_handle handle, 
> u32 level, void *data,
> 
>  out_put:
> fwnode_handle_put(args.fwnode);
> +   nhi->iommu_dma_protection = (iommu_status == IOMMU_ENABLED);
> return AE_OK;
>  }
> 
> diff --git a/drivers/thunderbolt/nhi.c b/drivers/thunderbolt/nhi.c
> index e12c2e266741..b5eb0cab392f 100644
> --- a/drivers/thunderbolt/nhi.c
> +++ b/drivers/thunderbolt/nhi.c
> @@ -1103,10 +1103,30 @@ static void nhi_check_quirks(struct tb_nhi *nhi)
> nhi->quirks |= QUIRK_AUTO_CLEAR_INT;
>  }
> 
> +enum nhi_iommu_status nhi_check_iommu_for_port(struct pci_dev *pdev)
> +{
> +   if (!pci_is_pcie(pdev) ||
> +   !(pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT ||
> +pci_pcie_type(pdev) == PCI_EXP_TYPE_DOWNSTREAM)) {
> +   return IOMMU_UNKNOWN;
> +   }
> +
> +   if (!device_iommu_mapped(>dev)) {
> +   return IOMMU_DISABLED;
> +   }
> +
> +   if (!pdev->untrusted) {
> +   dev_info(>dev,
> +   "Assuming unreliable Kernel DMA protection\n");
> +   return IOMMU_DISABLED;
> +   }
> +   return IOMMU_ENABLED;
> +}
> +
>  static void nhi_check_iommu(struct tb_nhi *nhi)
>  {
> -   struct pci_dev *pdev;
> -   bool port_ok = false;
> +   enum nhi_iommu_status iommu_status = nhi->iommu_dma_protection ?
> +   IOMMU_ENABLED : IOMMU_UNKNOWN;
> 
> /*
>  * Check for sibling devices that look like they should be our
> @@ -1117,23 +1137,13 @@ static void nhi_check_iommu(struct tb_nhi *nhi)
>  * otherwise even if translation is enabled for existing devices it
>  * may potentially be overridden for a future tunnelled endpoint.
>  */
> -   for_each_pci_bridge(pdev, nhi->pdev->bus) {
> -   if (!pci_is_pcie(pdev) ||
> -   !(pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT ||
> - pci_pcie_type(pdev) == PCI_EXP_TYPE_DOWNSTREAM))
> -   continue;
> -
> -   if (!device_iommu_mapped(>dev))
> -   return;
> -
> -   if (!pdev->untrusted) {
> -   dev_info(>pdev->dev,
> -"Assuming unreliable Kernel DMA 
> protection\n");
> -   return;
> -   }
> -   port_ok = true;
> +   if (iommu_status == IOMMU_UNKNOWN) {
> +   struct pci_dev *pdev;
> +   for_each_pci_bridge(pdev, nhi->pdev->bus)
> +   if (iommu_status != IOMMU_DISABLED)
> +   iommu_status = nhi_check_iommu_for_port(pdev);
> }
> -   nhi->iommu_dma_protection = port_ok;
> +   nhi->iommu_dma_protection = (iommu_status == IOMMU_ENABLED);
>  }
> 
>  static int nhi_init_msi(struct tb_nhi *nhi)
> 
> diff --git a/drivers/thunderbolt/nhi.h b/drivers/thunderbolt/nhi.h
> index 69083aab2736..1622d49b1763 100644
> --- a/drivers/thunderbolt/nhi.h
> +++ b/drivers/thunderbolt/nhi.h
> @@ -11,6 +11,13 @@
> 
>  #include 
> 
> +enum nhi_iommu_status {
> +   IOMMU_UNKNOWN,
> +   IOMMU_DISABLED,
> +   IOMMU_ENABLED,
> +};
> +enum nhi_iommu_status nhi_check_iommu_for_port(struct pci_dev *pdev);
> +

This adds quite a lot code and complexity, and honestly I would like to
keep it as simple as possible (and this is not enough because we need to
make sure the DMAR bit is there so that none of the possible connected
devices were able to overwrite our memory already).

Re: [PATCH 2/4 RESEND] dma-mapping: Add wrapper function to set dma_coherent

2022-03-18 Thread Robin Murphy

On 2022-03-17 19:13, Michael Kelley (LINUX) wrote:

From: Robin Murphy  Sent: Thursday, March 17, 2022 10:20 
AM


On 2022-03-17 16:25, Michael Kelley via iommu wrote:

Add a wrapper function to set dma_coherent, avoiding the need for
complex #ifdef's when setting it in architecture independent code.


No. It might happen to work out on the architectures you're looking at,
but if Hyper-V were ever to support, say, AArch32 VMs you might see the
problem. arch_setup_dma_ops() is the tool for this job.



OK.   There's currently no vIOMMU in a Hyper-V guest, so presumably the
code would call arch_setup_dma_ops() with the dma_base, size, and iommu
parameters set to 0 and NULL.  This call can then be used in Patch 3 instead
of acpi_dma_configure(), and in the Patch 4 hv_dma_configure() function
as you suggested.  arch_setup_dma_ops() is not exported, so I'd need to
wrap it in a Hyper-V specific function in built-in code that is exported.

But at some point in the future if there's a vIOMMU in Hyper-V guests,
this approach will need some rework.

Does that make sense?  Thanks for your input and suggestions ...


Yes, that's essentially what I had in mind. If you did present a vIOMMU 
to the guest, presumably you'd either have to construct a regular 
IORT/VIOT, and thus end up adding the root complex to the ACPI namespace 
too so it can be referenced, at which point it would all get picked up 
by the standard machinery, or come up with some magic VMBus mechanism 
that would need a whole load of work to wire up in all the relevant 
places anyway.


(But please lean extremely heavily towards the former option!)

Thanks,
Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 4/4 RESEND] PCI: hv: Propagate coherence from VMbus device to PCI device

2022-03-18 Thread Robin Murphy

On 2022-03-18 05:12, Michael Kelley (LINUX) wrote:

From: Robin Murphy  Sent: Thursday, March 17, 2022 10:15 
AM


On 2022-03-17 16:25, Michael Kelley via iommu wrote:

PCI pass-thru devices in a Hyper-V VM are represented as a VMBus
device and as a PCI device.  The coherence of the VMbus device is
set based on the VMbus node in ACPI, but the PCI device has no
ACPI node and defaults to not hardware coherent.  This results
in extra software coherence management overhead on ARM64 when
devices are hardware coherent.

Fix this by propagating the coherence of the VMbus device to the
PCI device.  There's no effect on x86/x64 where devices are
always hardware coherent.

Signed-off-by: Michael Kelley 
---
   drivers/pci/controller/pci-hyperv.c | 17 +
   1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/drivers/pci/controller/pci-hyperv.c 
b/drivers/pci/controller/pci-hyperv.c
index ae0bc2f..14276f5 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -49,6 +49,7 @@
   #include 
   #include 
   #include 
+#include 
   #include 

   /*
@@ -2142,9 +2143,9 @@ static void hv_pci_remove_slots(struct hv_pcibus_device

*hbus)

   }

   /*
- * Set NUMA node for the devices on the bus
+ * Set NUMA node and DMA coherence for the devices on the bus
*/
-static void hv_pci_assign_numa_node(struct hv_pcibus_device *hbus)
+static void hv_pci_assign_properties(struct hv_pcibus_device *hbus)
   {
struct pci_dev *dev;
struct pci_bus *bus = hbus->bridge->bus;
@@ -2167,6 +2168,14 @@ static void hv_pci_assign_numa_node(struct

hv_pcibus_device *hbus)

 numa_map_to_online_node(
 hv_dev->desc.virtual_numa_node));

+   /*
+* On ARM64, propagate the DMA coherence from the VMbus device
+* to the corresponding PCI device. On x86/x64, these calls
+* have no effect because DMA is always hardware coherent.
+*/
+   dev_set_dma_coherent(>dev,
+   dev_is_dma_coherent(>hdev->device));


Eww... if you really have to do this, I'd prefer to see a proper
hv_dma_configure() helper implemented and wired up to
pci_dma_configure(). Although since it's a generic property I guess at
worst pci_dma_configure could perhaps propagate coherency from the host
bridge to its children by itself in the absence of any other firmware
info. And it's built-in so could use arch_setup_dma_ops() like everyone
else.



I'm not seeing an existing mechanism to provide a "helper" or override
of pci_dma_configure().   Could you elaborate?  Or is this something
that needs to be created?


I mean something like the diff below (other #includes omitted for 
clarity). Essentially if VMBus has its own way of describing parts of 
the system, then for those parts it's nice if it could fit into the same 
abstractions we use for firmware-based system description.


Cheers,
Robin.

->8-
diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 588588cfda48..7d92ccad1569 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "pci.h"
 #include "pcie/portdrv.h"

@@ -1602,6 +1603,8 @@ static int pci_dma_configure(struct device *dev)
struct acpi_device *adev = to_acpi_device_node(bridge->fwnode);

ret = acpi_dma_configure(dev, acpi_get_dma_attr(adev));
+   } else if (is_vmbus_dev(bridge)) {
+   ret = hv_dma_configure(dev, device_get_dma_attr(bridge));
}

pci_put_host_bridge_device(bridge);
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index f565a8938836..d1d4dd3d5a3a 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -1764,4 +1764,19 @@ static inline unsigned long virt_to_hvpfn(void *addr)
 #define HVPFN_DOWN(x)  ((x) >> HV_HYP_PAGE_SHIFT)
 #define page_to_hvpfn(page)(page_to_pfn(page) * NR_HV_HYP_PAGES_IN_PAGE)

+static inline bool is_vmbus_dev(struct device *dev)
+{
+   /*
+* dev->bus == _bus would break when the caller is built-in
+* and CONFIG_HYPERV=m, so look for it by name instead.
+*/
+   return !strcmp(dev->bus->name, "vmbus");
+}
+
+static inline int hv_dma_configure(struct device *dev, enum 
dev_dma_attr attr)

+{
+   arch_setup_dma_ops(dev, 0, 0, NULL, attr == DEV_DMA_COHERENT);
+   return 0;
+}
+
 #endif /* _HYPERV_H */
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH] iommu/vt-d: check alignment before using psi

2022-03-18 Thread Tian, Kevin
> From: David Stevens 
> Sent: Wednesday, March 16, 2022 1:07 PM
> 
> From: David Stevens 
> 
> Fall back to domain selective flush if the target address is not aligned
> to the mask being used for invalidation. This is necessary because page

using domain selective flush is a bit conservative. What about trying
mask+1 first?

> selective invalidation masks out the lower order bits of the target
> address based on the mask value, so if a non-aligned address is targeted
> for psi, then mappings at the end of [pfn, pfn+pages) may not properly
> be flushed from the iotlb.
> 
> This is not normally an issue because iova.c always allocates iovas that
> are aligned to their size. However, iovas which come from other sources
> (e.g. userspace via VFIO) may not be aligned.
> 
> Signed-off-by: David Stevens 
> ---
>  drivers/iommu/intel/iommu.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index 5b196cfe9ed2..c122686e0a5c 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -1735,7 +1735,8 @@ static void iommu_flush_iotlb_psi(struct
> intel_iommu *iommu,
>* and the base address is naturally aligned to the size.
>*/
>   if (!cap_pgsel_inv(iommu->cap) ||
> - mask > cap_max_amask_val(iommu->cap))
> + mask > cap_max_amask_val(iommu->cap) ||
> + unlikely(((1 << mask) - 1) & pfn))
>   iommu->flush.flush_iotlb(iommu, did, 0, 0,
> 
>   DMA_TLB_DSI_FLUSH);
>   else
> --
> 2.35.1.723.g4982287a31-goog

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v4 15/32] vfio: introduce KVM-owned IOMMU type

2022-03-18 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Tuesday, March 15, 2022 10:55 PM
> 
> The first level iommu_domain has the 'type1' map and unmap and pins
> the pages. This is the 1:1 map with the GPA and ends up pinning all
> guest memory because the point is you don't want to take a memory pin
> on your performance path
> 
> The second level iommu_domain points to a single IO page table in GPA
> and is created/destroyed whenever the guest traps to the hypervisor to
> manipulate the anchor (ie the GPA of the guest IO page table).
> 

Can we use consistent terms as used in iommufd and hardware, i.e.
with first-level/stage-1 referring to the child (GIOVA->GPA) which is
further nested on second-level/stage-2 as the parent (GPA->HPA)?

Otherwise all other explanations are agreed.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] thunderbolt: Make iommu_dma_protection more accurate

2022-03-18 Thread Mika Westerberg
Hi Robin,

Thanks for working on this!

On Thu, Mar 17, 2022 at 04:17:07PM +, Robin Murphy wrote:
> Between me trying to get rid of iommu_present() and Mario wanting to
> support the AMD equivalent of DMAR_PLATFORM_OPT_IN, scrutiny has shown
> that the iommu_dma_protection attribute is being far too optimistic.
> Even if an IOMMU might be present for some PCI segment in the system,
> that doesn't necessarily mean it provides translation for the device(s)
> we care about. Furthermore, all that DMAR_PLATFORM_OPT_IN really does
> is tell us that memory was protected before the kernel was loaded, and
> prevent the user from disabling the intel-iommu driver entirely. What
> actually matters is whether we trust individual devices, based on the
> "external facing" property that we expect firmware to describe for
> Thunderbolt ports.

We still want to know that DMAR_PLATFORM_OPT_IN is set by the firmware
because that tells us that none of the devices connected before OS got
control had the ability to perform DMA outside of the RMRR regions. If
they did then all this is pointless because they could have modified the
system memory as they wished.

> Avoid false positives by looking as close as possible to the same PCI
> topology that the IOMMU layer will consider once a Thunderbolt endpoint
> appears. Crucially, we can't assume that IOMMU translation being enabled
> for any reason is sufficient on its own; full (expensive) DMA protection
> will still only be imposed on untrusted devices.
> 
> CC: Mario Limonciello 
> Signed-off-by: Robin Murphy 
> ---
> 
> This supersedes my previous attempt just trying to replace
> iommu_present() at [1], further to the original discussion at [2].
> 
> [1] 
> https://lore.kernel.org/linux-iommu/bl1pr12mb515799c0be396377dbbef055e2...@bl1pr12mb5157.namprd12.prod.outlook.com/T/
> [2] https://lore.kernel.org/linux-iommu/202203160844.lkviwr1q-...@intel.com/T/
> 
>  drivers/thunderbolt/domain.c | 12 +++-
>  drivers/thunderbolt/nhi.c| 35 +++
>  include/linux/thunderbolt.h  |  2 ++
>  3 files changed, 40 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/thunderbolt/domain.c b/drivers/thunderbolt/domain.c
> index 7018d959f775..d5c825e84ac8 100644
> --- a/drivers/thunderbolt/domain.c
> +++ b/drivers/thunderbolt/domain.c
> @@ -7,9 +7,7 @@
>   */
>  
>  #include 
> -#include 
>  #include 
> -#include 
>  #include 
>  #include 
>  #include 
> @@ -257,13 +255,9 @@ static ssize_t iommu_dma_protection_show(struct device 
> *dev,
>struct device_attribute *attr,
>char *buf)
>  {
> - /*
> -  * Kernel DMA protection is a feature where Thunderbolt security is
> -  * handled natively using IOMMU. It is enabled when IOMMU is
> -  * enabled and ACPI DMAR table has DMAR_PLATFORM_OPT_IN set.
> -  */
> - return sprintf(buf, "%d\n",
> -iommu_present(_bus_type) && dmar_platform_optin());
> + struct tb *tb = container_of(dev, struct tb, dev);
> +
> + return sprintf(buf, "%d\n", tb->nhi->iommu_dma_protection);
>  }
>  static DEVICE_ATTR_RO(iommu_dma_protection);
>  
> diff --git a/drivers/thunderbolt/nhi.c b/drivers/thunderbolt/nhi.c
> index c73da0532be4..e12c2e266741 100644
> --- a/drivers/thunderbolt/nhi.c
> +++ b/drivers/thunderbolt/nhi.c
> @@ -14,6 +14,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -1102,6 +1103,39 @@ static void nhi_check_quirks(struct tb_nhi *nhi)
>   nhi->quirks |= QUIRK_AUTO_CLEAR_INT;
>  }
>  
> +static void nhi_check_iommu(struct tb_nhi *nhi)
> +{
> + struct pci_dev *pdev;
> + bool port_ok = false;


So for here somewhere we should call dmar_platform_optin() first. I
think this is something we want to move inside the IOMMU API (alongside
with the below checks).

> +
> + /*
> +  * Check for sibling devices that look like they should be our
> +  * tunnelled ports. We can reasonably assume that if an IOMMU is
> +  * managing the bridge it will manage any future devices beyond it
> +  * too. If firmware has described a port as external-facing as
> +  * expected then we can trust the IOMMU layer to enforce isolation;
> +  * otherwise even if translation is enabled for existing devices it
> +  * may potentially be overridden for a future tunnelled endpoint.
> +  */
> + for_each_pci_bridge(pdev, nhi->pdev->bus) {
> + if (!pci_is_pcie(pdev) ||
> + !(pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT ||
> +   pci_pcie_type(pdev) == PCI_EXP_TYPE_DOWNSTREAM))
> + continue;
> +
> + if (!device_iommu_mapped(>dev))
> + return;
> +
> + if (!pdev->untrusted) {
> + dev_info(>pdev->dev,
> +  "Assuming unreliable Kernel DMA protection\n");
> +   

RE: [PATCH v2 9/9] dmaengine: idxd: separate user and kernel pasid enabling

2022-03-18 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Tuesday, March 15, 2022 1:07 PM

The coverletter is [0/8] but here you actually have the 9th patch...

> 
> From: Dave Jiang 
> 
> The idxd driver always gated the pasid enabling under a single knob and
> this assumption is incorrect. The pasid used for kernel operation can be
> independently toggled and has no dependency on the user pasid (and vice
> versa). Split the two so they are independent "enabled" flags.

While you said kernel vs. user sva can be independently toggled you still
only have a single option 'sva' to gate both in the end. 

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v2 7/8] iommu/vt-d: Delete supervisor/kernel SVA

2022-03-18 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Tuesday, March 15, 2022 1:07 PM
> 
> In-kernel DMA with PASID should use DMA API now, remove supervisor
> PASID
> SVA support. Remove special cases in bind mm and page request service.
> 
> Signed-off-by: Jacob Pan 

so you removed all the references to SVM_FLAG_SUPERVISOR_MODE
but the definition is still kept in include/linux/intel-svm.h...

> ---
>  drivers/iommu/intel/svm.c | 42 ---
>  1 file changed, 8 insertions(+), 34 deletions(-)
> 
> diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
> index 2c53689da461..37d6218f173b 100644
> --- a/drivers/iommu/intel/svm.c
> +++ b/drivers/iommu/intel/svm.c
> @@ -516,11 +516,10 @@ static void intel_svm_free_pasid(struct mm_struct
> *mm)
> 
>  static struct iommu_sva *intel_svm_bind_mm(struct intel_iommu *iommu,
>  struct device *dev,
> -struct mm_struct *mm,
> -unsigned int flags)
> +struct mm_struct *mm)
>  {
>   struct device_domain_info *info = get_domain_info(dev);
> - unsigned long iflags, sflags;
> + unsigned long iflags, sflags = 0;
>   struct intel_svm_dev *sdev;
>   struct intel_svm *svm;
>   int ret = 0;
> @@ -533,16 +532,13 @@ static struct iommu_sva
> *intel_svm_bind_mm(struct intel_iommu *iommu,
> 
>   svm->pasid = mm->pasid;
>   svm->mm = mm;
> - svm->flags = flags;
>   INIT_LIST_HEAD_RCU(>devs);
> 
> - if (!(flags & SVM_FLAG_SUPERVISOR_MODE)) {
> - svm->notifier.ops = _mmuops;
> - ret = mmu_notifier_register(>notifier, mm);
> - if (ret) {
> - kfree(svm);
> - return ERR_PTR(ret);
> - }
> + svm->notifier.ops = _mmuops;
> + ret = mmu_notifier_register(>notifier, mm);
> + if (ret) {
> + kfree(svm);
> + return ERR_PTR(ret);
>   }
> 
>   ret = pasid_private_add(svm->pasid, svm);
> @@ -583,8 +579,6 @@ static struct iommu_sva *intel_svm_bind_mm(struct
> intel_iommu *iommu,
>   }
> 
>   /* Setup the pasid table: */
> - sflags = (flags & SVM_FLAG_SUPERVISOR_MODE) ?
> - PASID_FLAG_SUPERVISOR_MODE : 0;
>   sflags |= cpu_feature_enabled(X86_FEATURE_LA57) ?
> PASID_FLAG_FL5LP : 0;
>   spin_lock_irqsave(>lock, iflags);
>   ret = intel_pasid_setup_first_level(iommu, dev, mm->pgd, mm-
> >pasid,
> @@ -957,7 +951,7 @@ static irqreturn_t prq_event_thread(int irq, void *d)
>* to unbind the mm while any page faults are
> outstanding.
>*/
>   svm = pasid_private_find(req->pasid);
> - if (IS_ERR_OR_NULL(svm) || (svm->flags &
> SVM_FLAG_SUPERVISOR_MODE))
> + if (IS_ERR_OR_NULL(svm))
>   goto bad_req;
>   }
> 
> @@ -1011,29 +1005,9 @@ static irqreturn_t prq_event_thread(int irq, void
> *d)
>  struct iommu_sva *intel_svm_bind(struct device *dev, struct mm_struct
> *mm, void *drvdata)
>  {
>   struct intel_iommu *iommu = device_to_iommu(dev, NULL, NULL);
> - unsigned int flags = 0;
>   struct iommu_sva *sva;
>   int ret;
> 
> - if (drvdata)
> - flags = *(unsigned int *)drvdata;
> -
> - if (flags & SVM_FLAG_SUPERVISOR_MODE) {
> - if (!ecap_srs(iommu->ecap)) {
> - dev_err(dev, "%s: Supervisor PASID not supported\n",
> - iommu->name);
> - return ERR_PTR(-EOPNOTSUPP);
> - }
> -
> - if (mm) {
> - dev_err(dev, "%s: Supervisor PASID with user
> provided mm\n",
> - iommu->name);
> - return ERR_PTR(-EINVAL);
> - }
> -
> - mm = _mm;
> - }
> -
>   mutex_lock(_mutex);
>   ret = intel_svm_alloc_pasid(dev, mm, flags);
>   if (ret) {
> --
> 2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v2 6/8] dmaengine: idxd: Use DMA API for in-kernel DMA with PASID

2022-03-18 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Tuesday, March 15, 2022 1:07 PM
> 
> The current in-kernel supervisor PASID support is based on the SVM/SVA
> machinery in SVA lib. The binding between a kernel PASID and kernel
> mapping has many flaws. See discussions in the link below.
> 
> This patch enables in-kernel DMA by switching from SVA lib to the
> standard DMA mapping APIs. Since both DMA requests with and without
> PASIDs are mapped identically, there is no change to how DMA APIs are
> used after the kernel PASID is enabled.
> 
> Link: https://lore.kernel.org/linux-
> iommu/20210511194726.gp1002...@nvidia.com/
> Signed-off-by: Jacob Pan 
> ---
>  drivers/dma/idxd/idxd.h  |  1 -
>  drivers/dma/idxd/init.c  | 34 +-
>  drivers/dma/idxd/sysfs.c |  7 ---
>  3 files changed, 9 insertions(+), 33 deletions(-)
> 
> diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h
> index da72eb15f610..a09ab4a6e1c1 100644
> --- a/drivers/dma/idxd/idxd.h
> +++ b/drivers/dma/idxd/idxd.h
> @@ -276,7 +276,6 @@ struct idxd_device {
>   struct idxd_wq **wqs;
>   struct idxd_engine **engines;
> 
> - struct iommu_sva *sva;
>   unsigned int pasid;
> 
>   int num_groups;
> diff --git a/drivers/dma/idxd/init.c b/drivers/dma/idxd/init.c
> index 08a5f4310188..5d1f8dd4abf6 100644
> --- a/drivers/dma/idxd/init.c
> +++ b/drivers/dma/idxd/init.c
> @@ -16,6 +16,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include "../dmaengine.h"
> @@ -466,36 +467,22 @@ static struct idxd_device *idxd_alloc(struct pci_dev
> *pdev, struct idxd_driver_d
> 
>  static int idxd_enable_system_pasid(struct idxd_device *idxd)

idxd_enable_pasid_dma() since system pasid is a confusing term now?
Or just remove the idxd specific wrappers and have the caller to call
iommu_enable_pasid_dma() directly given the simple logic here.

>  {
> - int flags;
> - unsigned int pasid;
> - struct iommu_sva *sva;
> + u32 pasid;
> + int ret;
> 
> - flags = SVM_FLAG_SUPERVISOR_MODE;
> -
> - sva = iommu_sva_bind_device(>pdev->dev, NULL, );
> - if (IS_ERR(sva)) {
> - dev_warn(>pdev->dev,
> -  "iommu sva bind failed: %ld\n", PTR_ERR(sva));
> - return PTR_ERR(sva);
> - }
> -
> - pasid = iommu_sva_get_pasid(sva);
> - if (pasid == IOMMU_PASID_INVALID) {
> - iommu_sva_unbind_device(sva);
> - return -ENODEV;
> + ret = iommu_enable_pasid_dma(>pdev->dev, );
> + if (ret) {
> + dev_err(>pdev->dev, "No DMA PASID %d\n", ret);
> + return ret;
>   }
> -
> - idxd->sva = sva;
>   idxd->pasid = pasid;
> - dev_dbg(>pdev->dev, "system pasid: %u\n", pasid);
> +
>   return 0;
>  }
> 
>  static void idxd_disable_system_pasid(struct idxd_device *idxd)
>  {
> -
> - iommu_sva_unbind_device(idxd->sva);
> - idxd->sva = NULL;
> + iommu_disable_pasid_dma(>pdev->dev, idxd->pasid);
>  }
> 
>  static int idxd_probe(struct idxd_device *idxd)
> @@ -524,10 +511,7 @@ static int idxd_probe(struct idxd_device *idxd)
>   } else {
>   dev_warn(dev, "Unable to turn on SVA feature.\n");
>   }
> - } else if (!sva) {
> - dev_warn(dev, "User forced SVA off via module param.\n");

why removing above 2 lines? they are related to a module param thus
not affected by the logic in this series.

>   }
> -
>   idxd_read_caps(idxd);
>   idxd_read_table_offsets(idxd);
> 
> diff --git a/drivers/dma/idxd/sysfs.c b/drivers/dma/idxd/sysfs.c
> index 7e19ab92b61a..fde6656695ba 100644
> --- a/drivers/dma/idxd/sysfs.c
> +++ b/drivers/dma/idxd/sysfs.c
> @@ -839,13 +839,6 @@ static ssize_t wq_name_store(struct device *dev,
>   if (strlen(buf) > WQ_NAME_SIZE || strlen(buf) == 0)
>   return -EINVAL;
> 
> - /*
> -  * This is temporarily placed here until we have SVM support for
> -  * dmaengine.
> -  */
> - if (wq->type == IDXD_WQT_KERNEL && device_pasid_enabled(wq-
> >idxd))
> - return -EOPNOTSUPP;
> -
>   memset(wq->name, 0, WQ_NAME_SIZE + 1);
>   strncpy(wq->name, buf, WQ_NAME_SIZE);
>   strreplace(wq->name, '\n', '\0');
> --
> 2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH] iommu/vt-d: check alignment before using psi

2022-03-18 Thread Zhang, Tina



> -Original Message-
> From: iommu  On Behalf Of
> David Stevens
> Sent: Wednesday, March 16, 2022 1:07 PM
> To: Lu Baolu 
> Cc: iommu@lists.linux-foundation.org; linux-ker...@vger.kernel.org; David
> Stevens 
> Subject: [PATCH] iommu/vt-d: check alignment before using psi
> 
> From: David Stevens 
> 
> Fall back to domain selective flush if the target address is not aligned to 
> the
> mask being used for invalidation. This is necessary because page selective
> invalidation masks out the lower order bits of the target address based on
> the mask value, so if a non-aligned address is targeted for psi, then mappings
> at the end of [pfn, pfn+pages) may not properly be flushed from the iotlb.
> 
> This is not normally an issue because iova.c always allocates iovas that are
> aligned to their size. However, iovas which come from other sources (e.g.
> userspace via VFIO) may not be aligned.

Indeed, the dma_map/unmap always use alloca_iova(..., true) to allocate iova, 
which means size_aligned. But the iova used by userspace might not be size 
aligned. And it seems we cannot expect userspace can always use size_aligned 
iova base address.

BR,
Tina

> 
> Signed-off-by: David Stevens 
> ---
>  drivers/iommu/intel/iommu.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index 5b196cfe9ed2..c122686e0a5c 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -1735,7 +1735,8 @@ static void iommu_flush_iotlb_psi(struct
> intel_iommu *iommu,
>* and the base address is naturally aligned to the size.
>*/
>   if (!cap_pgsel_inv(iommu->cap) ||
> - mask > cap_max_amask_val(iommu->cap))
> + mask > cap_max_amask_val(iommu->cap) ||
> + unlikely(((1 << mask) - 1) & pfn))
>   iommu->flush.flush_iotlb(iommu, did, 0, 0,
> 
>   DMA_TLB_DSI_FLUSH);
>   else
> --
> 2.35.1.723.g4982287a31-goog
> 
> ___
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu