RE: [RFC PATCH v3 2/2] platform-msi: Add platform check for subdevice irq domain

2021-01-13 Thread Tian, Kevin
> From: Lu Baolu
> Sent: Thursday, January 14, 2021 9:30 AM
> 
> The pci_subdevice_msi_create_irq_domain() should fail if the underlying
> platform is not able to support IMS (Interrupt Message Storage). Otherwise,
> the isolation of interrupt is not guaranteed.
> 
> For x86, IMS is only supported on bare metal for now. We could enable it
> in the virtualization environments in the future if interrupt HYPERCALL
> domain is supported or the hardware has the capability of interrupt
> isolation for subdevices.
> 
> Cc: David Woodhouse 
> Cc: Leon Romanovsky 
> Cc: Kevin Tian 
> Suggested-by: Thomas Gleixner 
> Link: https://lore.kernel.org/linux-
> pci/87pn4nk7nn@nanos.tec.linutronix.de/
> Link: https://lore.kernel.org/linux-
> pci/877dqrnzr3@nanos.tec.linutronix.de/
> Link: https://lore.kernel.org/linux-
> pci/877dqqmc2h@nanos.tec.linutronix.de/
> Signed-off-by: Lu Baolu 
> ---
>  arch/x86/pci/common.c   | 71
> +
>  drivers/base/platform-msi.c |  8 +
>  include/linux/msi.h |  1 +
>  3 files changed, 80 insertions(+)
> 
> diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
> index 3507f456fcd0..9deb826fb242 100644
> --- a/arch/x86/pci/common.c
> +++ b/arch/x86/pci/common.c
> @@ -12,6 +12,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
> 
>  #include 
>  #include 
> @@ -724,3 +725,73 @@ struct pci_dev *pci_real_dma_dev(struct pci_dev
> *dev)
>   return dev;
>  }
>  #endif
> +
> +/*
> + * We want to figure out which context we are running in. But the hardware
> + * does not introduce a reliable way (instruction, CPUID leaf, MSR, whatever)
> + * which can be manipulated by the VMM to let the OS figure out where it
> runs.
> + * So we go with the below probably on_bare_metal() function as a
> replacement
> + * for definitely on_bare_metal() to go forward only for the very simple
> reason
> + * that this is the only option we have.
> + */
> +static const char * const vmm_vendor_name[] = {
> + "QEMU", "Bochs", "KVM", "Xen", "VMware", "VMW", "VMware Inc.",
> + "innotek GmbH", "Oracle Corporation", "Parallels", "BHYVE"
> +};
> +
> +static void read_type0_virtual_machine(const struct dmi_header *dm, void
> *p)
> +{
> + u8 *data = (u8 *)dm + 0x13;
> +
> + /* BIOS Information (Type 0) */
> + if (dm->type != 0 || dm->length < 0x14)
> + return;
> +
> + /* Bit 4 of BIOS Characteristics Extension Byte 2*/
> + if (*data & BIT(4))
> + *((bool *)p) = true;
> +}
> +
> +static bool smbios_virtual_machine(void)
> +{
> + bool bit_present = false;
> +
> + dmi_walk(read_type0_virtual_machine, _present);
> +
> + return bit_present;
> +}
> +
> +static bool on_bare_metal(struct device *dev)
> +{
> + int i;
> +
> + if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
> + return false;
> +
> + if (smbios_virtual_machine())
> + return false;
> +
> + if (iommu_capable(dev->bus, IOMMU_CAP_VIOMMU))
> + return false;
> +
> + for (i = 0; i < ARRAY_SIZE(vmm_vendor_name); i++)
> + if (dmi_match(DMI_SYS_VENDOR, vmm_vendor_name[i]))
> + return false;

Thinking more I wonder whether this check is actually useful here. As Leon
and David commented, the same vendor name can be used both for VM
and bare metal instances. It implies that both bare metal and VM might be
misinterpreted with this check. This might not be what we want originally -
find heuristics to indicate a VM environment and tolerate misinterpreting 
VM as bare metal in corner cases (but not vice versa).

Thomas?

Thanks
Kevin

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH v3 2/2] platform-msi: Add platform check for subdevice irq domain

2021-01-13 Thread Lu Baolu
The pci_subdevice_msi_create_irq_domain() should fail if the underlying
platform is not able to support IMS (Interrupt Message Storage). Otherwise,
the isolation of interrupt is not guaranteed.

For x86, IMS is only supported on bare metal for now. We could enable it
in the virtualization environments in the future if interrupt HYPERCALL
domain is supported or the hardware has the capability of interrupt
isolation for subdevices.

Cc: David Woodhouse 
Cc: Leon Romanovsky 
Cc: Kevin Tian 
Suggested-by: Thomas Gleixner 
Link: https://lore.kernel.org/linux-pci/87pn4nk7nn@nanos.tec.linutronix.de/
Link: https://lore.kernel.org/linux-pci/877dqrnzr3@nanos.tec.linutronix.de/
Link: https://lore.kernel.org/linux-pci/877dqqmc2h@nanos.tec.linutronix.de/
Signed-off-by: Lu Baolu 
---
 arch/x86/pci/common.c   | 71 +
 drivers/base/platform-msi.c |  8 +
 include/linux/msi.h |  1 +
 3 files changed, 80 insertions(+)

diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
index 3507f456fcd0..9deb826fb242 100644
--- a/arch/x86/pci/common.c
+++ b/arch/x86/pci/common.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -724,3 +725,73 @@ struct pci_dev *pci_real_dma_dev(struct pci_dev *dev)
return dev;
 }
 #endif
+
+/*
+ * We want to figure out which context we are running in. But the hardware
+ * does not introduce a reliable way (instruction, CPUID leaf, MSR, whatever)
+ * which can be manipulated by the VMM to let the OS figure out where it runs.
+ * So we go with the below probably on_bare_metal() function as a replacement
+ * for definitely on_bare_metal() to go forward only for the very simple reason
+ * that this is the only option we have.
+ */
+static const char * const vmm_vendor_name[] = {
+   "QEMU", "Bochs", "KVM", "Xen", "VMware", "VMW", "VMware Inc.",
+   "innotek GmbH", "Oracle Corporation", "Parallels", "BHYVE"
+};
+
+static void read_type0_virtual_machine(const struct dmi_header *dm, void *p)
+{
+   u8 *data = (u8 *)dm + 0x13;
+
+   /* BIOS Information (Type 0) */
+   if (dm->type != 0 || dm->length < 0x14)
+   return;
+
+   /* Bit 4 of BIOS Characteristics Extension Byte 2*/
+   if (*data & BIT(4))
+   *((bool *)p) = true;
+}
+
+static bool smbios_virtual_machine(void)
+{
+   bool bit_present = false;
+
+   dmi_walk(read_type0_virtual_machine, _present);
+
+   return bit_present;
+}
+
+static bool on_bare_metal(struct device *dev)
+{
+   int i;
+
+   if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
+   return false;
+
+   if (smbios_virtual_machine())
+   return false;
+
+   if (iommu_capable(dev->bus, IOMMU_CAP_VIOMMU))
+   return false;
+
+   for (i = 0; i < ARRAY_SIZE(vmm_vendor_name); i++)
+   if (dmi_match(DMI_SYS_VENDOR, vmm_vendor_name[i]))
+   return false;
+
+   pr_info("System running on bare metal, report to bugzilla.kernel.org if 
not the case.");
+
+   return true;
+}
+
+bool arch_support_pci_device_ims(struct pci_dev *pdev)
+{
+   /*
+* When we are running in a VMM context, the device IMS could only be
+* enabled when the underlying hardware supports interrupt isolation
+* of the subdevice, or any mechanism (trap, hypercall) is added so
+* that changes in the interrupt message store could be managed by the
+* VMM. For now, we only support the device IMS when we are running on
+* the bare metal.
+*/
+   return on_bare_metal(>dev);
+}
diff --git a/drivers/base/platform-msi.c b/drivers/base/platform-msi.c
index 8432a1bf4e28..88e5fe4dae67 100644
--- a/drivers/base/platform-msi.c
+++ b/drivers/base/platform-msi.c
@@ -512,6 +512,11 @@ struct irq_domain *device_msi_create_irq_domain(struct 
fwnode_handle *fn,
 #ifdef CONFIG_PCI
 #include 
 
+bool __weak arch_support_pci_device_ims(struct pci_dev *pdev)
+{
+   return false;
+}
+
 /**
  * pci_subdevice_msi_create_irq_domain - Create an irq domain for subdevices
  * @pdev:  Pointer to PCI device for which the subdevice domain is created
@@ -523,6 +528,9 @@ struct irq_domain 
*pci_subdevice_msi_create_irq_domain(struct pci_dev *pdev,
struct irq_domain *domain, *pdev_msi;
struct fwnode_handle *fn;
 
+   if (!arch_support_pci_device_ims(pdev))
+   return NULL;
+
/*
 * Retrieve the MSI domain of the underlying PCI device's MSI
 * domain. The PCI device domain's parent domain is also the parent
diff --git a/include/linux/msi.h b/include/linux/msi.h
index f319d7c6a4ef..6fda81c4b859 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -478,6 +478,7 @@ struct irq_domain *device_msi_create_irq_domain(struct 
fwnode_handle *fn,
struct irq_domain *parent);
 
 # ifdef CONFIG_PCI
+bool arch_support_pci_device_ims(struct 

[RFC PATCH v3 1/2] iommu: Add capability IOMMU_CAP_VIOMMU

2021-01-13 Thread Lu Baolu
Some vendor IOMMU drivers are able to declare that it is running in a VM
context. This is very valuable for the features that only want to be
supported on bare metal. Add a capability bit so that it could be used.

Signed-off-by: Lu Baolu 
---
 drivers/iommu/intel/iommu.c  | 20 
 drivers/iommu/virtio-iommu.c |  9 +
 include/linux/iommu.h|  1 +
 3 files changed, 30 insertions(+)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index cb205a04fe4c..8eb022d0e8aa 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -5738,12 +5738,32 @@ static inline bool nested_mode_support(void)
return ret;
 }
 
+static inline bool caching_mode_enabled(void)
+{
+   struct dmar_drhd_unit *drhd;
+   struct intel_iommu *iommu;
+   bool ret = false;
+
+   rcu_read_lock();
+   for_each_active_iommu(iommu, drhd) {
+   if (cap_caching_mode(iommu->cap)) {
+   ret = true;
+   break;
+   }
+   }
+   rcu_read_unlock();
+
+   return ret;
+}
+
 static bool intel_iommu_capable(enum iommu_cap cap)
 {
if (cap == IOMMU_CAP_CACHE_COHERENCY)
return domain_update_iommu_snooping(NULL) == 1;
if (cap == IOMMU_CAP_INTR_REMAP)
return irq_remapping_enabled == 1;
+   if (cap == IOMMU_CAP_VIOMMU)
+   return caching_mode_enabled();
 
return false;
 }
diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
index 2bfdd5734844..719793e103db 100644
--- a/drivers/iommu/virtio-iommu.c
+++ b/drivers/iommu/virtio-iommu.c
@@ -931,7 +931,16 @@ static int viommu_of_xlate(struct device *dev, struct 
of_phandle_args *args)
return iommu_fwspec_add_ids(dev, args->args, 1);
 }
 
+static bool viommu_capable(enum iommu_cap cap)
+{
+   if (cap == IOMMU_CAP_VIOMMU)
+   return true;
+
+   return false;
+}
+
 static struct iommu_ops viommu_ops = {
+   .capable= viommu_capable,
.domain_alloc   = viommu_domain_alloc,
.domain_free= viommu_domain_free,
.attach_dev = viommu_attach_dev,
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index b95a6f8db6ff..1d24be667a03 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -94,6 +94,7 @@ enum iommu_cap {
   transactions */
IOMMU_CAP_INTR_REMAP,   /* IOMMU supports interrupt isolation */
IOMMU_CAP_NOEXEC,   /* IOMMU_NOEXEC flag */
+   IOMMU_CAP_VIOMMU,   /* IOMMU can declar running in a VM */
 };
 
 /*
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH v3 0/2] Add platform check for subdevice irq domain

2021-01-13 Thread Lu Baolu
Hi,

Learnt from the discussions in this thread:

https://lore.kernel.org/linux-pci/160408357912.912050.17005584526266191420.st...@djiang5-desk3.ch.intel.com/

The device IMS (Interrupt Message Storage) should not be enabled in any
virtualization environments unless there is a HYPERCALL domain which
makes the changes in the message store monitored by the hypervisor.

As the initial step, we allow the IMS to be enabled only if we are
running on the bare metal. It's easy to enable IMS in the virtualization
environments if above preconditions are met in the future.

This series is only for comments purpose. We will include it in the Intel
IMS implementation later once we reach a consensus.

Change log:
v2->v3:
 - v2:
   
https://lore.kernel.org/linux-pci/20210106022749.2769057-1-baolu...@linux.intel.com/
 - Add all identified heuristics so far.

v1->v2:
 - v1:
   
https://lore.kernel.org/linux-pci/20201210004624.345282-1-baolu...@linux.intel.com/
 - Rename probably_on_bare_metal() with on_bare_metal();
 - Some vendors might use the same name for both bare metal and virtual
   environment. Before we add vendor specific code to distinguish
   between them, let's return false in on_bare_metal(). This won't
   introduce any regression. The only impact is that the coming new
   platform msi feature won't be supported until the vendor specific code
   is provided.

Best regards,
baolu

Lu Baolu (2):
  iommu: Add capability IOMMU_CAP_VIOMMU
  platform-msi: Add platform check for subdevice irq domain

 arch/x86/pci/common.c| 71 
 drivers/base/platform-msi.c  |  8 
 drivers/iommu/intel/iommu.c  | 20 ++
 drivers/iommu/virtio-iommu.c |  9 +
 include/linux/iommu.h|  1 +
 include/linux/msi.h  |  1 +
 6 files changed, 110 insertions(+)

-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v3 4/6] swiotlb: Add restricted DMA alloc/free support.

2021-01-13 Thread Christoph Hellwig
On Wed, Jan 13, 2021 at 06:27:08PM +, Robin Murphy wrote:
>> Can we hook in somewhat lower level in the dma-direct code so that all
>> the remapping in dma-direct can be reused instead of duplicated?  That
>> also becomes important if we want to use non-remapping uncached support,
>> e.g. on mips or x86, or the direct changing of the attributes that Will
>> planned to look into for arm64.
>
> Indeed, AFAICS this ought to boil down to a direct equivalent of 
> __dma_direct_alloc_pages() - other than the address there should be no 
> conceptual difference between pages from the restricted pool and those from 
> the regular page allocator, so this probably deserves to be plumbed in as 
> an alternative to that.

Yes, that's what I mean.  You managed to word it much better, though.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: performance regression noted in v5.11-rc after c062db039f40

2021-01-13 Thread Chuck Lever



> On Jan 13, 2021, at 9:07 AM, Chuck Lever  wrote:
> 
> 
> 
>> On Jan 12, 2021, at 9:25 PM, Lu Baolu  wrote:
>> 
>> Hi,
>> 
>> On 1/12/21 10:38 PM, Will Deacon wrote:
>>> [Expanding cc list to include DMA-IOMMU and intel IOMMU folks]
>>> On Fri, Jan 08, 2021 at 04:18:36PM -0500, Chuck Lever wrote:
 Hi-
 
 [ Please cc: me on replies, I'm not currently subscribed to
 iommu@lists ].
 
 I'm running NFS performance tests on InfiniBand using CX-3 Pro cards
 at 56Gb/s. The test is iozone on an NFSv3/RDMA mount:
 
 /home/cel/bin/iozone -M -+u -i0 -i1 -s1g -r256k -t12 -I
 
 For those not familiar with the way storage protocols use RDMA, The
 initiator/client sets up memory regions and the target/server uses
 RDMA Read and Write to move data out of and into those regions. The
 initiator/client uses only RDMA memory registration and invalidation
 operations, and the target/server uses RDMA Read and Write.
 
 My NFS client is a two-socket 12-core x86_64 system with its I/O MMU
 enabled using the kernel command line options "intel_iommu=on
 iommu=strict".
 
 Recently I've noticed a significant (25-30%) loss in NFS throughput.
 I was able to bisect on my client to the following commits.
 
 Here's 65f746e8285f ("iommu: Add quirk for Intel graphic devices in
 map_sg"). This is about normal for this test.
 
Children see throughput for 12 initial writers  = 4732581.09 kB/sec
Parent sees throughput for 12 initial writers   = 4646810.21 kB/sec
Min throughput per process  =  387764.34 kB/sec
Max throughput per process  =  399655.47 kB/sec
Avg throughput per process  =  394381.76 kB/sec
Min xfer= 1017344.00 kB
CPU Utilization: Wall time2.671CPU time1.974CPU 
 utilization  73.89 %
Children see throughput for 12 rewriters= 4837741.94 kB/sec
Parent sees throughput for 12 rewriters = 4833509.35 kB/sec
Min throughput per process  =  398983.72 kB/sec
Max throughput per process  =  406199.66 kB/sec
Avg throughput per process  =  403145.16 kB/sec
Min xfer= 1030656.00 kB
CPU utilization: Wall time2.584CPU time1.959CPU 
 utilization  75.82 %
Children see throughput for 12 readers  = 5921370.94 kB/sec
Parent sees throughput for 12 readers   = 5914106.69 kB/sec
Min throughput per process  =  491812.38 kB/sec
Max throughput per process  =  494777.28 kB/sec
Avg throughput per process  =  493447.58 kB/sec
Min xfer= 1042688.00 kB
CPU utilization: Wall time2.122CPU time1.968CPU 
 utilization  92.75 %
Children see throughput for 12 re-readers   = 5947985.69 kB/sec
Parent sees throughput for 12 re-readers= 5941348.51 kB/sec
Min throughput per process  =  492805.81 kB/sec
Max throughput per process  =  497280.19 kB/sec
Avg throughput per process  =  495665.47 kB/sec
Min xfer= 1039360.00 kB
CPU utilization: Wall time2.111CPU time1.968CPU 
 utilization  93.22 %
 
 Here's c062db039f40 ("iommu/vt-d: Update domain geometry in
 iommu_ops.at(de)tach_dev"). It's losing some steam here.
 
Children see throughput for 12 initial writers  = 4342419.12 kB/sec
Parent sees throughput for 12 initial writers   = 4310612.79 kB/sec
Min throughput per process  =  359299.06 kB/sec
Max throughput per process  =  363866.16 kB/sec
Avg throughput per process  =  361868.26 kB/sec
Min xfer= 1035520.00 kB
CPU Utilization: Wall time2.902CPU time1.951CPU 
 utilization  67.22 %
Children see throughput for 12 rewriters= 4408576.66 kB/sec
Parent sees throughput for 12 rewriters = 4404280.87 kB/sec
Min throughput per process  =  364553.88 kB/sec
Max throughput per process  =  370029.28 kB/sec
Avg throughput per process  =  367381.39 kB/sec
Min xfer= 1033216.00 kB
CPU utilization: Wall time2.836CPU time1.956CPU 
 utilization  68.97 %
Children see throughput for 12 readers  = 5406879.47 kB/sec
Parent sees throughput for 12 readers   = 

Re: [RFC PATCH v3 4/6] swiotlb: Add restricted DMA alloc/free support.

2021-01-13 Thread Robin Murphy

On 2021-01-13 12:48, Christoph Hellwig wrote:

+#ifdef CONFIG_SWIOTLB
+   if (unlikely(dev->dma_io_tlb_mem))
+   return swiotlb_alloc(dev, size, dma_handle, attrs);
+#endif


Another place where the dma_io_tlb_mem is useful to avoid the ifdef.


-phys_addr_t swiotlb_tbl_map_single(struct device *hwdev, phys_addr_t orig_addr,
-   size_t mapping_size, size_t alloc_size,
-   enum dma_data_direction dir, unsigned long attrs)
+static int swiotlb_tbl_find_free_region(struct device *hwdev,
+   dma_addr_t tbl_dma_addr,
+   size_t alloc_size,
+   unsigned long attrs)



+static void swiotlb_tbl_release_region(struct device *hwdev, int index,
+  size_t size)


This refactoring should be another prep patch.



+void *swiotlb_alloc(struct device *dev, size_t size, dma_addr_t *dma_handle,
+   unsigned long attrs)


I'd rather have the names convey there are for the per-device bounce
buffer in some form.


+   struct io_tlb_mem *mem = dev->dma_io_tlb_mem;


While we're at it I wonder if the io_tlb is something we could change
while we're at it.  Maybe replace io_tlb_mem with struct swiotlb
and rename the field in struct device to dev_swiotlb?


+   int index;
+   void *vaddr;
+   phys_addr_t tlb_addr;
+
+   size = PAGE_ALIGN(size);
+   index = swiotlb_tbl_find_free_region(dev, mem->start, size, attrs);
+   if (index < 0)
+   return NULL;
+
+   tlb_addr = mem->start + (index << IO_TLB_SHIFT);
+   *dma_handle = phys_to_dma_unencrypted(dev, tlb_addr);
+
+   if (!dev_is_dma_coherent(dev)) {
+   unsigned long pfn = PFN_DOWN(tlb_addr);
+
+   /* remove any dirty cache lines on the kernel alias */
+   arch_dma_prep_coherent(pfn_to_page(pfn), size);


Can we hook in somewhat lower level in the dma-direct code so that all
the remapping in dma-direct can be reused instead of duplicated?  That
also becomes important if we want to use non-remapping uncached support,
e.g. on mips or x86, or the direct changing of the attributes that Will
planned to look into for arm64.


Indeed, AFAICS this ought to boil down to a direct equivalent of 
__dma_direct_alloc_pages() - other than the address there should be no 
conceptual difference between pages from the restricted pool and those 
from the regular page allocator, so this probably deserves to be plumbed 
in as an alternative to that.


Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v3 2/6] swiotlb: Add restricted DMA pool

2021-01-13 Thread Robin Murphy

On 2021-01-13 17:43, Florian Fainelli wrote:

On 1/13/21 7:27 AM, Robin Murphy wrote:

On 2021-01-13 13:59, Nicolas Saenz Julienne wrote:

Hi All,

On Tue, 2021-01-12 at 16:03 -0800, Florian Fainelli wrote:

On 1/5/21 7:41 PM, Claire Chang wrote:

Add the initialization function to create restricted DMA pools from
matching reserved-memory nodes in the device tree.

Signed-off-by: Claire Chang 
---
   include/linux/device.h  |   4 ++
   include/linux/swiotlb.h |   7 +-
   kernel/dma/Kconfig  |   1 +
   kernel/dma/swiotlb.c    | 144
++--
   4 files changed, 131 insertions(+), 25 deletions(-)

diff --git a/include/linux/device.h b/include/linux/device.h
index 89bb8b84173e..ca6f71ec8871 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -413,6 +413,7 @@ struct dev_links_info {
    * @dma_pools:    Dma pools (if dma'ble device).
    * @dma_mem:    Internal for coherent mem override.
    * @cma_area:    Contiguous memory area for dma allocations
+ * @dma_io_tlb_mem: Internal for swiotlb io_tlb_mem override.
    * @archdata:    For arch-specific additions.
    * @of_node:    Associated device tree node.
    * @fwnode:    Associated device node supplied by platform firmware.
@@ -515,6 +516,9 @@ struct device {
   #ifdef CONFIG_DMA_CMA
   struct cma *cma_area;    /* contiguous memory area for dma
  allocations */
+#endif
+#ifdef CONFIG_SWIOTLB
+    struct io_tlb_mem    *dma_io_tlb_mem;
   #endif
   /* arch specific additions */
   struct dev_archdata    archdata;
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index dd8eb57cbb8f..a1bbd775 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -76,12 +76,13 @@ extern enum swiotlb_force swiotlb_force;
    *
    * @start:    The start address of the swiotlb memory pool. Used
to do a quick
    *    range check to see if the memory was in fact allocated
by this
- *    API.
+ *    API. For restricted DMA pool, this is device tree
adjustable.


Maybe write it as this is "firmware adjustable" such that when/if ACPI
needs something like this, the description does not need updating.


TBH I really don't think this needs calling out at all. Even in the
regular case, the details of exactly how and where the pool is allocated
are beyond the scope of this code - architectures already have several
ways to control that and make their own decisions.



[snip]


+static int rmem_swiotlb_device_init(struct reserved_mem *rmem,
+    struct device *dev)
+{
+    struct io_tlb_mem *mem = rmem->priv;
+    int ret;
+
+    if (dev->dma_io_tlb_mem)
+    return -EBUSY;
+
+    if (!mem) {
+    mem = kzalloc(sizeof(*mem), GFP_KERNEL);
+    if (!mem)
+    return -ENOMEM;
+
+    if (!memremap(rmem->base, rmem->size, MEMREMAP_WB)) {


MEMREMAP_WB sounds appropriate as a default.


As per the binding 'no-map' has to be disabled here. So AFAIU, this
memory will
be part of the linear mapping. Is this really needed then?


More than that, I'd assume that we *have* to use the linear/direct map
address rather than anything that has any possibility of being a vmalloc
remap, otherwise we can no longer safely rely on
phys_to_dma/dma_to_phys, no?


I believe you are right, which means that if we want to make use of the
restricted DMA pool on a 32-bit architecture (and we do, at least, I do)
we should probably add some error checking/warning to ensure the
restricted DMA pool falls within the linear map.


Oh, good point - I'm so used to 64-bit that I instinctively just blanked 
out the !PageHighMem() condition in try_ram_remap(). So maybe the 
original intent here *was* to effectively just implement that check, but 
if so it could still do with being a lot more explicit.


Cheers,
Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC PATCH v3 2/6] swiotlb: Add restricted DMA pool

2021-01-13 Thread Florian Fainelli
On 1/13/21 7:27 AM, Robin Murphy wrote:
> On 2021-01-13 13:59, Nicolas Saenz Julienne wrote:
>> Hi All,
>>
>> On Tue, 2021-01-12 at 16:03 -0800, Florian Fainelli wrote:
>>> On 1/5/21 7:41 PM, Claire Chang wrote:
 Add the initialization function to create restricted DMA pools from
 matching reserved-memory nodes in the device tree.

 Signed-off-by: Claire Chang 
 ---
   include/linux/device.h  |   4 ++
   include/linux/swiotlb.h |   7 +-
   kernel/dma/Kconfig  |   1 +
   kernel/dma/swiotlb.c    | 144
 ++--
   4 files changed, 131 insertions(+), 25 deletions(-)

 diff --git a/include/linux/device.h b/include/linux/device.h
 index 89bb8b84173e..ca6f71ec8871 100644
 --- a/include/linux/device.h
 +++ b/include/linux/device.h
 @@ -413,6 +413,7 @@ struct dev_links_info {
    * @dma_pools:    Dma pools (if dma'ble device).
    * @dma_mem:    Internal for coherent mem override.
    * @cma_area:    Contiguous memory area for dma allocations
 + * @dma_io_tlb_mem: Internal for swiotlb io_tlb_mem override.
    * @archdata:    For arch-specific additions.
    * @of_node:    Associated device tree node.
    * @fwnode:    Associated device node supplied by platform firmware.
 @@ -515,6 +516,9 @@ struct device {
   #ifdef CONFIG_DMA_CMA
   struct cma *cma_area;    /* contiguous memory area for dma
  allocations */
 +#endif
 +#ifdef CONFIG_SWIOTLB
 +    struct io_tlb_mem    *dma_io_tlb_mem;
   #endif
   /* arch specific additions */
   struct dev_archdata    archdata;
 diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
 index dd8eb57cbb8f..a1bbd775 100644
 --- a/include/linux/swiotlb.h
 +++ b/include/linux/swiotlb.h
 @@ -76,12 +76,13 @@ extern enum swiotlb_force swiotlb_force;
    *
    * @start:    The start address of the swiotlb memory pool. Used
 to do a quick
    *    range check to see if the memory was in fact allocated
 by this
 - *    API.
 + *    API. For restricted DMA pool, this is device tree
 adjustable.
>>>
>>> Maybe write it as this is "firmware adjustable" such that when/if ACPI
>>> needs something like this, the description does not need updating.
> 
> TBH I really don't think this needs calling out at all. Even in the
> regular case, the details of exactly how and where the pool is allocated
> are beyond the scope of this code - architectures already have several
> ways to control that and make their own decisions.
> 
>>>
>>> [snip]
>>>
 +static int rmem_swiotlb_device_init(struct reserved_mem *rmem,
 +    struct device *dev)
 +{
 +    struct io_tlb_mem *mem = rmem->priv;
 +    int ret;
 +
 +    if (dev->dma_io_tlb_mem)
 +    return -EBUSY;
 +
 +    if (!mem) {
 +    mem = kzalloc(sizeof(*mem), GFP_KERNEL);
 +    if (!mem)
 +    return -ENOMEM;
 +
 +    if (!memremap(rmem->base, rmem->size, MEMREMAP_WB)) {
>>>
>>> MEMREMAP_WB sounds appropriate as a default.
>>
>> As per the binding 'no-map' has to be disabled here. So AFAIU, this
>> memory will
>> be part of the linear mapping. Is this really needed then?
> 
> More than that, I'd assume that we *have* to use the linear/direct map
> address rather than anything that has any possibility of being a vmalloc
> remap, otherwise we can no longer safely rely on
> phys_to_dma/dma_to_phys, no?

I believe you are right, which means that if we want to make use of the
restricted DMA pool on a 32-bit architecture (and we do, at least, I do)
we should probably add some error checking/warning to ensure the
restricted DMA pool falls within the linear map.
-- 
Florian
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v13 00/15] SMMUv3 Nested Stage Setup (IOMMU part)

2021-01-13 Thread Auger Eric
Hi Shameer,

On 1/8/21 6:05 PM, Shameerali Kolothum Thodi wrote:
> Hi Eric,
> 
>> -Original Message-
>> From: Eric Auger [mailto:eric.au...@redhat.com]
>> Sent: 18 November 2020 11:22
>> To: eric.auger@gmail.com; eric.au...@redhat.com;
>> iommu@lists.linux-foundation.org; linux-ker...@vger.kernel.org;
>> k...@vger.kernel.org; kvm...@lists.cs.columbia.edu; w...@kernel.org;
>> j...@8bytes.org; m...@kernel.org; robin.mur...@arm.com;
>> alex.william...@redhat.com
>> Cc: jean-phili...@linaro.org; zhangfei@linaro.org;
>> zhangfei@gmail.com; vivek.gau...@arm.com; Shameerali Kolothum
>> Thodi ;
>> jacob.jun@linux.intel.com; yi.l@intel.com; t...@semihalf.com;
>> nicoleots...@gmail.com; yuzenghui 
>> Subject: [PATCH v13 00/15] SMMUv3 Nested Stage Setup (IOMMU part)
>>
>> This series brings the IOMMU part of HW nested paging support
>> in the SMMUv3. The VFIO part is submitted separately.
>>
>> The IOMMU API is extended to support 2 new API functionalities:
>> 1) pass the guest stage 1 configuration
>> 2) pass stage 1 MSI bindings
>>
>> Then those capabilities gets implemented in the SMMUv3 driver.
>>
>> The virtualizer passes information through the VFIO user API
>> which cascades them to the iommu subsystem. This allows the guest
>> to own stage 1 tables and context descriptors (so-called PASID
>> table) while the host owns stage 2 tables and main configuration
>> structures (STE).
> 
> I am seeing an issue with Guest testpmd run with this series.
> I have two different setups and testpmd works fine with the
> first one but not with the second.
> 
> 1). Guest doesn't have kernel driver built-in for pass-through dev.
> 
> root@ubuntu:/# lspci -v
> ...
> 00:02.0 Ethernet controller: Huawei Technologies Co., Ltd. Device a22e (rev 
> 21)
> Subsystem: Huawei Technologies Co., Ltd. Device 
> Flags: fast devsel
> Memory at 800010 (64-bit, prefetchable) [disabled] [size=64K]
> Memory at 80 (64-bit, prefetchable) [disabled] [size=1M]
> Capabilities: [40] Express Root Complex Integrated Endpoint, MSI 00
> Capabilities: [a0] MSI-X: Enable- Count=67 Masked-
> Capabilities: [b0] Power Management version 3
> Capabilities: [100] Access Control Services
> Capabilities: [300] Transaction Processing Hints
> 
> root@ubuntu:/# echo vfio-pci > 
> /sys/bus/pci/devices/:00:02.0/driver_override
> root@ubuntu:/# echo :00:02.0 > /sys/bus/pci/drivers_probe
> 
> root@ubuntu:/mnt/dpdk/build/app# ./testpmd -w :00:02.0 --file-prefix 
> socket0  -l 0-1 -n 2 -- -i
> EAL: Detected 8 lcore(s)
> EAL: Detected 1 NUMA nodes
> EAL: Multi-process socket /var/run/dpdk/socket0/mp_socket
> EAL: Selected IOVA mode 'VA'
> EAL: No available hugepages reported in hugepages-32768kB
> EAL: No available hugepages reported in hugepages-64kB
> EAL: No available hugepages reported in hugepages-1048576kB
> EAL: Probing VFIO support...
> EAL: VFIO support initialized
> EAL:   Invalid NUMA socket, default to 0
> EAL:   using IOMMU type 1 (Type 1)
> EAL: Probe PCI driver: net_hns3_vf (19e5:a22e) device: :00:02.0 (socket 0)
> EAL: No legacy callbacks, legacy socket not created
> Interactive-mode selected
> testpmd: create a new mbuf pool : n=155456, size=2176, 
> socket=0
> testpmd: preferred mempool ops selected: ring_mp_mc
> 
> Warning! port-topology=paired and odd forward ports number, the last port 
> will pair with itself.
> 
> Configuring Port 0 (socket 0)
> Port 0: 8E:A6:8C:43:43:45
> Checking link statuses...
> Done
> testpmd>
> 
> 2). Guest have kernel driver built-in for pass-through dev.
> 
> root@ubuntu:/# lspci -v
> ...
> 00:02.0 Ethernet controller: Huawei Technologies Co., Ltd. Device a22e (rev 
> 21)
> Subsystem: Huawei Technologies Co., Ltd. Device 
> Flags: bus master, fast devsel, latency 0
> Memory at 800010 (64-bit, prefetchable) [size=64K]
> Memory at 80 (64-bit, prefetchable) [size=1M]
> Capabilities: [40] Express Root Complex Integrated Endpoint, MSI 00
> Capabilities: [a0] MSI-X: Enable+ Count=67 Masked-
> Capabilities: [b0] Power Management version 3
> Capabilities: [100] Access Control Services
> Capabilities: [300] Transaction Processing Hints
> Kernel driver in use: hns3
> 
> root@ubuntu:/# echo vfio-pci > 
> /sys/bus/pci/devices/:00:02.0/driver_override
> root@ubuntu:/# echo :00:02.0 > /sys/bus/pci/drivers/hns3/unbind
> root@ubuntu:/# echo :00:02.0 > /sys/bus/pci/drivers_probe
> 
> root@ubuntu:/mnt/dpdk/build/app# ./testpmd -w :00:02.0 --file-prefix 
> socket0 -l 0-1 -n 2 -- -i
> EAL: Detected 8 lcore(s)
> EAL: Detected 1 NUMA nodes
> EAL: Multi-process socket /var/run/dpdk/socket0/mp_socket
> EAL: Selected IOVA mode 'VA'
> EAL: No available hugepages reported in hugepages-32768kB
> EAL: No available hugepages reported in hugepages-64kB
> EAL: No available hugepages reported in hugepages-1048576kB
> EAL: Probing VFIO support...
> EAL: VFIO support initialized
> EAL:   Invalid NUMA socket, default to 0
> EAL:   using IOMMU type 1 (Type 1)
> EAL: Probe 

Re: [RFC PATCH v3 2/6] swiotlb: Add restricted DMA pool

2021-01-13 Thread Robin Murphy

On 2021-01-13 13:59, Nicolas Saenz Julienne wrote:

Hi All,

On Tue, 2021-01-12 at 16:03 -0800, Florian Fainelli wrote:

On 1/5/21 7:41 PM, Claire Chang wrote:

Add the initialization function to create restricted DMA pools from
matching reserved-memory nodes in the device tree.

Signed-off-by: Claire Chang 
---
  include/linux/device.h  |   4 ++
  include/linux/swiotlb.h |   7 +-
  kernel/dma/Kconfig  |   1 +
  kernel/dma/swiotlb.c| 144 ++--
  4 files changed, 131 insertions(+), 25 deletions(-)

diff --git a/include/linux/device.h b/include/linux/device.h
index 89bb8b84173e..ca6f71ec8871 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -413,6 +413,7 @@ struct dev_links_info {
   * @dma_pools:Dma pools (if dma'ble device).
   * @dma_mem:  Internal for coherent mem override.
   * @cma_area: Contiguous memory area for dma allocations
+ * @dma_io_tlb_mem: Internal for swiotlb io_tlb_mem override.
   * @archdata: For arch-specific additions.
   * @of_node:  Associated device tree node.
   * @fwnode:   Associated device node supplied by platform firmware.
@@ -515,6 +516,9 @@ struct device {
  #ifdef CONFIG_DMA_CMA
    struct cma *cma_area;   /* contiguous memory area for dma
   allocations */
+#endif
+#ifdef CONFIG_SWIOTLB
+   struct io_tlb_mem   *dma_io_tlb_mem;
  #endif
    /* arch specific additions */
    struct dev_archdata archdata;
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index dd8eb57cbb8f..a1bbd775 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -76,12 +76,13 @@ extern enum swiotlb_force swiotlb_force;
   *
   * @start:The start address of the swiotlb memory pool. Used to do a quick
   *range check to see if the memory was in fact allocated by this
- * API.
+ * API. For restricted DMA pool, this is device tree adjustable.


Maybe write it as this is "firmware adjustable" such that when/if ACPI
needs something like this, the description does not need updating.


TBH I really don't think this needs calling out at all. Even in the 
regular case, the details of exactly how and where the pool is allocated 
are beyond the scope of this code - architectures already have several 
ways to control that and make their own decisions.




[snip]


+static int rmem_swiotlb_device_init(struct reserved_mem *rmem,
+   struct device *dev)
+{
+   struct io_tlb_mem *mem = rmem->priv;
+   int ret;
+
+   if (dev->dma_io_tlb_mem)
+   return -EBUSY;
+
+   if (!mem) {
+   mem = kzalloc(sizeof(*mem), GFP_KERNEL);
+   if (!mem)
+   return -ENOMEM;
+
+   if (!memremap(rmem->base, rmem->size, MEMREMAP_WB)) {


MEMREMAP_WB sounds appropriate as a default.


As per the binding 'no-map' has to be disabled here. So AFAIU, this memory will
be part of the linear mapping. Is this really needed then?


More than that, I'd assume that we *have* to use the linear/direct map 
address rather than anything that has any possibility of being a vmalloc 
remap, otherwise we can no longer safely rely on 
phys_to_dma/dma_to_phys, no?


That said, given that we're not actually using the returned address, I'm 
not entirely sure what the point of this call is anyway. If we can 
assume it's always going to return the linear map address via 
try_ram_remap() then we can equally just go ahead and use the linear map 
address straight away. I don't really see how we could ever hit the 
"is_ram == REGION_MIXED" case in memremap() that would return NULL, if 
we passed the memblock check earlier in __reserved_mem_alloc_size() such 
that this rmem node ever got to be initialised at all.


Robin.


Documentation/devicetree/bindings/reserved-memory/ramoops.txt does
define an "unbuffered" property which in premise could be applied to the
generic reserved memory binding as well and that we may have to be
honoring here, if we were to make it more generic. Oh well, this does
not need to be addressed right now I guess.





___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 1/5] vfio/iommu_type1: Fixes vfio_dma_populate_bitmap to avoid dirty lose

2021-01-13 Thread Alex Williamson
On Wed, 13 Jan 2021 18:05:43 +0530
Kirti Wankhede  wrote:

> On 1/13/2021 2:50 AM, Alex Williamson wrote:
> > On Thu, 7 Jan 2021 17:28:57 +0800
> > Keqian Zhu  wrote:
> >   
> >> Defer checking whether vfio_dma is of fully-dirty in update_user_bitmap
> >> is easy to lose dirty log. For example, after promoting pinned_scope of
> >> vfio_iommu, vfio_dma is not considered as fully-dirty, then we may lose
> >> dirty log that occurs before vfio_iommu is promoted.
> >>
> >> The key point is that pinned-dirty is not a real dirty tracking way, it
> >> can't continuously track dirty pages, but just restrict dirty scope. It
> >> is essentially the same as fully-dirty. Fully-dirty is of full-scope and
> >> pinned-dirty is of pinned-scope.
> >>
> >> So we must mark pinned-dirty or fully-dirty after we start dirty tracking
> >> or clear dirty bitmap, to ensure that dirty log is marked right away.  
> > 
> > I was initially convinced by these first three patches, but upon
> > further review, I think the premise is wrong.  AIUI, the concern across
> > these patches is that our dirty bitmap is only populated with pages
> > dirtied by pinning and we only take into account the pinned page dirty
> > scope at the time the bitmap is retrieved by the user.  You suppose
> > this presents a gap where if a vendor driver has not yet identified
> > with a page pinning scope that the entire bitmap should be considered
> > dirty regardless of whether that driver later pins pages prior to the
> > user retrieving the dirty bitmap.
> > 
> > I don't think this is how we intended the cooperation between the iommu
> > driver and vendor driver to work.  By pinning pages a vendor driver is
> > not declaring that only their future dirty page scope is limited to
> > pinned pages, instead they're declaring themselves as a participant in
> > dirty page tracking and take responsibility for pinning any necessary
> > pages.  For example we might extend VFIO_IOMMU_DIRTY_PAGES_FLAG_START
> > to trigger a blocking notification to groups to not only begin dirty
> > tracking, but also to synchronously register their current device DMA
> > footprint.  This patch would require a vendor driver to possibly perform
> > a gratuitous page pinning in order to set the scope prior to dirty
> > logging being enabled, or else the initial bitmap will be fully dirty.
> > 
> > Therefore, I don't see that this series is necessary or correct.  Kirti,
> > does this match your thinking?
> >   
> 
> That's correct Alex and I agree with you.
> 
> > Thinking about these semantics, it seems there might still be an issue
> > if a group with non-pinned-page dirty scope is detached with dirty
> > logging enabled.
> 
> Hot-unplug a device while migration process has started - is this 
> scenario supported?

It's not prevented, it would rely on a userspace policy, right?  The
kernel should do the right thing regardless.  Thanks,

Alex

> > It seems this should in fact fully populate the dirty
> > bitmaps at the time it's removed since we don't know the extent of its
> > previous DMA, nor will the group be present to trigger the full bitmap
> > when the user retrieves the dirty bitmap.  Creating fully populated
> > bitmaps at the time tracking is enabled negates our ability to take
> > advantage of later enlightenment though.  Thanks,
> > 
> > Alex
> >   
> >> Fixes: d6a4c185660c ("vfio iommu: Implementation of ioctl for dirty pages 
> >> tracking")
> >> Signed-off-by: Keqian Zhu 
> >> ---
> >>   drivers/vfio/vfio_iommu_type1.c | 33 ++---
> >>   1 file changed, 22 insertions(+), 11 deletions(-)
> >>
> >> diff --git a/drivers/vfio/vfio_iommu_type1.c 
> >> b/drivers/vfio/vfio_iommu_type1.c
> >> index bceda5e8baaa..b0a26e8e0adf 100644
> >> --- a/drivers/vfio/vfio_iommu_type1.c
> >> +++ b/drivers/vfio/vfio_iommu_type1.c
> >> @@ -224,7 +224,7 @@ static void vfio_dma_bitmap_free(struct vfio_dma *dma)
> >>dma->bitmap = NULL;
> >>   }
> >>   
> >> -static void vfio_dma_populate_bitmap(struct vfio_dma *dma, size_t pgsize)
> >> +static void vfio_dma_populate_bitmap_pinned(struct vfio_dma *dma, size_t 
> >> pgsize)
> >>   {
> >>struct rb_node *p;
> >>unsigned long pgshift = __ffs(pgsize);
> >> @@ -236,6 +236,25 @@ static void vfio_dma_populate_bitmap(struct vfio_dma 
> >> *dma, size_t pgsize)
> >>}
> >>   }
> >>   
> >> +static void vfio_dma_populate_bitmap_full(struct vfio_dma *dma, size_t 
> >> pgsize)
> >> +{
> >> +  unsigned long pgshift = __ffs(pgsize);
> >> +  unsigned long nbits = dma->size >> pgshift;
> >> +
> >> +  bitmap_set(dma->bitmap, 0, nbits);
> >> +}
> >> +
> >> +static void vfio_dma_populate_bitmap(struct vfio_iommu *iommu,
> >> +   struct vfio_dma *dma)
> >> +{
> >> +  size_t pgsize = (size_t)1 << __ffs(iommu->pgsize_bitmap);
> >> +
> >> +  if (iommu->pinned_page_dirty_scope)
> >> +  vfio_dma_populate_bitmap_pinned(dma, pgsize);
> >> +  else if (dma->iommu_mapped)
> >> +  

Re: performance regression noted in v5.11-rc after c062db039f40

2021-01-13 Thread Chuck Lever



> On Jan 12, 2021, at 9:25 PM, Lu Baolu  wrote:
> 
> Hi,
> 
> On 1/12/21 10:38 PM, Will Deacon wrote:
>> [Expanding cc list to include DMA-IOMMU and intel IOMMU folks]
>> On Fri, Jan 08, 2021 at 04:18:36PM -0500, Chuck Lever wrote:
>>> Hi-
>>> 
>>> [ Please cc: me on replies, I'm not currently subscribed to
>>> iommu@lists ].
>>> 
>>> I'm running NFS performance tests on InfiniBand using CX-3 Pro cards
>>> at 56Gb/s. The test is iozone on an NFSv3/RDMA mount:
>>> 
>>> /home/cel/bin/iozone -M -+u -i0 -i1 -s1g -r256k -t12 -I
>>> 
>>> For those not familiar with the way storage protocols use RDMA, The
>>> initiator/client sets up memory regions and the target/server uses
>>> RDMA Read and Write to move data out of and into those regions. The
>>> initiator/client uses only RDMA memory registration and invalidation
>>> operations, and the target/server uses RDMA Read and Write.
>>> 
>>> My NFS client is a two-socket 12-core x86_64 system with its I/O MMU
>>> enabled using the kernel command line options "intel_iommu=on
>>> iommu=strict".
>>> 
>>> Recently I've noticed a significant (25-30%) loss in NFS throughput.
>>> I was able to bisect on my client to the following commits.
>>> 
>>> Here's 65f746e8285f ("iommu: Add quirk for Intel graphic devices in
>>> map_sg"). This is about normal for this test.
>>> 
>>> Children see throughput for 12 initial writers  = 4732581.09 kB/sec
>>> Parent sees throughput for 12 initial writers   = 4646810.21 kB/sec
>>> Min throughput per process  =  387764.34 kB/sec
>>> Max throughput per process  =  399655.47 kB/sec
>>> Avg throughput per process  =  394381.76 kB/sec
>>> Min xfer= 1017344.00 kB
>>> CPU Utilization: Wall time2.671CPU time1.974CPU 
>>> utilization  73.89 %
>>> Children see throughput for 12 rewriters= 4837741.94 kB/sec
>>> Parent sees throughput for 12 rewriters = 4833509.35 kB/sec
>>> Min throughput per process  =  398983.72 kB/sec
>>> Max throughput per process  =  406199.66 kB/sec
>>> Avg throughput per process  =  403145.16 kB/sec
>>> Min xfer= 1030656.00 kB
>>> CPU utilization: Wall time2.584CPU time1.959CPU 
>>> utilization  75.82 %
>>> Children see throughput for 12 readers  = 5921370.94 kB/sec
>>> Parent sees throughput for 12 readers   = 5914106.69 kB/sec
>>> Min throughput per process  =  491812.38 kB/sec
>>> Max throughput per process  =  494777.28 kB/sec
>>> Avg throughput per process  =  493447.58 kB/sec
>>> Min xfer= 1042688.00 kB
>>> CPU utilization: Wall time2.122CPU time1.968CPU 
>>> utilization  92.75 %
>>> Children see throughput for 12 re-readers   = 5947985.69 kB/sec
>>> Parent sees throughput for 12 re-readers= 5941348.51 kB/sec
>>> Min throughput per process  =  492805.81 kB/sec
>>> Max throughput per process  =  497280.19 kB/sec
>>> Avg throughput per process  =  495665.47 kB/sec
>>> Min xfer= 1039360.00 kB
>>> CPU utilization: Wall time2.111CPU time1.968CPU 
>>> utilization  93.22 %
>>> 
>>> Here's c062db039f40 ("iommu/vt-d: Update domain geometry in
>>> iommu_ops.at(de)tach_dev"). It's losing some steam here.
>>> 
>>> Children see throughput for 12 initial writers  = 4342419.12 kB/sec
>>> Parent sees throughput for 12 initial writers   = 4310612.79 kB/sec
>>> Min throughput per process  =  359299.06 kB/sec
>>> Max throughput per process  =  363866.16 kB/sec
>>> Avg throughput per process  =  361868.26 kB/sec
>>> Min xfer= 1035520.00 kB
>>> CPU Utilization: Wall time2.902CPU time1.951CPU 
>>> utilization  67.22 %
>>> Children see throughput for 12 rewriters= 4408576.66 kB/sec
>>> Parent sees throughput for 12 rewriters = 4404280.87 kB/sec
>>> Min throughput per process  =  364553.88 kB/sec
>>> Max throughput per process  =  370029.28 kB/sec
>>> Avg throughput per process  =  367381.39 kB/sec
>>> Min xfer= 1033216.00 kB
>>> CPU utilization: Wall time2.836CPU time1.956CPU 
>>> utilization  68.97 %
>>> Children see throughput for 12 readers  = 5406879.47 kB/sec
>>> Parent sees throughput for 12 readers   = 5401862.78 kB/sec
>>> Min throughput per process  =  449583.03 kB/sec
>>> 

Re: [RFC PATCH v3 2/6] swiotlb: Add restricted DMA pool

2021-01-13 Thread Nicolas Saenz Julienne
Hi All,

On Tue, 2021-01-12 at 16:03 -0800, Florian Fainelli wrote:
> On 1/5/21 7:41 PM, Claire Chang wrote:
> > Add the initialization function to create restricted DMA pools from
> > matching reserved-memory nodes in the device tree.
> > 
> > Signed-off-by: Claire Chang 
> > ---
> >  include/linux/device.h  |   4 ++
> >  include/linux/swiotlb.h |   7 +-
> >  kernel/dma/Kconfig  |   1 +
> >  kernel/dma/swiotlb.c| 144 ++--
> >  4 files changed, 131 insertions(+), 25 deletions(-)
> > 
> > diff --git a/include/linux/device.h b/include/linux/device.h
> > index 89bb8b84173e..ca6f71ec8871 100644
> > --- a/include/linux/device.h
> > +++ b/include/linux/device.h
> > @@ -413,6 +413,7 @@ struct dev_links_info {
> >   * @dma_pools: Dma pools (if dma'ble device).
> >   * @dma_mem:   Internal for coherent mem override.
> >   * @cma_area:  Contiguous memory area for dma allocations
> > + * @dma_io_tlb_mem: Internal for swiotlb io_tlb_mem override.
> >   * @archdata:  For arch-specific additions.
> >   * @of_node:   Associated device tree node.
> >   * @fwnode:Associated device node supplied by platform firmware.
> > @@ -515,6 +516,9 @@ struct device {
> >  #ifdef CONFIG_DMA_CMA
> >     struct cma *cma_area;   /* contiguous memory area for dma
> >    allocations */
> > +#endif
> > +#ifdef CONFIG_SWIOTLB
> > +   struct io_tlb_mem   *dma_io_tlb_mem;
> >  #endif
> >     /* arch specific additions */
> >     struct dev_archdata archdata;
> > diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> > index dd8eb57cbb8f..a1bbd775 100644
> > --- a/include/linux/swiotlb.h
> > +++ b/include/linux/swiotlb.h
> > @@ -76,12 +76,13 @@ extern enum swiotlb_force swiotlb_force;
> >   *
> >   * @start: The start address of the swiotlb memory pool. Used to do a quick
> >   * range check to see if the memory was in fact allocated by this
> > - * API.
> > + * API. For restricted DMA pool, this is device tree adjustable.
> 
> Maybe write it as this is "firmware adjustable" such that when/if ACPI
> needs something like this, the description does not need updating.
> 
> [snip]
> 
> > +static int rmem_swiotlb_device_init(struct reserved_mem *rmem,
> > +   struct device *dev)
> > +{
> > +   struct io_tlb_mem *mem = rmem->priv;
> > +   int ret;
> > +
> > +   if (dev->dma_io_tlb_mem)
> > +   return -EBUSY;
> > +
> > +   if (!mem) {
> > +   mem = kzalloc(sizeof(*mem), GFP_KERNEL);
> > +   if (!mem)
> > +   return -ENOMEM;
> > +
> > +   if (!memremap(rmem->base, rmem->size, MEMREMAP_WB)) {
> 
> MEMREMAP_WB sounds appropriate as a default.

As per the binding 'no-map' has to be disabled here. So AFAIU, this memory will
be part of the linear mapping. Is this really needed then?

> Documentation/devicetree/bindings/reserved-memory/ramoops.txt does
> define an "unbuffered" property which in premise could be applied to the
> generic reserved memory binding as well and that we may have to be
> honoring here, if we were to make it more generic. Oh well, this does
> not need to be addressed right now I guess.





signature.asc
Description: This is a digitally signed message part
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC PATCH 1/2] dt-bindings: iommu: add bindings for sprd iommu

2021-01-13 Thread Rob Herring
On Fri, Jan 8, 2021 at 5:34 AM Chunyan Zhang  wrote:
>
> On Fri, 8 Jan 2021 at 10:25, Rob Herring  wrote:
> >
> > On Wed, Dec 23, 2020 at 07:16:32PM +0800, Chunyan Zhang wrote:
> > > From: Chunyan Zhang 
> > >
> > > This patch only adds bindings to support display iommu, support for others
> > > would be added once finished tests with those devices, such as Image
> > > codec(jpeg) processor, a few signal processors, including VSP(video),
> > > GSP(graphic), ISP(image), and camera CPP, etc.
> > >
> > > Signed-off-by: Chunyan Zhang 
> > > ---
> > >  .../devicetree/bindings/iommu/sprd,iommu.yaml | 44 +++
> > >  1 file changed, 44 insertions(+)
> > >  create mode 100644 
> > > Documentation/devicetree/bindings/iommu/sprd,iommu.yaml
> > >
> > > diff --git a/Documentation/devicetree/bindings/iommu/sprd,iommu.yaml 
> > > b/Documentation/devicetree/bindings/iommu/sprd,iommu.yaml
> > > new file mode 100644
> > > index ..4d9a578a7cc9
> > > --- /dev/null
> > > +++ b/Documentation/devicetree/bindings/iommu/sprd,iommu.yaml
> > > @@ -0,0 +1,44 @@
> > > +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
> > > +# Copyright 2020 Unisoc Inc.
> > > +%YAML 1.2
> > > +---
> > > +$id: http://devicetree.org/schemas/iommu/sprd,iommu.yaml#
> > > +$schema: http://devicetree.org/meta-schemas/core.yaml#
> > > +
> > > +title: Unisoc IOMMU and Multi-media MMU
> > > +
> > > +maintainers:
> > > +  - Chunyan Zhang 
> > > +
> > > +properties:
> > > +  compatible:
> > > +enum:
> > > +  - sprd,iommu-disp
> >
> > Needs to be Soc specific.
>
> All SoCs so far use the same iommu IP, there's a little different
> among different iommu users.

That's what everyone says. Be warned that you cannot add properties
for any differences that come up whether features or errata.

> > Is this block specific to display subsys or
> > that just happens to be where the instance is?
>
> This iommu driver can serve many subsystem devices, such as Video,
> Camera, Image, etc., but they have their own iommu module which looks
> like a subdevice embedded in the master devices.
> I will add more compatible strings for those devices when needed.
> For now, only this one was listed here because I just tested this
> iommu driver with DPU only.

The iommu binding takes care of what each one is connected to. Is each
instance different in terms of features or programming model? If not,
then you shouldn't have different compatible strings for each
instance.

Rob
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v3 4/6] swiotlb: Add restricted DMA alloc/free support.

2021-01-13 Thread Christoph Hellwig
> +#ifdef CONFIG_SWIOTLB
> + if (unlikely(dev->dma_io_tlb_mem))
> + return swiotlb_alloc(dev, size, dma_handle, attrs);
> +#endif

Another place where the dma_io_tlb_mem is useful to avoid the ifdef.

> -phys_addr_t swiotlb_tbl_map_single(struct device *hwdev, phys_addr_t 
> orig_addr,
> - size_t mapping_size, size_t alloc_size,
> - enum dma_data_direction dir, unsigned long attrs)
> +static int swiotlb_tbl_find_free_region(struct device *hwdev,
> + dma_addr_t tbl_dma_addr,
> + size_t alloc_size,
> + unsigned long attrs)

> +static void swiotlb_tbl_release_region(struct device *hwdev, int index,
> +size_t size)

This refactoring should be another prep patch.


> +void *swiotlb_alloc(struct device *dev, size_t size, dma_addr_t *dma_handle,
> + unsigned long attrs)

I'd rather have the names convey there are for the per-device bounce
buffer in some form.

> + struct io_tlb_mem *mem = dev->dma_io_tlb_mem;

While we're at it I wonder if the io_tlb is something we could change
while we're at it.  Maybe replace io_tlb_mem with struct swiotlb
and rename the field in struct device to dev_swiotlb?

> + int index;
> + void *vaddr;
> + phys_addr_t tlb_addr;
> +
> + size = PAGE_ALIGN(size);
> + index = swiotlb_tbl_find_free_region(dev, mem->start, size, attrs);
> + if (index < 0)
> + return NULL;
> +
> + tlb_addr = mem->start + (index << IO_TLB_SHIFT);
> + *dma_handle = phys_to_dma_unencrypted(dev, tlb_addr);
> +
> + if (!dev_is_dma_coherent(dev)) {
> + unsigned long pfn = PFN_DOWN(tlb_addr);
> +
> + /* remove any dirty cache lines on the kernel alias */
> + arch_dma_prep_coherent(pfn_to_page(pfn), size);

Can we hook in somewhat lower level in the dma-direct code so that all
the remapping in dma-direct can be reused instead of duplicated?  That
also becomes important if we want to use non-remapping uncached support,
e.g. on mips or x86, or the direct changing of the attributes that Will
planned to look into for arm64.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v3 3/6] swiotlb: Use restricted DMA pool if available

2021-01-13 Thread Christoph Hellwig
> +#ifdef CONFIG_SWIOTLB
> + if (unlikely(swiotlb_force == SWIOTLB_FORCE) || dev->dma_io_tlb_mem)
>   return swiotlb_map(dev, phys, size, dir, attrs);
> +#endif

Please provide a wrapper for the dev->dma_io_tlb_mem check that
always returns false if the per-device swiotlb support is not enabled.

> index 7fb2ac087d23..1f05af09e61a 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -222,7 +222,6 @@ int __init swiotlb_init_with_tbl(char *tlb, unsigned long 
> nslabs, int verbose)
>   mem->orig_addr[i] = INVALID_PHYS_ADDR;
>   }
>   mem->index = 0;
> - no_iotlb_memory = false;

How does this fit in here?

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v3 2/6] swiotlb: Add restricted DMA pool

2021-01-13 Thread Christoph Hellwig
> +#ifdef CONFIG_SWIOTLB
> + struct io_tlb_mem   *dma_io_tlb_mem;
>  #endif

Please add a new config option for this code instead of always building
it when swiotlb is enabled.

> +static int swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
> +size_t size)

Can you split the refactoring in swiotlb.c into one or more prep
patches?

> +static int rmem_swiotlb_device_init(struct reserved_mem *rmem,
> + struct device *dev)
> +{
> + struct io_tlb_mem *mem = rmem->priv;
> + int ret;
> +
> + if (dev->dma_io_tlb_mem)
> + return -EBUSY;
> +
> + if (!mem) {
> + mem = kzalloc(sizeof(*mem), GFP_KERNEL);
> + if (!mem)
> + return -ENOMEM;

What is the calling convention here that allows for a NULL and non-NULL
private data?
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v3 2/6] swiotlb: Add restricted DMA pool

2021-01-13 Thread Christoph Hellwig
On Wed, Jan 13, 2021 at 01:29:05PM +0100, Greg KH wrote:
> > > Why does this have to be added here?  Shouldn't the platform-specific
> > > code handle it instead?
> > 
> > The whole code added here is pretty generic.  What we need to eventually
> > do, though is to add a separate dma_device instead of adding more and more
> > bloat to struct device.
> 
> I have no objections for that happening!

I'm pretty sure you agreed to it before in fact.  Now someone just needs
to find the time to do this heavy lifting, where "someone" probably means
me.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/5] vfio/iommu_type1: Fixes vfio_dma_populate_bitmap to avoid dirty lose

2021-01-13 Thread Kirti Wankhede




On 1/13/2021 2:50 AM, Alex Williamson wrote:

On Thu, 7 Jan 2021 17:28:57 +0800
Keqian Zhu  wrote:


Defer checking whether vfio_dma is of fully-dirty in update_user_bitmap
is easy to lose dirty log. For example, after promoting pinned_scope of
vfio_iommu, vfio_dma is not considered as fully-dirty, then we may lose
dirty log that occurs before vfio_iommu is promoted.

The key point is that pinned-dirty is not a real dirty tracking way, it
can't continuously track dirty pages, but just restrict dirty scope. It
is essentially the same as fully-dirty. Fully-dirty is of full-scope and
pinned-dirty is of pinned-scope.

So we must mark pinned-dirty or fully-dirty after we start dirty tracking
or clear dirty bitmap, to ensure that dirty log is marked right away.


I was initially convinced by these first three patches, but upon
further review, I think the premise is wrong.  AIUI, the concern across
these patches is that our dirty bitmap is only populated with pages
dirtied by pinning and we only take into account the pinned page dirty
scope at the time the bitmap is retrieved by the user.  You suppose
this presents a gap where if a vendor driver has not yet identified
with a page pinning scope that the entire bitmap should be considered
dirty regardless of whether that driver later pins pages prior to the
user retrieving the dirty bitmap.

I don't think this is how we intended the cooperation between the iommu
driver and vendor driver to work.  By pinning pages a vendor driver is
not declaring that only their future dirty page scope is limited to
pinned pages, instead they're declaring themselves as a participant in
dirty page tracking and take responsibility for pinning any necessary
pages.  For example we might extend VFIO_IOMMU_DIRTY_PAGES_FLAG_START
to trigger a blocking notification to groups to not only begin dirty
tracking, but also to synchronously register their current device DMA
footprint.  This patch would require a vendor driver to possibly perform
a gratuitous page pinning in order to set the scope prior to dirty
logging being enabled, or else the initial bitmap will be fully dirty.

Therefore, I don't see that this series is necessary or correct.  Kirti,
does this match your thinking?



That's correct Alex and I agree with you.


Thinking about these semantics, it seems there might still be an issue
if a group with non-pinned-page dirty scope is detached with dirty
logging enabled.  


Hot-unplug a device while migration process has started - is this 
scenario supported?


Thanks,
Kirti


It seems this should in fact fully populate the dirty
bitmaps at the time it's removed since we don't know the extent of its
previous DMA, nor will the group be present to trigger the full bitmap
when the user retrieves the dirty bitmap.  Creating fully populated
bitmaps at the time tracking is enabled negates our ability to take
advantage of later enlightenment though.  Thanks,

Alex


Fixes: d6a4c185660c ("vfio iommu: Implementation of ioctl for dirty pages 
tracking")
Signed-off-by: Keqian Zhu 
---
  drivers/vfio/vfio_iommu_type1.c | 33 ++---
  1 file changed, 22 insertions(+), 11 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index bceda5e8baaa..b0a26e8e0adf 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -224,7 +224,7 @@ static void vfio_dma_bitmap_free(struct vfio_dma *dma)
dma->bitmap = NULL;
  }
  
-static void vfio_dma_populate_bitmap(struct vfio_dma *dma, size_t pgsize)

+static void vfio_dma_populate_bitmap_pinned(struct vfio_dma *dma, size_t 
pgsize)
  {
struct rb_node *p;
unsigned long pgshift = __ffs(pgsize);
@@ -236,6 +236,25 @@ static void vfio_dma_populate_bitmap(struct vfio_dma *dma, 
size_t pgsize)
}
  }
  
+static void vfio_dma_populate_bitmap_full(struct vfio_dma *dma, size_t pgsize)

+{
+   unsigned long pgshift = __ffs(pgsize);
+   unsigned long nbits = dma->size >> pgshift;
+
+   bitmap_set(dma->bitmap, 0, nbits);
+}
+
+static void vfio_dma_populate_bitmap(struct vfio_iommu *iommu,
+struct vfio_dma *dma)
+{
+   size_t pgsize = (size_t)1 << __ffs(iommu->pgsize_bitmap);
+
+   if (iommu->pinned_page_dirty_scope)
+   vfio_dma_populate_bitmap_pinned(dma, pgsize);
+   else if (dma->iommu_mapped)
+   vfio_dma_populate_bitmap_full(dma, pgsize);
+}
+
  static int vfio_dma_bitmap_alloc_all(struct vfio_iommu *iommu)
  {
struct rb_node *n;
@@ -257,7 +276,7 @@ static int vfio_dma_bitmap_alloc_all(struct vfio_iommu 
*iommu)
}
return ret;
}
-   vfio_dma_populate_bitmap(dma, pgsize);
+   vfio_dma_populate_bitmap(iommu, dma);
}
return 0;
  }
@@ -987,13 +1006,6 @@ static int update_user_bitmap(u64 __user *bitmap, struct 
vfio_iommu *iommu,
unsigned long shift = 

Re: [RFC PATCH v3 2/6] swiotlb: Add restricted DMA pool

2021-01-13 Thread Greg KH
On Wed, Jan 13, 2021 at 12:51:26PM +0100, Christoph Hellwig wrote:
> On Wed, Jan 06, 2021 at 08:50:03AM +0100, Greg KH wrote:
> > > --- a/include/linux/device.h
> > > +++ b/include/linux/device.h
> > > @@ -413,6 +413,7 @@ struct dev_links_info {
> > >   * @dma_pools:   Dma pools (if dma'ble device).
> > >   * @dma_mem: Internal for coherent mem override.
> > >   * @cma_area:Contiguous memory area for dma allocations
> > > + * @dma_io_tlb_mem: Internal for swiotlb io_tlb_mem override.
> > 
> > Why does this have to be added here?  Shouldn't the platform-specific
> > code handle it instead?
> 
> The whole code added here is pretty generic.  What we need to eventually
> do, though is to add a separate dma_device instead of adding more and more
> bloat to struct device.

I have no objections for that happening!
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v3 2/6] swiotlb: Add restricted DMA pool

2021-01-13 Thread Christoph Hellwig
On Wed, Jan 06, 2021 at 08:50:03AM +0100, Greg KH wrote:
> > --- a/include/linux/device.h
> > +++ b/include/linux/device.h
> > @@ -413,6 +413,7 @@ struct dev_links_info {
> >   * @dma_pools: Dma pools (if dma'ble device).
> >   * @dma_mem:   Internal for coherent mem override.
> >   * @cma_area:  Contiguous memory area for dma allocations
> > + * @dma_io_tlb_mem: Internal for swiotlb io_tlb_mem override.
> 
> Why does this have to be added here?  Shouldn't the platform-specific
> code handle it instead?

The whole code added here is pretty generic.  What we need to eventually
do, though is to add a separate dma_device instead of adding more and more
bloat to struct device.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v3 1/6] swiotlb: Add io_tlb_mem struct

2021-01-13 Thread Christoph Hellwig
On Wed, Jan 06, 2021 at 11:41:19AM +0800, Claire Chang wrote:
> Added a new struct, io_tlb_mem, as the IO TLB memory pool descriptor and
> moved relevant global variables into that struct.
> This will be useful later to allow for restricted DMA pool.

I like where this is going, but a few comments.

Mostly I'd love to be able to entirely hide io_tlb_default_mem
and struct io_tlb_mem inside of swiotlb.c.

> --- a/arch/powerpc/platforms/pseries/svm.c
> +++ b/arch/powerpc/platforms/pseries/svm.c
> @@ -55,8 +55,8 @@ void __init svm_swiotlb_init(void)
>   if (vstart && !swiotlb_init_with_tbl(vstart, io_tlb_nslabs, false))
>   return;
>  
> - if (io_tlb_start)
> - memblock_free_early(io_tlb_start,
> + if (io_tlb_default_mem.start)
> + memblock_free_early(io_tlb_default_mem.start,
>   PAGE_ALIGN(io_tlb_nslabs << IO_TLB_SHIFT));

I think this should switch to use the local vstart variable in
prep patch.

>   panic("SVM: Cannot allocate SWIOTLB buffer");
>  }
> diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
> index 2b385c1b4a99..4d17dff7ffd2 100644
> --- a/drivers/xen/swiotlb-xen.c
> +++ b/drivers/xen/swiotlb-xen.c
> @@ -192,8 +192,8 @@ int __ref xen_swiotlb_init(int verbose, bool early)
>   /*
>* IO TLB memory already allocated. Just use it.
>*/
> - if (io_tlb_start != 0) {
> - xen_io_tlb_start = phys_to_virt(io_tlb_start);
> + if (io_tlb_default_mem.start != 0) {
> + xen_io_tlb_start = phys_to_virt(io_tlb_default_mem.start);
>   goto end;

xen_io_tlb_start is interesting. It is used only in two functions:

 1) is_xen_swiotlb_buffer, where I think we should be able to just use
is_swiotlb_buffer instead of open coding it with the extra
phys_to_virt/virt_to_phys cycle.
 2) xen_swiotlb_init, where except for the assignment it only is used
locally for the case not touched above and could this be replaced
with a local variable.

Konrad, does this make sense to you?

>  static inline bool is_swiotlb_buffer(phys_addr_t paddr)
>  {
> - return paddr >= io_tlb_start && paddr < io_tlb_end;
> + struct io_tlb_mem *mem = _tlb_default_mem;
> +
> + return paddr >= mem->start && paddr < mem->end;

We'd then have to move this out of line as well.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] swiotlb: Validate bounce size in the sync/unmap path

2021-01-13 Thread Christoph Hellwig
On Tue, Jan 12, 2021 at 04:07:29PM +0100, Martin Radev wrote:
> The size of the buffer being bounced is not checked if it happens
> to be larger than the size of the mapped buffer. Because the size
> can be controlled by a device, as it's the case with virtio devices,
> this can lead to memory corruption.
> 

I'm really worried about all these hodge podge hacks for not trusted
hypervisors in the I/O stack.  Instead of trying to harden protocols
that are fundamentally not designed for this, how about instead coming
up with a new paravirtualized I/O interface that is specifically
designed for use with an untrusted hypervisor from the start?
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v9 03/10] iommu: Separate IOMMU_DEV_FEAT_IOPF from IOMMU_DEV_FEAT_SVA

2021-01-13 Thread Tian, Kevin
> From: Lu Baolu 
> Sent: Wednesday, January 13, 2021 10:50 AM
> 
> Hi Jean,
> 
> On 1/12/21 5:16 PM, Jean-Philippe Brucker wrote:
> > Hi Baolu,
> >
> > On Tue, Jan 12, 2021 at 12:31:23PM +0800, Lu Baolu wrote:
> >> Hi Jean,
> >>
> >> On 1/8/21 10:52 PM, Jean-Philippe Brucker wrote:
> >>> Some devices manage I/O Page Faults (IOPF) themselves instead of
> relying
> >>> on PCIe PRI or Arm SMMU stall. Allow their drivers to enable SVA without
> >>> mandating IOMMU-managed IOPF. The other device drivers now need to
> first
> >>> enable IOMMU_DEV_FEAT_IOPF before enabling
> IOMMU_DEV_FEAT_SVA.
> >>>
> >>> Signed-off-by: Jean-Philippe Brucker 
> >>> ---
> >>> Cc: Arnd Bergmann 
> >>> Cc: David Woodhouse 
> >>> Cc: Greg Kroah-Hartman 
> >>> Cc: Joerg Roedel 
> >>> Cc: Lu Baolu 
> >>> Cc: Will Deacon 
> >>> Cc: Zhangfei Gao 
> >>> Cc: Zhou Wang 
> >>> ---
> >>>include/linux/iommu.h | 20 +---
> >>>1 file changed, 17 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> >>> index 583c734b2e87..701b2eeb0dc5 100644
> >>> --- a/include/linux/iommu.h
> >>> +++ b/include/linux/iommu.h
> >>> @@ -156,10 +156,24 @@ struct iommu_resv_region {
> >>>   enum iommu_resv_typetype;
> >>>};
> >>> -/* Per device IOMMU features */
> >>> +/**
> >>> + * enum iommu_dev_features - Per device IOMMU features
> >>> + * @IOMMU_DEV_FEAT_AUX: Auxiliary domain feature
> >>> + * @IOMMU_DEV_FEAT_SVA: Shared Virtual Addresses
> >>> + * @IOMMU_DEV_FEAT_IOPF: I/O Page Faults such as PRI or Stall.
> Generally using
> >>> + *%IOMMU_DEV_FEAT_SVA
> requires %IOMMU_DEV_FEAT_IOPF, but
> >>> + *some devices manage I/O Page Faults themselves
> instead
> >>> + *of relying on the IOMMU. When supported, this
> feature
> >>> + *must be enabled before and disabled after
> >>> + *%IOMMU_DEV_FEAT_SVA.
> >>
> >> Is this only for SVA? We may see more scenarios of using IOPF. For
> >> example, when passing through devices to user level, the user's pages
> >> could be managed dynamically instead of being allocated and pinned
> >> statically.
> >
> > Hm, isn't that precisely what SVA does?  I don't understand the
> > difference. That said FEAT_IOPF doesn't have to be only for SVA. It could
> > later be used as a prerequisite some another feature. For special cases
> > device drivers can always use the iommu_register_device_fault_handler()
> > API and handle faults themselves.
> 
>  From the perspective of IOMMU, there is a little difference between
> these two. For SVA, the page table is from CPU side, so IOMMU only needs
> to call handle_mm_fault(); For above pass-through case, the page table
> is from IOMMU side, so the device driver (probably VFIO) needs to
> register a fault handler and call iommu_map/unmap() to serve the page
> faults.
> 
> If we think about the nested mode (or dual-stage translation), it's more
> complicated since the kernel (probably VFIO) handles the second level
> page faults, while the first level page faults need to be delivered to
> user-level guest. Obviously, this hasn't been fully implemented in any
> IOMMU driver.
> 

Thinking more the confusion might come from the fact that we mixed
hardware capability with software capability. IOMMU_FEAT describes
the hardware capability. When FEAT_IOPF is set, it purely means whatever
page faults that are enabled by the software are routed through the IOMMU.
Nothing more. Then the software (IOMMU drivers) may choose to support
only limited faulting scenarios and then evolve to support more complex 
usages gradually. For example, the intel-iommu driver only supports 1st-level
fault (thus SVA) for now, while FEAT_IOPF as a separate feature may give the
impression that 2nd-level faults are also allowed. From this angle once we 
start to separate page fault from SVA, we may also need a way to report 
the software capability (e.g. a set of faulting categories) and also extend
iommu_register_device_fault_handler to allow specifying which 
category is enabled respectively. The example categories could be:

- IOPF_BIND, for page tables which are bound/linked to the IOMMU. 
Apply to bare metal SVA and guest SVA case;
- IOPF_MAP, for page tables which are managed through explicit IOMMU
map interfaces. Apply to removing VFIO pinning restriction;

Both categories can be enabled together in nested translation, with 
additional information provided to differentiate them in fault information.
Using paging/staging level doesn't make much sense as it's IOMMU driver's 
internal knowledge, e.g. VT-d driver plans to use 1st level for GPA if no 
nesting and then turn to 2nd level when nesting is enabled.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu