Re: [PATCH] iommu/vt-d: call __dmar_remove_one_dev_info with valid pointer

2020-01-21 Thread Lu Baolu

On 1/22/20 8:34 AM, Jerry Snitselaar wrote:

It is possible for archdata.iommu to be set to
DEFER_DEVICE_DOMAIN_INFO or DUMMY_DEVICE_DOMAIN_INFO so check for
those values before calling __dmar_remove_one_dev_info. Without a
check it can result in a null pointer dereference. This has been seen
while booting a kdump kernel on an HP dl380 gen9.

Cc: Joerg Roedel 
Cc: Lu Baolu 
Cc: David Woodhouse 
Cc: sta...@vger.kernel.org # 5.3+
Cc: linux-ker...@vger.kernel.org
Fixes: ae23bfb68f28 ("iommu/vt-d: Detach domain before using a private one")
Signed-off-by: Jerry Snitselaar 


Acked-by: Lu Baolu 

Best regards,
baolu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 3/4] iommu: Preallocate iommu group when probing devices

2020-01-21 Thread Lu Baolu

Hi Robin,

On 1/21/20 8:45 PM, Robin Murphy wrote:

On 19/01/2020 6:29 am, Lu Baolu wrote:

Hi Joerg,

On 1/17/20 6:21 PM, Joerg Roedel wrote:

On Wed, Jan 01, 2020 at 01:26:47PM +0800, Lu Baolu wrote:

This splits iommu group allocation from adding devices. This makes
it possible to determine the default domain type for each group as
all devices belonging to the group have been determined.


I think its better to keep group allocation as it is and just defer
default domain allocation after each device is in its group. But take


I tried defering default domain allocation, but it seems not possible.

The call path of adding devices into their groups:

iommu_probe_device
-> ops->add_device(dev)
    -> (iommu vendor driver) iommu_group_get_for_dev(dev)

After doing this, the vendor driver will get the default domain and
apply dma_ops according to the domain type. If we defer the domain
allocation, they will get a NULL default domain and cause panic in
the vendor driver.

Any suggestions?


https://lore.kernel.org/linux-iommu/6dbbfc10-3247-744c-ae8d-443a336e0...@linux.intel.com/ 



Haven't we been here before? ;)

Since we can't (safely or reasonably) change a group's default domain 
after ops->add_device() has returned, and in general it gets impractical 
to evaluate "all device in a group" once you look beyond _bus_type 
(or consider hotplug as mentioned), then AFAICS there's no reasonable 
way to get away from the default domain type being defined by the first 
device to attach.


Yes, agreed.

But in practice it's hardly a problem anyway - if 
every device in a given group requests the same domain type then it 
doesn't matter which comes first, and if they don't then we ultimately 
end up with an impossible set of constraints, so are doomed to do the 
'wrong' thing regardless.


The third case is, for example, three devices A, B, C in a group. The
first device A is neutral about which type of default domain type is
used. So the iommu framework will use a static default domain. But the
device B requires to use a specific one which is different from the
default. Currently, this is handled in the vendor iommu driver and one
motivation of this patch set is to handle this in the generic layer.



Thus unless anyone wants to start redefining the whole group concept to 
separate the notions of ID aliasing and peer-to-peer isolation (which 
still wouldn't necessarily help), I think this user override effectively 
boils down to just another flavour of iommu_request_*_for_dev(), and as 
such comes right back to the "just pass the bloody device to 
ops->domain_alloc() and resolve everything up-front" argument.


Robin.


Best regards,
baolu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC PATCH 0/4] iommu: Per-group default domain type

2020-01-21 Thread Lu Baolu

Hi,

On 1/21/20 6:14 PM, John Garry wrote:

On 21/01/2020 00:43, Lu Baolu wrote:
An IOMMU group represents the smallest set of devices that are 
considered

to be isolated. All devices belonging to an IOMMU group share a default
domain for DMA APIs. There are two types of default domain: 
IOMMU_DOMAIN_DMA
and IOMMU_DOMAIN_IDENTITY. The former means IOMMU translation, while 
the

latter means IOMMU by-pass.

Currently, the default domain type for the IOMMU groups is determined
globally. All IOMMU groups use a single default domain type. The global
default domain type can be adjusted by kernel build configuration or
kernel parameters.

More and more users are looking forward to a fine grained default 
domain
type. For example, with the global default domain type set to 
translation,
the OEM verndors or end users might want some trusted and fast-speed 
devices

to bypass IOMMU for performance gains. On the other hand, with global
default domain type set to by-pass, some devices with limited system
memory addressing capability might want IOMMU translation to remove the
bounce buffer overhead.


Hi Lu Baolu,

Do you think that it would be a more common usecase to want 
kernel-managed devices to be passthrough for performance reasons and 
some select devices to be in DMA domain, like those with limited 
address cap or whose drivers request huge amounts of memory?


I just think it would be more manageable to set kernel commandline 
parameters for this, i.e. those select few which want DMA domain.




Hi Baolu,



It's just two sides of a coin. Currently, iommu subsystem make DMA
domain by default, that's the reason why I selected to let user set
which devices are willing to use identity domains.



OK, understood.

There was an alternate solution here which would allow per-group type to 
be updated via sysfs:


https://lore.kernel.org/linux-iommu/cover.1566353521.git.sai.praneeth.prak...@intel.com/ 



Yes. My patch set just tries to do this statically during boot time.



Any idea what happened to that?


No idea. Sai might have more information. :-)

Best regards,
baolu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v5 5/7] iommu/vt-d: Remove VMD child device sanity check

2020-01-21 Thread Lu Baolu

On 1/21/20 9:37 PM, Jon Derrick wrote:

This removes the sanity check required for VMD child devices. The new
pci_real_dma_dev() DMA alias mechanism places them in the same IOMMU
group as the VMD endpoint. Assignment of the group would require
assigning the VMD endpoint, where unbinding the VMD endpoint removes the
child device domain from the hierarchy.

Signed-off-by: Jon Derrick


Acked-by: Lu Baolu 

Thanks,
baolu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v5 4/7] iommu/vt-d: Use pci_real_dma_dev() for mapping

2020-01-21 Thread Lu Baolu

On 1/21/20 9:37 PM, Jon Derrick wrote:

The PCI device may have a DMA requester on another bus, such as VMD
subdevices needing to use the VMD endpoint. This case requires the real
DMA device when mapping to IOMMU.

Signed-off-by: Jon Derrick


Acked-by: Lu Baolu 

Thanks,
baolu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2] swiotlb: Adjust SWIOTBL bounce buffer size for SEV guests.

2020-01-21 Thread Ashish Kalra
On Thu, Dec 19, 2019 at 08:52:45PM -0500, Konrad Rzeszutek Wilk wrote:
> On Mon, Dec 09, 2019 at 11:13:46PM +, Ashish Kalra wrote:
> > From: Ashish Kalra 
> > 
> > For SEV, all DMA to and from guest has to use shared
> > (un-encrypted) pages. SEV uses SWIOTLB to make this happen
> > without requiring changes to device drivers. However,
> > depending on workload being run, the default 64MB of SWIOTLB
> > might not be enough and SWIOTLB may run out of buffers to
> > use for DMA, resulting in I/O errors.
> > 
> > Increase the default size of SWIOTLB for SEV guests using
> > a minimum value of 128MB and a maximum value of 512MB,
> > determining on amount of provisioned guest memory.
> > 
> > The SWIOTLB default size adjustment is added as an
> > architecture specific interface/callback to allow
> > architectures such as those supporting memory encryption
> > to adjust/expand SWIOTLB size for their use.
> 
> What if this was made dynamic? That is if there is a memory
> pressure you end up expanding the SWIOTLB dynamically?

As of now we want to keep it as simple as possible and more
like a stop-gap arrangement till something more elegant is
available.

> 
>> Also is it worth doing this calculation based on memory or
>> more on the # of PCI devices + their MMIO ranges size?

Additional memory calculations based on # of PCI devices and
their memory ranges will make it more complicated with so
many other permutations and combinations to explore, it is
essential to keep this patch as simple as possible by 
adjusting the bounce buffer size simply by determining it
from the amount of provisioned guest memory.

Thanks,
Ashish
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH] iommu/vt-d: call __dmar_remove_one_dev_info with valid pointer

2020-01-21 Thread Jerry Snitselaar
It is possible for archdata.iommu to be set to
DEFER_DEVICE_DOMAIN_INFO or DUMMY_DEVICE_DOMAIN_INFO so check for
those values before calling __dmar_remove_one_dev_info. Without a
check it can result in a null pointer dereference. This has been seen
while booting a kdump kernel on an HP dl380 gen9.

Cc: Joerg Roedel 
Cc: Lu Baolu 
Cc: David Woodhouse 
Cc: sta...@vger.kernel.org # 5.3+
Cc: linux-ker...@vger.kernel.org
Fixes: ae23bfb68f28 ("iommu/vt-d: Detach domain before using a private one")
Signed-off-by: Jerry Snitselaar 
---
 drivers/iommu/intel-iommu.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 1801f0aaf013..932267f49f9a 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -5163,7 +5163,8 @@ static void dmar_remove_one_dev_info(struct device *dev)
 
spin_lock_irqsave(_domain_lock, flags);
info = dev->archdata.iommu;
-   if (info)
+   if (info && info != DEFER_DEVICE_DOMAIN_INFO
+   && info != DUMMY_DEVICE_DOMAIN_INFO)
__dmar_remove_one_dev_info(info);
spin_unlock_irqrestore(_domain_lock, flags);
 }
-- 
2.24.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2] swiotlb: Adjust SWIOTBL bounce buffer size for SEV guests.

2020-01-21 Thread Konrad Rzeszutek Wilk
On Tue, Jan 21, 2020 at 08:09:47PM +, Ashish Kalra wrote:
> On Thu, Dec 19, 2019 at 08:52:45PM -0500, Konrad Rzeszutek Wilk wrote:
> > On Mon, Dec 09, 2019 at 11:13:46PM +, Ashish Kalra wrote:
> > > From: Ashish Kalra 
> > > 
> > > For SEV, all DMA to and from guest has to use shared
> > > (un-encrypted) pages. SEV uses SWIOTLB to make this happen
> > > without requiring changes to device drivers. However,
> > > depending on workload being run, the default 64MB of SWIOTLB
> > > might not be enough and SWIOTLB may run out of buffers to
> > > use for DMA, resulting in I/O errors.
> > > 
> > > Increase the default size of SWIOTLB for SEV guests using
> > > a minimum value of 128MB and a maximum value of 512MB,
> > > determining on amount of provisioned guest memory.
> > > 
> > > The SWIOTLB default size adjustment is added as an
> > > architecture specific interface/callback to allow
> > > architectures such as those supporting memory encryption
> > > to adjust/expand SWIOTLB size for their use.
> > 
> > What if this was made dynamic? That is if there is a memory
> > pressure you end up expanding the SWIOTLB dynamically?
> 
> As of now we want to keep it as simple as possible and more
> like a stop-gap arrangement till something more elegant is
> available.

That is nice. But past experience has shown that stop-gap arrangments
end up being the defacto solution.

> 
> > 
> >> Also is it worth doing this calculation based on memory or
> >> more on the # of PCI devices + their MMIO ranges size?
> 
> Additional memory calculations based on # of PCI devices and
> their memory ranges will make it more complicated with so
> many other permutations and combinations to explore, it is
> essential to keep this patch as simple as possible by 
> adjusting the bounce buffer size simply by determining it
> from the amount of provisioned guest memory.

Please rework the patch to:

 - Use a log solution instead of the multiplication.
   Feel free to cap it at a sensible value.

 - Also the code depends on SWIOTLB calling in to the
   adjust_swiotlb_default_size which looks wrong.

   You should not adjust io_tlb_nslabs from swiotlb_size_or_default.
   That function's purpose is to report a value.

 - Make io_tlb_nslabs be visible outside of the SWIOTLB code.

 - Can you utilize the IOMMU_INIT APIs and have your own detect which would
   modify the io_tlb_nslabs (and set swiotbl=1?).

   Actually you seem to be piggybacking on pci_swiotlb_detect_4gb - so
   perhaps add in this code ? Albeit it really should be in it's own
   file, not in arch/x86/kernel/pci-swiotlb.c

 - Tweak the code in the swiotlb code to make sure it can deal
   with io_tlb_nslabs being modified outside of the code at
   the start. It should have no trouble, but only testing will
   tell for sure.

> 
> Thanks,
> Ashish
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v5 7/7] x86/PCI: Remove X86_DEV_DMA_OPS

2020-01-21 Thread Jon Derrick
From: Christoph Hellwig 

There are no users of X86_DEV_DMA_OPS left, so remove the code.

Reviewed-by: Jon Derrick 
Signed-off-by: Christoph Hellwig 
---
 arch/x86/Kconfig  |  3 ---
 arch/x86/include/asm/device.h | 10 --
 arch/x86/pci/common.c | 38 --
 3 files changed, 51 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5e89499..77f9426 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2955,9 +2955,6 @@ config HAVE_ATOMIC_IOMAP
def_bool y
depends on X86_32
 
-config X86_DEV_DMA_OPS
-   bool
-
 source "drivers/firmware/Kconfig"
 
 source "arch/x86/kvm/Kconfig"
diff --git a/arch/x86/include/asm/device.h b/arch/x86/include/asm/device.h
index 5e12c63..7e31f7f 100644
--- a/arch/x86/include/asm/device.h
+++ b/arch/x86/include/asm/device.h
@@ -8,16 +8,6 @@ struct dev_archdata {
 #endif
 };
 
-#if defined(CONFIG_X86_DEV_DMA_OPS) && defined(CONFIG_PCI_DOMAINS)
-struct dma_domain {
-   struct list_head node;
-   const struct dma_map_ops *dma_ops;
-   int domain_nr;
-};
-void add_dma_domain(struct dma_domain *domain);
-void del_dma_domain(struct dma_domain *domain);
-#endif
-
 struct pdev_archdata {
 };
 
diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
index fe21a5c..df1d959 100644
--- a/arch/x86/pci/common.c
+++ b/arch/x86/pci/common.c
@@ -625,43 +625,6 @@ unsigned int pcibios_assign_all_busses(void)
return (pci_probe & PCI_ASSIGN_ALL_BUSSES) ? 1 : 0;
 }
 
-#if defined(CONFIG_X86_DEV_DMA_OPS) && defined(CONFIG_PCI_DOMAINS)
-static LIST_HEAD(dma_domain_list);
-static DEFINE_SPINLOCK(dma_domain_list_lock);
-
-void add_dma_domain(struct dma_domain *domain)
-{
-   spin_lock(_domain_list_lock);
-   list_add(>node, _domain_list);
-   spin_unlock(_domain_list_lock);
-}
-EXPORT_SYMBOL_GPL(add_dma_domain);
-
-void del_dma_domain(struct dma_domain *domain)
-{
-   spin_lock(_domain_list_lock);
-   list_del(>node);
-   spin_unlock(_domain_list_lock);
-}
-EXPORT_SYMBOL_GPL(del_dma_domain);
-
-static void set_dma_domain_ops(struct pci_dev *pdev)
-{
-   struct dma_domain *domain;
-
-   spin_lock(_domain_list_lock);
-   list_for_each_entry(domain, _domain_list, node) {
-   if (pci_domain_nr(pdev->bus) == domain->domain_nr) {
-   pdev->dev.dma_ops = domain->dma_ops;
-   break;
-   }
-   }
-   spin_unlock(_domain_list_lock);
-}
-#else
-static void set_dma_domain_ops(struct pci_dev *pdev) {}
-#endif
-
 static void set_dev_domain_options(struct pci_dev *pdev)
 {
if (is_vmd(pdev->bus))
@@ -697,7 +660,6 @@ int pcibios_add_device(struct pci_dev *dev)
pa_data = data->next;
memunmap(data);
}
-   set_dma_domain_ops(dev);
set_dev_domain_options(dev);
return 0;
 }
-- 
1.8.3.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v5 0/7] Clean up VMD DMA Map Ops

2020-01-21 Thread Jon Derrick
v4 Set: 
https://lore.kernel.org/linux-pci/20200120110220.gb17...@e121166-lin.cambridge.arm.com/T/#t
v3 Set: 
https://lore.kernel.org/linux-iommu/20200113181742.ga27...@e121166-lin.cambridge.arm.com/T/#t
v2 Set: 
https://lore.kernel.org/linux-iommu/1578580256-3483-1-git-send-email-jonathan.derr...@intel.com/T/#t
v1 Set: 
https://lore.kernel.org/linux-iommu/20200107134125.gd30...@8bytes.org/T/#t

VMD currently works with VT-d enabled by pointing DMA and IOMMU actions at the
VMD endpoint. The problem with this approach is that the VMD endpoint's
device-specific attributes, such as the DMA Mask Bits, are used instead of the
child device's attributes.

This set cleans up VMD by removing the override that redirects DMA map
operations to the VMD endpoint. Instead it introduces a new DMA alias mechanism
into the existing DMA alias infrastructure. This new DMA alias mechanism allows
an architecture-specific pci_real_dma_dev() function to provide a pointer from
a pci_dev to its PCI DMA device, where by default it returns the original
pci_dev.

In addition, this set removes the sanity check that was added to prevent
assigning VMD child devices. By using the DMA alias mechanism, all child
devices are assigned the same IOMMU group as the VMD endpoint. This removes the
need for restricting VMD child devices from assignment, as the whole group
would have to be assigned, requiring unbinding the VMD driver and removing the
child device domain.

v1 added a pointer in struct pci_dev that pointed to the DMA alias' struct
pci_dev and did the necessary DMA alias and IOMMU modifications.

v2 introduced a new weak function to reference the 'Direct DMA Alias', and
removed the need to add a pointer in struct device or pci_dev. Weak functions
are generally frowned upon when it's a single architecture implementation, so I
am open to alternatives.

v3 referenced the pci_dev rather than the struct device for the PCI
'Direct DMA Alias' (pci_direct_dma_alias()). This revision also allowed
pci_for_each_dma_alias() to call any DMA aliases for the Direct DMA alias
device, though I don't expect the VMD endpoint to need intra-bus DMA aliases.

v4 changes the 'Direct DMA Alias' to instead refer to the 'Real DMA Dev', which
either returns the PCI device itself or the PCI DMA device.

v5 Fixes a bad call argument to pci_real_dma_dev that would have broken
bisection. This revision also changes one of the calls to a one-liner, and
assembles the same on my system.


Changes from v4:
Fix pci_real_dma_dev() call in 4/7.
Change other pci_real_dma_dev() call in 4/7 to one-liner.

Changes from v3:
Uses pci_real_dma_dev() instead of pci_direct_dma_alias()
Split IOMMU enabling, IOMMU VMD sanity check and VMD dma_map_ops cleanup into 
three patches

Changes from v2:
Uses struct pci_dev for PCI Device 'Direct DMA aliasing' (pci_direct_dma_alias)
Allows pci_for_each_dma_alias to iterate over the alias mask of the 'Direct DMA 
alias'

Changes from v1:
Removed 1/5 & 2/5 misc fix patches that were merged
Uses Christoph's staging/cleanup patches
Introduce weak function rather than including pointer in struct device or 
pci_dev.

Based on Bjorn's next:
https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git/log/?h=next

Christoph Hellwig (2):
  x86/PCI: Add a to_pci_sysdata helper
  x86/PCI: Remove X86_DEV_DMA_OPS

Jon Derrick (5):
  x86/PCI: Expose VMD's PCI Device in pci_sysdata
  PCI: Introduce pci_real_dma_dev()
  iommu/vt-d: Use pci_real_dma_dev() for mapping
  iommu/vt-d: Remove VMD child device sanity check
  PCI: vmd: Stop overriding dma_map_ops

 arch/x86/Kconfig   |   3 -
 arch/x86/include/asm/device.h  |  10 ---
 arch/x86/include/asm/pci.h |  31 -
 arch/x86/pci/common.c  |  48 +++--
 drivers/iommu/intel-iommu.c|  11 ++-
 drivers/pci/controller/Kconfig |   1 -
 drivers/pci/controller/vmd.c   | 152 +
 drivers/pci/pci.c  |  19 +-
 drivers/pci/search.c   |   6 ++
 include/linux/pci.h|   1 +
 10 files changed, 54 insertions(+), 228 deletions(-)

-- 
1.8.3.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v5 5/7] iommu/vt-d: Remove VMD child device sanity check

2020-01-21 Thread Jon Derrick
This removes the sanity check required for VMD child devices. The new
pci_real_dma_dev() DMA alias mechanism places them in the same IOMMU
group as the VMD endpoint. Assignment of the group would require
assigning the VMD endpoint, where unbinding the VMD endpoint removes the
child device domain from the hierarchy.

Signed-off-by: Jon Derrick 
---
 drivers/iommu/intel-iommu.c | 10 +-
 1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 72f26e8..7e2c492f 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -774,15 +774,7 @@ static struct intel_iommu *device_to_iommu(struct device 
*dev, u8 *bus, u8 *devf
if (dev_is_pci(dev)) {
struct pci_dev *pf_pdev;
 
-   pdev = to_pci_dev(dev);
-
-#ifdef CONFIG_X86
-   /* VMD child devices currently cannot be handled individually */
-   if (is_vmd(pdev->bus))
-   return NULL;
-#endif
-
-   pdev = pci_real_dma_dev(pdev);
+   pdev = pci_real_dma_dev(to_pci_dev(dev));
 
/* VFs aren't listed in scope tables; we need to look up
 * the PF instead to find the IOMMU. */
-- 
1.8.3.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v5 2/7] x86/PCI: Expose VMD's PCI Device in pci_sysdata

2020-01-21 Thread Jon Derrick
To be used by Intel-IOMMU code to find the correct domain.

CC: Christoph Hellwig 
Signed-off-by: Jon Derrick 
---
 arch/x86/include/asm/pci.h   | 4 ++--
 drivers/pci/controller/vmd.c | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/pci.h b/arch/x86/include/asm/pci.h
index a4e09db60..6512c54 100644
--- a/arch/x86/include/asm/pci.h
+++ b/arch/x86/include/asm/pci.h
@@ -25,7 +25,7 @@ struct pci_sysdata {
void*fwnode;/* IRQ domain for MSI assignment */
 #endif
 #if IS_ENABLED(CONFIG_VMD)
-   bool vmd_domain;/* True if in Intel VMD domain */
+   struct pci_dev  *vmd_dev;   /* VMD Device if in Intel VMD domain */
 #endif
 };
 
@@ -64,7 +64,7 @@ static inline void *_pci_root_bus_fwnode(struct pci_bus *bus)
 #if IS_ENABLED(CONFIG_VMD)
 static inline bool is_vmd(struct pci_bus *bus)
 {
-   return to_pci_sysdata(bus)->vmd_domain;
+   return to_pci_sysdata(bus)->vmd_dev != NULL;
 }
 #else
 #define is_vmd(bus)false
diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
index 2128422..d67ad56 100644
--- a/drivers/pci/controller/vmd.c
+++ b/drivers/pci/controller/vmd.c
@@ -679,7 +679,7 @@ static int vmd_enable_domain(struct vmd_dev *vmd, unsigned 
long features)
.parent = res,
};
 
-   sd->vmd_domain = true;
+   sd->vmd_dev = vmd->dev;
sd->domain = vmd_find_free_domain();
if (sd->domain < 0)
return sd->domain;
-- 
1.8.3.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v5 3/7] PCI: Introduce pci_real_dma_dev()

2020-01-21 Thread Jon Derrick
The current DMA alias implementation requires the aliased device be on
the same PCI bus as the requester ID. This introduces an arch-specific
mechanism to point to another PCI device when doing mapping and
PCI DMA alias search. The default case returns the actual device.

CC: Christoph Hellwig 
Signed-off-by: Jon Derrick 
---
 arch/x86/pci/common.c | 10 ++
 drivers/pci/pci.c | 19 ++-
 drivers/pci/search.c  |  6 ++
 include/linux/pci.h   |  1 +
 4 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
index 1e59df0..fe21a5c 100644
--- a/arch/x86/pci/common.c
+++ b/arch/x86/pci/common.c
@@ -736,3 +736,13 @@ int pci_ext_cfg_avail(void)
else
return 0;
 }
+
+#if IS_ENABLED(CONFIG_VMD)
+struct pci_dev *pci_real_dma_dev(struct pci_dev *dev)
+{
+   if (is_vmd(dev->bus))
+   return to_pci_sysdata(dev->bus)->vmd_dev;
+
+   return dev;
+}
+#endif
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 581b177..36d24f2 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -6048,7 +6048,9 @@ bool pci_devs_are_dma_aliases(struct pci_dev *dev1, 
struct pci_dev *dev2)
return (dev1->dma_alias_mask &&
test_bit(dev2->devfn, dev1->dma_alias_mask)) ||
   (dev2->dma_alias_mask &&
-   test_bit(dev1->devfn, dev2->dma_alias_mask));
+   test_bit(dev1->devfn, dev2->dma_alias_mask)) ||
+  pci_real_dma_dev(dev1) == dev2 ||
+  pci_real_dma_dev(dev2) == dev1;
 }
 
 bool pci_device_is_present(struct pci_dev *pdev)
@@ -6072,6 +6074,21 @@ void pci_ignore_hotplug(struct pci_dev *dev)
 }
 EXPORT_SYMBOL_GPL(pci_ignore_hotplug);
 
+/**
+ * pci_real_dma_dev - Get PCI DMA device for PCI device
+ * @dev: the PCI device that may have a PCI DMA alias
+ *
+ * Permits the platform to provide architecture-specific functionality to
+ * devices needing to alias DMA to another PCI device on another PCI bus. If
+ * the PCI device is on the same bus, it is recommended to use
+ * pci_add_dma_alias(). This is the default implementation. Architecture
+ * implementations can override this.
+ */
+struct pci_dev __weak *pci_real_dma_dev(struct pci_dev *dev)
+{
+   return dev;
+}
+
 resource_size_t __weak pcibios_default_alignment(void)
 {
return 0;
diff --git a/drivers/pci/search.c b/drivers/pci/search.c
index e4dbdef..2061672 100644
--- a/drivers/pci/search.c
+++ b/drivers/pci/search.c
@@ -32,6 +32,12 @@ int pci_for_each_dma_alias(struct pci_dev *pdev,
struct pci_bus *bus;
int ret;
 
+   /*
+* The device may have an explicit alias requester ID for DMA where the
+* requester is on another PCI bus.
+*/
+   pdev = pci_real_dma_dev(pdev);
+
ret = fn(pdev, pci_dev_id(pdev), data);
if (ret)
return ret;
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 930fab2..3840a54 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1202,6 +1202,7 @@ u32 pcie_bandwidth_available(struct pci_dev *dev, struct 
pci_dev **limiting_dev,
 int pci_select_bars(struct pci_dev *dev, unsigned long flags);
 bool pci_device_is_present(struct pci_dev *pdev);
 void pci_ignore_hotplug(struct pci_dev *dev);
+struct pci_dev *pci_real_dma_dev(struct pci_dev *dev);
 
 int __printf(6, 7) pci_request_irq(struct pci_dev *dev, unsigned int nr,
irq_handler_t handler, irq_handler_t thread_fn, void *dev_id,
-- 
1.8.3.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v5 6/7] PCI: vmd: Stop overriding dma_map_ops

2020-01-21 Thread Jon Derrick
Devices on the VMD domain use the VMD endpoint's requester ID and have
been relying on the VMD endpoint's DMA operations. The problem with this
was that VMD domain devices would use the VMD endpoint's attributes when
doing DMA and IOMMU mapping. We can be smarter about this by only using
the VMD endpoint when mapping and providing the correct child device's
attributes during DMA operations.

This patch removes the dma_map_ops redirect.

Signed-off-by: Jon Derrick 
---
 drivers/pci/controller/Kconfig |   1 -
 drivers/pci/controller/vmd.c   | 150 -
 2 files changed, 151 deletions(-)

diff --git a/drivers/pci/controller/Kconfig b/drivers/pci/controller/Kconfig
index 918e283..20bf00f 100644
--- a/drivers/pci/controller/Kconfig
+++ b/drivers/pci/controller/Kconfig
@@ -239,7 +239,6 @@ config PCIE_TANGO_SMP8759
 
 config VMD
depends on PCI_MSI && X86_64 && SRCU
-   select X86_DEV_DMA_OPS
tristate "Intel Volume Management Device Driver"
---help---
  Adds support for the Intel Volume Management Device (VMD). VMD is a
diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
index d67ad56..fe1acb0 100644
--- a/drivers/pci/controller/vmd.c
+++ b/drivers/pci/controller/vmd.c
@@ -98,9 +98,6 @@ struct vmd_dev {
struct irq_domain   *irq_domain;
struct pci_bus  *bus;
u8  busn_start;
-
-   struct dma_map_ops  dma_ops;
-   struct dma_domain   dma_domain;
 };
 
 static inline struct vmd_dev *vmd_from_bus(struct pci_bus *bus)
@@ -295,151 +292,6 @@ static void vmd_set_desc(msi_alloc_info_t *arg, struct 
msi_desc *desc)
.chip   = _msi_controller,
 };
 
-/*
- * VMD replaces the requester ID with its own.  DMA mappings for devices in a
- * VMD domain need to be mapped for the VMD, not the device requiring
- * the mapping.
- */
-static struct device *to_vmd_dev(struct device *dev)
-{
-   struct pci_dev *pdev = to_pci_dev(dev);
-   struct vmd_dev *vmd = vmd_from_bus(pdev->bus);
-
-   return >dev->dev;
-}
-
-static void *vmd_alloc(struct device *dev, size_t size, dma_addr_t *addr,
-  gfp_t flag, unsigned long attrs)
-{
-   return dma_alloc_attrs(to_vmd_dev(dev), size, addr, flag, attrs);
-}
-
-static void vmd_free(struct device *dev, size_t size, void *vaddr,
-dma_addr_t addr, unsigned long attrs)
-{
-   return dma_free_attrs(to_vmd_dev(dev), size, vaddr, addr, attrs);
-}
-
-static int vmd_mmap(struct device *dev, struct vm_area_struct *vma,
-   void *cpu_addr, dma_addr_t addr, size_t size,
-   unsigned long attrs)
-{
-   return dma_mmap_attrs(to_vmd_dev(dev), vma, cpu_addr, addr, size,
-   attrs);
-}
-
-static int vmd_get_sgtable(struct device *dev, struct sg_table *sgt,
-  void *cpu_addr, dma_addr_t addr, size_t size,
-  unsigned long attrs)
-{
-   return dma_get_sgtable_attrs(to_vmd_dev(dev), sgt, cpu_addr, addr, size,
-   attrs);
-}
-
-static dma_addr_t vmd_map_page(struct device *dev, struct page *page,
-  unsigned long offset, size_t size,
-  enum dma_data_direction dir,
-  unsigned long attrs)
-{
-   return dma_map_page_attrs(to_vmd_dev(dev), page, offset, size, dir,
-   attrs);
-}
-
-static void vmd_unmap_page(struct device *dev, dma_addr_t addr, size_t size,
-  enum dma_data_direction dir, unsigned long attrs)
-{
-   dma_unmap_page_attrs(to_vmd_dev(dev), addr, size, dir, attrs);
-}
-
-static int vmd_map_sg(struct device *dev, struct scatterlist *sg, int nents,
- enum dma_data_direction dir, unsigned long attrs)
-{
-   return dma_map_sg_attrs(to_vmd_dev(dev), sg, nents, dir, attrs);
-}
-
-static void vmd_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
-enum dma_data_direction dir, unsigned long attrs)
-{
-   dma_unmap_sg_attrs(to_vmd_dev(dev), sg, nents, dir, attrs);
-}
-
-static void vmd_sync_single_for_cpu(struct device *dev, dma_addr_t addr,
-   size_t size, enum dma_data_direction dir)
-{
-   dma_sync_single_for_cpu(to_vmd_dev(dev), addr, size, dir);
-}
-
-static void vmd_sync_single_for_device(struct device *dev, dma_addr_t addr,
-  size_t size, enum dma_data_direction dir)
-{
-   dma_sync_single_for_device(to_vmd_dev(dev), addr, size, dir);
-}
-
-static void vmd_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg,
-   int nents, enum dma_data_direction dir)
-{
-   dma_sync_sg_for_cpu(to_vmd_dev(dev), sg, nents, dir);
-}
-
-static void vmd_sync_sg_for_device(struct device *dev, struct scatterlist *sg,
-  int nents, 

[PATCH v5 4/7] iommu/vt-d: Use pci_real_dma_dev() for mapping

2020-01-21 Thread Jon Derrick
The PCI device may have a DMA requester on another bus, such as VMD
subdevices needing to use the VMD endpoint. This case requires the real
DMA device when mapping to IOMMU.

Signed-off-by: Jon Derrick 
---
 drivers/iommu/intel-iommu.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 0c8d81f..72f26e8 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -782,6 +782,8 @@ static struct intel_iommu *device_to_iommu(struct device 
*dev, u8 *bus, u8 *devf
return NULL;
 #endif
 
+   pdev = pci_real_dma_dev(pdev);
+
/* VFs aren't listed in scope tables; we need to look up
 * the PF instead to find the IOMMU. */
pf_pdev = pci_physfn(pdev);
@@ -2428,6 +2430,9 @@ static struct dmar_domain *find_domain(struct device *dev)
 dev->archdata.iommu == DUMMY_DEVICE_DOMAIN_INFO))
return NULL;
 
+   if (dev_is_pci(dev))
+   dev = _real_dma_dev(to_pci_dev(dev))->dev;
+
/* No lock here, assumes no domain exit in normal case */
info = dev->archdata.iommu;
if (likely(info))
-- 
1.8.3.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v5 1/7] x86/PCI: Add a to_pci_sysdata helper

2020-01-21 Thread Jon Derrick
From: Christoph Hellwig 

Various helpers need the pci_sysdata just to dereference a single field
in it.  Add a little helper that returns the properly typed sysdata
pointer to require a little less boilerplate code.

Signed-off-by: Christoph Hellwig 
[jonathan.derrick: to_pci_sysdata const argument]
Signed-off-by: Jon Derrick 
---
 arch/x86/include/asm/pci.h | 29 +
 1 file changed, 13 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/pci.h b/arch/x86/include/asm/pci.h
index 90d0731..a4e09db60 100644
--- a/arch/x86/include/asm/pci.h
+++ b/arch/x86/include/asm/pci.h
@@ -35,12 +35,15 @@ struct pci_sysdata {
 
 #ifdef CONFIG_PCI
 
+static inline struct pci_sysdata *to_pci_sysdata(const struct pci_bus *bus)
+{
+   return bus->sysdata;
+}
+
 #ifdef CONFIG_PCI_DOMAINS
 static inline int pci_domain_nr(struct pci_bus *bus)
 {
-   struct pci_sysdata *sd = bus->sysdata;
-
-   return sd->domain;
+   return to_pci_sysdata(bus)->domain;
 }
 
 static inline int pci_proc_domain(struct pci_bus *bus)
@@ -52,24 +55,20 @@ static inline int pci_proc_domain(struct pci_bus *bus)
 #ifdef CONFIG_PCI_MSI_IRQ_DOMAIN
 static inline void *_pci_root_bus_fwnode(struct pci_bus *bus)
 {
-   struct pci_sysdata *sd = bus->sysdata;
-
-   return sd->fwnode;
+   return to_pci_sysdata(bus)->fwnode;
 }
 
 #define pci_root_bus_fwnode_pci_root_bus_fwnode
 #endif
 
+#if IS_ENABLED(CONFIG_VMD)
 static inline bool is_vmd(struct pci_bus *bus)
 {
-#if IS_ENABLED(CONFIG_VMD)
-   struct pci_sysdata *sd = bus->sysdata;
-
-   return sd->vmd_domain;
-#else
-   return false;
-#endif
+   return to_pci_sysdata(bus)->vmd_domain;
 }
+#else
+#define is_vmd(bus)false
+#endif /* CONFIG_VMD */
 
 /* Can be used to override the logic in pci_scan_bus for skipping
already-configured bus numbers - to be used for buggy BIOSes
@@ -124,9 +123,7 @@ static inline void early_quirks(void) { }
 /* Returns the node based on pci bus */
 static inline int __pcibus_to_node(const struct pci_bus *bus)
 {
-   const struct pci_sysdata *sd = bus->sysdata;
-
-   return sd->node;
+   return to_pci_sysdata(bus)->node;
 }
 
 static inline const struct cpumask *
-- 
1.8.3.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 2/5] iommu/arm-smmu: Add support for split pagetables

2020-01-21 Thread Robin Murphy

On 21/01/2020 5:11 pm, Jordan Crouse wrote:
[...]

I'm looking at iommu_aux_attach_device() and friends, and it appears pretty
achievable to hook that up in a workable manner, even if it's just routed
straight through to the impl to only work within qcom-specific parameters to
begin with. I figure the first aux_attach_dev sanity-checks that the main
domain is using TTBR1 with a compatible split, sets TTBR0 and updates the
merged TCR value at that point. For subsequent calls it shouldn't need to do
much more than sanity-check that a new aux domain has the same parameters as
the existing one(s) (and again, such checks could potentially even start out
as just "this is OK by construction" comments). I guess we'd probably want a
count of the number of 'live' aux domains so we can simply disable TTBR0 on
the final aux_detach_dev without having to keep detailed track of whatever
the GPU has actually context switched in the hardware. Can you see any holes
in that idea?


Let me repeat this back just to be sure we're on the same page. When the quirk
is enabled on the primary domain, we'll set up TTBR1 and leave TTBR0 disabled.
Then, when the first aux domain is attached we will set up that io_ptgable
to enable TTBR0 and then let the GPU do what the GPU does until the last aux is
detached and we can switch off TTBR0 again.

I like this. I'll have to do a bit more exploration because the original aux
design assumed that we didn't need to touch the hardware and I'm not sure if
there are any resource contention issues between the primary domain and the aux
domain. Luckily, these should be solvable if they exist (and the original design
didn't take into account the TLB flush problem so this was likely something we
had to do anyway).


Yeah, sounds like you've got it (somehow I'd completely forgotten that 
you'd already prototyped the aux domain part, and I only re-read the 
cover letter after sending that review...). TBH it's not massively 
different, just being a bit more honest about the intermediate hardware 
state. As long as we can rely on all aux domains being equivalent and 
the GPU never writing nonsense to TTBR0, then all arm-smmu really wants 
to care about is whether there's *something* live or not at any given 
time, so attach (with quirk) does:


TTBR1 = primary_domain->ttbr
TCR = primary_domain->tcr | EPD0

then attach_aux comes along and adds:

TTBR0 = aux_domain->ttbr
TCR = primary_doman->tcr | aux_domain->tcr

such that arm-smmu can be happy that TTBR0 is always pointing at *some* 
valid pagetable from that point on regardless of what subsequently 
happens underneath, and nobody need touch TCR until the party's 
completely over.



I haven't thought it through in detail, but it also feels like between
aux_attach_dev and/or the TTBR1 quirk in attach_dev there ought to be enough
information to influence the context bank allocation or shuffle any existing
domains such that you can ensure that the right thing ends up in magic
context 0 when it needs to be. That could be a pretty neat and robust way to
finally put that to bed.


I'll try to wrap my brain around this as well. Seems like we could do a magic
swizzle of the SID mappings but I'm not sure how we could safely pull that off
on an existing domain. Maybe I'm overthinking it.


What I'm imagining isn't all that far from how we do normal domain 
attach, except instead of setting up the newly-allocated context for a 
new domain you simply clone the existing context into it, and instead of 
having a given device's set of Stream IDs to retarget you'd just scan 
though the S2CRs checking cbndx and rewriting as appropriate. Then 
finally rewrite domain->cfg.cbndx and the old context is all yours.



I'll spin up a new copy of the TTBR1 quirk patch and revive the aux domain stuff
and then we can go from there.


Sounds good, thanks!

Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [Patch v3 2/3] iommu: optimize iova_magazine_free_pfns()

2020-01-21 Thread Cong Wang
On Tue, Jan 21, 2020 at 1:52 AM Robin Murphy  wrote:
>
> On 18/12/2019 4:39 am, Cong Wang wrote:
> > If the magazine is empty, iova_magazine_free_pfns() should
> > be a nop, however it misses the case of mag->size==0. So we
> > should just call iova_magazine_empty().
> >
> > This should reduce the contention on iovad->iova_rbtree_lock
> > a little bit, not much at all.
>
> Have you measured that in any way? AFAICS the only time this can get
> called with a non-full magazine is in the CPU hotplug callback, where
> the impact of taking the rbtree lock and immediately releasing it seems
> unlikely to be significant on top of everything else involved in that
> operation.

This patchset is only tested as a whole, it is not easy to deploy
each to production and test it separately.

Is there anything wrong to optimize a CPU hotplug path? :) And,
it is called in alloc_iova_fast() too when, for example, over-cached.

Thanks.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [Patch v3 1/3] iommu: avoid unnecessary magazine allocations

2020-01-21 Thread Cong Wang
On Tue, Jan 21, 2020 at 3:11 AM Robin Murphy  wrote:
>
> On 18/12/2019 4:39 am, Cong Wang wrote:
> > The IOVA cache algorithm implemented in IOMMU code does not
> > exactly match the original algorithm described in the paper
> > "Magazines and Vmem: Extending the Slab Allocator to Many
> > CPUs and Arbitrary Resources".
> >
> > Particularly, it doesn't need to free the loaded empty magazine
> > when trying to put it back to global depot. To make it work, we
> > have to pre-allocate magazines in the depot and only recycle them
> > when all of them are full.
> >
> > Before this patch, rcache->depot[] contains either full or
> > freed entries, after this patch, it contains either full or
> > empty (but allocated) entries.
>
> How much additional memory overhead does this impose (particularly on
> systems that may have many domains mostly used for large, long-term
> mappings)? I'm wary that trying to micro-optimise for the "churn network
> packets as fast as possible" case may penalise every other case,
> potentially quite badly. Lower-end embedded systems are using IOMMUs in
> front of their GPUs, video codecs, etc. precisely because they *don't*
> have much memory to spare (and thus need to scrape together large
> buffers out of whatever pages they can find).

The calculation is not complicated: 32 * 6 * 129 * 8 = 198144 bytes,
which is roughly 192K, per domain.

>
> But on the other hand, if we were to go down this route, then why is
> there any dynamic allocation/freeing left at all? Once both the depot
> and the rcaches are preallocated, then AFAICS it would make more sense
> to rework the overflow case in __iova_rcache_insert() to just free the
> IOVAs and swap the empty mag around rather than destroying and
> recreating it entirely.

It's due to the algorithm requires a swap(), which can't be done with
statically allocated magzine. I had the same thought initially but gave it
up quickly when realized this.

If you are suggesting to change the algorithm, it is not a goal of this
patchset. I do have plan to search for a better algorithm as the IOMMU
performance still sucks (comparing to no IOMMU) after this patchset,
but once again, I do not want to change it in this patchset.

(My ultimate goal is to find a spinlock-free algorithm, otherwise there is
no way to make it close to no-IOMMU performance.)

>
> Perhaps there's a reasonable compromise wherein we don't preallocate,
> but still 'free' empty magazines back to the depot, such that busy
> domains will quickly reach a steady-state. In fact, having now dug up
> the paper at this point of writing this reply, that appears to be what
> fig. 3.1b describes anyway - I don't see any mention of preallocating
> the depot.

That paper missed a lot of things, it doesn't even recommend a size
of a depot or percpu cache. For implementation, we still have to
think about those details, including whether to preallocate memory.

Thanks.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 2/5] iommu/arm-smmu: Add support for split pagetables

2020-01-21 Thread Jordan Crouse
On Tue, Jan 21, 2020 at 02:36:19PM +, Robin Murphy wrote:
> On 16/12/2019 4:37 pm, Jordan Crouse wrote:
> >Add support to enable split pagetables (TTBR1) if the supporting driver
> >requests it via the DOMAIN_ATTR_SPLIT_TABLES flag. When enabled, the driver
> >will set up the TTBR0 and TTBR1 regions and program the default domain
> >pagetable on TTBR1.
> >
> >After attaching the device, the value of he domain attribute can
> >be queried to see if the split pagetables were successfully programmed.
> >Furthermore the domain geometry will be updated so that the caller can
> >determine the active region for the pagetable that was programmed.
> >
> >Signed-off-by: Jordan Crouse 
> >---
> >
> >  drivers/iommu/arm-smmu.c | 40 +++-
> >  drivers/iommu/arm-smmu.h | 45 +++--
> >  2 files changed, 74 insertions(+), 11 deletions(-)
> >
> >diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
> >index c106406..7b59116 100644
> >--- a/drivers/iommu/arm-smmu.c
> >+++ b/drivers/iommu/arm-smmu.c
> >@@ -538,9 +538,17 @@ static void arm_smmu_init_context_bank(struct 
> >arm_smmu_domain *smmu_domain,
> > cb->ttbr[0] = pgtbl_cfg->arm_v7s_cfg.ttbr;
> > cb->ttbr[1] = 0;
> > } else {
> >-cb->ttbr[0] = pgtbl_cfg->arm_lpae_s1_cfg.ttbr;
> >-cb->ttbr[0] |= FIELD_PREP(TTBRn_ASID, cfg->asid);
> >-cb->ttbr[1] = FIELD_PREP(TTBRn_ASID, cfg->asid);
> >+if (pgtbl_cfg->quirks & IO_PGTABLE_QUIRK_ARM_TTBR1) {
> >+cb->ttbr[0] = FIELD_PREP(TTBRn_ASID, cfg->asid);
> >+cb->ttbr[1] = pgtbl_cfg->arm_lpae_s1_cfg.ttbr;
> >+cb->ttbr[1] |=
> >+FIELD_PREP(TTBRn_ASID, cfg->asid);
> >+} else {
> >+cb->ttbr[0] = pgtbl_cfg->arm_lpae_s1_cfg.ttbr;
> >+cb->ttbr[0] |=
> >+FIELD_PREP(TTBRn_ASID, cfg->asid);
> >+cb->ttbr[1] = FIELD_PREP(TTBRn_ASID, cfg->asid);
> >+}
> > }
> > } else {
> > cb->ttbr[0] = pgtbl_cfg->arm_lpae_s2_cfg.vttbr;
> >@@ -651,6 +659,7 @@ static int arm_smmu_init_domain_context(struct 
> >iommu_domain *domain,
> > enum io_pgtable_fmt fmt;
> > struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
> > struct arm_smmu_cfg *cfg = _domain->cfg;
> >+u32 quirks = 0;
> > mutex_lock(_domain->init_mutex);
> > if (smmu_domain->smmu)
> >@@ -719,6 +728,8 @@ static int arm_smmu_init_domain_context(struct 
> >iommu_domain *domain,
> > oas = smmu->ipa_size;
> > if (cfg->fmt == ARM_SMMU_CTX_FMT_AARCH64) {
> > fmt = ARM_64_LPAE_S1;
> >+if (smmu_domain->split_pagetables)
> >+quirks |= IO_PGTABLE_QUIRK_ARM_TTBR1;
> > } else if (cfg->fmt == ARM_SMMU_CTX_FMT_AARCH32_L) {
> > fmt = ARM_32_LPAE_S1;
> > ias = min(ias, 32UL);
> >@@ -788,6 +799,7 @@ static int arm_smmu_init_domain_context(struct 
> >iommu_domain *domain,
> > .coherent_walk  = smmu->features & ARM_SMMU_FEAT_COHERENT_WALK,
> > .tlb= smmu_domain->flush_ops,
> > .iommu_dev  = smmu->dev,
> >+.quirks = quirks,
> > };
> > if (smmu_domain->non_strict)
> >@@ -801,8 +813,15 @@ static int arm_smmu_init_domain_context(struct 
> >iommu_domain *domain,
> > /* Update the domain's page sizes to reflect the page table format */
> > domain->pgsize_bitmap = pgtbl_cfg.pgsize_bitmap;
> >-domain->geometry.aperture_end = (1UL << ias) - 1;
> >-domain->geometry.force_aperture = true;
> >+
> >+if (pgtbl_cfg.quirks & IO_PGTABLE_QUIRK_ARM_TTBR1) {
> >+domain->geometry.aperture_start = ~((1ULL << ias) - 1);
> 
> AKA "~0UL << ias", if I'm not mistaken ;)
> 
> >+domain->geometry.aperture_end = ~0UL;
> >+} else {
> >+domain->geometry.aperture_end = (1UL << ias) - 1;
> >+domain->geometry.force_aperture = true;
> >+smmu_domain->split_pagetables = false;
> >+}
> > /* Initialise the context bank with our page table cfg */
> > arm_smmu_init_context_bank(smmu_domain, _cfg);
> >@@ -1484,6 +1503,9 @@ static int arm_smmu_domain_get_attr(struct 
> >iommu_domain *domain,
> > case DOMAIN_ATTR_NESTING:
> > *(int *)data = (smmu_domain->stage == 
> > ARM_SMMU_DOMAIN_NESTED);
> > return 0;
> >+case DOMAIN_ATTR_SPLIT_TABLES:
> >+*(int *)data = smmu_domain->split_pagetables;
> >+return 0;
> > default:
> > return -ENODEV;
> > }
> >@@ -1524,6 

Re: [PATCH] iommu: amd: Fix IOMMU perf counter clobbering during init

2020-01-21 Thread Shuah Khan

On 1/20/20 7:10 PM, Suravee Suthikulpanit wrote:

On 1/17/2020 5:08 PM, Joerg Roedel wrote:

Adding Suravee, who wrote the IOMMU Perf Counter code.

On Tue, Jan 14, 2020 at 08:12:20AM -0700, Shuah Khan wrote:

init_iommu_perf_ctr() clobbers the register when it checks write access
to IOMMU perf counters and fails to restore when they are writable.

Add save and restore to fix it.

Signed-off-by: Shuah Khan
---
  drivers/iommu/amd_iommu_init.c | 22 --
  1 file changed, 16 insertions(+), 6 deletions(-)

Suravee, can you please review this patch?



This looks ok. Does this fix certain issues? Or is this just for sanity.


I didn't notice any problems. Counters aren't writable on my system.
However, it certainly looks like a bog since registers aren't restored
like in other places in this file where such checks are done on other
registers.

I see 2 banks and 4 counters on my system. Is it sufficient to check
the first bank and first counter? In other words, if the first one
isn't writable, are all counters non-writable?

Should we read the config first and then, try to see if any of the
counters are writable? I have a patch that does that, I can send it
out for review.



Reviewed-by: Suravee Suthikulpanit 

Thanks for the review.

thanks,
-- Shuah
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v3 2/5] iommu/arm-smmu: Add support for split pagetables

2020-01-21 Thread Robin Murphy

Oh, one more thing...

On 16/12/2019 4:37 pm, Jordan Crouse wrote:
[...]

@@ -651,6 +659,7 @@ static int arm_smmu_init_domain_context(struct iommu_domain 
*domain,
enum io_pgtable_fmt fmt;
struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
struct arm_smmu_cfg *cfg = _domain->cfg;
+   u32 quirks = 0;
  
  	mutex_lock(_domain->init_mutex);

if (smmu_domain->smmu)
@@ -719,6 +728,8 @@ static int arm_smmu_init_domain_context(struct iommu_domain 
*domain,
oas = smmu->ipa_size;
if (cfg->fmt == ARM_SMMU_CTX_FMT_AARCH64) {
fmt = ARM_64_LPAE_S1;
+   if (smmu_domain->split_pagetables)
+   quirks |= IO_PGTABLE_QUIRK_ARM_TTBR1;


To avoid me forgetting and questioning it again in future, I'd recommend 
sticking a comment somewhere near here that we don't reduce cfg->ias in 
this case because we're currently assuming SEP_UPSTREAM.


Robin.


} else if (cfg->fmt == ARM_SMMU_CTX_FMT_AARCH32_L) {
fmt = ARM_32_LPAE_S1;
ias = min(ias, 32UL);

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 2/5] iommu/arm-smmu: Add support for split pagetables

2020-01-21 Thread Robin Murphy

On 16/12/2019 4:37 pm, Jordan Crouse wrote:

Add support to enable split pagetables (TTBR1) if the supporting driver
requests it via the DOMAIN_ATTR_SPLIT_TABLES flag. When enabled, the driver
will set up the TTBR0 and TTBR1 regions and program the default domain
pagetable on TTBR1.

After attaching the device, the value of he domain attribute can
be queried to see if the split pagetables were successfully programmed.
Furthermore the domain geometry will be updated so that the caller can
determine the active region for the pagetable that was programmed.

Signed-off-by: Jordan Crouse 
---

  drivers/iommu/arm-smmu.c | 40 +++-
  drivers/iommu/arm-smmu.h | 45 +++--
  2 files changed, 74 insertions(+), 11 deletions(-)

diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index c106406..7b59116 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -538,9 +538,17 @@ static void arm_smmu_init_context_bank(struct 
arm_smmu_domain *smmu_domain,
cb->ttbr[0] = pgtbl_cfg->arm_v7s_cfg.ttbr;
cb->ttbr[1] = 0;
} else {
-   cb->ttbr[0] = pgtbl_cfg->arm_lpae_s1_cfg.ttbr;
-   cb->ttbr[0] |= FIELD_PREP(TTBRn_ASID, cfg->asid);
-   cb->ttbr[1] = FIELD_PREP(TTBRn_ASID, cfg->asid);
+   if (pgtbl_cfg->quirks & IO_PGTABLE_QUIRK_ARM_TTBR1) {
+   cb->ttbr[0] = FIELD_PREP(TTBRn_ASID, cfg->asid);
+   cb->ttbr[1] = pgtbl_cfg->arm_lpae_s1_cfg.ttbr;
+   cb->ttbr[1] |=
+   FIELD_PREP(TTBRn_ASID, cfg->asid);
+   } else {
+   cb->ttbr[0] = pgtbl_cfg->arm_lpae_s1_cfg.ttbr;
+   cb->ttbr[0] |=
+   FIELD_PREP(TTBRn_ASID, cfg->asid);
+   cb->ttbr[1] = FIELD_PREP(TTBRn_ASID, cfg->asid);
+   }
}
} else {
cb->ttbr[0] = pgtbl_cfg->arm_lpae_s2_cfg.vttbr;
@@ -651,6 +659,7 @@ static int arm_smmu_init_domain_context(struct iommu_domain 
*domain,
enum io_pgtable_fmt fmt;
struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
struct arm_smmu_cfg *cfg = _domain->cfg;
+   u32 quirks = 0;
  
  	mutex_lock(_domain->init_mutex);

if (smmu_domain->smmu)
@@ -719,6 +728,8 @@ static int arm_smmu_init_domain_context(struct iommu_domain 
*domain,
oas = smmu->ipa_size;
if (cfg->fmt == ARM_SMMU_CTX_FMT_AARCH64) {
fmt = ARM_64_LPAE_S1;
+   if (smmu_domain->split_pagetables)
+   quirks |= IO_PGTABLE_QUIRK_ARM_TTBR1;
} else if (cfg->fmt == ARM_SMMU_CTX_FMT_AARCH32_L) {
fmt = ARM_32_LPAE_S1;
ias = min(ias, 32UL);
@@ -788,6 +799,7 @@ static int arm_smmu_init_domain_context(struct iommu_domain 
*domain,
.coherent_walk  = smmu->features & ARM_SMMU_FEAT_COHERENT_WALK,
.tlb= smmu_domain->flush_ops,
.iommu_dev  = smmu->dev,
+   .quirks = quirks,
};
  
  	if (smmu_domain->non_strict)

@@ -801,8 +813,15 @@ static int arm_smmu_init_domain_context(struct 
iommu_domain *domain,
  
  	/* Update the domain's page sizes to reflect the page table format */

domain->pgsize_bitmap = pgtbl_cfg.pgsize_bitmap;
-   domain->geometry.aperture_end = (1UL << ias) - 1;
-   domain->geometry.force_aperture = true;
+
+   if (pgtbl_cfg.quirks & IO_PGTABLE_QUIRK_ARM_TTBR1) {
+   domain->geometry.aperture_start = ~((1ULL << ias) - 1);


AKA "~0UL << ias", if I'm not mistaken ;)


+   domain->geometry.aperture_end = ~0UL;
+   } else {
+   domain->geometry.aperture_end = (1UL << ias) - 1;
+   domain->geometry.force_aperture = true;
+   smmu_domain->split_pagetables = false;
+   }
  
  	/* Initialise the context bank with our page table cfg */

arm_smmu_init_context_bank(smmu_domain, _cfg);
@@ -1484,6 +1503,9 @@ static int arm_smmu_domain_get_attr(struct iommu_domain 
*domain,
case DOMAIN_ATTR_NESTING:
*(int *)data = (smmu_domain->stage == 
ARM_SMMU_DOMAIN_NESTED);
return 0;
+   case DOMAIN_ATTR_SPLIT_TABLES:
+   *(int *)data = smmu_domain->split_pagetables;
+   return 0;
default:
return -ENODEV;
}
@@ -1524,6 +1546,14 @@ static int arm_smmu_domain_set_attr(struct iommu_domain 
*domain,
else
smmu_domain->stage = ARM_SMMU_DOMAIN_S1;
 

Re: [RFC PATCH 2/4] PCI: Add "pci=iommu_passthrough=" parameter for iommu passthrough

2020-01-21 Thread Bjorn Helgaas
[+cc linux-pci, thread at 
https://lore.kernel.org/r/20200101052648.14295-1-baolu...@linux.intel.com]

On Wed, Jan 01, 2020 at 01:26:46PM +0800, Lu Baolu wrote:
> The new parameter takes a list of devices separated by a semicolon.
> Each device specified will have its iommu_passthrough bit in struct
> device set. This is very similar to the existing 'disable_acs_redir'
> parameter.

Almost all of this patchset is in drivers/iommu.  Should the parameter
be "iommu ..." instead of "pci=iommu_passthrough=..."?

There is already an "iommu.passthrough=" argument.  Would this fit
better there?  Since the iommu_passthrough bit is generic, it seems
like you anticipate similar situations for non-PCI devices.

> Signed-off-by: Lu Baolu 
> ---
>  .../admin-guide/kernel-parameters.txt |  5 +++
>  drivers/pci/pci.c | 34 +++
>  drivers/pci/pci.h |  1 +
>  drivers/pci/probe.c   |  2 ++
>  4 files changed, 42 insertions(+)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> b/Documentation/admin-guide/kernel-parameters.txt
> index ade4e6ec23e0..d3edc2cb6696 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -3583,6 +3583,11 @@
>   may put more devices in an IOMMU group.
>   force_floating  [S390] Force usage of floating interrupts.
>   nomio   [S390] Do not use MIO instructions.
> + iommu_passthrough=[; ...]
> + Specify one or more PCI devices (in the format
> + specified above) separated by semicolons.
> + Each device specified will bypass IOMMU DMA
> + translation.
>  
>   pcie_aspm=  [PCIE] Forcibly enable or disable PCIe Active State 
> Power
>   Management.
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 90dbd7c70371..05bf3f4acc36 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -6401,6 +6401,37 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus)
>  }
>  EXPORT_SYMBOL(pci_fixup_cardbus);
>  
> +static const char *iommu_passthrough_param;
> +bool pci_iommu_passthrough_match(struct pci_dev *dev)
> +{
> + int ret = 0;
> + const char *p = iommu_passthrough_param;
> +
> + if (!p)
> + return false;
> +
> + while (*p) {
> + ret = pci_dev_str_match(dev, p, );
> + if (ret < 0) {
> + pr_info_once("PCI: Can't parse iommu_passthrough 
> parameter: %s\n",
> +  iommu_passthrough_param);
> +
> + break;
> + } else if (ret == 1) {
> + pci_info(dev, "PCI: IOMMU passthrough\n");
> + return true;
> + }
> +
> + if (*p != ';' && *p != ',') {
> + /* End of param or invalid format */
> + break;
> + }
> + p++;
> + }
> +
> + return false;
> +}
> +
>  static int __init pci_setup(char *str)
>  {
>   while (str) {
> @@ -6462,6 +6493,8 @@ static int __init pci_setup(char *str)
>   pci_add_flags(PCI_SCAN_ALL_PCIE_DEVS);
>   } else if (!strncmp(str, "disable_acs_redir=", 18)) {
>   disable_acs_redir_param = str + 18;
> + } else if (!strncmp(str, "iommu_passthrough=", 18)) {
> + iommu_passthrough_param = str + 18;
>   } else {
>   pr_err("PCI: Unknown option `%s'\n", str);
>   }
> @@ -6486,6 +6519,7 @@ static int __init pci_realloc_setup_params(void)
>   resource_alignment_param = kstrdup(resource_alignment_param,
>  GFP_KERNEL);
>   disable_acs_redir_param = kstrdup(disable_acs_redir_param, GFP_KERNEL);
> + iommu_passthrough_param = kstrdup(iommu_passthrough_param, GFP_KERNEL);
>  
>   return 0;
>  }
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index a0a53bd05a0b..95f6af06aba6 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -288,6 +288,7 @@ void pci_reassigndev_resource_alignment(struct pci_dev 
> *dev);
>  void pci_disable_bridge_window(struct pci_dev *dev);
>  struct pci_bus *pci_bus_get(struct pci_bus *bus);
>  void pci_bus_put(struct pci_bus *bus);
> +bool pci_iommu_passthrough_match(struct pci_dev *dev);
>  
>  /* PCIe link information */
>  #define PCIE_SPEED2STR(speed) \
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 512cb4312ddd..4c571ee75621 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -2404,6 +2404,8 @@ void pci_device_add(struct pci_dev *dev, struct pci_bus 
> *bus)
>  
>   dev->state_saved = false;
>  
> + 

Re: [RFC PATCH 3/4] iommu: Preallocate iommu group when probing devices

2020-01-21 Thread Robin Murphy

On 19/01/2020 6:29 am, Lu Baolu wrote:

Hi Joerg,

On 1/17/20 6:21 PM, Joerg Roedel wrote:

On Wed, Jan 01, 2020 at 01:26:47PM +0800, Lu Baolu wrote:

This splits iommu group allocation from adding devices. This makes
it possible to determine the default domain type for each group as
all devices belonging to the group have been determined.


I think its better to keep group allocation as it is and just defer
default domain allocation after each device is in its group. But take


I tried defering default domain allocation, but it seems not possible.

The call path of adding devices into their groups:

iommu_probe_device
-> ops->add_device(dev)
    -> (iommu vendor driver) iommu_group_get_for_dev(dev)

After doing this, the vendor driver will get the default domain and
apply dma_ops according to the domain type. If we defer the domain
allocation, they will get a NULL default domain and cause panic in
the vendor driver.

Any suggestions?


https://lore.kernel.org/linux-iommu/6dbbfc10-3247-744c-ae8d-443a336e0...@linux.intel.com/

Haven't we been here before? ;)

Since we can't (safely or reasonably) change a group's default domain 
after ops->add_device() has returned, and in general it gets impractical 
to evaluate "all device in a group" once you look beyond _bus_type 
(or consider hotplug as mentioned), then AFAICS there's no reasonable 
way to get away from the default domain type being defined by the first 
device to attach. But in practice it's hardly a problem anyway - if 
every device in a given group requests the same domain type then it 
doesn't matter which comes first, and if they don't then we ultimately 
end up with an impossible set of constraints, so are doomed to do the 
'wrong' thing regardless.


Thus unless anyone wants to start redefining the whole group concept to 
separate the notions of ID aliasing and peer-to-peer isolation (which 
still wouldn't necessarily help), I think this user override effectively 
boils down to just another flavour of iommu_request_*_for_dev(), and as 
such comes right back to the "just pass the bloody device to 
ops->domain_alloc() and resolve everything up-front" argument.


Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [Patch v3 1/3] iommu: avoid unnecessary magazine allocations

2020-01-21 Thread Robin Murphy

On 18/12/2019 4:39 am, Cong Wang wrote:

The IOVA cache algorithm implemented in IOMMU code does not
exactly match the original algorithm described in the paper
"Magazines and Vmem: Extending the Slab Allocator to Many
CPUs and Arbitrary Resources".

Particularly, it doesn't need to free the loaded empty magazine
when trying to put it back to global depot. To make it work, we
have to pre-allocate magazines in the depot and only recycle them
when all of them are full.

Before this patch, rcache->depot[] contains either full or
freed entries, after this patch, it contains either full or
empty (but allocated) entries.


How much additional memory overhead does this impose (particularly on 
systems that may have many domains mostly used for large, long-term 
mappings)? I'm wary that trying to micro-optimise for the "churn network 
packets as fast as possible" case may penalise every other case, 
potentially quite badly. Lower-end embedded systems are using IOMMUs in 
front of their GPUs, video codecs, etc. precisely because they *don't* 
have much memory to spare (and thus need to scrape together large 
buffers out of whatever pages they can find).


But on the other hand, if we were to go down this route, then why is 
there any dynamic allocation/freeing left at all? Once both the depot 
and the rcaches are preallocated, then AFAICS it would make more sense 
to rework the overflow case in __iova_rcache_insert() to just free the 
IOVAs and swap the empty mag around rather than destroying and 
recreating it entirely.


Perhaps there's a reasonable compromise wherein we don't preallocate, 
but still 'free' empty magazines back to the depot, such that busy 
domains will quickly reach a steady-state. In fact, having now dug up 
the paper at this point of writing this reply, that appears to be what 
fig. 3.1b describes anyway - I don't see any mention of preallocating 
the depot.


Robin.



Together with a few other changes to make it exactly match
the pseudo code in the paper.

Cc: Joerg Roedel 
Cc: John Garry 
Signed-off-by: Cong Wang 
---
  drivers/iommu/iova.c | 45 +++-
  1 file changed, 28 insertions(+), 17 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 41c605b0058f..cb473ddce4cf 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -862,12 +862,16 @@ static void init_iova_rcaches(struct iova_domain *iovad)
struct iova_cpu_rcache *cpu_rcache;
struct iova_rcache *rcache;
unsigned int cpu;
-   int i;
+   int i, j;
  
  	for (i = 0; i < IOVA_RANGE_CACHE_MAX_SIZE; ++i) {

rcache = >rcaches[i];
spin_lock_init(>lock);
rcache->depot_size = 0;
+   for (j = 0; j < MAX_GLOBAL_MAGS; ++j) {
+   rcache->depot[j] = iova_magazine_alloc(GFP_KERNEL);
+   WARN_ON(!rcache->depot[j]);
+   }
rcache->cpu_rcaches = __alloc_percpu(sizeof(*cpu_rcache), 
cache_line_size());
if (WARN_ON(!rcache->cpu_rcaches))
continue;
@@ -900,24 +904,30 @@ static bool __iova_rcache_insert(struct iova_domain 
*iovad,
  
  	if (!iova_magazine_full(cpu_rcache->loaded)) {

can_insert = true;
-   } else if (!iova_magazine_full(cpu_rcache->prev)) {
+   } else if (iova_magazine_empty(cpu_rcache->prev)) {
swap(cpu_rcache->prev, cpu_rcache->loaded);
can_insert = true;
} else {
-   struct iova_magazine *new_mag = iova_magazine_alloc(GFP_ATOMIC);
+   spin_lock(>lock);
+   if (rcache->depot_size < MAX_GLOBAL_MAGS) {
+   swap(rcache->depot[rcache->depot_size], 
cpu_rcache->prev);
+   swap(cpu_rcache->prev, cpu_rcache->loaded);
+   rcache->depot_size++;
+   can_insert = true;
+   } else {
+   mag_to_free = cpu_rcache->loaded;
+   }
+   spin_unlock(>lock);
+
+   if (mag_to_free) {
+   struct iova_magazine *new_mag = 
iova_magazine_alloc(GFP_ATOMIC);
  
-		if (new_mag) {

-   spin_lock(>lock);
-   if (rcache->depot_size < MAX_GLOBAL_MAGS) {
-   rcache->depot[rcache->depot_size++] =
-   cpu_rcache->loaded;
+   if (new_mag) {
+   cpu_rcache->loaded = new_mag;
+   can_insert = true;
} else {
-   mag_to_free = cpu_rcache->loaded;
+   mag_to_free = NULL;
}
-   spin_unlock(>lock);
-
-   cpu_rcache->loaded = new_mag;
-   can_insert = true;
}
}
  
@@ -963,14 +973,15 

Re: [RFC PATCH 0/4] iommu: Per-group default domain type

2020-01-21 Thread John Garry

On 21/01/2020 00:43, Lu Baolu wrote:
An IOMMU group represents the smallest set of devices that are 
considered

to be isolated. All devices belonging to an IOMMU group share a default
domain for DMA APIs. There are two types of default domain: 
IOMMU_DOMAIN_DMA

and IOMMU_DOMAIN_IDENTITY. The former means IOMMU translation, while the
latter means IOMMU by-pass.

Currently, the default domain type for the IOMMU groups is determined
globally. All IOMMU groups use a single default domain type. The global
default domain type can be adjusted by kernel build configuration or
kernel parameters.

More and more users are looking forward to a fine grained default domain
type. For example, with the global default domain type set to 
translation,
the OEM verndors or end users might want some trusted and fast-speed 
devices

to bypass IOMMU for performance gains. On the other hand, with global
default domain type set to by-pass, some devices with limited system
memory addressing capability might want IOMMU translation to remove the
bounce buffer overhead.


Hi Lu Baolu,

Do you think that it would be a more common usecase to want 
kernel-managed devices to be passthrough for performance reasons and 
some select devices to be in DMA domain, like those with limited 
address cap or whose drivers request huge amounts of memory?


I just think it would be more manageable to set kernel commandline 
parameters for this, i.e. those select few which want DMA domain.




Hi Baolu,



It's just two sides of a coin. Currently, iommu subsystem make DMA
domain by default, that's the reason why I selected to let user set
which devices are willing to use identity domains.



OK, understood.

There was an alternate solution here which would allow per-group type to 
be updated via sysfs:


https://lore.kernel.org/linux-iommu/cover.1566353521.git.sai.praneeth.prak...@intel.com/

Any idea what happened to that?

Cheers,
John

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [Patch v3 3/3] iommu: avoid taking iova_rbtree_lock twice

2020-01-21 Thread Robin Murphy

On 18/12/2019 4:39 am, Cong Wang wrote:

Both find_iova() and __free_iova() take iova_rbtree_lock,
there is no reason to take and release it twice inside
free_iova().

Fold them into one critical section by calling the unlock
versions instead.


Makes sense to me.

Reviewed-by: Robin Murphy 


Cc: Joerg Roedel 
Cc: John Garry 
Signed-off-by: Cong Wang 
---
  drivers/iommu/iova.c | 8 ++--
  1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 184d4c0e20b5..f46f8f794678 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -390,10 +390,14 @@ EXPORT_SYMBOL_GPL(__free_iova);
  void
  free_iova(struct iova_domain *iovad, unsigned long pfn)
  {
-   struct iova *iova = find_iova(iovad, pfn);
+   unsigned long flags;
+   struct iova *iova;
  
+	spin_lock_irqsave(>iova_rbtree_lock, flags);

+   iova = private_find_iova(iovad, pfn);
if (iova)
-   __free_iova(iovad, iova);
+   private_free_iova(iovad, iova);
+   spin_unlock_irqrestore(>iova_rbtree_lock, flags);
  
  }

  EXPORT_SYMBOL_GPL(free_iova);


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [Patch v3 2/3] iommu: optimize iova_magazine_free_pfns()

2020-01-21 Thread Robin Murphy

On 18/12/2019 4:39 am, Cong Wang wrote:

If the magazine is empty, iova_magazine_free_pfns() should
be a nop, however it misses the case of mag->size==0. So we
should just call iova_magazine_empty().

This should reduce the contention on iovad->iova_rbtree_lock
a little bit, not much at all.


Have you measured that in any way? AFAICS the only time this can get 
called with a non-full magazine is in the CPU hotplug callback, where 
the impact of taking the rbtree lock and immediately releasing it seems 
unlikely to be significant on top of everything else involved in that 
operation.


Robin.


Cc: Joerg Roedel 
Cc: John Garry 
Signed-off-by: Cong Wang 
---
  drivers/iommu/iova.c | 22 +++---
  1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index cb473ddce4cf..184d4c0e20b5 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -797,13 +797,23 @@ static void iova_magazine_free(struct iova_magazine *mag)
kfree(mag);
  }
  
+static bool iova_magazine_full(struct iova_magazine *mag)

+{
+   return (mag && mag->size == IOVA_MAG_SIZE);
+}
+
+static bool iova_magazine_empty(struct iova_magazine *mag)
+{
+   return (!mag || mag->size == 0);
+}
+
  static void
  iova_magazine_free_pfns(struct iova_magazine *mag, struct iova_domain *iovad)
  {
unsigned long flags;
int i;
  
-	if (!mag)

+   if (iova_magazine_empty(mag))
return;
  
  	spin_lock_irqsave(>iova_rbtree_lock, flags);

@@ -820,16 +830,6 @@ iova_magazine_free_pfns(struct iova_magazine *mag, struct 
iova_domain *iovad)
mag->size = 0;
  }
  
-static bool iova_magazine_full(struct iova_magazine *mag)

-{
-   return (mag && mag->size == IOVA_MAG_SIZE);
-}
-
-static bool iova_magazine_empty(struct iova_magazine *mag)
-{
-   return (!mag || mag->size == 0);
-}
-
  static unsigned long iova_magazine_pop(struct iova_magazine *mag,
   unsigned long limit_pfn)
  {


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu