[PATCH] iommu/vt-d: check alignment before using psi

2022-03-15 Thread David Stevens
From: David Stevens 

Fall back to domain selective flush if the target address is not aligned
to the mask being used for invalidation. This is necessary because page
selective invalidation masks out the lower order bits of the target
address based on the mask value, so if a non-aligned address is targeted
for psi, then mappings at the end of [pfn, pfn+pages) may not properly
be flushed from the iotlb.

This is not normally an issue because iova.c always allocates iovas that
are aligned to their size. However, iovas which come from other sources
(e.g. userspace via VFIO) may not be aligned.

Signed-off-by: David Stevens 
---
 drivers/iommu/intel/iommu.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 5b196cfe9ed2..c122686e0a5c 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1735,7 +1735,8 @@ static void iommu_flush_iotlb_psi(struct intel_iommu 
*iommu,
 * and the base address is naturally aligned to the size.
 */
if (!cap_pgsel_inv(iommu->cap) ||
-   mask > cap_max_amask_val(iommu->cap))
+   mask > cap_max_amask_val(iommu->cap) ||
+   unlikely(((1 << mask) - 1) & pfn))
iommu->flush.flush_iotlb(iommu, did, 0, 0,
DMA_TLB_DSI_FLUSH);
else
-- 
2.35.1.723.g4982287a31-goog

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 2/2] thunderbolt: Use pre-boot DMA protection on AMD systems

2022-03-15 Thread kernel test robot
Hi Mario,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on joro-iommu/next]
[also build test ERROR on arm-perf/for-next/perf linus/master v5.17-rc8 
next-20220315]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:
https://github.com/0day-ci/linux/commits/Mario-Limonciello/iommu-amd-Add-support-to-indicate-whether-DMA-remap-support-is-enabled/20220316-002821
base:   https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git next
config: x86_64-randconfig-a012 
(https://download.01.org/0day-ci/archive/20220316/202203160904.vb4alcdg-...@intel.com/config)
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project 
a6b2f50fb47da3baeee10b1906da6e30ac5d26ec)
reproduce (this is a W=1 build):
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# 
https://github.com/0day-ci/linux/commit/9b0b7079d348c607cba7af4c87eaae1a79e52d91
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review 
Mario-Limonciello/iommu-amd-Add-support-to-indicate-whether-DMA-remap-support-is-enabled/20220316-002821
git checkout 9b0b7079d348c607cba7af4c87eaae1a79e52d91
# save the config file to linux build tree
mkdir build_dir
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 
O=build_dir ARCH=x86_64 SHELL=/bin/bash drivers/

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot 

All errors (new ones prefixed by >>):

   In file included from drivers/thunderbolt/domain.c:9:
>> include/linux/amd-iommu.h:159:52: error: use of undeclared identifier 
>> 'ENODEV'
   static inline int amd_iommu_detect(void) { return -ENODEV; }
  ^
   1 error generated.


vim +/ENODEV +159 include/linux/amd-iommu.h

6a9401a7ac13e6 arch/x86/include/asm/amd_iommu.h Joerg Roedel  
2009-11-20  158  
480125ba49ba62 arch/x86/include/asm/amd_iommu.h Konrad Rzeszutek Wilk 
2010-08-26 @159  static inline int amd_iommu_detect(void) { return -ENODEV; }
6a9401a7ac13e6 arch/x86/include/asm/amd_iommu.h Joerg Roedel  
2009-11-20  160  

---
0-DAY CI Kernel Test Service
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/2] iommu/amd: Add support to indicate whether DMA remap support is enabled

2022-03-15 Thread kernel test robot
Hi Mario,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on joro-iommu/next]
[also build test ERROR on arm-perf/for-next/perf linus/master v5.17-rc8 
next-20220315]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:
https://github.com/0day-ci/linux/commits/Mario-Limonciello/iommu-amd-Add-support-to-indicate-whether-DMA-remap-support-is-enabled/20220316-002821
base:   https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git next
config: x86_64-defconfig 
(https://download.01.org/0day-ci/archive/20220316/202203160844.lkviwr1q-...@intel.com/config)
compiler: gcc-9 (Ubuntu 9.4.0-1ubuntu1~20.04) 9.4.0
reproduce (this is a W=1 build):
# 
https://github.com/0day-ci/linux/commit/fa63035401902e438c5ef3213112901a1054c621
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review 
Mario-Limonciello/iommu-amd-Add-support-to-indicate-whether-DMA-remap-support-is-enabled/20220316-002821
git checkout fa63035401902e438c5ef3213112901a1054c621
# save the config file to linux build tree
mkdir build_dir
make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot 

All errors (new ones prefixed by >>):

>> drivers/iommu/amd/init.c:3294:6: error: redefinition of 
>> 'amd_ivrs_remap_support'
3294 | bool amd_ivrs_remap_support(void)
 |  ^~
   In file included from drivers/iommu/amd/init.c:20:
   include/linux/amd-iommu.h:198:20: note: previous definition of 
'amd_ivrs_remap_support' was here
 198 | static inline bool amd_ivrs_remap_support(void)
 |^~


vim +/amd_ivrs_remap_support +3294 drivers/iommu/amd/init.c

  3284  
  3285  /*
  3286   * ivrs_remap_support - Is %IOMMU_IVINFO_DMA_REMAP set in IVRS table
  3287   *
  3288   * Returns true if the platform has %IOMMU_IVINFO_DMA_REMAP% set in the 
IOMMU
  3289   * IVRS IVInfo field.
  3290   * Presence of this flag indicates to the OS/HV that the IOMMU is used 
for
  3291   * Preboot DMA protection and device accessed memory should be remapped 
after
  3292   * the OS has loaded.
  3293   */
> 3294  bool amd_ivrs_remap_support(void)
  3295  {
  3296  return amdr_ivrs_remap_support;
  3297  }
  3298  EXPORT_SYMBOL_GPL(amd_ivrs_remap_support);
  3299  

---
0-DAY CI Kernel Test Service
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 12/15] swiotlb: provide swiotlb_init variants that remap the buffer

2022-03-15 Thread Boris Ostrovsky




On 3/15/22 2:36 AM, Christoph Hellwig wrote:


@@ -271,12 +273,23 @@ void __init swiotlb_init(bool addressing_limit, unsigned 
int flags)
 * allow to pick a location everywhere for hypervisors with guest
 * memory encryption.
 */
+retry:
+   bytes = PAGE_ALIGN(default_nslabs << IO_TLB_SHIFT);
if (flags & SWIOTLB_ANY)
tlb = memblock_alloc(bytes, PAGE_SIZE);
else
tlb = memblock_alloc_low(bytes, PAGE_SIZE);
if (!tlb)
goto fail;
+   if (remap && remap(tlb, nslabs) < 0) {
+   memblock_free(tlb, PAGE_ALIGN(bytes));
+
+   if (nslabs <= IO_TLB_MIN_SLABS)
+   panic("%s: Failed to remap %zu bytes\n",
+ __func__, bytes);
+   nslabs = max(1024UL, ALIGN(nslabs >> 1, IO_TLB_SEGSIZE));



I spoke with Konrad (who wrote the original patch --- 
f4b2f07b2ed9b469ead87e06fc2fc3d12663a725) and apparently the reason for 2MB was 
to optimize for Xen's slab allocator, it had nothing to do with 
IO_TLB_MIN_SLABS. Since this is now common code we should not expose 
Xen-specific optimizations here and smaller values will still work so 
IO_TLB_MIN_SLABS is fine.

I think this should be mentioned in the commit message though, probably best in 
the next patch where you switch to this code.

As far as the hunk above, I don't think we need the max() here: with 
IO_TLB_MIN_SLABS being 512 we may get stuck in an infinite loop. Something like

nslabs = ALIGN(nslabs >> 1, IO_TLB_SEGSIZE);
if (nslabs <= IO_TLB_MIN_SLABS)
panic()

should be sufficient.



+   goto retry;
+   }
if (swiotlb_init_with_tbl(tlb, default_nslabs, flags))
goto fail_free_mem;
return;
@@ -287,12 +300,18 @@ void __init swiotlb_init(bool addressing_limit, unsigned 
int flags)
pr_warn("Cannot allocate buffer");
  }
  
+void __init swiotlb_init(bool addressing_limit, unsigned int flags)

+{
+   return swiotlb_init_remap(addressing_limit, flags, NULL);
+}
+
  /*
   * Systems with larger DMA zones (those that don't support ISA) can
   * initialize the swiotlb later using the slab allocator if needed.
   * This should be just like above, but with some error catching.
   */
-int swiotlb_init_late(size_t size, gfp_t gfp_mask)
+int swiotlb_init_late(size_t size, gfp_t gfp_mask,
+   int (*remap)(void *tlb, unsigned long nslabs))
  {
unsigned long nslabs = ALIGN(size >> IO_TLB_SHIFT, IO_TLB_SEGSIZE);
unsigned long bytes;
@@ -303,6 +322,7 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask)
if (swiotlb_force_disable)
return 0;
  
+retry:

order = get_order(nslabs << IO_TLB_SHIFT);
nslabs = SLABS_PER_PAGE << order;
bytes = nslabs << IO_TLB_SHIFT;
@@ -317,6 +337,16 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask)
  
  	if (!vstart)

return -ENOMEM;
+   if (remap)
+   rc = remap(vstart, nslabs);
+   if (rc) {
+   free_pages((unsigned long)vstart, order);
+
+   if (IO_TLB_MIN_SLABS <= 1024)
+   return rc;
+   nslabs = max(1024UL, ALIGN(nslabs >> 1, IO_TLB_SEGSIZE));



Same here. (The 'if' check above is wrong anyway).

Patches 13 and 14 look good.


-boris




+   goto retry;
+   }
  
  	if (order != get_order(bytes)) {

pr_warn("only able to allocate %ld MB\n",

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 5/8] iommu: Add PASID support for DMA mapping API users

2022-03-15 Thread Jason Gunthorpe via iommu
On Tue, Mar 15, 2022 at 09:38:10AM -0700, Jacob Pan wrote:
> > > +int iommu_enable_pasid_dma(struct device *dev, ioasid_t *pasid)
> > > +{
> > > + struct iommu_domain *dom;
> > > + ioasid_t id, max;
> > > + int ret;
> > > +
> > > + dom = iommu_get_domain_for_dev(dev);
> > > + if (!dom || !dom->ops || !dom->ops->attach_dev_pasid)
> > > + return -ENODEV;  
> > 
> > Given the purpose of this API I think it should assert that the device
> > has the DMA API in-use using the machinery from the other series.
> > 
> > ie this should not be used to clone non-DMA API iommu_domains..
> > 
> Let me try to confirm the specific here. I should check domain type and
> rejects all others except IOMMU_DOMAIN_DMA type, right? Should also allow
> IOMMU_DOMAIN_IDENTITY.
> 
> That makes sense.

That is one way, the other is to check the new group data that Lu's
patch added.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 3/8] iommu/vt-d: Implement device_pasid domain attach ops

2022-03-15 Thread Jason Gunthorpe via iommu
On Tue, Mar 15, 2022 at 03:36:20PM -0700, Jacob Pan wrote:
> Hi Jason,
> 
> On Tue, 15 Mar 2022 11:33:22 -0300, Jason Gunthorpe  wrote:
> 
> > On Mon, Mar 14, 2022 at 10:07:07PM -0700, Jacob Pan wrote:
> > > + /*
> > > +  * Each domain could have multiple devices attached with
> > > shared or per
> > > +  * device PASIDs. At the domain level, we keep track of unique
> > > PASIDs and
> > > +  * device user count.
> > > +  * E.g. If a domain has two devices attached, device A has
> > > PASID 0, 1;
> > > +  * device B has PASID 0, 2. Then the domain would have PASID
> > > 0, 1, 2.
> > > +  */  
> > 
> > A 2d array of xarray's seems like a poor data structure for this task.
> > 
> > AFACIT this wants to store a list of (device, pasid) tuples, so a
> > simple linked list, 1d xarray vector or a red black tree seems more
> > appropriate..
> > 
> Agreed.
> It might need some surgery for dmar_domain and device_domain_info, which
> already has a simple device list. I am trying to leverage the existing data
> struct, let me take a closer look.

Maybe the core code should provide this data structure in the
iommu_domain.

Figuring out what stuff is attached is something every driver has to
do right?

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 3/8] iommu/vt-d: Implement device_pasid domain attach ops

2022-03-15 Thread Jacob Pan
Hi Jason,

On Tue, 15 Mar 2022 11:33:22 -0300, Jason Gunthorpe  wrote:

> On Mon, Mar 14, 2022 at 10:07:07PM -0700, Jacob Pan wrote:
> > +   /*
> > +* Each domain could have multiple devices attached with
> > shared or per
> > +* device PASIDs. At the domain level, we keep track of unique
> > PASIDs and
> > +* device user count.
> > +* E.g. If a domain has two devices attached, device A has
> > PASID 0, 1;
> > +* device B has PASID 0, 2. Then the domain would have PASID
> > 0, 1, 2.
> > +*/  
> 
> A 2d array of xarray's seems like a poor data structure for this task.
> 
> AFACIT this wants to store a list of (device, pasid) tuples, so a
> simple linked list, 1d xarray vector or a red black tree seems more
> appropriate..
> 
Agreed.
It might need some surgery for dmar_domain and device_domain_info, which
already has a simple device list. I am trying to leverage the existing data
struct, let me take a closer look.

> > +   if (entry) {
> > +   pinfo = entry;
> > +   } else {
> > +   pinfo = kzalloc(sizeof(*pinfo), GFP_ATOMIC);
> > +   if (!pinfo)
> > +   return -ENOMEM;
> > +   pinfo->pasid = pasid;
> > +   /* Store the new PASID info in the per domain array */
> > +   ret = xa_err(__xa_store(_domain->pasids, pasid,
> > pinfo,
> > +GFP_ATOMIC));
> > +   if (ret)
> > +   goto xa_store_err;
> > +   }
> > +   /* Store PASID in per device-domain array, this is for
> > tracking devTLB */
> > +   ret = xa_err(xa_store(>pasids, pasid, pinfo,
> > GFP_ATOMIC));
> > +   if (ret)
> > +   goto xa_store_err;
> > +
> > +   atomic_inc(>users);
> > +   xa_unlock(_domain->pasids);
> > +
> > +   return 0;
> > +
> > +xa_store_err:
> > +   xa_unlock(_domain->pasids);
> > +   spin_lock_irqsave(>lock, flags);
> > +   intel_pasid_tear_down_entry(iommu, dev, pasid, false);
> > +   spin_unlock_irqrestore(>lock, flags);
> > +
> > +   if (!atomic_read(>users)) {
> > +   __xa_erase(_domain->pasids, pasid);  
> 
> This isn't locked right
> 
good catch! need to move under xa_unlock.

Thanks,

Jacob
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 3/8] iommu/vt-d: Implement device_pasid domain attach ops

2022-03-15 Thread Jacob Pan
Hi Kevin,

On Tue, 15 Mar 2022 10:33:08 +, "Tian, Kevin" 
wrote:

> > From: Jacob Pan 
> > Sent: Tuesday, March 15, 2022 1:07 PM
> > 
> > On VT-d platforms with scalable mode enabled, devices issue DMA requests
> > with PASID need to attach to the correct IOMMU domains.
> > The attach operation involves the following:
> > - programming the PASID into device's PASID table
> > - tracking device domain and the PASID relationship
> > - managing IOTLB and device TLB invalidations
> > 
> > This patch extends DMAR domain and device domain info with xarrays to
> > track per domain and per device PASIDs.  It provides the flexibility to
> > be used beyond DMA API PASID support.
> > 
> > Signed-off-by: Lu Baolu 
> > Signed-off-by: Jacob Pan 
> > ---
> >  drivers/iommu/intel/iommu.c | 194
> > +++-
> >  include/linux/intel-iommu.h |  12 ++-
> >  2 files changed, 202 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> > index 881f8361eca2..9267194eaed3 100644
> > --- a/drivers/iommu/intel/iommu.c
> > +++ b/drivers/iommu/intel/iommu.c
> > @@ -1622,20 +1622,48 @@ static void __iommu_flush_dev_iotlb(struct
> > device_domain_info *info,
> >qdep, addr, mask);
> >  }
> > 
> > +static void __iommu_flush_dev_piotlb(struct device_domain_info *info,  
> 
> piotlb is confusing, better be:
> 
>   __iommu_flush_dev_iotlb_pasid()
> 
yeah, that is more clear.

> > +   u64 address,
> > +ioasid_t pasid, unsigned int mask)
> > +{
> > +   u16 sid, qdep;
> > +
> > +   if (!info || !info->ats_enabled)
> > +   return;
> > +
> > +   sid = info->bus << 8 | info->devfn;
> > +   qdep = info->ats_qdep;
> > +   qi_flush_dev_iotlb_pasid(info->iommu, sid, info->pfsid,
> > +pasid, qdep, address, mask);
> > +}
> > +
> >  static void iommu_flush_dev_iotlb(struct dmar_domain *domain,
> >   u64 addr, unsigned mask)
> >  {
> > unsigned long flags;
> > struct device_domain_info *info;
> > struct subdev_domain_info *sinfo;
> > +   unsigned long pasid;
> > +   struct pasid_info *pinfo;
> > 
> > if (!domain->has_iotlb_device)
> > return;
> > 
> > spin_lock_irqsave(_domain_lock, flags);
> > -   list_for_each_entry(info, >devices, link)
> > -   __iommu_flush_dev_iotlb(info, addr, mask);
> > -
> > +   list_for_each_entry(info, >devices, link) {
> > +   /*
> > +* We cannot use PASID based devTLB invalidation on
> > RID2PASID
> > +* Device does not understand RID2PASID/0. This is
> > different  
> 
> Lack of a conjunction word between 'RID2PASID' and 'Device'.
> 
> and what is RID2PASID/0? It would be clearer to point out that RID2PASID
> is visible only within the iommu to mark out requests without PASID, 
> thus this PASID value should never be sent to the device side.
> 
Good point, will do.

> > +* than IOTLB invalidation where RID2PASID is also
> > used for
> > +* tagging.  
> 
> Then it would be obvious because IOTLB is iommu internal agent thus takes 
> use of RID2PASID for tagging.
> 
ditto

> > +*/
> > +   xa_for_each(>pasids, pasid, pinfo) {
> > +   if (!pasid)  
> 
> this should be compared to PASID_RID2PASID (though it's zero today).
> 
ditto

> > +   __iommu_flush_dev_iotlb(info, addr,
> > mask);
> > +   else
> > +   __iommu_flush_dev_piotlb(info, addr,
> > pasid, mask);
> > +   }
> > +   }
> > list_for_each_entry(sinfo, >subdevices, link_domain) {
> > info = get_domain_info(sinfo->pdev);
> > __iommu_flush_dev_iotlb(info, addr, mask);  
> 
> Thanks
> Kevin


Thanks,

Jacob
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 2/2] thunderbolt: Use pre-boot DMA protection on AMD systems

2022-03-15 Thread Robin Murphy

On 2022-03-15 18:36, Limonciello, Mario wrote:

+ Christian Kellner (Bolt userspace maintainer)

On 3/15/2022 13:07, Robin Murphy wrote:

On 2022-03-15 16:54, Limonciello, Mario via iommu wrote:

[Public]



On Tue, Mar 15, 2022 at 11:24:55AM -0500, Mario Limonciello wrote:

- * handled natively using IOMMU. It is enabled when IOMMU is
- * enabled and ACPI DMAR table has DMAR_PLATFORM_OPT_IN set.
+ * handled natively using IOMMU. It is enabled when the IOMMU is
+ * enabled and either:
+ * ACPI DMAR table has DMAR_PLATFORM_OPT_IN set
+ * or
+ * ACPI IVRS table has DMA_REMAP bitset
   */
  return sprintf(buf, "%d\n",
-   iommu_present(_bus_type) &&

dmar_platform_optin());

+   iommu_present(_bus_type) &&
+   (dmar_platform_optin() || amd_ivrs_remap_support()));


Yikes.  No, the thunderbot code does not have any business poking into
either dmar_platform_optin or amd_ivrs_remap_support.  This needs
a proper abstration from the IOMMU code.


To make sure I follow your ask - it's to make a new generic iommu 
function
That would check dmar/ivrs, and switch out thunderbolt domain.c to 
use the

symbol?

I'm happy to rework that if that is what you want.
Do you have a preferred proposed function name for that?


But why? Either IOMMU translation is enabled or it isn't, and if it 
is, what's to gain from guessing at *why* it might have been? And even 
if the IOMMU's firmware table did tell the IOMMU driver to enable the 
IOMMU, why should that be Thunderbolt's business?
A lot of this comes from baggage from early Thunderbolt 3 implementation 
on systems with ICM (Intel's FW CM). On those systems there was a 
concept of "Security Levels".  This meant that downstream PCIe devices 
were not automatically authorized when a TBT3 device was plugged in.  In 
those cases there was no guarantee that the IOMMU was in use and so the 
security was passed on to the user to make a decision.


In Linux this was accomplished using the 'authorized' attribute in 
/sys/bus/thunderbolt/devices/$NUM/authorized.  When this was set to 1 
then the TBT3 device and PCIe topology behind it would be enumerated.


Further documentation explaining how this works is available here:
https://www.kernel.org/doc/html/latest/admin-guide/thunderbolt.html#security-levels-and-how-to-use-them 



(Intel based) Platforms from 2018+ w/ TBT3 started to use the IOMMU 
consistently at runtime but had this existing implementation of security 
levels to worry about.  Furthermore tunnels could be created pre-boot, 
and so the thunderbolt driver may or may not re-create them based on 
policy.


So a new attribute was created "iommu_dma_protection" that userspace 
could use as part of a policy decision to automatically authorize 
devices.  Exporting this attribute is very similar to what Microsoft 
does to let the user see the security of the system.


https://docs.microsoft.com/en-us/windows-hardware/design/device-experiences/oem-kernel-dma-protection 



In Linux today some userspace software "bolt" has a policy included by
default that will automatically authorize TBT3 and USB4 (w/ PCIe) 
devices when iommu_dma_protection is set to 1.




Furthermore, looking at patch #1 I can only conclude that this is 
entirely meaningless anyway. AFAICS it's literally reporting whether 
the firmware flag was set or not. Not whether it's actually been 
honoured and the IOMMU is enforcing any kind of DMA protection at all. 
Even on Intel where the flag does at least have some effect on the 
IOMMU driver, that can still be overridden.


Take a look at the Microsoft link I shared above.  They also make policy
decisions based on the information in these tables.



I already have a patch refactoring this to get rid of iommu_present(), 
but at the time I wasn't looking to closely at what it's trying to 
*do* with the information. If it's supposed to accurately reflect 
whether the Thunderbolt device is subject to IOMMU translation and not 
bypassed, I can fix that too (and unexport dmar_platform_optin() in 
the process...)


Robin.


This patch series stems from that history.  To give the best experience 
to end users you want hotplugged devices to be automatically authorized 
when software says it's safe to do so.


To summarize the flow:
* User plugs in device
* USB4 CM will query supported tunnels
* USB4 CM will create devices in /sys/bus/thunderbolt/devices for new 
plugged in TBT3/USB4 device
* "authorized" attribute will default to "0" and PCIe tunnels are not 
created

* Userspace gets a uevent that the device was added
* Userspace (bolt) reacts by reading 
/sys/bus/thunderbolt/devices/domainX/iommu_dma_protection
* If that is set to "1", bolt will write "1" to "authorized"  and USB4 
CM will create PCIe tunnels
* If that is set to "0", bolt will send an event to GUI to show a popup 
asking to authorize the device
* After user acks the authorization then it will write "1" to 
"authorized" and USB4 CM 

RE: [PATCH 2/2] thunderbolt: Use pre-boot DMA protection on AMD systems

2022-03-15 Thread Limonciello, Mario via iommu
[Public]



> -Original Message-
> From: Limonciello, Mario
> Sent: Tuesday, March 15, 2022 13:36
> To: Robin Murphy ; Christoph Hellwig
> ; christ...@kellner.me; Mika Westerberg
> 
> Cc: Michael Jamet ; open list:THUNDERBOLT
> DRIVER ; open list  ker...@vger.kernel.org>; Yehezkel Bernat ;
> open list:AMD IOMMU (AMD-VI) ;
> Andreas Noever ; Will Deacon
> 
> Subject: Re: [PATCH 2/2] thunderbolt: Use pre-boot DMA protection on AMD
> systems
> 
> + Christian Kellner (Bolt userspace maintainer)
> 
> On 3/15/2022 13:07, Robin Murphy wrote:
> > On 2022-03-15 16:54, Limonciello, Mario via iommu wrote:
> >> [Public]
> >>
> >>
> >>> On Tue, Mar 15, 2022 at 11:24:55AM -0500, Mario Limonciello wrote:
>  - * handled natively using IOMMU. It is enabled when IOMMU is
>  - * enabled and ACPI DMAR table has DMAR_PLATFORM_OPT_IN
> set.
>  + * handled natively using IOMMU. It is enabled when the IOMMU is
>  + * enabled and either:
>  + * ACPI DMAR table has DMAR_PLATFORM_OPT_IN set
>  + * or
>  + * ACPI IVRS table has DMA_REMAP bitset
>     */
>    return sprintf(buf, "%d\n",
>  -   iommu_present(_bus_type) &&
> >>> dmar_platform_optin());
>  +   iommu_present(_bus_type) &&
>  +   (dmar_platform_optin() || amd_ivrs_remap_support()));
> >>>
> >>> Yikes.  No, the thunderbot code does not have any business poking into
> >>> either dmar_platform_optin or amd_ivrs_remap_support.  This needs
> >>> a proper abstration from the IOMMU code.
> >>
> >> To make sure I follow your ask - it's to make a new generic iommu
> >> function
> >> That would check dmar/ivrs, and switch out thunderbolt domain.c to use
> >> the
> >> symbol?
> >>
> >> I'm happy to rework that if that is what you want.
> >> Do you have a preferred proposed function name for that?
> >
> > But why? Either IOMMU translation is enabled or it isn't, and if it is,
> > what's to gain from guessing at *why* it might have been? And even if
> > the IOMMU's firmware table did tell the IOMMU driver to enable the
> > IOMMU, why should that be Thunderbolt's business?
> A lot of this comes from baggage from early Thunderbolt 3 implementation
> on systems with ICM (Intel's FW CM). On those systems there was a
> concept of "Security Levels".  This meant that downstream PCIe devices
> were not automatically authorized when a TBT3 device was plugged in.  In
> those cases there was no guarantee that the IOMMU was in use and so the
> security was passed on to the user to make a decision.
> 
> In Linux this was accomplished using the 'authorized' attribute in
> /sys/bus/thunderbolt/devices/$NUM/authorized.  When this was set to 1
> then the TBT3 device and PCIe topology behind it would be enumerated.
> 
> Further documentation explaining how this works is available here:
> https://www.kernel.org/doc/html/latest/admin-
> guide/thunderbolt.html#security-levels-and-how-to-use-them
> 
> (Intel based) Platforms from 2018+ w/ TBT3 started to use the IOMMU
> consistently at runtime but had this existing implementation of security
> levels to worry about.  Furthermore tunnels could be created pre-boot,
> and so the thunderbolt driver may or may not re-create them based on
> policy.
> 
> So a new attribute was created "iommu_dma_protection" that userspace
> could use as part of a policy decision to automatically authorize
> devices.  Exporting this attribute is very similar to what Microsoft
> does to let the user see the security of the system.
> 
> https://docs.microsoft.com/en-us/windows-hardware/design/device-
> experiences/oem-kernel-dma-protection
> 
> In Linux today some userspace software "bolt" has a policy included by
> default that will automatically authorize TBT3 and USB4 (w/ PCIe)
> devices when iommu_dma_protection is set to 1.
> 
> >
> > Furthermore, looking at patch #1 I can only conclude that this is
> > entirely meaningless anyway. AFAICS it's literally reporting whether the
> > firmware flag was set or not. Not whether it's actually been honoured
> > and the IOMMU is enforcing any kind of DMA protection at all. Even on
> > Intel where the flag does at least have some effect on the IOMMU driver,
> > that can still be overridden.
> 
> Take a look at the Microsoft link I shared above.  They also make policy
> decisions based on the information in these tables.
> 
> >
> > I already have a patch refactoring this to get rid of iommu_present(),
> > but at the time I wasn't looking to closely at what it's trying to *do*
> > with the information. If it's supposed to accurately reflect whether the
> > Thunderbolt device is subject to IOMMU translation and not bypassed, I
> > can fix that too (and unexport dmar_platform_optin() in the process...)
> >
> > Robin.
> 
> This patch series stems from that history.  To give the best experience
> to end users you want hotplugged devices to be automatically authorized
> when software says it's safe to do so.
> 
> To 

Re: [PATCH v2 5/8] iommu: Add PASID support for DMA mapping API users

2022-03-15 Thread Jacob Pan
Hi Jason,

On Tue, 15 Mar 2022 14:05:07 -0300, Jason Gunthorpe  wrote:

> On Tue, Mar 15, 2022 at 09:31:35AM -0700, Jacob Pan wrote:
> 
> > > IMHO it is a device mis-design of IDXD to require all DMA be PASID
> > > tagged. Devices should be able to do DMA on their RID when the PCI  
> 
> > IDXD can do DMA w/ RID, the PASID requirement is only for shared WQ
> > where ENQCMDS is used. ENQCMDS has the benefit of avoiding locking
> > where work submission is done from multiple CPUs.
> > Tony, Dave?  
> 
> This is what I mean, it has an operating mode you want to use from the
> kernel driver that cannot do RID DMA. It is a HW mis-design, IMHO.
> 
> Something like PASID0 in the ENQCMDS should have triggered RID DMA.
> 
That would simplify things a lot, it would be just a device change I think.
>From IA pov, only ENQCMD will #GP if PASID==0. I will bring this back to HW
team to consider for future generations.

> > > In any case I think we are better to wait for an actual user for multi
> > > DMA API iommu_domains to come forward before we try to build an API
> > > for it.  
> > 
> > What would you recommend in the interim?  
> 
> Oh, I mean this approach at a high level is fine - I was saying we
> shouldn't try to broaden it like Robin was suggesting without a driver
> that needs multiple iommu_domains for the DMA API.
> 
Got it. Thanks for the clarification.

> Jason


Thanks,

Jacob
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 2/2] thunderbolt: Use pre-boot DMA protection on AMD systems

2022-03-15 Thread Limonciello, Mario via iommu

+ Christian Kellner (Bolt userspace maintainer)

On 3/15/2022 13:07, Robin Murphy wrote:

On 2022-03-15 16:54, Limonciello, Mario via iommu wrote:

[Public]



On Tue, Mar 15, 2022 at 11:24:55AM -0500, Mario Limonciello wrote:

- * handled natively using IOMMU. It is enabled when IOMMU is
- * enabled and ACPI DMAR table has DMAR_PLATFORM_OPT_IN set.
+ * handled natively using IOMMU. It is enabled when the IOMMU is
+ * enabled and either:
+ * ACPI DMAR table has DMAR_PLATFORM_OPT_IN set
+ * or
+ * ACPI IVRS table has DMA_REMAP bitset
   */
  return sprintf(buf, "%d\n",
-   iommu_present(_bus_type) &&

dmar_platform_optin());

+   iommu_present(_bus_type) &&
+   (dmar_platform_optin() || amd_ivrs_remap_support()));


Yikes.  No, the thunderbot code does not have any business poking into
either dmar_platform_optin or amd_ivrs_remap_support.  This needs
a proper abstration from the IOMMU code.


To make sure I follow your ask - it's to make a new generic iommu 
function
That would check dmar/ivrs, and switch out thunderbolt domain.c to use 
the

symbol?

I'm happy to rework that if that is what you want.
Do you have a preferred proposed function name for that?


But why? Either IOMMU translation is enabled or it isn't, and if it is, 
what's to gain from guessing at *why* it might have been? And even if 
the IOMMU's firmware table did tell the IOMMU driver to enable the 
IOMMU, why should that be Thunderbolt's business?
A lot of this comes from baggage from early Thunderbolt 3 implementation 
on systems with ICM (Intel's FW CM). On those systems there was a 
concept of "Security Levels".  This meant that downstream PCIe devices 
were not automatically authorized when a TBT3 device was plugged in.  In 
those cases there was no guarantee that the IOMMU was in use and so the 
security was passed on to the user to make a decision.


In Linux this was accomplished using the 'authorized' attribute in 
/sys/bus/thunderbolt/devices/$NUM/authorized.  When this was set to 1 
then the TBT3 device and PCIe topology behind it would be enumerated.


Further documentation explaining how this works is available here:
https://www.kernel.org/doc/html/latest/admin-guide/thunderbolt.html#security-levels-and-how-to-use-them

(Intel based) Platforms from 2018+ w/ TBT3 started to use the IOMMU 
consistently at runtime but had this existing implementation of security 
levels to worry about.  Furthermore tunnels could be created pre-boot, 
and so the thunderbolt driver may or may not re-create them based on policy.


So a new attribute was created "iommu_dma_protection" that userspace 
could use as part of a policy decision to automatically authorize 
devices.  Exporting this attribute is very similar to what Microsoft 
does to let the user see the security of the system.


https://docs.microsoft.com/en-us/windows-hardware/design/device-experiences/oem-kernel-dma-protection

In Linux today some userspace software "bolt" has a policy included by
default that will automatically authorize TBT3 and USB4 (w/ PCIe) 
devices when iommu_dma_protection is set to 1.




Furthermore, looking at patch #1 I can only conclude that this is 
entirely meaningless anyway. AFAICS it's literally reporting whether the 
firmware flag was set or not. Not whether it's actually been honoured 
and the IOMMU is enforcing any kind of DMA protection at all. Even on 
Intel where the flag does at least have some effect on the IOMMU driver, 
that can still be overridden.


Take a look at the Microsoft link I shared above.  They also make policy
decisions based on the information in these tables.



I already have a patch refactoring this to get rid of iommu_present(), 
but at the time I wasn't looking to closely at what it's trying to *do* 
with the information. If it's supposed to accurately reflect whether the 
Thunderbolt device is subject to IOMMU translation and not bypassed, I 
can fix that too (and unexport dmar_platform_optin() in the process...)


Robin.


This patch series stems from that history.  To give the best experience 
to end users you want hotplugged devices to be automatically authorized 
when software says it's safe to do so.


To summarize the flow:
* User plugs in device
* USB4 CM will query supported tunnels
* USB4 CM will create devices in /sys/bus/thunderbolt/devices for new 
plugged in TBT3/USB4 device
* "authorized" attribute will default to "0" and PCIe tunnels are not 
created

* Userspace gets a uevent that the device was added
* Userspace (bolt) reacts by reading 
/sys/bus/thunderbolt/devices/domainX/iommu_dma_protection
* If that is set to "1", bolt will write "1" to "authorized"  and USB4 
CM will create PCIe tunnels
* If that is set to "0", bolt will send an event to GUI to show a popup 
asking to authorize the device
* After user acks the authorization then it will write "1" to 
"authorized" and USB4 CM will create PCIe tunnels



Mika,

I wonder if maybe 

Re: [PATCH 2/2] thunderbolt: Use pre-boot DMA protection on AMD systems

2022-03-15 Thread Robin Murphy

On 2022-03-15 16:54, Limonciello, Mario via iommu wrote:

[Public]



On Tue, Mar 15, 2022 at 11:24:55AM -0500, Mario Limonciello wrote:

-* handled natively using IOMMU. It is enabled when IOMMU is
-* enabled and ACPI DMAR table has DMAR_PLATFORM_OPT_IN set.
+* handled natively using IOMMU. It is enabled when the IOMMU is
+* enabled and either:
+* ACPI DMAR table has DMAR_PLATFORM_OPT_IN set
+* or
+* ACPI IVRS table has DMA_REMAP bitset
 */
return sprintf(buf, "%d\n",
-  iommu_present(_bus_type) &&

dmar_platform_optin());

+  iommu_present(_bus_type) &&
+  (dmar_platform_optin() || amd_ivrs_remap_support()));


Yikes.  No, the thunderbot code does not have any business poking into
either dmar_platform_optin or amd_ivrs_remap_support.  This needs
a proper abstration from the IOMMU code.


To make sure I follow your ask - it's to make a new generic iommu function
That would check dmar/ivrs, and switch out thunderbolt domain.c to use the
symbol?

I'm happy to rework that if that is what you want.
Do you have a preferred proposed function name for that?


But why? Either IOMMU translation is enabled or it isn't, and if it is, 
what's to gain from guessing at *why* it might have been? And even if 
the IOMMU's firmware table did tell the IOMMU driver to enable the 
IOMMU, why should that be Thunderbolt's business?


Furthermore, looking at patch #1 I can only conclude that this is 
entirely meaningless anyway. AFAICS it's literally reporting whether the 
firmware flag was set or not. Not whether it's actually been honoured 
and the IOMMU is enforcing any kind of DMA protection at all. Even on 
Intel where the flag does at least have some effect on the IOMMU driver, 
that can still be overridden.


I already have a patch refactoring this to get rid of iommu_present(), 
but at the time I wasn't looking to closely at what it's trying to *do* 
with the information. If it's supposed to accurately reflect whether the 
Thunderbolt device is subject to IOMMU translation and not bypassed, I 
can fix that too (and unexport dmar_platform_optin() in the process...)


Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v8 00/11] ACPI/IORT: Support for IORT RMR node

2022-03-15 Thread Eric Auger
Hi Shameer,
On 2/21/22 4:43 PM, Shameer Kolothum wrote:
> Hi,
>
> Since we now have an updated verion[0] of IORT spec(E.d) which
> addresses the memory attributes issues discussed here [1],
> this series now make use of it.
>
> The pull request for ACPICA E.d related changes are already
> raised and can be found here,
> https://github.com/acpica/acpica/pull/752
>
> v7 --> v8
>   - Patch #1 has temp definitions for RMR related changes till
> the ACPICA header changes are part of kernel.
>   - No early parsing of RMR node info and is only parsed at the
> time of use.
>   - Changes to the RMR get/put API format compared to the
> previous version.
>   - Support for RMR descriptor shared by multiple stream IDs.
>
> Please take a look and let me know your thoughts.
>
> Thanks,
> Shameer
> [0] https://developer.arm.com/documentation/den0049/ed/
> [1] https://lore.kernel.org/linux-acpi/20210805160319.GB23085@lpieralisi/
>
> From old:
> We have faced issues with 3408iMR RAID controller cards which
> fail to boot when SMMU is enabled. This is because these
> controllers make use of host memory for various caching related
> purposes and when SMMU is enabled the iMR firmware fails to
> access these memory regions as there is no mapping for them.
> IORT RMR provides a way for UEFI to describe and report these
> memory regions so that the kernel can make a unity mapping for
> these in SMMU.
>
> Change History:
>
> v6 --> v7
>  -fix pointed out by Steve to the SMMUv2 SMR bypass install in patch #8.
>
> v5 --> v6
> - Addressed comments from Robin & Lorenzo.
>   : Moved iort_parse_rmr() to acpi_iort_init() from
> iort_init_platform_devices().
>   : Removed use of struct iort_rmr_entry during the initial
> parse. Using struct iommu_resv_region instead.
>   : Report RMR address alignment and overlap errors, but continue.
>   : Reworked arm_smmu_init_bypass_stes() (patch # 6).
> - Updated SMMUv2 bypass SMR code. Thanks to Jon N (patch #8).
> - Set IOMMU protection flags(IOMMU_CACHE, IOMMU_MMIO) based
>   on Type of RMR region. Suggested by Jon N.
>
> v4 --> v5
>  -Added a fw_data union to struct iommu_resv_region and removed
>   struct iommu_rmr (Based on comments from Joerg/Robin).
>  -Added iommu_put_rmrs() to release mem.
>  -Thanks to Steve for verifying on SMMUv2, but not added the Tested-by
>   yet because of the above changes.
>
> v3 -->v4
> -Included the SMMUv2 SMR bypass install changes suggested by
>  Steve(patch #7)
> -As per Robin's comments, RMR reserve implementation is now
>  more generic  (patch #8) and dropped v3 patches 8 and 10.
> -Rebase to 5.13-rc1
>
> RFC v2 --> v3
>  -Dropped RFC tag as the ACPICA header changes are now ready to be
>   part of 5.13[0]. But this series still has a dependency on that patch.
>  -Added IORT E.b related changes(node flags, _DSM function 5 checks for
>   PCIe).
>  -Changed RMR to stream id mapping from M:N to M:1 as per the spec and
>   discussion here[1].
>  -Last two patches add support for SMMUv2(Thanks to Jon Nettleton!)
>
> Jon Nettleton (1):
>   iommu/arm-smmu: Get associated RMR info and install bypass SMR
>
> Shameer Kolothum (10):
>   ACPI/IORT: Add temporary RMR node flag definitions
>   iommu: Introduce a union to struct iommu_resv_region
>   ACPI/IORT: Add helper functions to parse RMR nodes
>   iommu/dma: Introduce generic helper to retrieve RMR info
>   ACPI/IORT: Add a helper to retrieve RMR memory regions
>   iommu/arm-smmu-v3: Introduce strtab init helper
>   iommu/arm-smmu-v3: Refactor arm_smmu_init_bypass_stes() to force
> bypass
>   iommu/arm-smmu-v3: Get associated RMR info and install bypass STE
>   iommu/arm-smmu-v3: Reserve any RMR regions associated with a dev
>   iommu/arm-smmu: Reserve any RMR regions associated with a dev
fyi, the last 2 patches have conflicts with
[PATCH v4 9/9] iommu: Split struct iommu_ops
which was applied on core branch.

Thanks

Eric
>
>  drivers/acpi/arm64/iort.c   | 305 
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c |  91 --
>  drivers/iommu/arm/arm-smmu/arm-smmu.c   |  65 -
>  drivers/iommu/dma-iommu.c   |  25 ++
>  include/linux/acpi_iort.h   |  14 +
>  include/linux/dma-iommu.h   |  14 +
>  include/linux/iommu.h   |   9 +
>  7 files changed, 504 insertions(+), 19 deletions(-)
>

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 15/32] vfio: introduce KVM-owned IOMMU type

2022-03-15 Thread Jason Gunthorpe via iommu
On Tue, Mar 15, 2022 at 12:29:02PM -0400, Matthew Rosato wrote:
> On 3/15/22 10:38 AM, Jason Gunthorpe wrote:
> > On Tue, Mar 15, 2022 at 09:49:01AM -0400, Matthew Rosato wrote:
> > 
> > > The rationale for splitting steps 1 and 2 are that VFIO_SET_IOMMU doesn't
> > > have a mechanism for specifying more than the type as an arg, no?  
> > > Otherwise
> > > yes, you could specify a kvm fd at this point and it would have some other
> > > advantages (e.g. skip notifier).  But we still can't use the IOMMU for
> > > mapping until step 3.
> > 
> > Stuff like this is why I'd be much happier if this could join our
> > iommfd project so we can have clean modeling of the multiple iommu_domains.
> > 
> 
> I'd certainly be willing to collaborate so feel free to loop me in on the
> discussions; 

Sure, I have you on my list. I've been waiting for Eric to get a bit
further along on his ARM work so you have something appropriate to
look at.

In the mean time you can certainly work out the driver details as
you've been doing here and hacking through VFIO. The iommu_domain
logic is the big work item in this series, not the integration with
the uAPI.

> but I got the impression that iommufd is not close to ready (maybe
> I'm wrong?)

Well, quite alot has been done already and I think we are getting
close to having something that can start moving forward, but yes it
will not be "tomorrow".

> -- if so I really don't want to completely delay this zPCI support
> behind it as it has a significant benefit for kvm guests on s390x :(

Every platform vendor is trying to do this exact same performance
optimization. s390 is doing it a bit different, but as we've seen, not
fundamentally so compard to ARM and Intel IOMMU's with HW nesting.

I'm not sure that s390 should jump the queue and hacky hacky uAPIs all
over the place. ARM platform has been working on this for literal
years now.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 15/32] vfio: introduce KVM-owned IOMMU type

2022-03-15 Thread Jason Gunthorpe via iommu
On Tue, Mar 15, 2022 at 12:04:35PM -0400, Matthew Rosato wrote:

> > You can't pin/unpin in this path, there is no real way to handle error
> > and ulimit stuff here, plus it is really slow. I didn't notice any of
> > this in your patches, so what do you mean by 'pin' above?
> 
> patch 18 does some symbol_get for gfn_to_page (which will drive hva_to_pfn
> under the covers) and kvm_release_pfn_dirty and uses those symbols for
> pin/unpin.

To be very clear, this is quite wrong.

It does not account for the memory pinned by the guest and a hostile
VM could pin more memory than the KVM process is allowed to - which is
a security hole.

It also uses the wrong kind of pin, DMA pinned pages must be
pin_user_page'd not get_page'd and undone with unpin_user_page() and
not put_page(). This allows the guest to pin ZONE_MOVABALE memory and
other things which cannot be DMA'd to which will break the hypervisor
kernel. See David Hildenbrand's various series on COW bugs for an
overview why this is really security bad.

If you want to do dynamic pinning that is a different thing and
requires more CPU work in the shadowing operations. The modeling would
be similar except that the 1st stage iommu_domain would be this
'shared with KVM' domain people have been talking about - ie the page
table is not set with type 1 map/unmap but follows the KVM page table and
here it would be appropriate to use gfn_to_page/etc to access it.

However, if you do that then you do still have to take care of the
ulimit checks and you must teach kvm to use unpin_user_page/_dirty()
and related to be correct. This looks like a pretty big deal.

My advice is to start with the fully pinned case I described and
consider a KVM approach down the road.

[Also, I'm quite excited by this series you have, I think it shows
exactly how to fix POWER to work within the modern iommu framework,
they have the same basic problem]

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 5/8] iommu: Add PASID support for DMA mapping API users

2022-03-15 Thread Jason Gunthorpe via iommu
On Tue, Mar 15, 2022 at 09:31:35AM -0700, Jacob Pan wrote:

> > IMHO it is a device mis-design of IDXD to require all DMA be PASID
> > tagged. Devices should be able to do DMA on their RID when the PCI

> IDXD can do DMA w/ RID, the PASID requirement is only for shared WQ where
> ENQCMDS is used. ENQCMDS has the benefit of avoiding locking where work
> submission is done from multiple CPUs.
> Tony, Dave?

This is what I mean, it has an operating mode you want to use from the
kernel driver that cannot do RID DMA. It is a HW mis-design, IMHO.

Something like PASID0 in the ENQCMDS should have triggered RID DMA.

> > In any case I think we are better to wait for an actual user for multi
> > DMA API iommu_domains to come forward before we try to build an API
> > for it.
> 
> What would you recommend in the interim?

Oh, I mean this approach at a high level is fine - I was saying we
shouldn't try to broaden it like Robin was suggesting without a driver
that needs multiple iommu_domains for the DMA API.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 15/32] vfio: introduce KVM-owned IOMMU type

2022-03-15 Thread Matthew Rosato

On 3/15/22 10:17 AM, Matthew Rosato wrote:

On 3/15/22 3:57 AM, Tian, Kevin wrote:

From: Jason Gunthorpe 
Sent: Tuesday, March 15, 2022 7:18 AM

On Mon, Mar 14, 2022 at 04:50:33PM -0600, Alex Williamson wrote:


+/*
+ * The KVM_IOMMU type implies that the hypervisor will control the

mappings

+ * rather than userspace
+ */
+#define VFIO_KVM_IOMMU    11


Then why is this hosted in the type1 code that exposes a wide variety
of userspace interfaces?  Thanks,


It is really badly named, this is the root level of a 2 stage nested
IO page table, and this approach needed a special flag to distinguish
the setup from the normal iommu_domain.

If we do try to stick this into VFIO it should probably use the
VFIO_TYPE1_NESTING_IOMMU instead - however, we would like to delete
that flag entirely as it was never fully implemented, was never used,
and isn't part of what we are proposing for IOMMU nesting on ARM
anyhow. (So far I've found nobody to explain what the plan here was..)

This is why I said the second level should be an explicit iommu_domain
all on its own that is explicitly coupled to the KVM to read the page
tables, if necessary.

But I'm not sure that reading the userspace io page tables with KVM is
even the best thing to do - the iommu driver already has the pinned
memory, it would be faster and more modular to traverse the io page
tables through the pfns in the root iommu_domain than by having KVM do
the translations. Lets see what Matthew says..



Reading this thread it's sort of like an optimization to software 
nesting.


Yes, we want to avoid breaking to userspace for a very frequent 
operation (RPCIT / updating shadow mappings)



If that is the case does it make more sense to complete the basic form
of software nesting first and then adds this optimization?

The basic form would allow the userspace to create a special domain
type which points to a user/guest page table (like hardware nesting)
but doesn't install the user page table to the IOMMU hardware (unlike
hardware nesting). When receiving invalidate cmd from userspace > the 
iommu driver walks the user page table (1st-level) and the parent

page table (2nd-level) to generate a shadow mapping for the
invalidated range in the non-nested hardware page table of this
special domain type.

Once that works what this series does just changes the matter of
how the invalidate cmd is triggered. Previously iommu driver receives
invalidate cmd from Qemu (via iommufd uAPI) while now receiving
the cmd from kvm (via iommufd kAPI) upon interception of RPCIT.
 From this angle once the connection between iommufd and kvm fd
is established there is even no direct talk between iommu driver and
kvm.


But something somewhere still needs to be responsible for 
pinning/unpinning of the guest table entries upon each RPCIT 
interception.  e.g. the RPCIT intercept can happen because the guest 
wants to invalidate some old mappings or has generated some new mappings 
over a range, so we must shadow the new mappings (by pinning the guest 
entries and placing them in the host hardware table / unpinning 
invalidated ones and clearing their entry in the host hardware table).




OK, this got clarified by Jason in another thread: What I was missing 
here was an assumption that the 1st-level has already mapped and pinned 
all of guest physical address space; in that case there's no need to 
invoke pin/unpin operations against a kvm from within the iommu domain 
(this series as-is does not pin all of the guest physical address space; 
it does pins/unpins on-demand at RPCIT time)

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [PATCH 2/2] thunderbolt: Use pre-boot DMA protection on AMD systems

2022-03-15 Thread Limonciello, Mario via iommu
[Public]


> On Tue, Mar 15, 2022 at 11:24:55AM -0500, Mario Limonciello wrote:
> > -* handled natively using IOMMU. It is enabled when IOMMU is
> > -* enabled and ACPI DMAR table has DMAR_PLATFORM_OPT_IN set.
> > +* handled natively using IOMMU. It is enabled when the IOMMU is
> > +* enabled and either:
> > +* ACPI DMAR table has DMAR_PLATFORM_OPT_IN set
> > +* or
> > +* ACPI IVRS table has DMA_REMAP bitset
> >  */
> > return sprintf(buf, "%d\n",
> > -  iommu_present(_bus_type) &&
> dmar_platform_optin());
> > +  iommu_present(_bus_type) &&
> > +  (dmar_platform_optin() || amd_ivrs_remap_support()));
> 
> Yikes.  No, the thunderbot code does not have any business poking into
> either dmar_platform_optin or amd_ivrs_remap_support.  This needs
> a proper abstration from the IOMMU code.

To make sure I follow your ask - it's to make a new generic iommu function
That would check dmar/ivrs, and switch out thunderbolt domain.c to use the
symbol?

I'm happy to rework that if that is what you want.
Do you have a preferred proposed function name for that?
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 2/2] thunderbolt: Use pre-boot DMA protection on AMD systems

2022-03-15 Thread Christoph Hellwig
On Tue, Mar 15, 2022 at 11:24:55AM -0500, Mario Limonciello wrote:
> -  * handled natively using IOMMU. It is enabled when IOMMU is
> -  * enabled and ACPI DMAR table has DMAR_PLATFORM_OPT_IN set.
> +  * handled natively using IOMMU. It is enabled when the IOMMU is
> +  * enabled and either:
> +  * ACPI DMAR table has DMAR_PLATFORM_OPT_IN set
> +  * or
> +  * ACPI IVRS table has DMA_REMAP bitset
>*/
>   return sprintf(buf, "%d\n",
> -iommu_present(_bus_type) && dmar_platform_optin());
> +iommu_present(_bus_type) &&
> +(dmar_platform_optin() || amd_ivrs_remap_support()));

Yikes.  No, the thunderbot code does not have any business poking into
either dmar_platform_optin or amd_ivrs_remap_support.  This needs
a proper abstration from the IOMMU code.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 22/32] KVM: s390: pci: routines for (dis)associating zPCI devices with a KVM

2022-03-15 Thread Matthew Rosato

On 3/14/22 5:46 PM, Jason Gunthorpe wrote:

On Mon, Mar 14, 2022 at 03:44:41PM -0400, Matthew Rosato wrote:

+int kvm_s390_pci_zpci_start(struct kvm *kvm, struct zpci_dev *zdev)
+{
+   struct vfio_device *vdev;
+   struct pci_dev *pdev;
+   int rc;
+
+   rc = kvm_s390_pci_dev_open(zdev);
+   if (rc)
+   return rc;
+
+   pdev = pci_get_slot(zdev->zbus->bus, zdev->devfn);
+   if (!pdev) {
+   rc = -ENODEV;
+   goto exit_err;
+   }
+
+   vdev = get_vdev(>dev);
+   if (!vdev) {
+   pci_dev_put(pdev);
+   rc = -ENODEV;
+   goto exit_err;
+   }
+
+   zdev->kzdev->nb.notifier_call = kvm_s390_pci_group_notifier;
+
+   /*
+* At this point, a KVM should already be associated with this device,
+* so registering the notifier now should immediately trigger the
+* event.  We also want to know if the KVM association is later removed
+* to ensure proper cleanup happens.
+*/
+   rc = register_notifier(vdev->dev, >kzdev->nb);
+
+   put_vdev(vdev);
+   pci_dev_put(pdev);
+
+   /* Make sure the registered KVM matches the KVM issuing the ioctl */
+   if (rc || zdev->kzdev->kvm != kvm) {
+   rc = -ENODEV;
+   goto exit_err;
+   }
+
+   /* Must support KVM-managed IOMMU to proceed */
+   if (IS_ENABLED(CONFIG_S390_KVM_IOMMU))
+   rc = zpci_iommu_attach_kvm(zdev, kvm);
+   else
+   rc = -EINVAL;


This seems like kind of a strange API, shouldn't kvm be getting a
reference on the underlying iommu_domain and then calling into it to
get the mapping table instead of pushing KVM specific logic into the
iommu driver?

I would be nice if all the special kvm stuff could more isolated in
kvm code.

I'm still a little unclear about why this is so complicated - can't
you get the iommu_domain from the group FD directly in KVM code as
power does?


Yeah, I think I could do something like that using the vfio group fd 
like power does.


Providing a reference to the kvm itself inside iommu was being used for 
the pin/unpin operations, which would not be necessary if we switched to 
the 1st layer iommu pinning all of guest memory.




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 2/2] thunderbolt: Use pre-boot DMA protection on AMD systems

2022-03-15 Thread Mario Limonciello via iommu
The information is exported from the IOMMU driver whether or not
pre-boot DMA protection has been enabled on AMD systems.  Use this
information to properly set iomma_dma_protection.

Link: 
https://docs.microsoft.com/en-us/windows-hardware/design/device-experiences/oem-kernel-dma-protection
Link: https://www.amd.com/system/files/TechDocs/48882_IOMMU.pdf
Signed-off-by: Mario Limonciello 
---
 drivers/thunderbolt/domain.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/thunderbolt/domain.c b/drivers/thunderbolt/domain.c
index 7018d959f775..e03790735c12 100644
--- a/drivers/thunderbolt/domain.c
+++ b/drivers/thunderbolt/domain.c
@@ -6,6 +6,7 @@
  * Author: Mika Westerberg 
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -259,11 +260,15 @@ static ssize_t iommu_dma_protection_show(struct device 
*dev,
 {
/*
 * Kernel DMA protection is a feature where Thunderbolt security is
-* handled natively using IOMMU. It is enabled when IOMMU is
-* enabled and ACPI DMAR table has DMAR_PLATFORM_OPT_IN set.
+* handled natively using IOMMU. It is enabled when the IOMMU is
+* enabled and either:
+* ACPI DMAR table has DMAR_PLATFORM_OPT_IN set
+* or
+* ACPI IVRS table has DMA_REMAP bitset
 */
return sprintf(buf, "%d\n",
-  iommu_present(_bus_type) && dmar_platform_optin());
+  iommu_present(_bus_type) &&
+  (dmar_platform_optin() || amd_ivrs_remap_support()));
 }
 static DEVICE_ATTR_RO(iommu_dma_protection);
 
-- 
2.34.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 1/2] iommu/amd: Add support to indicate whether DMA remap support is enabled

2022-03-15 Thread Mario Limonciello via iommu
Bit 1 of the IVFS IVInfo field indicates that IOMMU has been used for
pre-boot DMA protection.  Export this information to the kernel to
allow other drivers to use it.

Link: https://www.amd.com/system/files/TechDocs/48882_IOMMU.pdf
Signed-off-by: Mario Limonciello 
---
 drivers/iommu/amd/amd_iommu_types.h |  1 +
 drivers/iommu/amd/init.c| 18 ++
 include/linux/amd-iommu.h   |  5 +
 3 files changed, 24 insertions(+)

diff --git a/drivers/iommu/amd/amd_iommu_types.h 
b/drivers/iommu/amd/amd_iommu_types.h
index 47108ed44fbb..68e51213f119 100644
--- a/drivers/iommu/amd/amd_iommu_types.h
+++ b/drivers/iommu/amd/amd_iommu_types.h
@@ -407,6 +407,7 @@
 /* IOMMU IVINFO */
 #define IOMMU_IVINFO_OFFSET 36
 #define IOMMU_IVINFO_EFRSUP BIT(0)
+#define IOMMU_IVINFO_DMA_REMAP  BIT(1)
 
 /* IOMMU Feature Reporting Field (for IVHD type 10h */
 #define IOMMU_FEAT_GASUP_SHIFT 6
diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
index 7bfe37e52e21..e9b669592dfc 100644
--- a/drivers/iommu/amd/init.c
+++ b/drivers/iommu/amd/init.c
@@ -182,6 +182,7 @@ u32 amd_iommu_max_pasid __read_mostly = ~0;
 
 bool amd_iommu_v2_present __read_mostly;
 static bool amd_iommu_pc_present __read_mostly;
+static bool amdr_ivrs_remap_support __read_mostly;
 
 bool amd_iommu_force_isolation __read_mostly;
 
@@ -326,6 +327,8 @@ static void __init early_iommu_features_init(struct 
amd_iommu *iommu,
 {
if (amd_iommu_ivinfo & IOMMU_IVINFO_EFRSUP)
iommu->features = h->efr_reg;
+   if (amd_iommu_ivinfo & IOMMU_IVINFO_DMA_REMAP)
+   amdr_ivrs_remap_support = true;
 }
 
 /* Access to l1 and l2 indexed register spaces */
@@ -3269,6 +3272,21 @@ struct amd_iommu *get_amd_iommu(unsigned int idx)
return NULL;
 }
 
+/*
+ * ivrs_remap_support - Is %IOMMU_IVINFO_DMA_REMAP set in IVRS table
+ *
+ * Returns true if the platform has %IOMMU_IVINFO_DMA_REMAP% set in the IOMMU
+ * IVRS IVInfo field.
+ * Presence of this flag indicates to the OS/HV that the IOMMU is used for
+ * Preboot DMA protection and device accessed memory should be remapped after
+ * the OS has loaded.
+ */
+bool amd_ivrs_remap_support(void)
+{
+   return amdr_ivrs_remap_support;
+}
+EXPORT_SYMBOL_GPL(amd_ivrs_remap_support);
+
 /
  *
  * IOMMU EFR Performance Counter support functionality. This code allows
diff --git a/include/linux/amd-iommu.h b/include/linux/amd-iommu.h
index 58e6c3806c09..d07b9fed6474 100644
--- a/include/linux/amd-iommu.h
+++ b/include/linux/amd-iommu.h
@@ -170,6 +170,7 @@ amd_iommu_update_ga(int cpu, bool is_run, void *data);
 
 extern int amd_iommu_activate_guest_mode(void *data);
 extern int amd_iommu_deactivate_guest_mode(void *data);
+extern bool amd_ivrs_remap_support(void);
 
 #else /* defined(CONFIG_AMD_IOMMU) && defined(CONFIG_IRQ_REMAP) */
 
@@ -194,6 +195,10 @@ static inline int amd_iommu_deactivate_guest_mode(void 
*data)
 {
return 0;
 }
+static inline bool amd_ivrs_remap_support(void)
+{
+   return false;
+}
 #endif /* defined(CONFIG_AMD_IOMMU) && defined(CONFIG_IRQ_REMAP) */
 
 int amd_iommu_get_num_iommus(void);
-- 
2.34.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 5/8] iommu: Add PASID support for DMA mapping API users

2022-03-15 Thread Jacob Pan
Hi Jason,

On Tue, 15 Mar 2022 11:35:35 -0300, Jason Gunthorpe  wrote:

> On Mon, Mar 14, 2022 at 10:07:09PM -0700, Jacob Pan wrote:
> > DMA mapping API is the de facto standard for in-kernel DMA. It operates
> > on a per device/RID basis which is not PASID-aware.
> > 
> > Some modern devices such as Intel Data Streaming Accelerator, PASID is
> > required for certain work submissions. To allow such devices use DMA
> > mapping API, we need the following functionalities:
> > 1. Provide device a way to retrieve a PASID for work submission within
> > the kernel
> > 2. Enable the kernel PASID on the IOMMU for the device
> > 3. Attach the kernel PASID to the device's default DMA domain, let it
> > be IOVA or physical address in case of pass-through.
> > 
> > This patch introduces a driver facing API that enables DMA API
> > PASID usage. Once enabled, device drivers can continue to use DMA APIs
> > as is. There is no difference in dma_handle between without PASID and
> > with PASID.
> > 
> > Signed-off-by: Jacob Pan 
> >  drivers/iommu/dma-iommu.c | 65 +++
> >  include/linux/dma-iommu.h |  7 +
> >  include/linux/iommu.h |  9 ++
> >  3 files changed, 81 insertions(+)
> > 
> > diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> > index b22034975301..d0ff1a34b1b6 100644
> > +++ b/drivers/iommu/dma-iommu.c
> > @@ -39,6 +39,8 @@ enum iommu_dma_cookie_type {
> > IOMMU_DMA_MSI_COOKIE,
> >  };
> >  
> > +static DECLARE_IOASID_SET(iommu_dma_pasid);
> > +
> >  struct iommu_dma_cookie {
> > enum iommu_dma_cookie_type  type;
> > union {
> > @@ -370,6 +372,69 @@ void iommu_put_dma_cookie(struct iommu_domain
> > *domain) domain->iova_cookie = NULL;
> >  }
> >  
> > +/**
> > + * iommu_enable_pasid_dma --Enable in-kernel DMA request with PASID
> > + * @dev:   Device to be enabled
> > + *
> > + * DMA request with PASID will be mapped the same way as the legacy
> > DMA.
> > + * If the device is in pass-through, PASID will also pass-through. If
> > the
> > + * device is in IOVA map, the supervisor PASID will point to the same
> > IOVA
> > + * page table.
> > + *
> > + * @return the kernel PASID to be used for DMA or INVALID_IOASID on
> > failure
> > + */
> > +int iommu_enable_pasid_dma(struct device *dev, ioasid_t *pasid)
> > +{
> > +   struct iommu_domain *dom;
> > +   ioasid_t id, max;
> > +   int ret;
> > +
> > +   dom = iommu_get_domain_for_dev(dev);
> > +   if (!dom || !dom->ops || !dom->ops->attach_dev_pasid)
> > +   return -ENODEV;  
> 
> Given the purpose of this API I think it should assert that the device
> has the DMA API in-use using the machinery from the other series.
> 
> ie this should not be used to clone non-DMA API iommu_domains..
> 
Let me try to confirm the specific here. I should check domain type and
rejects all others except IOMMU_DOMAIN_DMA type, right? Should also allow
IOMMU_DOMAIN_IDENTITY.

That makes sense.

> > diff --git a/include/linux/dma-iommu.h b/include/linux/dma-iommu.h
> > index 24607dc3c2ac..e6cb9b52a420 100644
> > +++ b/include/linux/dma-iommu.h
> > @@ -18,6 +18,13 @@ int iommu_get_dma_cookie(struct iommu_domain
> > *domain); int iommu_get_msi_cookie(struct iommu_domain *domain,
> > dma_addr_t base); void iommu_put_dma_cookie(struct iommu_domain
> > *domain); 
> > +/*
> > + * For devices that can do DMA request with PASID, setup a system
> > PASID.
> > + * Address modes (IOVA, PA) are selected by the platform code.
> > + */
> > +int iommu_enable_pasid_dma(struct device *dev, ioasid_t *pasid);
> > +void iommu_disable_pasid_dma(struct device *dev, ioasid_t pasid);  
> 
> The functions already have a kdoc, don't need two..
> 
Good point!

Thanks,

Jacob
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 15/32] vfio: introduce KVM-owned IOMMU type

2022-03-15 Thread Matthew Rosato

On 3/15/22 10:38 AM, Jason Gunthorpe wrote:

On Tue, Mar 15, 2022 at 09:49:01AM -0400, Matthew Rosato wrote:


The rationale for splitting steps 1 and 2 are that VFIO_SET_IOMMU doesn't
have a mechanism for specifying more than the type as an arg, no?  Otherwise
yes, you could specify a kvm fd at this point and it would have some other
advantages (e.g. skip notifier).  But we still can't use the IOMMU for
mapping until step 3.


Stuff like this is why I'd be much happier if this could join our
iommfd project so we can have clean modeling of the multiple iommu_domains.



I'd certainly be willing to collaborate so feel free to loop me in on 
the discussions; but I got the impression that iommufd is not close to 
ready (maybe I'm wrong?) -- if so I really don't want to completely 
delay this zPCI support behind it as it has a significant benefit for 
kvm guests on s390x :(


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 5/8] iommu: Add PASID support for DMA mapping API users

2022-03-15 Thread Jacob Pan
Hi Jason,

On Tue, 15 Mar 2022 11:22:16 -0300, Jason Gunthorpe  wrote:

> On Tue, Mar 15, 2022 at 11:16:41AM +, Robin Murphy wrote:
> > On 2022-03-15 05:07, Jacob Pan wrote:  
> > > DMA mapping API is the de facto standard for in-kernel DMA. It
> > > operates on a per device/RID basis which is not PASID-aware.
> > > 
> > > Some modern devices such as Intel Data Streaming Accelerator, PASID is
> > > required for certain work submissions. To allow such devices use DMA
> > > mapping API, we need the following functionalities:
> > > 1. Provide device a way to retrieve a PASID for work submission within
> > > the kernel
> > > 2. Enable the kernel PASID on the IOMMU for the device
> > > 3. Attach the kernel PASID to the device's default DMA domain, let it
> > > be IOVA or physical address in case of pass-through.
> > > 
> > > This patch introduces a driver facing API that enables DMA API
> > > PASID usage. Once enabled, device drivers can continue to use DMA
> > > APIs as is. There is no difference in dma_handle between without
> > > PASID and with PASID.  
> > 
> > Surely the main point of PASIDs is to be able to use more than one
> > of them?  
> 
> IMHO, not for the DMA API.
> 
Right, but we really need two here. One for DMA request w/o PASID (PASID 0)
and a kernel PASID for DMA request tagged w/ PASID.
Since DMA API is not per process, there is no need for more right now.

> I can't think of good reasons why a single in-kernel device should
> require more than one iommu_domain for use by the DMA API. Even with
> the SIOV cases we have been looking at we don't really see a use case
> for more than one DMA API iommu_domain on a single physical device.
> Do you know of something on the horizon?
> 
Not that I know.

> From my view the main point of PASIDs is to assign iommu_domains that
> are not used by the DMA API.
> 
Right, DMA API default to PASID 0. But IDXD device cannot use PASID 0 for
enqcmds.

> IMHO it is a device mis-design of IDXD to require all DMA be PASID
> tagged. Devices should be able to do DMA on their RID when the PCI
IDXD can do DMA w/ RID, the PASID requirement is only for shared WQ where
ENQCMDS is used. ENQCMDS has the benefit of avoiding locking where work
submission is done from multiple CPUs.
Tony, Dave?

> function is controlled by a kernel driver. I see this driver facing
> API as addressing a device quirk by aliasing the DMA API of the RID
> into a PASID and that is really all it is good for.
> 
> In any case I think we are better to wait for an actual user for multi
> DMA API iommu_domains to come forward before we try to build an API
> for it.
> 
What would you recommend in the interim?

Shall we let VT-d driver set up a special global PASID for DMA API? Then
IDXD driver can retrieve it somehow? But that still needs an API similar to
what I did in the previous version where PASID #1 was used.

Thanks,

Jacob
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 2/8] iommu: Add attach/detach_dev_pasid domain ops

2022-03-15 Thread Jacob Pan
Hi Kevin,

On Tue, 15 Mar 2022 11:49:57 +, "Tian, Kevin" 
wrote:

> > From: Jean-Philippe Brucker 
> > Sent: Tuesday, March 15, 2022 7:27 PM
> > 
> > On Mon, Mar 14, 2022 at 10:07:06PM -0700, Jacob Pan wrote:  
> > > From: Lu Baolu 
> > >
> > > An IOMMU domain represents an address space which can be attached by
> > > devices that perform DMA within a domain. However, for platforms with
> > > PASID capability the domain attachment needs be handled at
> > > device+PASID level. There can be multiple PASIDs within a device and
> > > multiple devices attached to a given domain.
> > > This patch introduces a new IOMMU op which support device, PASID, and
> > > IOMMU domain attachment. The immediate use case is for PASID capable
> > > devices to perform DMA under DMA APIs.
> > >
> > > Signed-off-by: Lu Baolu 
> > > Signed-off-by: Jacob Pan 
> > > ---
> > >  include/linux/iommu.h | 6 ++
> > >  1 file changed, 6 insertions(+)
> > >
> > > diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> > > index 369f05c2a4e2..fde5b933dbe3 100644
> > > --- a/include/linux/iommu.h
> > > +++ b/include/linux/iommu.h
> > > @@ -227,6 +227,8 @@ struct iommu_iotlb_gather {
> > >   * @aux_get_pasid: get the pasid given an aux-domain
> > >   * @sva_bind: Bind process address space to device
> > >   * @sva_unbind: Unbind process address space from device
> > > + * @attach_dev_pasid: attach an iommu domain to a pasid of device
> > > + * @detach_dev_pasid: detach an iommu domain from a pasid of device  
> > 
> > Isn't that operation "assign a PASID to a domain" instead?  In patch 5,
> > the domain is already attached to the device, so set_domain_pasid()
> > might be clearer and to the point. If the IOMMU driver did the
> > allocation we could also avoid patch 1.
I agree, we could let vendor driver do the allocation inside this op. On
the other side, we could also keep the flexibility such that this op can be
used for guest PASID bind, with a SVA domain.
> 
> iiuc this API can also work for future SIOV usage where each mdev attached
> to the domain has its own pasid. "assigning a PASID to a domain" sounds
> like going back to the previous aux domain approach which has one PASID
> per domain and that PASID is used on all devices attached to the aux
> domain...
> 
yes, that is the intention. I plan to lift the requirement in patch 5 such
that device attachment will not be a prerequisite. That would be after mdev
adoption.

> > 
> > If I understand correctly this series is not about a generic PASID API
> > that allows drivers to manage multiple DMA address spaces, because there
> > still doesn't seem to be any interest in that. It's about the specific
> > IDXD use-case, so let's focus on that. We can introduce a specialized
> > call such as (iommu|dma)_set_device_pasid(), which will be easy to
> > consolidate later into a more generic "dma_enable_pasid()" API if that
> > ever seems useful.
> > 
Right, at the moment it is still a single address space. i.e. the current
domain of the device/group.

But this limitation is at the driver facing API layer not limited in the
IOMMU ops.

> > Thanks,
> > Jean
> >   
> > >   * @sva_get_pasid: Get PASID associated to a SVA handle
> > >   * @page_response: handle page request response
> > >   * @cache_invalidate: invalidate translation caches
> > > @@ -296,6 +298,10 @@ struct iommu_ops {
> > >   struct iommu_sva *(*sva_bind)(struct device *dev, struct
> > > mm_struct  
> > *mm,  
> > > void *drvdata);
> > >   void (*sva_unbind)(struct iommu_sva *handle);
> > > + int (*attach_dev_pasid)(struct iommu_domain *domain,
> > > + struct device *dev, ioasid_t id);
> > > + void (*detach_dev_pasid)(struct iommu_domain *domain,
> > > +  struct device *dev, ioasid_t id);
> > >   u32 (*sva_get_pasid)(struct iommu_sva *handle);
> > >
> > >   int (*page_response)(struct device *dev,
> > > --
> > > 2.25.1
> > >  


Thanks,

Jacob
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 15/32] vfio: introduce KVM-owned IOMMU type

2022-03-15 Thread Matthew Rosato

On 3/15/22 10:55 AM, Jason Gunthorpe wrote:

On Tue, Mar 15, 2022 at 09:36:08AM -0400, Matthew Rosato wrote:

If we do try to stick this into VFIO it should probably use the
VFIO_TYPE1_NESTING_IOMMU instead - however, we would like to delete
that flag entirely as it was never fully implemented, was never used,
and isn't part of what we are proposing for IOMMU nesting on ARM
anyhow. (So far I've found nobody to explain what the plan here was..)



I'm open to suggestions on how better to tie this into vfio.  The scenario
basically plays out that:


Ideally I would like it to follow the same 'user space page table'
design that Eric and Kevin are working on for HW iommu.


'[RFC v16 0/9] SMMUv3 Nested Stage Setup (IOMMU part)' ??

https://lore.kernel.org/linux-iommu/20211027104428.1059740-1-eric.au...@redhat.com/



You have an 1st iommu_domain that maps and pins the entire guest physical
address space.


Ahh, I see.

@Christian would it be OK to pursue a model that pins all of guest 
memory upfront?




You have an nested iommu_domain that represents the user page table
(the ioat in your language I think)


Yes



When the guest says it wants to set a user page table then you create
the nested iommu_domain representing that user page table and pass in
the anchor (guest address of the root IOPTE) to the kernel to do the
work. >
The rule for all other HW's is that the user space page table is
translated by the top level kernel page table. So when you traverse it
you fetch the CPU page storing the guest's IOPTE by doing an IOVA
translation through the first level page table - not through KVM.

Since the first level page table an the KVM GPA should be 1:1 this is
an equivalent operation.


1) the iommu will be domain_alloc'd once VFIO_SET_IOMMU is issued -- so at
that time (or earlier) we have to make the decision on whether to use the
standard IOMMU or this alternate KVM/nested IOMMU.


So in terms of iommufd I would see it this would be an iommufd 'create
a device specific iomm_domain' IOCTL and you can pass in a S390
specific data blob to make it into this special mode.


This is why I said the second level should be an explicit iommu_domain
all on its own that is explicitly coupled to the KVM to read the page
tables, if necessary.


Maybe I misunderstood this.  Are you proposing 2 layers of IOMMU that
interact with each other within host kernel space?

A second level runs the guest tables, pins the appropriate pieces from the
guest to get the resulting phys_addr(s) which are then passed via iommu to a
first level via map (or unmap)?



The first level iommu_domain has the 'type1' map and unmap and pins
the pages. This is the 1:1 map with the GPA and ends up pinning all
guest memory because the point is you don't want to take a memory pin
on your performance path

The second level iommu_domain points to a single IO page table in GPA
and is created/destroyed whenever the guest traps to the hypervisor to
manipulate the anchor (ie the GPA of the guest IO page table).



That makes sense, thanks for clarifying.


But I'm not sure that reading the userspace io page tables with KVM is
even the best thing to do - the iommu driver already has the pinned
memory, it would be faster and more modular to traverse the io page
tables through the pfns in the root iommu_domain than by having KVM do
the translations. Lets see what Matthew says..


OK, you lost me a bit here.  And this may be associated with the above.

So, what the current implementation is doing is reading the guest DMA tables
(which we must pin the first time we access them) and then map the PTEs of
the associated guest DMA entries into the associated host DMA table (so,
again pin and place the address, or unpin and invalidate).  Basically we are
shadowing the first level DMA table as a copy of the second level DMA table
with the host address(es) of the pinned guest page(s).


You can't pin/unpin in this path, there is no real way to handle error
and ulimit stuff here, plus it is really slow. I didn't notice any of
this in your patches, so what do you mean by 'pin' above?


patch 18 does some symbol_get for gfn_to_page (which will drive 
hva_to_pfn under the covers) and kvm_release_pfn_dirty and uses those 
symbols for pin/unpin.


pin/unpin errors in this series are reliant on the fact that RPCIT is 
architected to include a panic response to the guest of 'mappings failed 
for the specified range, go refresh your tables and make room', thus 
allowing this to work for pageable guests.


Agreed this would be unnecessary if we've already mapped all of guest 
memory via a 1st iommu domain.




To be like other IOMMU nesting drivers the pages should already be
pinned and stored in the 1st iommu_domain, lets say in an xarray. This
xarray is populated by type1 map/unmap sytem calls like any
iommu_domain.

A nested iommu_domain should create the real HW IO page table and
associate it with the real HW IOMMU and record the parent 1st level 
iommu_domain.

When you 

Re: [PATCH v2 0/8] Enable PASID for DMA API users

2022-03-15 Thread Jacob Pan
Hi Kevin,

On Tue, 15 Mar 2022 08:16:59 +, "Tian, Kevin" 
wrote:

> > From: Jacob Pan 
> > Sent: Tuesday, March 15, 2022 1:07 PM
> > 
> > Some modern accelerators such as Intel's Data Streaming Accelerator
> > (DSA) require PASID in DMA requests to be operational. Specifically,
> > the work submissions with ENQCMD on shared work queues require PASIDs.
> > The use cases
> > include both user DMA with shared virtual addressing (SVA) and in-kernel
> > DMA similar to legacy DMA w/o PASID. Here we address the latter.
> > 
> > DMA mapping API is the de facto standard for in-kernel DMA. However, it
> > operates on a per device or Requester ID(RID) basis which is not
> > PASID-aware. To leverage DMA API for devices relies on PASIDs, this
> > patchset introduces the following APIs
> > 
> > 1. A driver facing API that enables DMA API PASID usage:
> > iommu_enable_pasid_dma(struct device *dev, ioasid_t );  
> 
> Should this be called dma_enable_pasid() since it's about DMA API? Doing
> so also avoids the driver to include iommu.h.
> 
PASID is still tied to IOMMU, drivers who wants to use this must explicitly
put dependency on IOMMU. So I prefer not to give that illusion.

> Thanks
> Kevin


Thanks,

Jacob
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 29/32] vfio-pci/zdev: add DTSM to clp group capability

2022-03-15 Thread Jason Gunthorpe via iommu
On Tue, Mar 15, 2022 at 10:39:18AM -0400, Matthew Rosato wrote:
> > That is something that should be modeled as a nested iommu domain.
> > 
> > Querying the formats and any control logic for this should be on the
> > iommu side not built into VFIO.
> 
> I agree that the DTSM is really controlled by what the IOMMU domain can
> support (e.g. what guest table formats it can actually operate on) and so
> the DTSM value should come from there vs out of KVM; but is there harm in
> including the query response data here along with the rest of the response
> information for 'query this zPCI group'?

'Harm'? No, but I think it is wrong encapsulation and layering.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 15/32] vfio: introduce KVM-owned IOMMU type

2022-03-15 Thread Jason Gunthorpe via iommu
On Tue, Mar 15, 2022 at 09:36:08AM -0400, Matthew Rosato wrote:
> > If we do try to stick this into VFIO it should probably use the
> > VFIO_TYPE1_NESTING_IOMMU instead - however, we would like to delete
> > that flag entirely as it was never fully implemented, was never used,
> > and isn't part of what we are proposing for IOMMU nesting on ARM
> > anyhow. (So far I've found nobody to explain what the plan here was..)
> > 
> 
> I'm open to suggestions on how better to tie this into vfio.  The scenario
> basically plays out that:

Ideally I would like it to follow the same 'user space page table'
design that Eric and Kevin are working on for HW iommu.

You have an 1st iommu_domain that maps and pins the entire guest physical
address space.

You have an nested iommu_domain that represents the user page table
(the ioat in your language I think)

When the guest says it wants to set a user page table then you create
the nested iommu_domain representing that user page table and pass in
the anchor (guest address of the root IOPTE) to the kernel to do the
work.

The rule for all other HW's is that the user space page table is
translated by the top level kernel page table. So when you traverse it
you fetch the CPU page storing the guest's IOPTE by doing an IOVA
translation through the first level page table - not through KVM.

Since the first level page table an the KVM GPA should be 1:1 this is
an equivalent operation.

> 1) the iommu will be domain_alloc'd once VFIO_SET_IOMMU is issued -- so at
> that time (or earlier) we have to make the decision on whether to use the
> standard IOMMU or this alternate KVM/nested IOMMU.

So in terms of iommufd I would see it this would be an iommufd 'create
a device specific iomm_domain' IOCTL and you can pass in a S390
specific data blob to make it into this special mode.

> > This is why I said the second level should be an explicit iommu_domain
> > all on its own that is explicitly coupled to the KVM to read the page
> > tables, if necessary.
> 
> Maybe I misunderstood this.  Are you proposing 2 layers of IOMMU that
> interact with each other within host kernel space?
> 
> A second level runs the guest tables, pins the appropriate pieces from the
> guest to get the resulting phys_addr(s) which are then passed via iommu to a
> first level via map (or unmap)?


The first level iommu_domain has the 'type1' map and unmap and pins
the pages. This is the 1:1 map with the GPA and ends up pinning all
guest memory because the point is you don't want to take a memory pin
on your performance path

The second level iommu_domain points to a single IO page table in GPA
and is created/destroyed whenever the guest traps to the hypervisor to
manipulate the anchor (ie the GPA of the guest IO page table).

> > But I'm not sure that reading the userspace io page tables with KVM is
> > even the best thing to do - the iommu driver already has the pinned
> > memory, it would be faster and more modular to traverse the io page
> > tables through the pfns in the root iommu_domain than by having KVM do
> > the translations. Lets see what Matthew says..
> 
> OK, you lost me a bit here.  And this may be associated with the above.
> 
> So, what the current implementation is doing is reading the guest DMA tables
> (which we must pin the first time we access them) and then map the PTEs of
> the associated guest DMA entries into the associated host DMA table (so,
> again pin and place the address, or unpin and invalidate).  Basically we are
> shadowing the first level DMA table as a copy of the second level DMA table
> with the host address(es) of the pinned guest page(s).

You can't pin/unpin in this path, there is no real way to handle error
and ulimit stuff here, plus it is really slow. I didn't notice any of
this in your patches, so what do you mean by 'pin' above?

To be like other IOMMU nesting drivers the pages should already be
pinned and stored in the 1st iommu_domain, lets say in an xarray. This
xarray is populated by type1 map/unmap sytem calls like any
iommu_domain.

A nested iommu_domain should create the real HW IO page table and
associate it with the real HW IOMMU and record the parent 1st level 
iommu_domain.

When you do the shadowing you use the xarray of the 1st level
iommu_domain to translate from GPA to host physical and there is no
pinning/etc involved. After walking the guest table and learning the
final vIOVA it is translated through the xarray to a CPU physical and
then programmed into the real HW IO page table.

There is no reason to use KVM to do any of this, and is actively wrong
to place CPU pages from KVM into an IOPTE that did not come through
the type1 map/unmap calls that do all the proper validation and
accounting.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 29/32] vfio-pci/zdev: add DTSM to clp group capability

2022-03-15 Thread Matthew Rosato

On 3/14/22 5:49 PM, Jason Gunthorpe wrote:

On Mon, Mar 14, 2022 at 03:44:48PM -0400, Matthew Rosato wrote:

The DTSM, or designation type supported mask, indicates what IOAT formats
are available to the guest.  For an interpreted device, userspace will not
know what format(s) the IOAT assist supports, so pass it via the
capability chain.  Since the value belongs to the Query PCI Function Group
clp, let's extend the existing capability with a new version.


Why is this on the VFIO device?


Current vfio_pci_zdev support adds a series of capability chains to the 
VFIO_DEVICE_GET_INFO ioctl.  These capability chains are all related to 
output values associated with what are basically s390x query instructions.


The capability chain being modified by this patch is used to populate a 
response to the 'query this zPCI group' instruction.




Maybe I don't quite understand it right, but the IOAT is the
'userspace page table'?


IOAT = I/O Address Translation tables;  the head of which is called the 
IOTA (translation anchor).  But yes, this would generally refer to the 
guest DMA tables.


Specifically here we are only talking about the DTSM which is the 
formats that the guest is allowed to use for their address translation 
tables, because the hardware (or in our case the intermediary kvm iommu) 
can only operate on certain formats.




That is something that should be modeled as a nested iommu domain.

Querying the formats and any control logic for this should be on the
iommu side not built into VFIO.


I agree that the DTSM is really controlled by what the IOMMU domain can 
support (e.g. what guest table formats it can actually operate on) and 
so the DTSM value should come from there vs out of KVM; but is there 
harm in including the query response data here along with the rest of 
the response information for 'query this zPCI group'?

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 15/32] vfio: introduce KVM-owned IOMMU type

2022-03-15 Thread Jason Gunthorpe via iommu
On Tue, Mar 15, 2022 at 09:49:01AM -0400, Matthew Rosato wrote:

> The rationale for splitting steps 1 and 2 are that VFIO_SET_IOMMU doesn't
> have a mechanism for specifying more than the type as an arg, no?  Otherwise
> yes, you could specify a kvm fd at this point and it would have some other
> advantages (e.g. skip notifier).  But we still can't use the IOMMU for
> mapping until step 3.

Stuff like this is why I'd be much happier if this could join our
iommfd project so we can have clean modeling of the multiple iommu_domains.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 5/8] iommu: Add PASID support for DMA mapping API users

2022-03-15 Thread Jason Gunthorpe via iommu
On Mon, Mar 14, 2022 at 10:07:09PM -0700, Jacob Pan wrote:
> DMA mapping API is the de facto standard for in-kernel DMA. It operates
> on a per device/RID basis which is not PASID-aware.
> 
> Some modern devices such as Intel Data Streaming Accelerator, PASID is
> required for certain work submissions. To allow such devices use DMA
> mapping API, we need the following functionalities:
> 1. Provide device a way to retrieve a PASID for work submission within
> the kernel
> 2. Enable the kernel PASID on the IOMMU for the device
> 3. Attach the kernel PASID to the device's default DMA domain, let it
> be IOVA or physical address in case of pass-through.
> 
> This patch introduces a driver facing API that enables DMA API
> PASID usage. Once enabled, device drivers can continue to use DMA APIs as
> is. There is no difference in dma_handle between without PASID and with
> PASID.
> 
> Signed-off-by: Jacob Pan 
>  drivers/iommu/dma-iommu.c | 65 +++
>  include/linux/dma-iommu.h |  7 +
>  include/linux/iommu.h |  9 ++
>  3 files changed, 81 insertions(+)
> 
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index b22034975301..d0ff1a34b1b6 100644
> +++ b/drivers/iommu/dma-iommu.c
> @@ -39,6 +39,8 @@ enum iommu_dma_cookie_type {
>   IOMMU_DMA_MSI_COOKIE,
>  };
>  
> +static DECLARE_IOASID_SET(iommu_dma_pasid);
> +
>  struct iommu_dma_cookie {
>   enum iommu_dma_cookie_type  type;
>   union {
> @@ -370,6 +372,69 @@ void iommu_put_dma_cookie(struct iommu_domain *domain)
>   domain->iova_cookie = NULL;
>  }
>  
> +/**
> + * iommu_enable_pasid_dma --Enable in-kernel DMA request with PASID
> + * @dev: Device to be enabled
> + *
> + * DMA request with PASID will be mapped the same way as the legacy DMA.
> + * If the device is in pass-through, PASID will also pass-through. If the
> + * device is in IOVA map, the supervisor PASID will point to the same IOVA
> + * page table.
> + *
> + * @return the kernel PASID to be used for DMA or INVALID_IOASID on failure
> + */
> +int iommu_enable_pasid_dma(struct device *dev, ioasid_t *pasid)
> +{
> + struct iommu_domain *dom;
> + ioasid_t id, max;
> + int ret;
> +
> + dom = iommu_get_domain_for_dev(dev);
> + if (!dom || !dom->ops || !dom->ops->attach_dev_pasid)
> + return -ENODEV;

Given the purpose of this API I think it should assert that the device
has the DMA API in-use using the machinery from the other series.

ie this should not be used to clone non-DMA API iommu_domains..

> diff --git a/include/linux/dma-iommu.h b/include/linux/dma-iommu.h
> index 24607dc3c2ac..e6cb9b52a420 100644
> +++ b/include/linux/dma-iommu.h
> @@ -18,6 +18,13 @@ int iommu_get_dma_cookie(struct iommu_domain *domain);
>  int iommu_get_msi_cookie(struct iommu_domain *domain, dma_addr_t base);
>  void iommu_put_dma_cookie(struct iommu_domain *domain);
>  
> +/*
> + * For devices that can do DMA request with PASID, setup a system PASID.
> + * Address modes (IOVA, PA) are selected by the platform code.
> + */
> +int iommu_enable_pasid_dma(struct device *dev, ioasid_t *pasid);
> +void iommu_disable_pasid_dma(struct device *dev, ioasid_t pasid);

The functions already have a kdoc, don't need two..

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 3/8] iommu/vt-d: Implement device_pasid domain attach ops

2022-03-15 Thread Jason Gunthorpe via iommu
On Mon, Mar 14, 2022 at 10:07:07PM -0700, Jacob Pan wrote:
> + /*
> +  * Each domain could have multiple devices attached with shared or per
> +  * device PASIDs. At the domain level, we keep track of unique PASIDs 
> and
> +  * device user count.
> +  * E.g. If a domain has two devices attached, device A has PASID 0, 1;
> +  * device B has PASID 0, 2. Then the domain would have PASID 0, 1, 2.
> +  */

A 2d array of xarray's seems like a poor data structure for this task.

AFACIT this wants to store a list of (device, pasid) tuples, so a
simple linked list, 1d xarray vector or a red black tree seems more
appropriate..

> + if (entry) {
> + pinfo = entry;
> + } else {
> + pinfo = kzalloc(sizeof(*pinfo), GFP_ATOMIC);
> + if (!pinfo)
> + return -ENOMEM;
> + pinfo->pasid = pasid;
> + /* Store the new PASID info in the per domain array */
> + ret = xa_err(__xa_store(_domain->pasids, pasid, pinfo,
> +  GFP_ATOMIC));
> + if (ret)
> + goto xa_store_err;
> + }
> + /* Store PASID in per device-domain array, this is for tracking devTLB 
> */
> + ret = xa_err(xa_store(>pasids, pasid, pinfo, GFP_ATOMIC));
> + if (ret)
> + goto xa_store_err;
> +
> + atomic_inc(>users);
> + xa_unlock(_domain->pasids);
> +
> + return 0;
> +
> +xa_store_err:
> + xa_unlock(_domain->pasids);
> + spin_lock_irqsave(>lock, flags);
> + intel_pasid_tear_down_entry(iommu, dev, pasid, false);
> + spin_unlock_irqrestore(>lock, flags);
> +
> + if (!atomic_read(>users)) {
> + __xa_erase(_domain->pasids, pasid);

This isn't locked right

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 5/8] iommu: Add PASID support for DMA mapping API users

2022-03-15 Thread Jason Gunthorpe via iommu
On Tue, Mar 15, 2022 at 11:16:41AM +, Robin Murphy wrote:
> On 2022-03-15 05:07, Jacob Pan wrote:
> > DMA mapping API is the de facto standard for in-kernel DMA. It operates
> > on a per device/RID basis which is not PASID-aware.
> > 
> > Some modern devices such as Intel Data Streaming Accelerator, PASID is
> > required for certain work submissions. To allow such devices use DMA
> > mapping API, we need the following functionalities:
> > 1. Provide device a way to retrieve a PASID for work submission within
> > the kernel
> > 2. Enable the kernel PASID on the IOMMU for the device
> > 3. Attach the kernel PASID to the device's default DMA domain, let it
> > be IOVA or physical address in case of pass-through.
> > 
> > This patch introduces a driver facing API that enables DMA API
> > PASID usage. Once enabled, device drivers can continue to use DMA APIs as
> > is. There is no difference in dma_handle between without PASID and with
> > PASID.
> 
> Surely the main point of PASIDs is to be able to use more than one
> of them?

IMHO, not for the DMA API.

I can't think of good reasons why a single in-kernel device should
require more than one iommu_domain for use by the DMA API. Even with
the SIOV cases we have been looking at we don't really see a use case
for more than one DMA API iommu_domain on a single physical device.
Do you know of something on the horizon?

>From my view the main point of PASIDs is to assign iommu_domains that
are not used by the DMA API.

IMHO it is a device mis-design of IDXD to require all DMA be PASID
tagged. Devices should be able to do DMA on their RID when the PCI
function is controlled by a kernel driver. I see this driver facing
API as addressing a device quirk by aliasing the DMA API of the RID
into a PASID and that is really all it is good for.

In any case I think we are better to wait for an actual user for multi
DMA API iommu_domains to come forward before we try to build an API
for it.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 15/32] vfio: introduce KVM-owned IOMMU type

2022-03-15 Thread Matthew Rosato

On 3/15/22 3:57 AM, Tian, Kevin wrote:

From: Jason Gunthorpe 
Sent: Tuesday, March 15, 2022 7:18 AM

On Mon, Mar 14, 2022 at 04:50:33PM -0600, Alex Williamson wrote:


+/*
+ * The KVM_IOMMU type implies that the hypervisor will control the

mappings

+ * rather than userspace
+ */
+#define VFIO_KVM_IOMMU 11


Then why is this hosted in the type1 code that exposes a wide variety
of userspace interfaces?  Thanks,


It is really badly named, this is the root level of a 2 stage nested
IO page table, and this approach needed a special flag to distinguish
the setup from the normal iommu_domain.

If we do try to stick this into VFIO it should probably use the
VFIO_TYPE1_NESTING_IOMMU instead - however, we would like to delete
that flag entirely as it was never fully implemented, was never used,
and isn't part of what we are proposing for IOMMU nesting on ARM
anyhow. (So far I've found nobody to explain what the plan here was..)

This is why I said the second level should be an explicit iommu_domain
all on its own that is explicitly coupled to the KVM to read the page
tables, if necessary.

But I'm not sure that reading the userspace io page tables with KVM is
even the best thing to do - the iommu driver already has the pinned
memory, it would be faster and more modular to traverse the io page
tables through the pfns in the root iommu_domain than by having KVM do
the translations. Lets see what Matthew says..



Reading this thread it's sort of like an optimization to software nesting.


Yes, we want to avoid breaking to userspace for a very frequent 
operation (RPCIT / updating shadow mappings)



If that is the case does it make more sense to complete the basic form
of software nesting first and then adds this optimization?

The basic form would allow the userspace to create a special domain
type which points to a user/guest page table (like hardware nesting)
but doesn't install the user page table to the IOMMU hardware (unlike
hardware nesting). When receiving invalidate cmd from userspace > the iommu 
driver walks the user page table (1st-level) and the parent
page table (2nd-level) to generate a shadow mapping for the
invalidated range in the non-nested hardware page table of this
special domain type.

Once that works what this series does just changes the matter of
how the invalidate cmd is triggered. Previously iommu driver receives
invalidate cmd from Qemu (via iommufd uAPI) while now receiving
the cmd from kvm (via iommufd kAPI) upon interception of RPCIT.
 From this angle once the connection between iommufd and kvm fd
is established there is even no direct talk between iommu driver and
kvm.


But something somewhere still needs to be responsible for 
pinning/unpinning of the guest table entries upon each RPCIT 
interception.  e.g. the RPCIT intercept can happen because the guest 
wants to invalidate some old mappings or has generated some new mappings 
over a range, so we must shadow the new mappings (by pinning the guest 
entries and placing them in the host hardware table / unpinning 
invalidated ones and clearing their entry in the host hardware table).



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 15/32] vfio: introduce KVM-owned IOMMU type

2022-03-15 Thread Matthew Rosato

On 3/14/22 5:38 PM, Jason Gunthorpe wrote:

On Mon, Mar 14, 2022 at 03:44:34PM -0400, Matthew Rosato wrote:


diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 9394aa9444c1..0bec97077d61 100644
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -77,6 +77,7 @@ struct vfio_iommu {
boolnesting;
booldirty_page_tracking;
boolcontainer_open;
+   boolkvm;
struct list_heademulated_iommu_groups;
  };
  
@@ -2203,7 +2204,12 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,

goto out_free_group;
  
  	ret = -EIO;

-   domain->domain = iommu_domain_alloc(bus);
+
+   if (iommu->kvm)
+   domain->domain = iommu_domain_alloc_type(bus, IOMMU_DOMAIN_KVM);
+   else
+   domain->domain = iommu_domain_alloc(bus);
+
if (!domain->domain)
goto out_free_domain;
  
@@ -2552,6 +2558,9 @@ static void *vfio_iommu_type1_open(unsigned long arg)

case VFIO_TYPE1v2_IOMMU:
iommu->v2 = true;
break;
+   case VFIO_KVM_IOMMU:
+   iommu->kvm = true;
+   break;


Same remark for this - but more - this is called KVM but it doesn't
accept a kvm FD or any thing else to link the domain to the KVM
in-use.


Right...  The name is poor, but with the current design the KVM 
association comes shortly after.  To summarize, with this series, the 
following relevant steps occur:


1) VFIO_SET_IOMMU: Indicate we wish to use the alternate IOMMU domain
	-> At this point, the IOMMU will reject any maps (no KVM, no guest 
table anchor)

2) KVM ioctl "start":
-> Register the KVM with the IOMMU domain
-> At this point, IOMMU will still reject any maps (no guest table 
anchor)
3) KVM ioctl "register ioat"
-> Register the guest DMA table head with the IOMMU domain
-> now IOMMU maps are allowed

The rationale for splitting steps 1 and 2 are that VFIO_SET_IOMMU 
doesn't have a mechanism for specifying more than the type as an arg, 
no?  Otherwise yes, you could specify a kvm fd at this point and it 
would have some other advantages (e.g. skip notifier).  But we still 
can't use the IOMMU for mapping until step 3.


The rationale for splitting steps 2 and 3 are twofold:  1) during init, 
we simply don't know where the guest anchor will be when we allocate the 
domain and 2) because the guest can technically clear and re-initialize 
their DMA space during the life of the guest, moving the location of the 
table anchor.  We would receive another ioctl operation to unregister 
the guest table anchor and again reject any map operation until a new 
table location is provided.


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 15/32] vfio: introduce KVM-owned IOMMU type

2022-03-15 Thread Matthew Rosato

On 3/14/22 7:18 PM, Jason Gunthorpe wrote:

On Mon, Mar 14, 2022 at 04:50:33PM -0600, Alex Williamson wrote:


+/*
+ * The KVM_IOMMU type implies that the hypervisor will control the mappings
+ * rather than userspace
+ */
+#define VFIO_KVM_IOMMU 11


Then why is this hosted in the type1 code that exposes a wide variety
of userspace interfaces?  Thanks,


It is really badly named, this is the root level of a 2 stage nested
IO page table, and this approach needed a special flag to distinguish
the setup from the normal iommu_domain.


^^ Yes, this.



If we do try to stick this into VFIO it should probably use the
VFIO_TYPE1_NESTING_IOMMU instead - however, we would like to delete
that flag entirely as it was never fully implemented, was never used,
and isn't part of what we are proposing for IOMMU nesting on ARM
anyhow. (So far I've found nobody to explain what the plan here was..)



I'm open to suggestions on how better to tie this into vfio.  The 
scenario basically plays out that:


1) the iommu will be domain_alloc'd once VFIO_SET_IOMMU is issued -- so 
at that time (or earlier) we have to make the decision on whether to use 
the standard IOMMU or this alternate KVM/nested IOMMU.


2) At the time VFIO_SET_IOMMU is received, we have not yet associated 
the vfio group with a KVM, so we can't (today) use this as an indicator 
to guess which IOMMU strategy to use.


3) Ideally, even if we changed QEMU vfio to make the KVM association 
earlier, it would be nice to still be able to indicate that we want to 
use the standard iommu/type1 despite a KVM association existing (e.g. 
backwards compatibility with older QEMU that lacks 'interpretation' 
support, nested virtualization scenarios).



This is why I said the second level should be an explicit iommu_domain
all on its own that is explicitly coupled to the KVM to read the page
tables, if necessary.


Maybe I misunderstood this.  Are you proposing 2 layers of IOMMU that
interact with each other within host kernel space?

A second level runs the guest tables, pins the appropriate pieces from 
the guest to get the resulting phys_addr(s) which are then passed via 
iommu to a first level via map (or unmap)?




But I'm not sure that reading the userspace io page tables with KVM is
even the best thing to do - the iommu driver already has the pinned
memory, it would be faster and more modular to traverse the io page
tables through the pfns in the root iommu_domain than by having KVM do
the translations. Lets see what Matthew says..


OK, you lost me a bit here.  And this may be associated with the above.

So, what the current implementation is doing is reading the guest DMA 
tables (which we must pin the first time we access them) and then map 
the PTEs of the associated guest DMA entries into the associated host 
DMA table (so, again pin and place the address, or unpin and 
invalidate).  Basically we are shadowing the first level DMA table as a 
copy of the second level DMA table with the host address(es) of the 
pinned guest page(s).


I'm unclear where you are proposing the pinning be done if not by the 
iommu domain traversing the tables to perform the 'shadow' operation.




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v2 2/8] iommu: Add attach/detach_dev_pasid domain ops

2022-03-15 Thread Tian, Kevin
> From: Jean-Philippe Brucker 
> Sent: Tuesday, March 15, 2022 7:27 PM
> 
> On Mon, Mar 14, 2022 at 10:07:06PM -0700, Jacob Pan wrote:
> > From: Lu Baolu 
> >
> > An IOMMU domain represents an address space which can be attached by
> > devices that perform DMA within a domain. However, for platforms with
> > PASID capability the domain attachment needs be handled at device+PASID
> > level. There can be multiple PASIDs within a device and multiple devices
> > attached to a given domain.
> > This patch introduces a new IOMMU op which support device, PASID, and
> > IOMMU domain attachment. The immediate use case is for PASID capable
> > devices to perform DMA under DMA APIs.
> >
> > Signed-off-by: Lu Baolu 
> > Signed-off-by: Jacob Pan 
> > ---
> >  include/linux/iommu.h | 6 ++
> >  1 file changed, 6 insertions(+)
> >
> > diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> > index 369f05c2a4e2..fde5b933dbe3 100644
> > --- a/include/linux/iommu.h
> > +++ b/include/linux/iommu.h
> > @@ -227,6 +227,8 @@ struct iommu_iotlb_gather {
> >   * @aux_get_pasid: get the pasid given an aux-domain
> >   * @sva_bind: Bind process address space to device
> >   * @sva_unbind: Unbind process address space from device
> > + * @attach_dev_pasid: attach an iommu domain to a pasid of device
> > + * @detach_dev_pasid: detach an iommu domain from a pasid of device
> 
> Isn't that operation "assign a PASID to a domain" instead?  In patch 5,
> the domain is already attached to the device, so set_domain_pasid() might
> be clearer and to the point. If the IOMMU driver did the allocation we
> could also avoid patch 1.

iiuc this API can also work for future SIOV usage where each mdev attached
to the domain has its own pasid. "assigning a PASID to a domain" sounds
like going back to the previous aux domain approach which has one PASID
per domain and that PASID is used on all devices attached to the aux domain...

> 
> If I understand correctly this series is not about a generic PASID API
> that allows drivers to manage multiple DMA address spaces, because there
> still doesn't seem to be any interest in that. It's about the specific
> IDXD use-case, so let's focus on that. We can introduce a specialized call
> such as (iommu|dma)_set_device_pasid(), which will be easy to consolidate
> later into a more generic "dma_enable_pasid()" API if that ever seems
> useful.
> 
> Thanks,
> Jean
> 
> >   * @sva_get_pasid: Get PASID associated to a SVA handle
> >   * @page_response: handle page request response
> >   * @cache_invalidate: invalidate translation caches
> > @@ -296,6 +298,10 @@ struct iommu_ops {
> > struct iommu_sva *(*sva_bind)(struct device *dev, struct mm_struct
> *mm,
> >   void *drvdata);
> > void (*sva_unbind)(struct iommu_sva *handle);
> > +   int (*attach_dev_pasid)(struct iommu_domain *domain,
> > +   struct device *dev, ioasid_t id);
> > +   void (*detach_dev_pasid)(struct iommu_domain *domain,
> > +struct device *dev, ioasid_t id);
> > u32 (*sva_get_pasid)(struct iommu_sva *handle);
> >
> > int (*page_response)(struct device *dev,
> > --
> > 2.25.1
> >
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 8/8] iommu: Remove unused driver data in sva_bind_device

2022-03-15 Thread Jean-Philippe Brucker
On Mon, Mar 14, 2022 at 10:07:12PM -0700, Jacob Pan wrote:
> No one is using drvdata for sva_bind_device after kernel SVA support is
> removed from VT-d driver. Remove the drvdata parameter as well.
> 
> Signed-off-by: Jacob Pan 

Reviewed-by: Jean-Philippe Brucker 

> ---
>  drivers/dma/idxd/cdev.c | 2 +-
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c | 2 +-
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 5 ++---
>  drivers/iommu/intel/svm.c   | 9 -
>  drivers/iommu/iommu.c   | 4 ++--
>  drivers/misc/uacce/uacce.c  | 2 +-
>  include/linux/intel-iommu.h | 3 +--
>  include/linux/iommu.h   | 9 +++--
>  8 files changed, 15 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/dma/idxd/cdev.c b/drivers/dma/idxd/cdev.c
> index b9b2b4a4124e..312ec37ebf91 100644
> --- a/drivers/dma/idxd/cdev.c
> +++ b/drivers/dma/idxd/cdev.c
> @@ -100,7 +100,7 @@ static int idxd_cdev_open(struct inode *inode, struct 
> file *filp)
>   filp->private_data = ctx;
>  
>   if (device_pasid_enabled(idxd)) {
> - sva = iommu_sva_bind_device(dev, current->mm, NULL);
> + sva = iommu_sva_bind_device(dev, current->mm);
>   if (IS_ERR(sva)) {
>   rc = PTR_ERR(sva);
>   dev_err(dev, "pasid allocation failed: %d\n", rc);
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c 
> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
> index a737ba5f727e..eb2f5cb0701a 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
> @@ -354,7 +354,7 @@ __arm_smmu_sva_bind(struct device *dev, struct mm_struct 
> *mm)
>  }
>  
>  struct iommu_sva *
> -arm_smmu_sva_bind(struct device *dev, struct mm_struct *mm, void *drvdata)
> +arm_smmu_sva_bind(struct device *dev, struct mm_struct *mm)
>  {
>   struct iommu_sva *handle;
>   struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h 
> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> index cd48590ada30..d2ba86470c42 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> @@ -754,8 +754,7 @@ bool arm_smmu_master_sva_enabled(struct arm_smmu_master 
> *master);
>  int arm_smmu_master_enable_sva(struct arm_smmu_master *master);
>  int arm_smmu_master_disable_sva(struct arm_smmu_master *master);
>  bool arm_smmu_master_iopf_supported(struct arm_smmu_master *master);
> -struct iommu_sva *arm_smmu_sva_bind(struct device *dev, struct mm_struct *mm,
> - void *drvdata);
> +struct iommu_sva *arm_smmu_sva_bind(struct device *dev, struct mm_struct 
> *mm);
>  void arm_smmu_sva_unbind(struct iommu_sva *handle);
>  u32 arm_smmu_sva_get_pasid(struct iommu_sva *handle);
>  void arm_smmu_sva_notifier_synchronize(void);
> @@ -791,7 +790,7 @@ static inline bool arm_smmu_master_iopf_supported(struct 
> arm_smmu_master *master
>  }
>  
>  static inline struct iommu_sva *
> -arm_smmu_sva_bind(struct device *dev, struct mm_struct *mm, void *drvdata)
> +arm_smmu_sva_bind(struct device *dev, struct mm_struct *mm)
>  {
>   return ERR_PTR(-ENODEV);
>  }
> diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
> index 37d6218f173b..94deb58375f5 100644
> --- a/drivers/iommu/intel/svm.c
> +++ b/drivers/iommu/intel/svm.c
> @@ -500,8 +500,7 @@ int intel_svm_unbind_gpasid(struct device *dev, u32 pasid)
>   return ret;
>  }
>  
> -static int intel_svm_alloc_pasid(struct device *dev, struct mm_struct *mm,
> -  unsigned int flags)
> +static int intel_svm_alloc_pasid(struct device *dev, struct mm_struct *mm)
>  {
>   ioasid_t max_pasid = dev_is_pci(dev) ?
>   pci_max_pasids(to_pci_dev(dev)) : intel_pasid_max_id;
> @@ -1002,20 +1001,20 @@ static irqreturn_t prq_event_thread(int irq, void *d)
>   return IRQ_RETVAL(handled);
>  }
>  
> -struct iommu_sva *intel_svm_bind(struct device *dev, struct mm_struct *mm, 
> void *drvdata)
> +struct iommu_sva *intel_svm_bind(struct device *dev, struct mm_struct *mm)
>  {
>   struct intel_iommu *iommu = device_to_iommu(dev, NULL, NULL);
>   struct iommu_sva *sva;
>   int ret;
>  
>   mutex_lock(_mutex);
> - ret = intel_svm_alloc_pasid(dev, mm, flags);
> + ret = intel_svm_alloc_pasid(dev, mm);
>   if (ret) {
>   mutex_unlock(_mutex);
>   return ERR_PTR(ret);
>   }
>  
> - sva = intel_svm_bind_mm(iommu, dev, mm, flags);
> + sva = intel_svm_bind_mm(iommu, dev, mm);
>   if (IS_ERR_OR_NULL(sva))
>   intel_svm_free_pasid(mm);
>   mutex_unlock(_mutex);
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 107dcf5938d6..fef34879bc0c 100644
> --- a/drivers/iommu/iommu.c
> +++ 

Re: [PATCH v2 2/8] iommu: Add attach/detach_dev_pasid domain ops

2022-03-15 Thread Jean-Philippe Brucker
On Mon, Mar 14, 2022 at 10:07:06PM -0700, Jacob Pan wrote:
> From: Lu Baolu 
> 
> An IOMMU domain represents an address space which can be attached by
> devices that perform DMA within a domain. However, for platforms with
> PASID capability the domain attachment needs be handled at device+PASID
> level. There can be multiple PASIDs within a device and multiple devices
> attached to a given domain.
> This patch introduces a new IOMMU op which support device, PASID, and
> IOMMU domain attachment. The immediate use case is for PASID capable
> devices to perform DMA under DMA APIs.
> 
> Signed-off-by: Lu Baolu 
> Signed-off-by: Jacob Pan 
> ---
>  include/linux/iommu.h | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 369f05c2a4e2..fde5b933dbe3 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -227,6 +227,8 @@ struct iommu_iotlb_gather {
>   * @aux_get_pasid: get the pasid given an aux-domain
>   * @sva_bind: Bind process address space to device
>   * @sva_unbind: Unbind process address space from device
> + * @attach_dev_pasid: attach an iommu domain to a pasid of device
> + * @detach_dev_pasid: detach an iommu domain from a pasid of device

Isn't that operation "assign a PASID to a domain" instead?  In patch 5,
the domain is already attached to the device, so set_domain_pasid() might
be clearer and to the point. If the IOMMU driver did the allocation we
could also avoid patch 1.

If I understand correctly this series is not about a generic PASID API
that allows drivers to manage multiple DMA address spaces, because there
still doesn't seem to be any interest in that. It's about the specific
IDXD use-case, so let's focus on that. We can introduce a specialized call
such as (iommu|dma)_set_device_pasid(), which will be easy to consolidate
later into a more generic "dma_enable_pasid()" API if that ever seems
useful.

Thanks,
Jean

>   * @sva_get_pasid: Get PASID associated to a SVA handle
>   * @page_response: handle page request response
>   * @cache_invalidate: invalidate translation caches
> @@ -296,6 +298,10 @@ struct iommu_ops {
>   struct iommu_sva *(*sva_bind)(struct device *dev, struct mm_struct *mm,
> void *drvdata);
>   void (*sva_unbind)(struct iommu_sva *handle);
> + int (*attach_dev_pasid)(struct iommu_domain *domain,
> + struct device *dev, ioasid_t id);
> + void (*detach_dev_pasid)(struct iommu_domain *domain,
> +  struct device *dev, ioasid_t id);
>   u32 (*sva_get_pasid)(struct iommu_sva *handle);
>  
>   int (*page_response)(struct device *dev,
> -- 
> 2.25.1
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 5/8] iommu: Add PASID support for DMA mapping API users

2022-03-15 Thread Robin Murphy

On 2022-03-15 05:07, Jacob Pan wrote:

DMA mapping API is the de facto standard for in-kernel DMA. It operates
on a per device/RID basis which is not PASID-aware.

Some modern devices such as Intel Data Streaming Accelerator, PASID is
required for certain work submissions. To allow such devices use DMA
mapping API, we need the following functionalities:
1. Provide device a way to retrieve a PASID for work submission within
the kernel
2. Enable the kernel PASID on the IOMMU for the device
3. Attach the kernel PASID to the device's default DMA domain, let it
be IOVA or physical address in case of pass-through.

This patch introduces a driver facing API that enables DMA API
PASID usage. Once enabled, device drivers can continue to use DMA APIs as
is. There is no difference in dma_handle between without PASID and with
PASID.


Surely the main point of PASIDs is to be able to use more than one of 
them? The way I expected this to work is that iommu_enable_pasid_dma() 
returns a *new* struct device representing the dev+PASID combination, 
and the driver can then pass that to subsequent DMA API and/or IOMMU API 
calls as normal, and they know to do the right thing. Automatically 
inferring a PASID for the original physical device clearly can't scale, 
and seems like a dead-end approach that only helps this one niche use-case.


Either way, I think this is also still fundamentally an IOMMU API 
operation that belongs in iommu.[ch] - since the iommu_dma_ops 
consolidation I'd prefer to continue working towards making dma-iommu.h 
a private header just for IOMMU API internal helpers.


Thanks,
Robin.


Signed-off-by: Jacob Pan 
---
  drivers/iommu/dma-iommu.c | 65 +++
  include/linux/dma-iommu.h |  7 +
  include/linux/iommu.h |  9 ++
  3 files changed, 81 insertions(+)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index b22034975301..d0ff1a34b1b6 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -39,6 +39,8 @@ enum iommu_dma_cookie_type {
IOMMU_DMA_MSI_COOKIE,
  };
  
+static DECLARE_IOASID_SET(iommu_dma_pasid);

+
  struct iommu_dma_cookie {
enum iommu_dma_cookie_type  type;
union {
@@ -370,6 +372,69 @@ void iommu_put_dma_cookie(struct iommu_domain *domain)
domain->iova_cookie = NULL;
  }
  
+/**

+ * iommu_enable_pasid_dma --Enable in-kernel DMA request with PASID
+ * @dev:   Device to be enabled
+ *
+ * DMA request with PASID will be mapped the same way as the legacy DMA.
+ * If the device is in pass-through, PASID will also pass-through. If the
+ * device is in IOVA map, the supervisor PASID will point to the same IOVA
+ * page table.
+ *
+ * @return the kernel PASID to be used for DMA or INVALID_IOASID on failure
+ */
+int iommu_enable_pasid_dma(struct device *dev, ioasid_t *pasid)
+{
+   struct iommu_domain *dom;
+   ioasid_t id, max;
+   int ret;
+
+   dom = iommu_get_domain_for_dev(dev);
+   if (!dom || !dom->ops || !dom->ops->attach_dev_pasid)
+   return -ENODEV;
+   max = iommu_get_dev_pasid_max(dev);
+   if (!max)
+   return -EINVAL;
+
+   id = ioasid_alloc(_dma_pasid, 1, max, dev);
+   if (id == INVALID_IOASID)
+   return -ENOMEM;
+
+   ret = dom->ops->attach_dev_pasid(dom, dev, id);
+   if (ret) {
+   ioasid_put(id);
+   return ret;
+   }
+   *pasid = id;
+
+   return ret;
+}
+EXPORT_SYMBOL(iommu_enable_pasid_dma);
+
+/**
+ * iommu_disable_pasid_dma --Disable in-kernel DMA request with PASID
+ * @dev:   Device's PASID DMA to be disabled
+ *
+ * It is the device driver's responsibility to ensure no more incoming DMA
+ * requests with the kernel PASID before calling this function. IOMMU driver
+ * ensures PASID cache, IOTLBs related to the kernel PASID are cleared and
+ * drained.
+ *
+ * @return 0 on success or error code on failure
+ */
+void iommu_disable_pasid_dma(struct device *dev, ioasid_t pasid)
+{
+   struct iommu_domain *dom;
+
+   /* TODO: check the given PASID is within the ioasid_set */
+   dom = iommu_get_domain_for_dev(dev);
+   if (!dom->ops->detach_dev_pasid)
+   return;
+   dom->ops->detach_dev_pasid(dom, dev, pasid);
+   ioasid_put(pasid);
+}
+EXPORT_SYMBOL(iommu_disable_pasid_dma);
+
  /**
   * iommu_dma_get_resv_regions - Reserved region driver helper
   * @dev: Device from iommu_get_resv_regions()
diff --git a/include/linux/dma-iommu.h b/include/linux/dma-iommu.h
index 24607dc3c2ac..e6cb9b52a420 100644
--- a/include/linux/dma-iommu.h
+++ b/include/linux/dma-iommu.h
@@ -18,6 +18,13 @@ int iommu_get_dma_cookie(struct iommu_domain *domain);
  int iommu_get_msi_cookie(struct iommu_domain *domain, dma_addr_t base);
  void iommu_put_dma_cookie(struct iommu_domain *domain);
  
+/*

+ * For devices that can do DMA request with PASID, setup a system PASID.
+ * Address modes (IOVA, PA) are selected by the 

Re: [PATCH v4 14/32] iommu: introduce iommu_domain_alloc_type and the KVM type

2022-03-15 Thread Robin Murphy

On 2022-03-14 19:44, Matthew Rosato wrote:

s390x will introduce an additional domain type that is used for
managing IOMMU owned by KVM.  Define the type here and add an
interface for allocating a specified type vs the default type.


I'm also not a huge fan of adding a new domain_alloc interface like 
this, however if it is justifiable, then please make it take struct 
device rather than struct bus_type as an argument.


It also sounds like there may be a degree of conceptual overlap here 
with what Jean-Philippe is working on for sharing pagetables between KVM 
and SMMU for Android pKVM, so it's probably worth some thought over 
whether there's any scope for common interfaces in terms of actual 
implementation.


Thanks,
Robin.


Signed-off-by: Matthew Rosato 
---
  drivers/iommu/iommu.c |  7 +++
  include/linux/iommu.h | 12 
  2 files changed, 19 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index f2c45b85b9fc..8bb57e0e3945 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1976,6 +1976,13 @@ void iommu_domain_free(struct iommu_domain *domain)
  }
  EXPORT_SYMBOL_GPL(iommu_domain_free);
  
+struct iommu_domain *iommu_domain_alloc_type(struct bus_type *bus,

+unsigned int t)
+{
+   return __iommu_domain_alloc(bus, t);
+}
+EXPORT_SYMBOL_GPL(iommu_domain_alloc_type);
+
  static int __iommu_attach_device(struct iommu_domain *domain,
 struct device *dev)
  {
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 9208eca4b0d1..b427bbb9f387 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -63,6 +63,7 @@ struct iommu_domain_geometry {
  implementation  */
  #define __IOMMU_DOMAIN_PT (1U << 2)  /* Domain is identity mapped   */
  #define __IOMMU_DOMAIN_DMA_FQ (1U << 3)  /* DMA-API uses flush queue*/
+#define __IOMMU_DOMAIN_KVM (1U << 4)  /* Domain is controlled by KVM */
  
  /*

   * This are the possible domain-types
@@ -77,6 +78,7 @@ struct iommu_domain_geometry {
   *  certain optimizations for these domains
   *IOMMU_DOMAIN_DMA_FQ - As above, but definitely using batched TLB
   *  invalidation.
+ * IOMMU_DOMAIN_KVM- DMA mappings managed by KVM, used for VMs
   */
  #define IOMMU_DOMAIN_BLOCKED  (0U)
  #define IOMMU_DOMAIN_IDENTITY (__IOMMU_DOMAIN_PT)
@@ -86,6 +88,8 @@ struct iommu_domain_geometry {
  #define IOMMU_DOMAIN_DMA_FQ   (__IOMMU_DOMAIN_PAGING |\
 __IOMMU_DOMAIN_DMA_API |   \
 __IOMMU_DOMAIN_DMA_FQ)
+#define IOMMU_DOMAIN_KVM   (__IOMMU_DOMAIN_PAGING |\
+__IOMMU_DOMAIN_KVM)
  
  struct iommu_domain {

unsigned type;
@@ -421,6 +425,8 @@ extern bool iommu_capable(struct bus_type *bus, enum 
iommu_cap cap);
  extern struct iommu_domain *iommu_domain_alloc(struct bus_type *bus);
  extern struct iommu_group *iommu_group_get_by_id(int id);
  extern void iommu_domain_free(struct iommu_domain *domain);
+extern struct iommu_domain *iommu_domain_alloc_type(struct bus_type *bus,
+   unsigned int t);
  extern int iommu_attach_device(struct iommu_domain *domain,
   struct device *dev);
  extern void iommu_detach_device(struct iommu_domain *domain,
@@ -708,6 +714,12 @@ static inline void iommu_domain_free(struct iommu_domain 
*domain)
  {
  }
  
+static inline struct iommu_domain *iommu_domain_alloc_type(struct bus_type *bus,

+  unsigned int t)
+{
+   return NULL;
+}
+
  static inline int iommu_attach_device(struct iommu_domain *domain,
  struct device *dev)
  {

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v2 3/8] iommu/vt-d: Implement device_pasid domain attach ops

2022-03-15 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Tuesday, March 15, 2022 1:07 PM
> 
> On VT-d platforms with scalable mode enabled, devices issue DMA requests
> with PASID need to attach to the correct IOMMU domains.
> The attach operation involves the following:
> - programming the PASID into device's PASID table
> - tracking device domain and the PASID relationship
> - managing IOTLB and device TLB invalidations
> 
> This patch extends DMAR domain and device domain info with xarrays to
> track per domain and per device PASIDs.  It provides the flexibility to
> be used beyond DMA API PASID support.
> 
> Signed-off-by: Lu Baolu 
> Signed-off-by: Jacob Pan 
> ---
>  drivers/iommu/intel/iommu.c | 194
> +++-
>  include/linux/intel-iommu.h |  12 ++-
>  2 files changed, 202 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index 881f8361eca2..9267194eaed3 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -1622,20 +1622,48 @@ static void __iommu_flush_dev_iotlb(struct
> device_domain_info *info,
>  qdep, addr, mask);
>  }
> 
> +static void __iommu_flush_dev_piotlb(struct device_domain_info *info,

piotlb is confusing, better be:

__iommu_flush_dev_iotlb_pasid()

> + u64 address,
> +  ioasid_t pasid, unsigned int mask)
> +{
> + u16 sid, qdep;
> +
> + if (!info || !info->ats_enabled)
> + return;
> +
> + sid = info->bus << 8 | info->devfn;
> + qdep = info->ats_qdep;
> + qi_flush_dev_iotlb_pasid(info->iommu, sid, info->pfsid,
> +  pasid, qdep, address, mask);
> +}
> +
>  static void iommu_flush_dev_iotlb(struct dmar_domain *domain,
> u64 addr, unsigned mask)
>  {
>   unsigned long flags;
>   struct device_domain_info *info;
>   struct subdev_domain_info *sinfo;
> + unsigned long pasid;
> + struct pasid_info *pinfo;
> 
>   if (!domain->has_iotlb_device)
>   return;
> 
>   spin_lock_irqsave(_domain_lock, flags);
> - list_for_each_entry(info, >devices, link)
> - __iommu_flush_dev_iotlb(info, addr, mask);
> -
> + list_for_each_entry(info, >devices, link) {
> + /*
> +  * We cannot use PASID based devTLB invalidation on
> RID2PASID
> +  * Device does not understand RID2PASID/0. This is different

Lack of a conjunction word between 'RID2PASID' and 'Device'.

and what is RID2PASID/0? It would be clearer to point out that RID2PASID
is visible only within the iommu to mark out requests without PASID, 
thus this PASID value should never be sent to the device side.

> +  * than IOTLB invalidation where RID2PASID is also used for
> +  * tagging.

Then it would be obvious because IOTLB is iommu internal agent thus takes 
use of RID2PASID for tagging.

> +  */
> + xa_for_each(>pasids, pasid, pinfo) {
> + if (!pasid)

this should be compared to PASID_RID2PASID (though it's zero today).

> + __iommu_flush_dev_iotlb(info, addr, mask);
> + else
> + __iommu_flush_dev_piotlb(info, addr, pasid,
> mask);
> + }
> + }
>   list_for_each_entry(sinfo, >subdevices, link_domain) {
>   info = get_domain_info(sinfo->pdev);
>   __iommu_flush_dev_iotlb(info, addr, mask);

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v2 2/8] iommu: Add attach/detach_dev_pasid domain ops

2022-03-15 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Tuesday, March 15, 2022 1:07 PM
> 
> From: Lu Baolu 
> 
> An IOMMU domain represents an address space which can be attached by
> devices that perform DMA within a domain. However, for platforms with
> PASID capability the domain attachment needs be handled at device+PASID
> level. There can be multiple PASIDs within a device and multiple devices
> attached to a given domain.
> This patch introduces a new IOMMU op which support device, PASID, and
> IOMMU domain attachment. The immediate use case is for PASID capable
> devices to perform DMA under DMA APIs.
> 
> Signed-off-by: Lu Baolu 
> Signed-off-by: Jacob Pan 
> ---
>  include/linux/iommu.h | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 369f05c2a4e2..fde5b933dbe3 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -227,6 +227,8 @@ struct iommu_iotlb_gather {
>   * @aux_get_pasid: get the pasid given an aux-domain
>   * @sva_bind: Bind process address space to device
>   * @sva_unbind: Unbind process address space from device
> + * @attach_dev_pasid: attach an iommu domain to a pasid of device
> + * @detach_dev_pasid: detach an iommu domain from a pasid of device
>   * @sva_get_pasid: Get PASID associated to a SVA handle
>   * @page_response: handle page request response
>   * @cache_invalidate: invalidate translation caches
> @@ -296,6 +298,10 @@ struct iommu_ops {
>   struct iommu_sva *(*sva_bind)(struct device *dev, struct mm_struct
> *mm,
> void *drvdata);
>   void (*sva_unbind)(struct iommu_sva *handle);
> + int (*attach_dev_pasid)(struct iommu_domain *domain,
> + struct device *dev, ioasid_t id);
> + void (*detach_dev_pasid)(struct iommu_domain *domain,
> +  struct device *dev, ioasid_t id);
>   u32 (*sva_get_pasid)(struct iommu_sva *handle);
> 
>   int (*page_response)(struct device *dev,
> --
> 2.25.1

Probably this can be combined with patch05 to demonstrate the usage
together with the definition of the new ops. Then comes the vt-d specific
change.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v2 0/8] Enable PASID for DMA API users

2022-03-15 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Tuesday, March 15, 2022 1:07 PM
> 
> Some modern accelerators such as Intel's Data Streaming Accelerator (DSA)
> require PASID in DMA requests to be operational. Specifically, the work
> submissions with ENQCMD on shared work queues require PASIDs. The use
> cases
> include both user DMA with shared virtual addressing (SVA) and in-kernel
> DMA similar to legacy DMA w/o PASID. Here we address the latter.
> 
> DMA mapping API is the de facto standard for in-kernel DMA. However, it
> operates on a per device or Requester ID(RID) basis which is not
> PASID-aware. To leverage DMA API for devices relies on PASIDs, this
> patchset introduces the following APIs
> 
> 1. A driver facing API that enables DMA API PASID usage:
> iommu_enable_pasid_dma(struct device *dev, ioasid_t );

Should this be called dma_enable_pasid() since it's about DMA API? Doing
so also avoids the driver to include iommu.h.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v4 15/32] vfio: introduce KVM-owned IOMMU type

2022-03-15 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Tuesday, March 15, 2022 7:18 AM
> 
> On Mon, Mar 14, 2022 at 04:50:33PM -0600, Alex Williamson wrote:
> 
> > > +/*
> > > + * The KVM_IOMMU type implies that the hypervisor will control the
> mappings
> > > + * rather than userspace
> > > + */
> > > +#define VFIO_KVM_IOMMU   11
> >
> > Then why is this hosted in the type1 code that exposes a wide variety
> > of userspace interfaces?  Thanks,
> 
> It is really badly named, this is the root level of a 2 stage nested
> IO page table, and this approach needed a special flag to distinguish
> the setup from the normal iommu_domain.
> 
> If we do try to stick this into VFIO it should probably use the
> VFIO_TYPE1_NESTING_IOMMU instead - however, we would like to delete
> that flag entirely as it was never fully implemented, was never used,
> and isn't part of what we are proposing for IOMMU nesting on ARM
> anyhow. (So far I've found nobody to explain what the plan here was..)
> 
> This is why I said the second level should be an explicit iommu_domain
> all on its own that is explicitly coupled to the KVM to read the page
> tables, if necessary.
> 
> But I'm not sure that reading the userspace io page tables with KVM is
> even the best thing to do - the iommu driver already has the pinned
> memory, it would be faster and more modular to traverse the io page
> tables through the pfns in the root iommu_domain than by having KVM do
> the translations. Lets see what Matthew says..
> 

Reading this thread it's sort of like an optimization to software nesting.
If that is the case does it make more sense to complete the basic form
of software nesting first and then adds this optimization?

The basic form would allow the userspace to create a special domain
type which points to a user/guest page table (like hardware nesting)
but doesn't install the user page table to the IOMMU hardware (unlike
hardware nesting). When receiving invalidate cmd from userspace 
the iommu driver walks the user page table (1st-level) and the parent 
page table (2nd-level) to generate a shadow mapping for the 
invalidated range in the non-nested hardware page table of this
special domain type.

Once that works what this series does just changes the matter of
how the invalidate cmd is triggered. Previously iommu driver receives
invalidate cmd from Qemu (via iommufd uAPI) while now receiving
the cmd from kvm (via iommufd kAPI) upon interception of RPCIT.
>From this angle once the connection between iommufd and kvm fd 
is established there is even no direct talk between iommu driver and
kvm. 

Thanks
Kevin

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 13/15] swiotlb: merge swiotlb-xen initialization into swiotlb

2022-03-15 Thread Christoph Hellwig
On Mon, Mar 14, 2022 at 07:07:44PM -0400, Boris Ostrovsky wrote:
>> +swiotlb_init_remap(true, x86_swiotlb_flags, xen_swiotlb_fixup);
>
>
> I think we need to have SWIOTLB_ANY set in x86_swiotlb_flags here.

Yes.

> Notice that we don't do remap() after final update to nslabs. We should.

Indeed.  I've pushed the fixed patches to the git tree and attached
the patches 12, 13 and 14 in case that is easier:
>From 6d72b98620281984ae778659cedeb369e82af8d8 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig 
Date: Mon, 14 Mar 2022 08:02:57 +0100
Subject: swiotlb: provide swiotlb_init variants that remap the buffer

To shared more code between swiotlb and xen-swiotlb, offer a
swiotlb_init_remap interface and add a remap callback to
swiotlb_init_late that will allow Xen to remap the buffer the
buffer without duplicating much of the logic.

Signed-off-by: Christoph Hellwig 
---
 arch/x86/pci/sta2x11-fixup.c |  2 +-
 include/linux/swiotlb.h  |  5 -
 kernel/dma/swiotlb.c | 36 +---
 3 files changed, 38 insertions(+), 5 deletions(-)

diff --git a/arch/x86/pci/sta2x11-fixup.c b/arch/x86/pci/sta2x11-fixup.c
index c7e6faf59a861..7368afc039987 100644
--- a/arch/x86/pci/sta2x11-fixup.c
+++ b/arch/x86/pci/sta2x11-fixup.c
@@ -57,7 +57,7 @@ static void sta2x11_new_instance(struct pci_dev *pdev)
 		int size = STA2X11_SWIOTLB_SIZE;
 		/* First instance: register your own swiotlb area */
 		dev_info(>dev, "Using SWIOTLB (size %i)\n", size);
-		if (swiotlb_init_late(size, GFP_DMA))
+		if (swiotlb_init_late(size, GFP_DMA, NULL))
 			dev_emerg(>dev, "init swiotlb failed\n");
 	}
 	list_add(>list, _instance_list);
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index ee655f2e4d28b..7b50c82f84ce9 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -36,8 +36,11 @@ struct scatterlist;
 
 int swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, unsigned int flags);
 unsigned long swiotlb_size_or_default(void);
+void __init swiotlb_init_remap(bool addressing_limit, unsigned int flags,
+	int (*remap)(void *tlb, unsigned long nslabs));
+int swiotlb_init_late(size_t size, gfp_t gfp_mask,
+	int (*remap)(void *tlb, unsigned long nslabs));
 extern int swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs);
-int swiotlb_init_late(size_t size, gfp_t gfp_mask);
 extern void __init swiotlb_update_mem_attributes(void);
 
 phys_addr_t swiotlb_tbl_map_single(struct device *hwdev, phys_addr_t phys,
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 79641c446d284..c37fd3d1c97f7 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -256,9 +256,11 @@ int __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs,
  * Statically reserve bounce buffer space and initialize bounce buffer data
  * structures for the software IO TLB used to implement the DMA API.
  */
-void __init swiotlb_init(bool addressing_limit, unsigned int flags)
+void __init swiotlb_init_remap(bool addressing_limit, unsigned int flags,
+		int (*remap)(void *tlb, unsigned long nslabs))
 {
-	size_t bytes = PAGE_ALIGN(default_nslabs << IO_TLB_SHIFT);
+	unsigned long nslabs = default_nslabs;
+	size_t bytes;
 	void *tlb;
 
 	if (!addressing_limit && !swiotlb_force_bounce)
@@ -271,12 +273,23 @@ void __init swiotlb_init(bool addressing_limit, unsigned int flags)
 	 * allow to pick a location everywhere for hypervisors with guest
 	 * memory encryption.
 	 */
+retry:
+	bytes = PAGE_ALIGN(default_nslabs << IO_TLB_SHIFT);
 	if (flags & SWIOTLB_ANY)
 		tlb = memblock_alloc(bytes, PAGE_SIZE);
 	else
 		tlb = memblock_alloc_low(bytes, PAGE_SIZE);
 	if (!tlb)
 		goto fail;
+	if (remap && remap(tlb, nslabs) < 0) {
+		memblock_free(tlb, PAGE_ALIGN(bytes));
+
+		if (nslabs <= IO_TLB_MIN_SLABS)
+			panic("%s: Failed to remap %zu bytes\n",
+			  __func__, bytes);
+		nslabs = max(1024UL, ALIGN(nslabs >> 1, IO_TLB_SEGSIZE));
+		goto retry;
+	}
 	if (swiotlb_init_with_tbl(tlb, default_nslabs, flags))
 		goto fail_free_mem;
 	return;
@@ -287,12 +300,18 @@ void __init swiotlb_init(bool addressing_limit, unsigned int flags)
 	pr_warn("Cannot allocate buffer");
 }
 
+void __init swiotlb_init(bool addressing_limit, unsigned int flags)
+{
+	return swiotlb_init_remap(addressing_limit, flags, NULL);
+}
+
 /*
  * Systems with larger DMA zones (those that don't support ISA) can
  * initialize the swiotlb later using the slab allocator if needed.
  * This should be just like above, but with some error catching.
  */
-int swiotlb_init_late(size_t size, gfp_t gfp_mask)
+int swiotlb_init_late(size_t size, gfp_t gfp_mask,
+		int (*remap)(void *tlb, unsigned long nslabs))
 {
 	unsigned long nslabs = ALIGN(size >> IO_TLB_SHIFT, IO_TLB_SEGSIZE);
 	unsigned long bytes;
@@ -303,6 +322,7 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask)
 	if (swiotlb_force_disable)
 		return 0;
 
+retry:
 	order = get_order(nslabs << IO_TLB_SHIFT);
 	nslabs = SLABS_PER_PAGE << order;
 	bytes = nslabs << IO_TLB_SHIFT;
@@ -323,6 

Re: [PATCH 12/15] swiotlb: provide swiotlb_init variants that remap the buffer

2022-03-15 Thread Christoph Hellwig
On Mon, Mar 14, 2022 at 06:39:21PM -0400, Boris Ostrovsky wrote:
> This is IO_TLB_MIN_SLABS, isn't it? (Xen code didn't say so but that's what 
> it meant to say I believe)

Yes, that makes much more sense.  I've switched the patch to use
IO_TLB_MIN_SLABS and drop the 2MB comment in both places.

Can I get a review with that fixed up?

---
>From 153085bf3e6e69d676bef0fb96395a86fb8122f5 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig 
Date: Mon, 14 Mar 2022 08:02:57 +0100
Subject: swiotlb: provide swiotlb_init variants that remap the buffer

To shared more code between swiotlb and xen-swiotlb, offer a
swiotlb_init_remap interface and add a remap callback to
swiotlb_init_late that will allow Xen to remap the buffer the
buffer without duplicating much of the logic.

Signed-off-by: Christoph Hellwig 
---
 arch/x86/pci/sta2x11-fixup.c |  2 +-
 include/linux/swiotlb.h  |  5 -
 kernel/dma/swiotlb.c | 36 +---
 3 files changed, 38 insertions(+), 5 deletions(-)

diff --git a/arch/x86/pci/sta2x11-fixup.c b/arch/x86/pci/sta2x11-fixup.c
index c7e6faf59a861..7368afc039987 100644
--- a/arch/x86/pci/sta2x11-fixup.c
+++ b/arch/x86/pci/sta2x11-fixup.c
@@ -57,7 +57,7 @@ static void sta2x11_new_instance(struct pci_dev *pdev)
int size = STA2X11_SWIOTLB_SIZE;
/* First instance: register your own swiotlb area */
dev_info(>dev, "Using SWIOTLB (size %i)\n", size);
-   if (swiotlb_init_late(size, GFP_DMA))
+   if (swiotlb_init_late(size, GFP_DMA, NULL))
dev_emerg(>dev, "init swiotlb failed\n");
}
list_add(>list, _instance_list);
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index ee655f2e4d28b..7b50c82f84ce9 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -36,8 +36,11 @@ struct scatterlist;
 
 int swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, unsigned int flags);
 unsigned long swiotlb_size_or_default(void);
+void __init swiotlb_init_remap(bool addressing_limit, unsigned int flags,
+   int (*remap)(void *tlb, unsigned long nslabs));
+int swiotlb_init_late(size_t size, gfp_t gfp_mask,
+   int (*remap)(void *tlb, unsigned long nslabs));
 extern int swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs);
-int swiotlb_init_late(size_t size, gfp_t gfp_mask);
 extern void __init swiotlb_update_mem_attributes(void);
 
 phys_addr_t swiotlb_tbl_map_single(struct device *hwdev, phys_addr_t phys,
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 79641c446d284..b3d4f24fb5f5e 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -256,9 +256,11 @@ int __init swiotlb_init_with_tbl(char *tlb, unsigned long 
nslabs,
  * Statically reserve bounce buffer space and initialize bounce buffer data
  * structures for the software IO TLB used to implement the DMA API.
  */
-void __init swiotlb_init(bool addressing_limit, unsigned int flags)
+void __init swiotlb_init_remap(bool addressing_limit, unsigned int flags,
+   int (*remap)(void *tlb, unsigned long nslabs))
 {
-   size_t bytes = PAGE_ALIGN(default_nslabs << IO_TLB_SHIFT);
+   unsigned long nslabs = default_nslabs;
+   size_t bytes;
void *tlb;
 
if (!addressing_limit && !swiotlb_force_bounce)
@@ -271,12 +273,23 @@ void __init swiotlb_init(bool addressing_limit, unsigned 
int flags)
 * allow to pick a location everywhere for hypervisors with guest
 * memory encryption.
 */
+retry:
+   bytes = PAGE_ALIGN(default_nslabs << IO_TLB_SHIFT);
if (flags & SWIOTLB_ANY)
tlb = memblock_alloc(bytes, PAGE_SIZE);
else
tlb = memblock_alloc_low(bytes, PAGE_SIZE);
if (!tlb)
goto fail;
+   if (remap && remap(tlb, nslabs) < 0) {
+   memblock_free(tlb, PAGE_ALIGN(bytes));
+
+   if (nslabs <= IO_TLB_MIN_SLABS)
+   panic("%s: Failed to remap %zu bytes\n",
+ __func__, bytes);
+   nslabs = max(1024UL, ALIGN(nslabs >> 1, IO_TLB_SEGSIZE));
+   goto retry;
+   }
if (swiotlb_init_with_tbl(tlb, default_nslabs, flags))
goto fail_free_mem;
return;
@@ -287,12 +300,18 @@ void __init swiotlb_init(bool addressing_limit, unsigned 
int flags)
pr_warn("Cannot allocate buffer");
 }
 
+void __init swiotlb_init(bool addressing_limit, unsigned int flags)
+{
+   return swiotlb_init_remap(addressing_limit, flags, NULL);
+}
+
 /*
  * Systems with larger DMA zones (those that don't support ISA) can
  * initialize the swiotlb later using the slab allocator if needed.
  * This should be just like above, but with some error catching.
  */
-int swiotlb_init_late(size_t size, gfp_t gfp_mask)
+int swiotlb_init_late(size_t size, gfp_t gfp_mask,
+   int (*remap)(void *tlb, unsigned long nslabs))
 {
unsigned