RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)

2020-04-06 Thread Tian, Kevin
> From: Alex Williamson
> Sent: Saturday, April 4, 2020 1:50 AM
[...]
> > > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > > index 9e843a1..298ac80 100644
> > > > --- a/include/uapi/linux/vfio.h
> > > > +++ b/include/uapi/linux/vfio.h
> > > > @@ -794,6 +794,47 @@ struct vfio_iommu_type1_dma_unmap {
> > > >  #define VFIO_IOMMU_ENABLE  _IO(VFIO_TYPE, VFIO_BASE + 15)
> > > >  #define VFIO_IOMMU_DISABLE _IO(VFIO_TYPE, VFIO_BASE + 16)
> > > >
> > > > +/*
> > > > + * PASID (Process Address Space ID) is a PCIe concept which
> > > > + * has been extended to support DMA isolation in fine-grain.
> > > > + * With device assigned to user space (e.g. VMs), PASID alloc
> > > > + * and free need to be system wide. This structure defines
> > > > + * the info for pasid alloc/free between user space and kernel
> > > > + * space.
> > > > + *
> > > > + * @flag=VFIO_IOMMU_PASID_ALLOC, refer to the @alloc_pasid
> > > > + * @flag=VFIO_IOMMU_PASID_FREE, refer to @free_pasid
> > > > + */
> > > > +struct vfio_iommu_type1_pasid_request {
> > > > +   __u32   argsz;
> > > > +#define VFIO_IOMMU_PASID_ALLOC (1 << 0)
> > > > +#define VFIO_IOMMU_PASID_FREE  (1 << 1)
> > > > +   __u32   flags;
> > > > +   union {
> > > > +   struct {
> > > > +   __u32 min;
> > > > +   __u32 max;
> > > > +   __u32 result;
> > > > +   } alloc_pasid;
> > > > +   __u32 free_pasid;
> > > > +   };
> > >
> > > We seem to be using __u8 data[] lately where the struct at data is
> > > defined by the flags.  should we do that here?
> >
> > yeah, I can do that. BTW. Do you want to let the structure in the
> > lately patch share the same structure with this one? As I can foresee,
> > the two structures would look like similar as both of them include
> > argsz, flags and data[] fields. The difference is the definition of
> > flags. what about your opinion?
> >
> > struct vfio_iommu_type1_pasid_request {
> > __u32   argsz;
> > #define VFIO_IOMMU_PASID_ALLOC  (1 << 0)
> > #define VFIO_IOMMU_PASID_FREE   (1 << 1)
> > __u32   flags;
> > __u8data[];
> > };
> >
> > struct vfio_iommu_type1_bind {
> > __u32   argsz;
> > __u32   flags;
> > #define VFIO_IOMMU_BIND_GUEST_PGTBL (1 << 0)
> > #define VFIO_IOMMU_UNBIND_GUEST_PGTBL   (1 << 1)
> > __u8data[];
> > };
> 
> 
> Yes, I was even wondering the same for the cache invalidate ioctl, or
> whether this is going too far for a general purpose "everything related
> to PASIDs" ioctl.  We need to factor usability into the equation too.
> I'd be interested in opinions from others here too.  Clearly I don't
> like single use, throw-away ioctls, but I can find myself on either
> side of the argument that allocation, binding, and invalidating are all
> within the domain of PASIDs and could fall within a single ioctl or
> they each represent different facets of managing PASIDs and should have
> separate ioctls.  Thanks,
> 

Looking at uapi/linux/iommu.h:

* Invalidations by %IOMMU_INV_GRANU_DOMAIN don't take any argument other than
 * @version and @cache.

Although intel-iommu handles only PASID-related invalidation now, I
suppose other vendors (or future usages?) may allow non-pasid
based invalidation too based on above comment. 

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)

2020-04-06 Thread Tian, Kevin
> From: Alex Williamson
> Sent: Friday, April 3, 2020 11:14 PM
> 
> On Fri, 3 Apr 2020 05:58:55 +
> "Tian, Kevin"  wrote:
> 
> > > From: Alex Williamson 
> > > Sent: Friday, April 3, 2020 1:50 AM
> > >
> > > On Sun, 22 Mar 2020 05:31:58 -0700
> > > "Liu, Yi L"  wrote:
> > >
> > > > From: Liu Yi L 
> > > >
> > > > For a long time, devices have only one DMA address space from
> platform
> > > > IOMMU's point of view. This is true for both bare metal and directed-
> > > > access in virtualization environment. Reason is the source ID of DMA in
> > > > PCIe are BDF (bus/dev/fnc ID), which results in only device granularity
> > > > DMA isolation. However, this is changing with the latest advancement in
> > > > I/O technology area. More and more platform vendors are utilizing the
> > > PCIe
> > > > PASID TLP prefix in DMA requests, thus to give devices with multiple
> DMA
> > > > address spaces as identified by their individual PASIDs. For example,
> > > > Shared Virtual Addressing (SVA, a.k.a Shared Virtual Memory) is able to
> > > > let device access multiple process virtual address space by binding the
> > > > virtual address space with a PASID. Wherein the PASID is allocated in
> > > > software and programmed to device per device specific manner.
> Devices
> > > > which support PASID capability are called PASID-capable devices. If such
> > > > devices are passed through to VMs, guest software are also able to bind
> > > > guest process virtual address space on such devices. Therefore, the
> guest
> > > > software could reuse the bare metal software programming model,
> which
> > > > means guest software will also allocate PASID and program it to device
> > > > directly. This is a dangerous situation since it has potential PASID
> > > > conflicts and unauthorized address space access. It would be safer to
> > > > let host intercept in the guest software's PASID allocation. Thus PASID
> > > > are managed system-wide.
> > >
> > > Providing an allocation interface only allows for collaborative usage
> > > of PASIDs though.  Do we have any ability to enforce PASID usage or can
> > > a user spoof other PASIDs on the same BDF?
> >
> > An user can access only PASIDs allocated to itself, i.e. the specific IOASID
> > set tied to its mm_struct.
> 
> A user is only _supposed_ to access PASIDs allocated to itself.  AIUI
> the mm_struct is used for managing the pool of IOASIDs from which the
> user may allocate that PASID.  We also state that programming the PASID
> into the device is device specific.  Therefore, are we simply trusting
> the user to use a PASID that's been allocated to them when they program
> the device?  If a user can program an arbitrary PASID into the device,
> then what prevents them from attempting to access data from another
> user via the device?   I think I've asked this question before, so if
> there's a previous explanation or spec section I need to review, please
> point me to it.  Thanks,
> 

There are two scenarios:

(1) for PF/VF, the iommu driver maintains an individual PASID table per
PDF. Although the PASID namespace is global, the per-BDF PASID table
contains only valid entries for those PASIDs which are allocated to the
mm_struct. The user is free to program arbitrary PASID into the assigned
device, but using invalid PASIDs simply hit iommu fault.

(2) for mdev, multiple mdev instances share the same PASID table of
the parent BDF. However, PASID programming is a privileged operation
in multiplexing usage, thus must be mediated by mdev device driver. 
The mediation logic will guarantee that only allocated PASIDs are 
forwarded to the device. 

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v1 2/2] vfio/pci: Emulate PASID/PRI capability for VFs

2020-04-06 Thread Tian, Kevin
> From: Alex Williamson 
> Sent: Saturday, April 4, 2020 1:26 AM
[...]
> > > > +   if (!pasid_cap.control_reg.paside) {
> > > > +   pr_debug("%s: its PF's PASID capability is not 
> > > > enabled\n",
> > > > +   dev_name(>pdev->dev));
> > > > +   ret = 0;
> > > > +   goto out;
> > > > +   }
> > >
> > > What happens if the PF's PASID gets disabled while we're using it??
> >
> > This is actually the open I highlighted in cover letter. Per the reply
> > from Baolu, this seems to be an open for bare-metal all the same.
> > https://lkml.org/lkml/2020/3/31/95
> 
> Seems that needs to get sorted out before we can expose this.  Maybe
> some sort of registration with the PF driver that PASID is being used
> by a VF so it cannot be disabled?

I guess we may do vSVA for PF first, and then adding VF vSVA later
given above additional need. It's not necessarily to enable both
in one step.

[...]
> > > > @@ -1604,6 +1901,18 @@ static int vfio_ecap_init(struct
> vfio_pci_device *vdev)
> > > > if (!ecaps)
> > > > *(u32 *)>vconfig[PCI_CFG_SPACE_SIZE] = 0;
> > > >
> > > > +#ifdef CONFIG_PCI_ATS
> > > > +   if (pdev->is_virtfn) {
> > > > +   struct pci_dev *physfn = pdev->physfn;
> > > > +
> > > > +   ret = vfio_pci_add_emulated_cap_for_vf(vdev,
> > > > +   physfn, epos_max, prev);
> > > > +   if (ret)
> > > > +   pr_info("%s, failed to add special caps for VF 
> > > > %s\n",
> > > > +   __func__, dev_name(>pdev->dev));
> > > > +   }
> > > > +#endif
> > >
> > > I can only imagine that we should place the caps at the same location
> > > they exist on the PF, we don't know what hidden registers might be
> > > hiding in config space.

Is there vendor guarantee that hidden registers will locate at the
same offset between PF and VF config space? 

> >
> > but we are not sure whether the same location is available on VF. In
> > this patch, it actually places the emulated cap physically behind the
> > cap which lays farthest (its offset is largest) within VF's config space
> > as the PCIe caps are linked in a chain.
> 
> But, as we've found on Broadcom NICs (iirc), hardware developers have a
> nasty habit of hiding random registers in PCI config space, outside of
> defined capabilities.  I feel like IGD might even do this too, is that
> true?  So I don't think we can guarantee that just because a section of
> config space isn't part of a defined capability that its unused.  It
> only means that it's unused by common code, but it might have device
> specific purposes.  So of the PCIe spec indicates that VFs cannot
> include these capabilities and virtialization software needs to
> emulate them, we need somewhere safe to place them in config space, and
> simply placing them off the end of known capabilities doesn't give me
> any confidence.  Also, hardware has no requirement to make compact use
> of extended config space.  The first capability must be at 0x100, the
> very next capability could consume all the way to the last byte of the
> 4K extended range, and the next link in the chain could be somewhere in
> the middle.  Thanks,
> 

Then what would be a viable option? Vendor nasty habit implies
no standard, thus I don't see how VFIO can find a safe location
by itself. Also curious how those hidden registers are identified
by VFIO and employed with proper r/w policy today. If sort of quirks
are used, then could such quirk way be extended to also carry
the information about vendor specific safe location? When no
such quirk info is provided (the majority case), VFIO then finds
out a free location to carry the new cap.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH] iommu/amd: fix a race in fetch_pte()

2020-04-06 Thread Qian Cai
fetch_pte() could race with increase_address_space() because it held no
lock from iommu_unmap_page(). On the CPU that runs fetch_pte() it could
see a stale domain->pt_root and a new increased domain->mode from
increase_address_space(). As the result, it could trigger invalid
accesses later on. Fix it by using a pair of smp_[w|r]mb in those
places.

 kernel BUG at drivers/iommu/amd_iommu.c:1704!
 BUG_ON(unmapped && !is_power_of_2(unmapped));
 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 
07/10/2019
 RIP: 0010:amd_iommu_unmap+0x1b2/0x1d0
 Call Trace:
  
  __iommu_unmap+0x106/0x320
  iommu_unmap_fast+0xe/0x10
  __iommu_dma_unmap+0xdc/0x1a0
  iommu_dma_unmap_sg+0xae/0xd0
  scsi_dma_unmap+0xe7/0x150
  pqi_raid_io_complete+0x37/0x60 [smartpqi]
  pqi_irq_handler+0x1fc/0x13f0 [smartpqi]
  __handle_irq_event_percpu+0x78/0x4f0
  handle_irq_event_percpu+0x70/0x100
  handle_irq_event+0x5a/0x8b
  handle_edge_irq+0x10c/0x370
  do_IRQ+0x9e/0x1e0
  common_interrupt+0xf/0xf
  

Signed-off-by: Qian Cai 
---
 drivers/iommu/amd_iommu.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 20cce366e951..22328a23335f 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -1434,6 +1434,11 @@ static bool increase_address_space(struct 
protection_domain *domain,
*pte = PM_LEVEL_PDE(domain->mode,
iommu_virt_to_phys(domain->pt_root));
domain->pt_root  = pte;
+   /*
+* Make sure fetch_pte() will see the new domain->pt_root before it
+* snapshots domain->mode.
+*/
+   smp_wmb();
domain->mode+= 1;
 
ret = true;
@@ -1460,6 +1465,8 @@ static u64 *alloc_pte(struct protection_domain *domain,
*updated = increase_address_space(domain, address, gfp) || 
*updated;
 
level   = domain->mode - 1;
+   /* To pair with smp_wmb() in increase_address_space(). */
+   smp_rmb();
pte = >pt_root[PM_LEVEL_INDEX(level, address)];
address = PAGE_SIZE_ALIGN(address, page_size);
end_lvl = PAGE_SIZE_LEVEL(page_size);
@@ -1545,6 +1552,8 @@ static u64 *fetch_pte(struct protection_domain *domain,
return NULL;
 
level  =  domain->mode - 1;
+   /* To pair with smp_wmb() in increase_address_space(). */
+   smp_rmb();
pte= >pt_root[PM_LEVEL_INDEX(level, address)];
*page_size =  PTE_LEVEL_PAGE_SIZE(level);
 
-- 
2.21.0 (Apple Git-122.2)

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 00/19] [PULL REQUEST] iommu/vt-d: patches for v5.7

2020-04-06 Thread Lu Baolu

On 2020/4/6 21:36, Christoph Hellwig wrote:

On Sun, Apr 05, 2020 at 04:30:34PM +0800, Lu Baolu wrote:

Hi Joerg,

Below patches have been piled up for v5.7. They enable below
features:


Err, this is not the time for 5.7 features that haven't been in
linux-next before 5.6 was released.



Really, my bad. I will resend the request after 5.7-rc1 and target them
to 5.8. Thanks for reminding.

Best regards,
baolu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC/RFT][PATCH v2] dma-mapping: set default segment_boundary_mask to ULONG_MAX

2020-04-06 Thread Nicolin Chen
The default segment_boundary_mask was set to DMA_BIT_MAKS(32)
a decade ago by referencing SCSI/block subsystem, as a 32-bit
mask was good enough for most of the devices.

Now more and more drivers set dma_masks above DMA_BIT_MAKS(32)
while only a handful of them call dma_set_seg_boundary(). This
means that most drivers have a 4GB segmention boundary because
DMA API returns a 32-bit default value, though they might not
really have such a limit.

The default segment_boundary_mask should mean "no limit" since
the device doesn't explicitly set the mask. But a 32-bit mask
certainly limits those devices capable of 32+ bits addressing.

So this patch sets default segment_boundary_mask to ULONG_MAX.

Signed-off-by: Nicolin Chen 
---
Changelog:
v1->v2
 * Followed Robin's comments to revise the commit message by
   dropping one paragraph of not-entirely-true justification
   (no git-diff level change, so please ack if you tested v1)

 include/linux/dma-mapping.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 330ad58fbf4d..ff8cefe85f30 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -736,7 +736,7 @@ static inline unsigned long dma_get_seg_boundary(struct 
device *dev)
 {
if (dev->dma_parms && dev->dma_parms->segment_boundary_mask)
return dev->dma_parms->segment_boundary_mask;
-   return DMA_BIT_MASK(32);
+   return ULONG_MAX;
 }
 
 static inline int dma_set_seg_boundary(struct device *dev, unsigned long mask)
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC/RFT][PATCH] dma-mapping: set default segment_boundary_mask to ULONG_MAX

2020-04-06 Thread Nicolin Chen
On Mon, Apr 06, 2020 at 02:48:13PM +0100, Robin Murphy wrote:
> On 2020-04-05 1:51 am, Nicolin Chen wrote:
> > The default segment_boundary_mask was set to DMA_BIT_MAKS(32)
> > a decade ago by referencing SCSI/block subsystem, as a 32-bit
> > mask was good enough for most of the devices.
> > 
> > Now more and more drivers set dma_masks above DMA_BIT_MAKS(32)
> > while only a handful of them call dma_set_seg_boundary(). This
> > means that most drivers have a 4GB segmention boundary because
> > DMA API returns a 32-bit default value, though they might not
> > really have such a limit.
> > 
> > The default segment_boundary_mask should mean "no limit" since
> > the device doesn't explicitly set the mask. But a 32-bit mask
> > certainly limits those devices capable of 32+ bits addressing.
> > 
> > And this 32-bit boundary mask might result in a situation that
> > when dma-iommu maps a DMA buffer (size > 4GB), iommu_map_sg()
> > cuts the IOVA region into discontiguous pieces, and creates a
> > faulty IOVA mapping that overlaps some physical memory outside
> > the scatter list, which might lead to some random kernel panic
> > after DMA overwrites that faulty IOVA space.
> 
> Once again, get rid of this paragraph - it doesn't have much to do with the
> *default* value since it describes a behaviour general to any boundary mask.
> Plus it effectively says "if a driver uses a DMA-mapped scatterlist
> incorrectly, this change can help paper over the bug", which is rather the
> opposite of a good justification.

Np. Will drop it and resend.

> (for example most SATA devices end up with a 64KB boundary mask, such that
> padding the IOVAs to provide the appropriate alignment happens very
> frequently, and they've been working just fine for years now)

Okay.

Thanks!
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 06/10] iommu/ioasid: Convert to set aware allocations

2020-04-06 Thread Jacob Pan
On Sat, 28 Mar 2020 06:40:58 +
"Tian, Kevin"  wrote:

> > From: Jacob Pan 
> > Sent: Saturday, March 28, 2020 1:42 AM
> > 
> > On Fri, 27 Mar 2020 09:54:11 +
> > "Tian, Kevin"  wrote:
> >   
> > > > From: Jacob Pan 
> > > > Sent: Thursday, March 26, 2020 1:55 AM
> > > >
> > > > The current ioasid_alloc function takes a token/ioasid_set then
> > > > record it on the IOASID being allocated. There is no alloc/free
> > > > on the ioasid_set.
> > > >
> > > > With the IOASID set APIs, callers must allocate an ioasid_set
> > > > before allocate IOASIDs within the set. Quota and other
> > > > ioasid_set level activities can then be enforced.
> > > >
> > > > This patch converts existing API to the new ioasid_set model.
> > > >
> > > > Signed-off-by: Liu Yi L 
> > > > Signed-off-by: Jacob Pan 
> > > > ---
> > > >  drivers/iommu/intel-iommu.c | 10 +++---
> > > >  drivers/iommu/intel-svm.c   | 10 +++---
> > > >  drivers/iommu/ioasid.c  | 78
> > > > +--- -
> > > >  include/linux/ioasid.h  | 11 +++
> > > >  4 files changed, 72 insertions(+), 37 deletions(-)
> > > >
> > > > diff --git a/drivers/iommu/intel-iommu.c
> > > > b/drivers/iommu/intel-iommu.c index af7a1ef7b31e..c571cc8d9e57
> > > > 100644 --- a/drivers/iommu/intel-iommu.c
> > > > +++ b/drivers/iommu/intel-iommu.c
> > > > @@ -3323,11 +3323,11 @@ static void intel_ioasid_free(ioasid_t
> > > > ioasid, void *data)
> > > > if (!iommu)
> > > > return;
> > > > /*
> > > > -* Sanity check the ioasid owner is done at upper
> > > > layer, e.g. VFIO
> > > > -* We can only free the PASID when all the devices are
> > > > unbound.
> > > > +* In the guest, all IOASIDs belong to the
> > > > system_ioasid set.
> > > > +* Sanity check against the system set.  
> > >
> > > below code has nothing to deal with guest, then why putting the
> > > comment specifically for guest?
> > >  
> > intel_ioasid_alloc/free() is the custom IOASID allocator only
> > registered when running in the guest.  
> 
> in that case may be rename the functions to
> intel_guest_ioasid_alloc/free would avoid similar confusion as I had?
> 
Sounds good.
> > 
> > The custom allocator calls virtual command. Since we don't support
> > nested guest, all IOASIDs belong to the system ioasid_set.  
> 
> could you put no support of nested guest in the comment, so later
> when people want to add nested support they will know some
> additional work required here?
> 
will do.

> >   
> > > >  */
> > > > -   if (ioasid_find(NULL, ioasid, NULL)) {
> > > > -   pr_alert("Cannot free active IOASID %d\n",
> > > > ioasid);
> > > > +   if (IS_ERR(ioasid_find(system_ioasid_sid, ioasid,
> > > > NULL))) {
> > > > +   pr_err("Cannot free IOASID %d, not in system
> > > > set\n", ioasid); return;
> > > > }
> > > > vcmd_free_pasid(iommu, ioasid);
> > > > @@ -5541,7 +5541,7 @@ static int aux_domain_add_dev(struct
> > > > dmar_domain *domain,
> > > > int pasid;
> > > >
> > > > /* No private data needed for the default
> > > > pasid */
> > > > -   pasid = ioasid_alloc(NULL, PASID_MIN,
> > > > +   pasid = ioasid_alloc(system_ioasid_sid,
> > > > PASID_MIN, pci_max_pasids(to_pci_dev(dev))
> > > > - 1, NULL);
> > > > if (pasid == INVALID_IOASID) {
> > > > diff --git a/drivers/iommu/intel-svm.c
> > > > b/drivers/iommu/intel-svm.c index 1991587fd3fd..f511855d187b
> > > > 100644 --- a/drivers/iommu/intel-svm.c
> > > > +++ b/drivers/iommu/intel-svm.c
> > > > @@ -268,7 +268,7 @@ int intel_svm_bind_gpasid(struct
> > > > iommu_domain *domain,
> > > > }
> > > >
> > > > mutex_lock(_mutex);
> > > > -   svm = ioasid_find(NULL, data->hpasid, NULL);
> > > > +   svm = ioasid_find(INVALID_IOASID_SET, data->hpasid,
> > > > NULL); if (IS_ERR(svm)) {
> > > > ret = PTR_ERR(svm);
> > > > goto out;
> > > > @@ -401,7 +401,7 @@ int intel_svm_unbind_gpasid(struct device
> > > > *dev, int pasid)
> > > > return -EINVAL;
> > > >
> > > > mutex_lock(_mutex);
> > > > -   svm = ioasid_find(NULL, pasid, NULL);
> > > > +   svm = ioasid_find(INVALID_IOASID_SET, pasid, NULL);
> > > > if (!svm) {
> > > > ret = -EINVAL;
> > > > goto out;
> > > > @@ -559,7 +559,7 @@ static int intel_svm_bind_mm(struct device
> > > > *dev, int flags, struct svm_dev_ops *
> > > > pasid_max = intel_pasid_max_id;
> > > >
> > > > /* Do not use PASID 0, reserved for RID to
> > > > PASID */
> > > > -   svm->pasid = ioasid_alloc(NULL, PASID_MIN,
> > > > +   svm->pasid = ioasid_alloc(system_ioasid_sid,
> > > > PASID_MIN, pasid_max - 1, svm);
> > > > if (svm->pasid == INVALID_IOASID) {
> > > > kfree(svm);
> > > > @@ -642,7 +642,7 

Re: [PATCH 03/10] iommu/ioasid: Introduce per set allocation APIs

2020-04-06 Thread Jacob Pan
On Wed, 1 Apr 2020 15:47:45 +0200
Jean-Philippe Brucker  wrote:

> On Wed, Mar 25, 2020 at 10:55:24AM -0700, Jacob Pan wrote:
> > IOASID set defines a group of IDs that share the same token. The
> > ioasid_set concept helps to do permission checking among users as
> > in the current code.
> > 
> > With guest SVA usage, each VM has its own IOASID set. More
> > functionalities are needed:
> > 1. Enforce quota, each guest may be assigned limited quota such
> > that one guest cannot abuse all the system resource.
> > 2. Stores IOASID mapping between guest and host IOASIDs
> > 3. Per set operations, e.g. free the entire set
> > 
> > For each ioasid_set token, a unique set ID is assigned. This makes
> > reference of the set and data lookup much easier to implement.
> > 
> > Signed-off-by: Liu Yi L 
> > Signed-off-by: Jacob Pan 
> > ---
> >  drivers/iommu/ioasid.c | 147
> > +
> > include/linux/ioasid.h |  13 + 2 files changed, 160
> > insertions(+)
> > 
> > diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> > index 4026e52855b9..27ee57f7079b 100644
> > --- a/drivers/iommu/ioasid.c
> > +++ b/drivers/iommu/ioasid.c
> > @@ -10,6 +10,25 @@
> >  #include 
> >  #include 
> >  
> > +static DEFINE_XARRAY_ALLOC(ioasid_sets);
> > +/**
> > + * struct ioasid_set_data - Meta data about ioasid_set
> > + *
> > + * @token: Unique to identify an IOASID set
> > + * @xa:XArray to store subset ID and IOASID mapping
> > + * @size:  Max number of IOASIDs can be allocated within the
> > set
> > + * @nr_ioasids Number of IOASIDs allocated in the set
> > + * @sidID of the set
> > + */
> > +struct ioasid_set_data {
> > +   struct ioasid_set *token;
> > +   struct xarray xa;
> > +   int size;
> > +   int nr_ioasids;
> > +   int sid;
> > +   struct rcu_head rcu;
> > +};
> > +
> >  struct ioasid_data {
> > ioasid_t id;
> > struct ioasid_set *set;
> > @@ -388,6 +407,111 @@ void ioasid_free(ioasid_t ioasid)
> >  EXPORT_SYMBOL_GPL(ioasid_free);
> >  
> >  /**
> > + * ioasid_alloc_set - Allocate a set of IOASIDs
> > + * @token: Unique token of the IOASID set
> > + * @quota: Quota allowed in this set
> > + * @sid:   IOASID set ID to be assigned
> > + *
> > + * Return 0 upon success. Token will be stored internally for
> > lookup,
> > + * IOASID allocation within the set and other per set operations
> > will use
> > + * the @sid assigned.
> > + *
> > + */
> > +int ioasid_alloc_set(struct ioasid_set *token, ioasid_t quota, int
> > *sid) +{
> > +   struct ioasid_set_data *sdata;
> > +   ioasid_t id;
> > +   int ret = 0;
> > +
> > +   if (quota > ioasid_capacity_avail) {
> > +   pr_warn("Out of IOASID capacity! ask %d, avail
> > %d\n",
> > +   quota, ioasid_capacity_avail);
> > +   return -ENOSPC;
> > +   }  
> 
> This check should be in the same critical section as the quota
> substraction
> 
yes, right.

> > +
> > +   sdata = kzalloc(sizeof(*sdata), GFP_KERNEL);
> > +   if (!sdata)
> > +   return -ENOMEM;  
> 
> I don't understand why we need this structure at all, nor why we need
> the SID. Users have already allocated an ioasid_set, so why not just
> stick the content of ioasid_set_data in there, and pass the
> ioasid_set pointer to ioasid_alloc()?
> 

My thinking was that ioasid_set is an opaque user token, e.g. we use mm
to identify a common set belong to a VM.

This sdata is an IOASID internal structure for managing & servicing per
set data. If we let user fill in the content, some of the entries need
to be managed by the IOASID code under a lock. IMO, not suitable to let
user allocate and manage.

Perhaps we should rename struct ioasid_set to ioasid_set_token?

/**
 * struct ioasid_set_data - Meta data about ioasid_set
 *
 * @token:  Unique to identify an IOASID set
 * @xa: XArray to store ioasid_set private ID to
system-wide IOASID
 *  mapping
 * @max_id: Max number of IOASIDs can be allocated within the set
 * @nr_id   Number of IOASIDs allocated in the set
 * @sid ID of the set
 */
struct ioasid_set_data {
struct ioasid_set *token;
struct xarray xa;
int size;
int nr_ioasids;
int sid;
struct rcu_head rcu;
};


I agree SID is optional. User could just use the token to reference
the set. I use the SID for performance reason, i.e. quick lookup from
SID to its data. Otherwise, we may have to search through a list of
sets to find a match?

> > +
> > +   spin_lock(_allocator_lock);
> > +
> > +   ret = xa_alloc(_sets, , sdata,
> > +  XA_LIMIT(0, ioasid_capacity_avail - quota),
> > +  GFP_KERNEL);  
> 
> Same as Kevin, I think the limit should be the static
> ioasid_capacity. And perhaps a comment explaining the worst case of
> one PASID per set. I found a little confusing using the same type
> ioasid_t for IOASIDs and IOASID sets, it may be clearer to use an int
> for IOASID set IDs.
> 

Re: [PATCH 1/2] dma-mapping: add a dma_ops_bypass flag to struct device

2020-04-06 Thread Christoph Hellwig
On Mon, Apr 06, 2020 at 11:25:09PM +1000, Alexey Kardashevskiy wrote:
> >> Do you see any serious problem with this approach? Thanks!
> > 
> > Do you have a link to the whole branch?  The github UI is unfortunately
> > unusable for that (or I'm missing something).
> 
> The UI shows the branch but since I rebased and forcepushed it, it does
> not. Here is the current one with:
> 
> https://github.com/aik/linux/commits/dma-bypass.3

Ok, so we use the core bypass without persistent memory, and then
have another bypass mode on top.  Not great, but I can't think
of anything better.  Note that your checks for the map_sg case
aren't very efficient - for one it would make sense to calculate
the limit only once, but also it would make sense to reuse the
calculted diecect mapping addresses instead of doing another pass
later on in the dma-direct code.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 05/10] iommu/ioasid: Create an IOASID set for host SVA use

2020-04-06 Thread Jacob Pan
Hi Jean,

On Wed, 1 Apr 2020 15:53:16 +0200
Jean-Philippe Brucker  wrote:

> On Wed, Mar 25, 2020 at 10:55:26AM -0700, Jacob Pan wrote:
> > Bare metal SVA allocates IOASIDs for native process addresses. This
> > should be separated from VM allocated IOASIDs thus under its own
> > set.
> > 
> > This patch creates a system IOASID set with its quota set to
> > PID_MAX. This is a reasonable default in that SVM capable devices
> > can only bind to limited user processes.  
> 
> Yes realistically there won't be more than PID_MAX_DEFAULT=0x8000
> bound address spaces. My machine uses a PID_MAX of 4 million though,
> so in theory more than 0x8000 processes may want a bond.
Got it, I assume we can adjust the system set quota as necessary.

> On Arm the
> limit of shared contexts per VM is currently a little less than
> 0x1 (which is the number of CPU ASIDs).
> 
I guess shared contexts means shared address? then it makes sense
#IOASID < #ASID.

> But quotas are only necessary for VMs, when the host shares the PASID
> space with them (which isn't a use-case for Arm systems as far as I
> know, each VM gets its own PASID space).
Is there a host-guest PASID translation? or the PASID used by the VM is
physical PASID? When a page request comes in to SMMU, how does it know
the owner of the PASID if PASID range can overlap between host and
guest?

> Could we have quota-free IOASID sets for the host?
> 
Yes, perhaps just add a flag such that the set has its own namespace.
You mean have this quota-free IOASID set even co-exist with VMs? I still
don't get how PRQ works.

That is not the use case for VT-d in that we have to have system-wide
allocation for host PASIDs. We have enqcmd which can take a PASID from
the per task MSR and deliver to multiple devices, so even though the
PASID table is per device the PASID name space must be global.

> For the SMMU I'd like to allocate two sets, one SVA and one private
> for auxiliary domains, and I don't think giving either a quota makes
> much sense at the moment.
I agree we don;t need the quota if we don't support guest SVA at the
same time.

So the sva set and aux_domain set PASIDs have their own namespaces?

> There can be systems using only SVA and
> systems using only private PASIDs. I think it should be
> first-come-first-served until admins want a knob to define a policy
> themselves, based on cgroups for example.
> 
> > Signed-off-by: Jacob Pan 
> > ---
> >  drivers/iommu/intel-iommu.c | 8 +++-
> >  drivers/iommu/ioasid.c  | 9 +
> >  include/linux/ioasid.h  | 9 +
> >  3 files changed, 25 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/iommu/intel-iommu.c
> > b/drivers/iommu/intel-iommu.c index ec3fc121744a..af7a1ef7b31e
> > 100644 --- a/drivers/iommu/intel-iommu.c
> > +++ b/drivers/iommu/intel-iommu.c
> > @@ -3511,8 +3511,14 @@ static int __init init_dmars(void)
> > goto free_iommu;
> >  
> > /* PASID is needed for scalable mode irrespective to SVM */
> > -   if (intel_iommu_sm)
> > +   if (intel_iommu_sm) {
> > ioasid_install_capacity(intel_pasid_max_id);
> > +   /* We should not run out of IOASIDs at boot */
> > +   if (ioasid_alloc_system_set(PID_MAX_DEFAULT)) {
> > +   pr_err("Failed to enable host PASID
> > allocator\n");
> > +   intel_iommu_sm = 0;
> > +   }
> > +   }
> >  
> > /*
> >  * for each drhd
> > diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> > index 6265d2dbbced..9135af171a7c 100644
> > --- a/drivers/iommu/ioasid.c
> > +++ b/drivers/iommu/ioasid.c
> > @@ -39,6 +39,9 @@ struct ioasid_data {
> >  static ioasid_t ioasid_capacity;
> >  static ioasid_t ioasid_capacity_avail;
> >  
> > +int system_ioasid_sid;
> > +static DECLARE_IOASID_SET(system_ioasid);
> > +
> >  /* System capacity can only be set once */
> >  void ioasid_install_capacity(ioasid_t total)
> >  {
> > @@ -51,6 +54,12 @@ void ioasid_install_capacity(ioasid_t total)
> >  }
> >  EXPORT_SYMBOL_GPL(ioasid_install_capacity);
> >  
> > +int ioasid_alloc_system_set(int quota)
> > +{
> > +   return ioasid_alloc_set(_ioasid, quota,
> > _ioasid_sid); +}
> > +EXPORT_SYMBOL_GPL(ioasid_alloc_system_set);  
> 
> I think this helper could stay in the VT-d driver for the moment. If
> the SMMU driver ever implements auxiliary domains it will use a
> private IOASID set, separate from the shared IOASID set managed by
> iommu-sva. Both could qualify as "system set".
> 
Sounds good. Perhaps remove the special "system set". SVA code,
VFIO, VT-d, or SMMU driver can all allocate their own sets.
So to meet both SMMU and VT-d requirements, we should do:
1. add an IOASID_PRIVATE flag to ioasid_alloc_set(), indicating this is
a private set
2. All APIs operate on the set_id accordingly, e.g. ioasid_find() will
only search within the private set. Private set is excluded from from
global search (VT-d needs this in PRQ).

Since VT-d already needs private PASIDs for guest SVM where

Re: arm-smmu-v3 high cpu usage for NVMe

2020-04-06 Thread John Garry

On 02/04/2020 13:10, John Garry wrote:

On 18/03/2020 20:53, Will Deacon wrote:
As for arm_smmu_cmdq_issue_cmdlist(), I do note that during the 
testing our

batch size is 1, so we're not seeing the real benefit of the batching. I
can't help but think that we could improve this code to try to 
combine CMD

SYNCs for small batches.

Anyway, let me know your thoughts or any questions. I'll have a look 
if a

get a chance for other possible bottlenecks.

Did you ever get any more information on this? I don't have any SMMUv3
hardware any more, so I can't really dig into this myself.





Hi Will,

JFYI, I added some debug in arm_smmu_cmdq_issue_cmdlist() to get some 
idea of what is going on. Perf annotate did not tell much.


I tested NVMe performance with and without Marc's patchset to spread 
LPIs for managed interrupts.


Average duration of arm_smmu_cmdq_issue_cmdlist() mainline [all results 
are approximations]:

owner: 6ms
non-owner: 4ms

mainline + LPI spreading patchset:
owner: 25ms
non-owner: 22ms

For this, a list would be a itlb+cmd_sync.

Please note that the LPI spreading patchset is still giving circa 25% 
NVMe throughput increase. What happens there would be that we get many 
more cpus involved, which creates more inter-cpu contention. But the 
performance increase comes from just alleviating pressure on those 
overloaded cpus.


I also notice that with the LPI spreading patchset, on average a cpu is 
an "owner" in arm_smmu_cmdq_issue_cmdlist() 1 in 8, as opposed to 1 in 3 
for mainline. This means that we're just creating longer chains of lists 
to be published.


But I found that for a non-owner, average msi cmd_sync polling time is 
12ms with the LPI spreading patchset. As such, it seems to be really 
taking approx (12*2/8-1=) ~3ms to consume a single list. This seems 
consistent with my finding that an owner polls consumption for 3ms also. 
Without the LPI speading patchset, polling time is approx 2 and 3ms for 
both owner and non-owner, respectively.


As an experiment, I did try to hack the code to use a spinlock again for 
protecting the command queue, instead of current solution - and always 
saw a performance drop there. To be expected. But maybe we can try to 
not use a spinlock, but still serialise production+consumption to 
alleviate the long polling periods.


Let me know your thoughts.

Cheers,
John

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH] dma-direct: Fix data truncation in dma_direct_get_required_mask()

2020-04-06 Thread Kishon Vijay Abraham I via iommu
The upper 32-bit physical address gets truncated inadvertently
when dma_direct_get_required_mask() invokes phys_to_dma_direct().
This results in dma_addressing_limited() return incorrect value
when used in platforms with LPAE enabled.
Fix it here by explicitly type casting 'max_pfn' to phys_addr_t
in order to prevent overflow of intermediate value while evaluating
'(max_pfn - 1) << PAGE_SHIFT'.

Signed-off-by: Kishon Vijay Abraham I 
---
Fixes the following WARN DUMP (and hang thereafter) observed when
connecting NVMe card to a platform with ARM LPAE enabled
nvme :01:00.0: overflow 0x00027b3be000+270336 of DMA mask
 bus limit 
---[ cut here ]---
WARNING: CPU: 0 PID: 26 at kernel/dma/direct.c:35 report_addr+0xf0/0xf4
Modules linked in:
CPU: 0 PID: 26 Comm: kworker/u4:1 Not tainted 5.5.0-2-g1383adf7b819 #2
Hardware name: Generic DRA74X (Flattened Device Tree)
Workqueue: writeback wb_workfn (flush-259:0)
(unwind_backtrace) from [] (show_stack+0x10/0x14)
(show_stack) from [] (dump_stack+0x94/0xa8)
(dump_stack) from [] (__warn+0xbc/0xd8)
(__warn) from [] (warn_slowpath_fmt+0x60/0xb8)
(warn_slowpath_fmt) from [] (report_addr+0xf0/0xf4)
(report_addr) from [] (dma_direct_map_page+0x18c/0x19c)
(dma_direct_map_page) from [] (dma_direct_map_sg+0x64/0xb4)
(dma_direct_map_sg) from [] (nvme_queue_rq+0x778/0x9ec)
(nvme_queue_rq) from [] (__blk_mq_try_issue_directly+0x130/0x1bc)
(__blk_mq_try_issue_directly) from []
(blk_mq_request_issue_directly+0x48/0x78)
(blk_mq_request_issue_directly) from []
(blk_mq_try_issue_list_directly+0x44/0xb8)
(blk_mq_try_issue_list_directly) from []
(blk_mq_sched_insert_requests+0xe0/0x154)
(blk_mq_sched_insert_requests) from []
(blk_mq_flush_plug_list+0x150/0x184)
(blk_mq_flush_plug_list) from [] (blk_flush_plug_list+0xc8/0xe4)
(blk_flush_plug_list) from [] (blk_mq_make_request+0x24c/0x3f0)
(blk_mq_make_request) from [] (generic_make_request+0xb0/0x2d4)
(generic_make_request) from [] (submit_bio+0x44/0x180)
(submit_bio) from [] (mpage_writepages+0xac/0xe8)
(mpage_writepages) from [] (do_writepages+0x44/0xdc)
(do_writepages) from [] (__writeback_single_inode+0x2c/0x1bc)
(__writeback_single_inode) from []
(writeback_sb_inodes+0x1d8/0x404)
(writeback_sb_inodes) from [] (__writeback_inodes_wb+0x58/0x9c)
(__writeback_inodes_wb) from [] (wb_writeback+0x194/0x1d8)
(wb_writeback) from [] (wb_workfn+0x244/0x33c)
(wb_workfn) from [] (process_one_work+0x204/0x458)
(process_one_work) from [] (worker_thread+0x44/0x598)
(worker_thread) from [] (kthread+0x14c/0x150)
(kthread) from [] (ret_from_fork+0x14/0x3c)
 kernel/dma/direct.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index ac7956c38f69..4b24275e306a 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -39,7 +39,8 @@ static inline struct page *dma_direct_to_page(struct device 
*dev,
 
 u64 dma_direct_get_required_mask(struct device *dev)
 {
-   u64 max_dma = phys_to_dma_direct(dev, (max_pfn - 1) << PAGE_SHIFT);
+   phys_addr_t phys = (phys_addr_t)(max_pfn - 1) << PAGE_SHIFT;
+   u64 max_dma = phys_to_dma_direct(dev, phys);
 
return (1ULL << (fls64(max_dma) - 1)) * 2 - 1;
 }
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC/RFT][PATCH] dma-mapping: set default segment_boundary_mask to ULONG_MAX

2020-04-06 Thread Robin Murphy

On 2020-04-05 1:51 am, Nicolin Chen wrote:

The default segment_boundary_mask was set to DMA_BIT_MAKS(32)
a decade ago by referencing SCSI/block subsystem, as a 32-bit
mask was good enough for most of the devices.

Now more and more drivers set dma_masks above DMA_BIT_MAKS(32)
while only a handful of them call dma_set_seg_boundary(). This
means that most drivers have a 4GB segmention boundary because
DMA API returns a 32-bit default value, though they might not
really have such a limit.

The default segment_boundary_mask should mean "no limit" since
the device doesn't explicitly set the mask. But a 32-bit mask
certainly limits those devices capable of 32+ bits addressing.

And this 32-bit boundary mask might result in a situation that
when dma-iommu maps a DMA buffer (size > 4GB), iommu_map_sg()
cuts the IOVA region into discontiguous pieces, and creates a
faulty IOVA mapping that overlaps some physical memory outside
the scatter list, which might lead to some random kernel panic
after DMA overwrites that faulty IOVA space.


Once again, get rid of this paragraph - it doesn't have much to do with 
the *default* value since it describes a behaviour general to any 
boundary mask. Plus it effectively says "if a driver uses a DMA-mapped 
scatterlist incorrectly, this change can help paper over the bug", which 
is rather the opposite of a good justification.


(for example most SATA devices end up with a 64KB boundary mask, such 
that padding the IOVAs to provide the appropriate alignment happens very 
frequently, and they've been working just fine for years now)


Robin.


So this patch sets default segment_boundary_mask to ULONG_MAX.

Signed-off-by: Nicolin Chen 
---
  include/linux/dma-mapping.h | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 330ad58fbf4d..ff8cefe85f30 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -736,7 +736,7 @@ static inline unsigned long dma_get_seg_boundary(struct 
device *dev)
  {
if (dev->dma_parms && dev->dma_parms->segment_boundary_mask)
return dev->dma_parms->segment_boundary_mask;
-   return DMA_BIT_MASK(32);
+   return ULONG_MAX;
  }
  
  static inline int dma_set_seg_boundary(struct device *dev, unsigned long mask)



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 00/19] [PULL REQUEST] iommu/vt-d: patches for v5.7

2020-04-06 Thread Christoph Hellwig
On Sun, Apr 05, 2020 at 04:30:34PM +0800, Lu Baolu wrote:
> Hi Joerg,
> 
> Below patches have been piled up for v5.7. They enable below
> features:

Err, this is not the time for 5.7 features that haven't been in
linux-next before 5.6 was released.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/2] dma-mapping: add a dma_ops_bypass flag to struct device

2020-04-06 Thread Alexey Kardashevskiy



On 06/04/2020 21:50, Christoph Hellwig wrote:
> On Fri, Apr 03, 2020 at 07:38:11PM +1100, Alexey Kardashevskiy wrote:
>>
>>
>> On 26/03/2020 12:26, Alexey Kardashevskiy wrote:
>>>
>>>
>>> On 25/03/2020 19:37, Christoph Hellwig wrote:
 On Wed, Mar 25, 2020 at 03:51:36PM +1100, Alexey Kardashevskiy wrote:
>>> This is for persistent memory which you can DMA to/from but yet it does
>>> not appear in the system as a normal memory and therefore requires
>>> special handling anyway (O_DIRECT or DAX, I do not know the exact
>>> mechanics). All other devices in the system should just run as usual,
>>> i.e. use 1:1 mapping if possible.
>>
>> On other systems (x86 and arm) pmem as long as it is page backed does
>> not require any special handling.  This must be some weird way powerpc
>> fucked up again, and I suspect you'll have to suffer from it.
>
>
> It does not matter if it is backed by pages or not, the problem may also
> appear if we wanted for example p2p PCI via IOMMU (between PHBs) and
> MMIO might be mapped way too high in the system address space and make
> 1:1 impossible.

 How can it be mapped too high for a direct mapping with a 64-bit DMA
 mask?
>>>
>>> The window size is limited and often it is not even sparse. It requires
>>> an 8 byte entry per an IOMMU page (which is most commonly is 64k max) so
>>> 1TB limit (a guest RAM size) is a quite real thing. MMIO is mapped to
>>> guest physical address space outside of this 1TB (on PPC).
>>>
>>>
>>
>> I am trying now this approach on top of yours "dma-bypass.3" (it is
>> "wip", needs an upper boundary check):
>>
>> https://github.com/aik/linux/commit/49d73c7771e3f6054804f6cfa80b4e320111662d
>>
>> Do you see any serious problem with this approach? Thanks!
> 
> Do you have a link to the whole branch?  The github UI is unfortunately
> unusable for that (or I'm missing something).

The UI shows the branch but since I rebased and forcepushed it, it does
not. Here is the current one with:

https://github.com/aik/linux/commits/dma-bypass.3


Thanks,


-- 
Alexey
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/2] dma-mapping: add a dma_ops_bypass flag to struct device

2020-04-06 Thread Christoph Hellwig
On Fri, Apr 03, 2020 at 07:38:11PM +1100, Alexey Kardashevskiy wrote:
> 
> 
> On 26/03/2020 12:26, Alexey Kardashevskiy wrote:
> > 
> > 
> > On 25/03/2020 19:37, Christoph Hellwig wrote:
> >> On Wed, Mar 25, 2020 at 03:51:36PM +1100, Alexey Kardashevskiy wrote:
> > This is for persistent memory which you can DMA to/from but yet it does
> > not appear in the system as a normal memory and therefore requires
> > special handling anyway (O_DIRECT or DAX, I do not know the exact
> > mechanics). All other devices in the system should just run as usual,
> > i.e. use 1:1 mapping if possible.
> 
>  On other systems (x86 and arm) pmem as long as it is page backed does
>  not require any special handling.  This must be some weird way powerpc
>  fucked up again, and I suspect you'll have to suffer from it.
> >>>
> >>>
> >>> It does not matter if it is backed by pages or not, the problem may also
> >>> appear if we wanted for example p2p PCI via IOMMU (between PHBs) and
> >>> MMIO might be mapped way too high in the system address space and make
> >>> 1:1 impossible.
> >>
> >> How can it be mapped too high for a direct mapping with a 64-bit DMA
> >> mask?
> > 
> > The window size is limited and often it is not even sparse. It requires
> > an 8 byte entry per an IOMMU page (which is most commonly is 64k max) so
> > 1TB limit (a guest RAM size) is a quite real thing. MMIO is mapped to
> > guest physical address space outside of this 1TB (on PPC).
> > 
> > 
> 
> I am trying now this approach on top of yours "dma-bypass.3" (it is
> "wip", needs an upper boundary check):
> 
> https://github.com/aik/linux/commit/49d73c7771e3f6054804f6cfa80b4e320111662d
> 
> Do you see any serious problem with this approach? Thanks!

Do you have a link to the whole branch?  The github UI is unfortunately
unusable for that (or I'm missing something).
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: How to fix WARN from drivers/base/dd.c in next-20200401 if CONFIG_MODULES=y?

2020-04-06 Thread Yoshihiro Shimoda
Hi again,


> I'm guessing we should add the following flush_work for 
> deferred_probe_timeout_work().
> # Sorry, I didn't test this for some reasons yet though...
> 
> +   /* wait for the deferred probe timeout workqueue to finish */
> +   if (driver_deferred_probe_timeout > 0)
> +   flush_work(_probe_timeout_work);

I'm sorry. This code caused build error because the deferred_probe_timeout_work
is struct delayed_work. Also, I don't think using flush_delayed_work() is
my expectation (wait until the timeout of deferred)...

Best regards,
Yoshihiro Shimoda

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: How to fix WARN from drivers/base/dd.c in next-20200401 if CONFIG_MODULES=y?

2020-04-06 Thread Yoshihiro Shimoda
Hi John, Geert,

> From: John Stultz, Sent: Saturday, April 4, 2020 1:19 PM
> 
> On Fri, Apr 3, 2020 at 4:47 AM Geert Uytterhoeven  
> wrote:
> > On Thu, Apr 2, 2020 at 7:27 PM John Stultz  wrote:
> > > On Thu, Apr 2, 2020 at 3:17 AM Yoshihiro Shimoda
> > >  wrote:
> > > >
> > > > I found an issue after applied the following patches:
> > > > ---
> > > > 64c775f driver core: Rename deferred_probe_timeout and make it global
> > > > 0e9f8d0 driver core: Remove driver_deferred_probe_check_state_continue()
> > > > bec6c0e pinctrl: Remove use of 
> > > > driver_deferred_probe_check_state_continue()
> > > > e2cec7d driver core: Set deferred_probe_timeout to a longer default if 
> > > > CONFIG_MODULES is set
> >
> > Note that just setting deferred_probe_timeout = -1 like for the
> > CONFIG_MODULES=n case doesn't help.
> 
> Yea. I can see why in that case, as we're checking
> !IS_ENABLED(CONFIG_MODULES) directly in
> driver_deferred_probe_check_state.
> 
> I guess we could switch that to checking
> (driver_deferred_probe_timeout == -1) which would have the same logic
> and at least make it consistent if someone specifies -1 on the command
> line (since now it will effectively have it EPROBE_DEFER forever in
> that case). But also having a timeout=infinity could be useful if
> folks don't want the deferring to time out.  Maybe in the !modules
> case setting it to =0 would be the most clear.
> 
> But that's sort of a further cleanup. I'm still more worried about the
> NFS failure below.
> 
> 
> > > Hey,
> > >   Terribly sorry for the trouble. So as Robin mentioned I have a patch
> > > to remove the WARN messages, but I'm a bit more concerned about why
> > > after the 30 second delay, the ethernet driver loads:
> > >   [   36.218666] ravb e680.ethernet eth0: Base address at
> > > 0xe680, 2e:09:0a:02:eb:2d, IRQ 117.
> > > but NFS fails.
> > >
> > > Is it just that the 30 second delay is too long and NFS gives up?
> >
> > I added some debug code to mount_nfs_root(), which shows that the first
> > 3 tries happen before ravb is instantiated, and the last 3 tries happen
> > after.  So NFS root should work, if the network works.
> >
> > However, it seems the Ethernet PHY is never initialized, hence the link
> > never becomes ready.  Dmesg before/after:
> >
> >  ravb e680.ethernet eth0: Base address at 0xe680,
> > 2e:09:0a:02:ea:ff, IRQ 108.
> >
> > Good.
> >
> >  ...
> > -gpio_rcar e6052000.gpio: sense irq = 11, type = 8
> >
> > This is the GPIO the PHY IRQ is connected to.
> > Note that that GPIO controller has been instantiated before.
> >
> >  ...
> > -Micrel KSZ9031 Gigabit PHY e680.ethernet-:00:
> > attached PHY driver [Micrel KSZ9031 Gigabit PHY]
> > (mii_bus:phy_addr=e680.ethernet-:00, irq=197)
> >  ...
> > -ravb e680.ethernet eth0: Link is Up - 1Gbps/Full - flow control off
> >
> > Oops.
> >
> > -Sending DHCP requests .., OK
> > -IP-Config: Got DHCP answer from ...
> >  ...
> > +VFS: Unable to mount root fs via NFS, trying floppy.
> > +VFS: Cannot open root device "nfs" or unknown-block(2,0): error -6
> >
> > > Does booting with deferred_probe_timeout=0 work?
> >
> > It does, as now everything using optional links (DMA and IOMMU) is now
> > instantiated on first try.
> 
> Thanks so much for helping clarify this!
> 
> So it's at least good to hear that booting with
> deferred_probe_timeout=0 is working!  But I'm bummed the NFS (or as
> you pointed out in your later mail,  ip_auto_config) falls over
> because the network isn't immediately there.
> 
> Looking a little closer at the ip_auto_config() code, I think the
> issue may be that wait_for_device_probe() is effectively returning too
> early, since the probe_defer_timeout is still active? I need to dig a
> bit more on that code, on Monday, as I don't fully understand it yet.

I think so. I also investigated this issue more and then the following
patch seems to be related because return value is changed a bit.

c8c43ce driver core: Fix driver_deferred_probe_check_state() logic

# By the way, this is other topic though, IIUC we should revise
# the deferred_probe_timeout= in Documentation/admin-guide/kernel-parameters.txt
# for the commit c8c43ce. Especially " A timeout of 0 will timeout at the end 
of initcalls."
# doesn't match after we applied the commit.

I'm guessing we should add the following flush_work for 
deferred_probe_timeout_work().
# Sorry, I didn't test this for some reasons yet though...

+   /* wait for the deferred probe timeout workqueue to finish */
+   if (driver_deferred_probe_timeout > 0)
+   flush_work(_probe_timeout_work);

> If I can't find a way to address that, I think the best course will be
> to set the driver_deferred_probe_timeout value to default to 0
> regardless of the value of CONFIG_MODULES, so we don't cause any
> apparent regression from previous behavior. That will also sort out
> the less intuitive = -1 

RE: [PATCH v10 04/11] vfio/pci: Add VFIO_REGION_TYPE_NESTED region type

2020-04-06 Thread Liu, Yi L
Hi Eric,
> From: Auger Eric 
> Sent: Wednesday, April 1, 2020 9:31 PM
> To: Liu, Yi L ; eric.auger@gmail.com; 
> iommu@lists.linux-
> Subject: Re: [PATCH v10 04/11] vfio/pci: Add VFIO_REGION_TYPE_NESTED region
> type
> 
> Hi Yi,
> 
> On 4/1/20 3:18 PM, Liu, Yi L wrote:
> > Hi Eric,
> >
> > Just curious about your plan on this patch, I just heard my colleague
> > would like to reference the functions from this patch in his dsa driver 
> > work.
> 
> Well I intend to respin until somebody tells me it is completely vain or dead 
> follows.
> Joking aside, feel free to embed it in any series it would be beneficial to, 
> just please
> cc me in case code diverges.

got it. Please also cc me in future version. :-)

Regards,
Yi Liu

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu