RE: [RFC PATCH 30/30] vfio: Allow to bind foreign task

2017-02-27 Thread Tian, Kevin
> From: Alex Williamson
> Sent: Tuesday, February 28, 2017 11:54 AM
> 
> On Mon, 27 Feb 2017 19:54:41 +
> Jean-Philippe Brucker  wrote:
> 
[...]
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index 3fe4197a5ea0..41ae8a231d42 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -415,7 +415,9 @@ struct vfio_device_svm {
> > __u32   flags;
> >  #define VFIO_SVM_PASID_RELEASE_FLUSHED (1 << 0)
> >  #define VFIO_SVM_PASID_RELEASE_CLEAN   (1 << 1)
> > +#define VFIO_SVM_PID   (1 << 2)
> > __u32   pasid;
> > +   __u32   pid;
> >  };
> >  /*
> >   * VFIO_DEVICE_BIND_TASK - _IOWR(VFIO_TYPE, VFIO_BASE + 22,
> > @@ -432,6 +434,19 @@ struct vfio_device_svm {
> >   * On success, VFIO writes a Process Address Space ID (PASID) into @pasid. 
> > This
> >   * ID is unique to a device.
> >   *
> > + * VFIO_SVM_PID: bind task @pid instead of current task. The shared address
> > + *space identified by @pasid is that of task identified by @pid.
> > + *
> > + *Given that the caller owns the device, setting this flag grants 
> > the
> > + *caller read and write permissions on the entire address space of
> > + *foreign task described by @pid. Therefore, permission to perform 
> > the
> > + *bind operation on a foreign process is governed by the ptrace 
> > access
> > + *mode PTRACE_MODE_ATTACH_REALCREDS check. See man ptrace(2) for
> more
> > + *information.
> > + *
> > + *If the VFIO_SVM_PID flag is not set, @pid is unused and it is the
> > + *current task that is bound to the device.
> > + *
> >   * The bond between device and process must be removed with
> >   * VFIO_DEVICE_UNBIND_TASK before exiting.
> >   *
> 
> BTW, nice commit logs throughout this series, I probably need to read
> through them a few more times to really digest it all.  AIUI, the VFIO
> support here is really only useful for basic userspace drivers, I don't
> see how we could take advantage of it for a VM use case where the guest
> manages the PASID space for a domain.  Perhaps it hasn't spent enough
> cycles bouncing around in my head yet.  Thanks,
> 

Current definition doesn't work with virtualization usage, at least on Intel
VT-d. To enable virtualized SVM within a VM, architecturally VT-d needs
be in a nested mode - go through guest PASID table to find guest CR3, 
use guest CR3 as 1st level translation for GVA->GPA and then use 2nd 
level translation for GPA->HPA. PASID table is fully allocated/managed
by VM. Within the translation process each guest pointer (PASID or 1st 
level paging structures) is treated as GPA which also goes through 2nd 
level translation. I didn't read ARM SMMU spec yet, but hope the basic 
mechanism stays similar.

Here we need an API which allows Qemu vIOMMU to bind guest PASID
table pointer and enable nested mode for target device in underlying 
IOMMU hardware, while proposed API is only for user space driver 
regarding to binding a specific host address space.

Based on above requirement difference, Alex, do you prefer to 
introducing one API covering both usages or separate APIs for their
own purposes?

btw Yi is working on a SVM virtualization prototype based on Intel 
VT-d. I hope soon he will send out a RFC so we can align the high
level API requirement better. :-)

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC Design Doc v3] Enable Shared Virtual Memory feature in pass-through scenarios

2017-02-28 Thread Tian, Kevin
> From: Konrad Rzeszutek Wilk [mailto:konrad.w...@oracle.com]
> Sent: Wednesday, March 01, 2017 6:07 AM
> 
> On Wed, Nov 30, 2016 at 08:49:24AM +, Liu, Yi L wrote:
> > What's changed from v2:
> > a) Detailed feature description
> > b) refine description in "Address translation in virtual SVM"
> > b) "Terms" is added
> >
> > Content
> > ===
> > 1. Feature description
> > 2. Why use it?
> > 3. How to enable it
> > 4. How to test
> > 5. Terms
> >
> > Details
> > ===
> > 1. Feature description
> > Shared virtual memory(SVM) is to let application program share its virtual
> > address with SVM capable devices.
> >
> > Shared virtual memory details:
> > a) SVM feature requires ATS/PRQ/PASID support on both device side and
> > IOMMU side.
> > b) SVM capable devices could send DMA requests with PASID, the address
> > in the request would be a virtual address within a program's virtual address
> > space.
> > c) IOMMU would use first level page table to translate the address in the
> > request.
> > d) First level page table is a HVA->HPA mapping on bare metal.
> >
> > Shared Virtual Memory feature in pass-through scenarios is actually SVM
> > virtualization. It is to let application programs(running in guest)share 
> > their
> > virtual address with assigned device(e.g. graphics processors or 
> > accelerators).
> 
> I think I am missing something obvious, but the current way that DRM
> works is that the kernel sets up its VA addresses for the GPU and it uses
> that for its ring. It also setups an user level mapping for the GPU if the
> application (Xorg) really wants it - but most of the time the kernel is
> in charge of poking at the ring, and the memory that is shared with the
> Xorg is normal RAM allocated via alloc_pages (see
> drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
> and drivers/gpu/drm/ttm/ttm_page_alloc.c).
> 
> So are talking about the guest applications having access to the
> ring of the GPU?

No. SVM is purely about sharing CPU address space with device. Command
submission is still through kernel driver which controls rings (with SVM then
you can put VA into those commands). There are other vendor specific 
features to enable direct user space submission which is orthogonal to SVM.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC PATCH 30/30] vfio: Allow to bind foreign task

2017-03-01 Thread Tian, Kevin
> From: Jean-Philippe Brucker [mailto:jean-philippe.bruc...@arm.com]
> Sent: Tuesday, February 28, 2017 11:23 PM
> 
> Hi Kevin,
> 
> On Tue, Feb 28, 2017 at 06:43:31AM +, Tian, Kevin wrote:
> > > From: Alex Williamson
> > > Sent: Tuesday, February 28, 2017 11:54 AM
> > >
> > > On Mon, 27 Feb 2017 19:54:41 +
> > > Jean-Philippe Brucker  wrote:
> > >
> > [...]
> > > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > > index 3fe4197a5ea0..41ae8a231d42 100644
> > > > --- a/include/uapi/linux/vfio.h
> > > > +++ b/include/uapi/linux/vfio.h
> > > > @@ -415,7 +415,9 @@ struct vfio_device_svm {
> > > > __u32   flags;
> > > >  #define VFIO_SVM_PASID_RELEASE_FLUSHED (1 << 0)
> > > >  #define VFIO_SVM_PASID_RELEASE_CLEAN   (1 << 1)
> > > > +#define VFIO_SVM_PID   (1 << 2)
> > > > __u32   pasid;
> > > > +   __u32   pid;
> > > >  };
> > > >  /*
> > > >   * VFIO_DEVICE_BIND_TASK - _IOWR(VFIO_TYPE, VFIO_BASE + 22,
> > > > @@ -432,6 +434,19 @@ struct vfio_device_svm {
> > > >   * On success, VFIO writes a Process Address Space ID (PASID) into 
> > > > @pasid. This
> > > >   * ID is unique to a device.
> > > >   *
> > > > + * VFIO_SVM_PID: bind task @pid instead of current task. The shared 
> > > > address
> > > > + *space identified by @pasid is that of task identified by 
> > > > @pid.
> > > > + *
> > > > + *Given that the caller owns the device, setting this flag 
> > > > grants the
> > > > + *caller read and write permissions on the entire address 
> > > > space of
> > > > + *foreign task described by @pid. Therefore, permission to 
> > > > perform the
> > > > + *bind operation on a foreign process is governed by the 
> > > > ptrace access
> > > > + *mode PTRACE_MODE_ATTACH_REALCREDS check. See man ptrace(2)
> for
> > > more
> > > > + *information.
> > > > + *
> > > > + *If the VFIO_SVM_PID flag is not set, @pid is unused and it 
> > > > is the
> > > > + *current task that is bound to the device.
> > > > + *
> > > >   * The bond between device and process must be removed with
> > > >   * VFIO_DEVICE_UNBIND_TASK before exiting.
> > > >   *
> > >
> > > BTW, nice commit logs throughout this series, I probably need to read
> > > through them a few more times to really digest it all.  AIUI, the VFIO
> > > support here is really only useful for basic userspace drivers, I don't
> > > see how we could take advantage of it for a VM use case where the guest
> > > manages the PASID space for a domain.  Perhaps it hasn't spent enough
> > > cycles bouncing around in my head yet.  Thanks,
> > >
> >
> > Current definition doesn't work with virtualization usage, at least on Intel
> > VT-d. To enable virtualized SVM within a VM, architecturally VT-d needs
> > be in a nested mode - go through guest PASID table to find guest CR3,
> > use guest CR3 as 1st level translation for GVA->GPA and then use 2nd
> > level translation for GPA->HPA. PASID table is fully allocated/managed
> > by VM. Within the translation process each guest pointer (PASID or 1st
> > level paging structures) is treated as GPA which also goes through 2nd
> > level translation. I didn't read ARM SMMU spec yet, but hope the basic
> > mechanism stays similar.
> 
> If I understand correctly, it is very similar on ARM SMMU, where we have
> two stages of translation. Stage-1 is GVA->GPA and stage-2 is GPA->HPA,
> with all intermediate tables of stage-1 translation obtained via stage-2
> as well. The SMMU holds stage-1 paging structure in the PASID tables.

Good to know. :-)

> 
> > Here we need an API which allows Qemu vIOMMU to bind guest PASID
> > table pointer and enable nested mode for target device in underlying
> > IOMMU hardware, while proposed API is only for user space driver
> > regarding to binding a specific host address space.
> >
> > Based on above requirement difference, Alex, do you prefer to
> > introducing one API covering both usages or separate APIs for their
> > own purposes?
> >
> > btw Yi is working on a SVM virtualization prototype based on Intel
> > VT-d. I hope

RE: [RFC PATCH 22/30] iommu: Bind/unbind tasks to/from devices

2017-03-01 Thread Tian, Kevin
> From: Jean-Philippe Brucker
> Sent: Tuesday, February 28, 2017 3:55 AM
> 
[...]
> 
>   API naming
>   ==
> 
> I realize that "SVM" as a name isn't great because the svm namespace is
> already taken by AMD-V (Secure Virtual Machine) in arch/x86. Also, the
> name itself doesn't say much.
> 
> I personally prefer "Unified Virtual Addressing" (UVA), adopted by CUDA,
> or rather Unified Virtual Address Space (UVAS). Another possibility is
> Unified Virtual Memory (UVM). Acronym UAS for Unified Address Space is
> already used by USB. Same for Shared Address Space (SAS), already in use
> in the kernel, but SVAS would work (although it doesn't look good).
> 

'unified' is not exactly matching to 'shared'. In some context it means
unifying device local memory and system memory in one virtual address
space, while SVM is more for sharing of a CPU virtual address space with 
device.

What about Shared Virtual Addressing (SVA)?

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH V5] drm/i915: Disable stolen memory when i915 runs on qemu

2017-04-13 Thread Tian, Kevin
> From: Joonas Lahtinen [mailto:joonas.lahti...@linux.intel.com]
> Sent: Wednesday, April 12, 2017 9:22 PM
> 
[...]
> By my limited understanding of VT-d details: The stolen memory is never
> directly accessed by i915 driver (because CPU access doesn't work even
> in DOM0). It is only used through the aperture, which just requires for
> the GT device to have access to the RMRR. Further, the GT device needs
> to have access to stolen memory, because that's what GuC uses for
> backing storage for for WOPCM.
> 
> And even if after all of the above is addressed, shouldn't we rather
> try to detect the lack of RMRR, than presence of QEMU ISA?
> 
> What comes to my mind is exporting function like device_has_rmrr() from
> intel-iommu.com and consuming that, if we end up doing this. That way,
> if somebody, some day, goes and write RMRR pass-through code currently
> missing, it'll start working, just like it should.
> 

I like what you proposed in the long run, e.g. in a nested virtualization
environment L0-VMM assigns the device to L1-VMM which further
wants to assign device to L2-VM. In such case RMRR information 
must be propagated through the path to L1-VMM.

However I can see one limitation here on your proposal. There is no 
RMRR if VT-d is disabled in BIOS. Then you cannot use stolen memory 
even on bare metal in such configuration, which is possibly not desired.

Also the long term direction is to move away from RMRR for Intel
integrated devices. People realized its limitation (especially the
objection from KVM community. I don't think RMRR passthrough
would be an option there). So I'd be with Xiong's simple workaround
here. :-)

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC 0/3] virtio-iommu: a paravirtualized IOMMU

2017-04-13 Thread Tian, Kevin
> From: Jason Wang
> Sent: Wednesday, April 12, 2017 5:07 PM
> 
> On 2017年04月08日 03:17, Jean-Philippe Brucker wrote:
> > This is the initial proposal for a paravirtualized IOMMU device using
> > virtio transport. It contains a description of the device, a Linux driver,
> > and a toy implementation in kvmtool. With this prototype, you can
> > translate DMA to guest memory from emulated (virtio), or passed-through
> > (VFIO) devices.
> >
> > In its simplest form, implemented here, the device handles map/unmap
> > requests from the guest. Future extensions proposed in "RFC 3/3" should
> > allow to bind page tables to devices.
> >
> > There are a number of advantages in a paravirtualized IOMMU over a full
> > emulation. It is portable and could be reused on different architectures.
> > It is easier to implement than a full emulation, with less state tracking.
> > It might be more efficient in some cases, with less context switches to
> > the host and the possibility of in-kernel emulation.
> 
> I like the idea. Consider the complexity of IOMMU hardware. I believe we
> don't want to have and fight  for bugs of three or more different IOMMU
> implementations in either userspace or kernel.
> 

Though there are definitely positive things around pvIOMMU approach,
it also has some limitations:

- Existing IOMMU implementations have been in old distros for quite some
time, while pvIOMMU driver will only land in future distros. Doing pvIOMMU
only means we completely drop support of old distros in VM;

- Similar situation on supporting other guest OSes e.g. Windows. IOMMU is
a key kernel component which I'm not sure pvIOMMU through virtio can be
recognized in those OSes (not like a virtio device driver);

I would image both full-emulated IOMMUs and pvIOMMU would co-exist
for some time due to above reasons. Someday when pvIOMMU is mature/
spread enough in the eco-system (and feature-wise comparable to full-emulated
IOMMUs for all vendors), then we may make a call.

Thanks,
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [RFC 0/3] virtio-iommu: a paravirtualized IOMMU

2017-04-13 Thread Tian, Kevin
> From: Jean-Philippe Brucker
> Sent: Saturday, April 8, 2017 3:18 AM
> 
> This is the initial proposal for a paravirtualized IOMMU device using
> virtio transport. It contains a description of the device, a Linux driver,
> and a toy implementation in kvmtool. With this prototype, you can
> translate DMA to guest memory from emulated (virtio), or passed-through
> (VFIO) devices.
> 
> In its simplest form, implemented here, the device handles map/unmap
> requests from the guest. Future extensions proposed in "RFC 3/3" should
> allow to bind page tables to devices.
> 
> There are a number of advantages in a paravirtualized IOMMU over a full
> emulation. It is portable and could be reused on different architectures.
> It is easier to implement than a full emulation, with less state tracking.
> It might be more efficient in some cases, with less context switches to
> the host and the possibility of in-kernel emulation.
> 
> When designing it and writing the kvmtool device, I considered two main
> scenarios, illustrated below.
> 
> Scenario 1: a hardware device passed through twice via VFIO
> 
>MEMpIOMMUPCI device
> HARDWARE
> | (2b)\
>   --|-+-+--\-
> | : KVM :   \
> | : :\
>pIOMMU drv : ___virtio-iommu drv   \KERNEL
> | :|:  |   \
>   VFIO:|:VFIO   \
> | :|:  | \
> | :|:  | /
>   --|-+|+--|/
> |  |:  |   /
> | (1c)(1b) |: (1a) |  / (2a)
> |  |:  | /
> |  |:  |/   USERSPACE
> |___virtio-iommu dev___|:net drv___/
> :
>   --+
>  HOST   : GUEST
> 

Usually people draw such layers in reverse order, e.g. hw in the
bottom then kernel in the middle then user in the top. :-)

> (1) a. Guest userspace is running a net driver (e.g. DPDK). It allocates a
>buffer with mmap, obtaining virtual address VA. It then send a
>VFIO_IOMMU_MAP_DMA request to map VA to an IOVA (possibly
> VA=IOVA).
> b. The maping request is relayed to the host through virtio
>(VIRTIO_IOMMU_T_MAP).
> c. The mapping request is relayed to the physical IOMMU through VFIO.
> 
> (2) a. The guest userspace driver can now instruct the device to directly
>access the buffer at IOVA
> b. IOVA accesses from the device are translated into physical
>addresses by the IOMMU.
> 
> Scenario 2: a virtual net device behind a virtual IOMMU.
> 
>   MEM__pIOMMU___PCI device HARDWARE
>  | |
>   ---|-|--+-+---
>  | |  : KVM :
>  | |  : :
> pIOMMU drv |  : :
>  \ |  :  _virtio-net drv  KERNEL
>   \_net drv   : |   :  / (1a)
>|  : |   : /
>   tap : |virtio-iommu drv
>|  : |   |   : (1b)
>   -|--+-|---|---+---
>||   |   :
>|_virtio-net_|   |   :
>  / (2)  |   :
> /   |   :  USERSPACE
>   virtio-iommu dev__|   :
> :
>   --+---
>  HOST   : GUEST
> 
> (1) a. Guest virtio-net driver maps the virtio ring and a buffer
> b. The mapping requests are relayed to the host through virtio.
> (2) The virtio-net device now needs to access any guest memory via the
> IOMMU.
> 
> Physical and virtual IOMMUs are completely dissociated. The net driver is
> mapping its own buffers via DMA/IOMMU API, and buffers are copied
> between
> virtio-net and tap.
> 
> 
> The description itself seemed too long for a single email, so I split it
> into three documents, and will attach Linux and kvmtool patches to this
> email.
> 
>   1. Firmware note,
>   2. device operations (draft for the virtio specification),
>   3. future

RE: [RFC 1/3] virtio-iommu: firmware description of the virtual topology

2017-04-18 Thread Tian, Kevin
> From: Jean-Philippe Brucker
> Sent: Saturday, April 8, 2017 3:18 AM
> 
> Unlike other virtio devices, the virtio-iommu doesn't work independently,
> it is linked to other virtual or assigned devices. So before jumping into
> device operations, we need to define a way for the guest to discover the
> virtual IOMMU and the devices it translates.
> 
> The host must describe the relation between IOMMU and devices to the
> guest
> using either device-tree or ACPI. The virtual IOMMU identifies each

Do you plan to support both device tree and ACPI?

> virtual device with a 32-bit ID, that we will call "Device ID" in this
> document. Device IDs are not necessarily unique system-wide, but they may
> not overlap within a single virtual IOMMU. Device ID of passed-through
> devices do not need to match IDs seen by the physical IOMMU.
> 
> The virtual IOMMU uses virtio-mmio transport exclusively, not virtio-pci,
> because with PCI the IOMMU interface would itself be an endpoint, and
> existing firmware interfaces don't allow to describe IOMMU<->master
> relations between PCI endpoints.

I'm not familiar with virtio-mmio mechanism. Curious how devices in
virtio-mmio are enumerated today? Could we use that mechanism to
identify vIOMMUs and then invent a purely para-virtualized method to
enumerate devices behind each vIOMMU? 

Asking this is because each vendor has its own enumeration methods.
ARM has device tree and ACPI IORT. AMR has ACPI IVRS and device
tree (same format as ARM?). Intel has APCI DMAR and sub-tables. Your 
current proposal looks following ARM definitions which I'm not sure 
extensible enough to cover features defined only in other vendors' 
structures.

Since the purpose of this series is to go para-virtualize, why not also
para-virtualize and simplify the enumeration method? For example, 
we may define a query interface through vIOMMU registers to allow 
guest query whether a device belonging to that vIOMMU. Then we 
can even remove use of any enumeration structure completely... 
Just a quick example which I may not think through all the pros and 
cons. :-)

> 
> The following diagram describes a situation where two virtual IOMMUs
> translate traffic from devices in the system. vIOMMU 1 translates two PCI
> domains, in which each function has a 16-bits requester ID. In order for
> the vIOMMU to differentiate guest requests targeted at devices in each
> domain, their Device ID ranges cannot overlap. vIOMMU 2 translates two PCI
> domains and a collection of platform devices.
> 
>Device IDRequester ID
>   /   0x0   0x0  \
>  / | |PCI domain 1
> /  0x   0x   /
> vIOMMU 1
> \ 0x1   0x0  \
>  \ | |PCI domain 2
>   \   0x1   0x   /
> 
>   /   0x0\
>  / |  platform devices
> /  0x1fff/
> vIOMMU 2
> \  0x2000   0x0  \
>  \ | |PCI domain 3
>   \   0x11fff   0x   /
> 

isn't above be (0x3, 3) for PCI domain 3 giving device ID is 16bit?

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC 2/3] virtio-iommu: device probing and operations

2017-04-18 Thread Tian, Kevin
> From: Jean-Philippe Brucker
> Sent: Saturday, April 8, 2017 3:18 AM
> 
[...]
>   II. Feature bits
>   
> 
> VIRTIO_IOMMU_F_INPUT_RANGE (0)
>  Available range of virtual addresses is described in input_range

Usually only the maximum supported address bits are important. 
Curious do you see such situation where low end of the address 
space is not usable (since you have both start/end defined later)?

[...]
>   1. Attach device
>   
> 
> struct virtio_iommu_req_attach {
>   le32address_space;
>   le32device;
>   le32flags/reserved;
> };
> 
> Attach a device to an address space. 'address_space' is an identifier
> unique to the guest. If the address space doesn't exist in the IOMMU

Based on your description this address space ID is per operation right?
MAP/UNMAP and page-table sharing should have different ID spaces...

> device, it is created. 'device' is an identifier unique to the IOMMU. The
> host communicates unique device ID to the guest during boot. The method
> used to communicate this ID is outside the scope of this specification,
> but the following rules must apply:
> 
> * The device ID is unique from the IOMMU point of view. Multiple devices
>   whose DMA transactions are not translated by the same IOMMU may have
> the
>   same device ID. Devices whose DMA transactions may be translated by the
>   same IOMMU must have different device IDs.
> 
> * Sometimes the host cannot completely isolate two devices from each
>   others. For example on a legacy PCI bus, devices can snoop DMA
>   transactions from their neighbours. In this case, the host must
>   communicate to the guest that it cannot isolate these devices from each
>   others. The method used to communicate this is outside the scope of this
>   specification. The IOMMU device must ensure that devices that cannot be

"IOMMU device" -> "IOMMU driver"

>   isolated by the host have the same address spaces.
> 

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC 3/3] virtio-iommu: future work

2017-04-21 Thread Tian, Kevin
> From: Jean-Philippe Brucker
> Sent: Saturday, April 8, 2017 3:18 AM
> 
> Here I propose a few ideas for extensions and optimizations. This is all
> very exploratory, feel free to correct mistakes and suggest more things.

[...]
> 
>   II. Page table sharing
>   ==
> 
>   1. Sharing IOMMU page tables
>   
> 
> VIRTIO_IOMMU_F_PT_SHARING
> 
> This is independent of the nested mode described in I.2, but relies on a
> similar feature in the physical IOMMU: having two stages of page tables,
> one for the host and one for the guest.
> 
> When this is supported, the guest can manage its own s1 page directory, to
> avoid sending MAP/UNMAP requests. Feature
> VIRTIO_IOMMU_F_PT_SHARING allows
> a driver to give a page directory pointer (pgd) to the host and send
> invalidations when removing or changing a mapping. In this mode, three
> requests are used: probe, attach and invalidate. An address space cannot
> be using the MAP/UNMAP interface and PT_SHARING at the same time.
> 
> Device and driver first need to negotiate which page table format they
> will be using. This depends on the physical IOMMU, so the request contains
> a negotiation part to probe the device capabilities.
> 
> (1) Driver attaches devices to address spaces as usual, but a flag
> VIRTIO_IOMMU_ATTACH_F_PRIVATE (working title) tells the device not to
> create page tables for use with the MAP/UNMAP API. The driver intends
> to manage the address space itself.
> 
> (2) Driver sends a PROBE_TABLE request. It sets len > 0 with the size of
> pg_format array.
> 
>   VIRTIO_IOMMU_T_PROBE_TABLE
> 
>   struct virtio_iommu_req_probe_table {
>   le32address_space;
>   le32flags;
>   le32len;
> 
>   le32nr_contexts;
>   struct {
>   le32model;
>   u8  format[64];
>   } pg_format[len];
>   };
> 
> Introducing a probe request is more flexible than advertising those
> features in virtio config, because capabilities are dynamic, and depend on
> which devices are attached to an address space. Within a single address
> space, devices may support different numbers of contexts (PASIDs), and
> some may not support recoverable faults.
> 
> (3) Device responds success with all page table formats implemented by the
> physical IOMMU in pg_format. 'model' 0 is invalid, so driver can
> initialize the array to 0 and deduce from there which entries have
> been filled by the device.
> 
> Using a probe method seems preferable over trying to attach every possible
> format until one sticks. For instance, with an ARM guest running on an x86
> host, PROBE_TABLE would return the Intel IOMMU page table format, and
> the
> guest could use that page table code to handle its mappings, hidden behind
> the IOMMU API. This requires that the page-table code is reasonably
> abstracted from the architecture, as is done with drivers/iommu/io-pgtable
> (an x86 guest could use any format implement by io-pgtable for example.)

So essentially you need modify all existing IOMMU drivers to support page 
table sharing in pvIOMMU. After abstraction is done the core pvIOMMU files 
can be kept vendor agnostic. But if we talk about the whole pvIOMMU 
module, it actually includes vendor specific logic thus unlike typical 
para-virtualized virtio drivers being completely vendor agnostic. Is this 
understanding accurate?

It also means in the host-side pIOMMU driver needs to propagate all
supported formats through VFIO to Qemu vIOMMU, meaning
such format definitions need be consistently agreed across all those 
components.

[...]

> 
>   2. Sharing MMU page tables
>   --
> 
> The guest can share process page-tables with the physical IOMMU. To do
> that, it sends PROBE_TABLE with (F_INDIRECT | F_NATIVE | F_FAULT). The
> page table format is implicit, so the pg_format array can be empty (unless
> the guest wants to query some specific property, e.g. number of levels
> supported by the pIOMMU?). If the host answers with success, guest can
> send its MMU page table details with ATTACH_TABLE and (F_NATIVE |
> F_INDIRECT | F_FAULT) flags.
> 
> F_FAULT means that the host communicates page requests from device to
> the
> guest, and the guest can handle them by mapping virtual address in the
> fault to pages. It is only available with VIRTIO_IOMMU_F_FAULT_QUEUE (see
> below.)
> 
> F_NATIVE means that the pIOMMU pgtable format is the same as guest
> MMU
> pgtable format.
> 
> F_INDIRECT means that 'table' pointer is a context table, instead of a
> page directory. Each slot in the context table points to a page directory:
> 
>64  2 1 0
>   table > +-+
>   |   pgd   |0|1|<--- context 0
>   |   ---   |0|0|<--- context 1
>   |   p

RE: [RFC 1/3] virtio-iommu: firmware description of the virtual topology

2017-04-21 Thread Tian, Kevin
> From: Jean-Philippe Brucker [mailto:jean-philippe.bruc...@arm.com]
> Sent: Wednesday, April 19, 2017 2:41 AM
> 
> On 18/04/17 10:51, Tian, Kevin wrote:
> >> From: Jean-Philippe Brucker
> >> Sent: Saturday, April 8, 2017 3:18 AM
> >>
> >> Unlike other virtio devices, the virtio-iommu doesn't work independently,
> >> it is linked to other virtual or assigned devices. So before jumping into
> >> device operations, we need to define a way for the guest to discover the
> >> virtual IOMMU and the devices it translates.
> >>
> >> The host must describe the relation between IOMMU and devices to the
> >> guest
> >> using either device-tree or ACPI. The virtual IOMMU identifies each
> >
> > Do you plan to support both device tree and ACPI?
> 
> Yes, with ACPI the topology would be described using IORT nodes. I didn't
> include an example in my driver because DT is sufficient for a prototype
> and is readily available (both in Linux and kvmtool), whereas IORT would
> be quite easy to reuse in Linux, but isn't present in kvmtool at the
> moment. However, both interfaces have to be supported for the virtio-
> iommu
> to be portable.

'portable' means whether guest enables ACPI?

> 
> >> virtual device with a 32-bit ID, that we will call "Device ID" in this
> >> document. Device IDs are not necessarily unique system-wide, but they
> may
> >> not overlap within a single virtual IOMMU. Device ID of passed-through
> >> devices do not need to match IDs seen by the physical IOMMU.
> >>
> >> The virtual IOMMU uses virtio-mmio transport exclusively, not virtio-pci,
> >> because with PCI the IOMMU interface would itself be an endpoint, and
> >> existing firmware interfaces don't allow to describe IOMMU<->master
> >> relations between PCI endpoints.
> >
> > I'm not familiar with virtio-mmio mechanism. Curious how devices in
> > virtio-mmio are enumerated today? Could we use that mechanism to
> > identify vIOMMUs and then invent a purely para-virtualized method to
> > enumerate devices behind each vIOMMU?
> 
> Using DT, virtio-mmio devices are described with "virtio-mmio" compatible
> node, and with ACPI they use _HID LNRO0005. Since the host already
> describes available devices to a guest using a firmware interface, I think
> we should reuse the tools provided by that interface for describing
> relations between DMA masters and IOMMU.

OK, I didn't realize virtio-mmio is defined to rely on DT for enumeration.

> 
> > Asking this is because each vendor has its own enumeration methods.
> > ARM has device tree and ACPI IORT. AMR has ACPI IVRS and device
> > tree (same format as ARM?). Intel has APCI DMAR and sub-tables. Your
> > current proposal looks following ARM definitions which I'm not sure
> > extensible enough to cover features defined only in other vendors'
> > structures.
> 
> ACPI IORT can be extended to incorporate para-virtualized IOMMUs,
> regardless of the underlying architecture. It isn't defined solely for the
> ARM SMMU, but serves a more general purpose of describing a map of
> device
> identifiers communicated from one components to another. Both DMAR and
> IVRS have such description (respectively DRHD and IVHD), but they are
> designed for a specific IOMMU, whereas IORT could host other kinds.

I'll take a look at IORT definition. DRHD includes information more
than device mapping.

> 
> It seems that all we really need is an interface that says "there is a
> virtio-iommu at address X, here are the devices it translates and their
> corresponding IDs", and both DT and ACPI IORT are able to fulfill this role.
> 
> > Since the purpose of this series is to go para-virtualize, why not also
> > para-virtualize and simplify the enumeration method? For example,
> > we may define a query interface through vIOMMU registers to allow
> > guest query whether a device belonging to that vIOMMU. Then we
> > can even remove use of any enumeration structure completely...
> > Just a quick example which I may not think through all the pros and
> > cons. :-)
> 
> I don't think adding a brand new topology description mechanism is worth
> the effort, we're better off reusing what already exists and is
> implemented by operating systems. Adding a query interface inside the
> vIOMMU may work (though might be very painful to integrate with fwspec in
> Linux), but would be redundant since the host has to provide a firmware
> description of the system anyway.
> 
> >> T

RE: [RFC 2/3] virtio-iommu: device probing and operations

2017-04-21 Thread Tian, Kevin
> From: Jean-Philippe Brucker [mailto:jean-philippe.bruc...@arm.com]
> Sent: Wednesday, April 19, 2017 2:46 AM
> 
> On 18/04/17 11:26, Tian, Kevin wrote:
> >> From: Jean-Philippe Brucker
> >> Sent: Saturday, April 8, 2017 3:18 AM
> >>
> > [...]
> >>   II. Feature bits
> >>   
> >>
> >> VIRTIO_IOMMU_F_INPUT_RANGE (0)
> >>  Available range of virtual addresses is described in input_range
> >
> > Usually only the maximum supported address bits are important.
> > Curious do you see such situation where low end of the address
> > space is not usable (since you have both start/end defined later)?
> 
> A start address would allow to provide something resembling a GART to the
> guest: an IOMMU with one address space (ioasid_bits=0) and a small IOVA
> aperture. I'm not sure how useful that would be in practice.

Intel VT-d has no such limitation, which I can tell. :-)

> 
> On a related note, the virtio-iommu itself doesn't provide a
> per-address-space aperture as it stands. For example, attaching a device
> to an address space might restrict the available IOVA range for the whole
> AS if that device cannot write to high memory (above 32-bit). If the guest
> attempts to map an IOVA outside this window into the device's address
> space, it should expect the MAP request to fail. And when attaching, if
> the address space already has mappings outside this window, then ATTACH
> should fail.
> 
> This too seems to be something that ought to be communicated by firmware,
> but bits are missing (I can't find anything equivalent to DT's dma-ranges
> for PCI root bridges in ACPI tables, for example). In addition VFIO
> doesn't communicate any DMA mask for devices, and doesn't check them
> itself. I guess that the host could find out the DMA mask of devices one
> way or another, but it is tricky to enforce, so I didn't make this a hard
> requirement. Although I should probably add a few words about it.

If there is no such communication on bare metal, then same for pvIOMMU.

> 
> > [...]
> >>   1. Attach device
> >>   
> >>
> >> struct virtio_iommu_req_attach {
> >>le32address_space;
> >>le32device;
> >>le32flags/reserved;
> >> };
> >>
> >> Attach a device to an address space. 'address_space' is an identifier
> >> unique to the guest. If the address space doesn't exist in the IOMMU
> >
> > Based on your description this address space ID is per operation right?
> > MAP/UNMAP and page-table sharing should have different ID spaces...
> 
> I think it's simpler if we keep a single IOASID space per virtio-iommu
> device, because the maximum number of address spaces (described by
> ioasid_bits) might be a restriction of the pIOMMU. For page-table sharing
> you still need to define which devices will share a page directory using
> ATTACH requests, though that interface is not set in stone.

got you. yes VM is supposed to consume less IOASIDs than physically
available. It doesn’t hurt to have one IOASID space for both IOVA
map/unmap usages (one IOASID per device) and SVM usages (multiple
IOASIDs per device). The former is digested by software and the latter
will be bound to hardware.

Thanks
Kevin

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [Qemu-devel] [RFC PATCH 09/20] Memory: introduce iommu_ops->record_device

2017-05-19 Thread Tian, Kevin
> From: Liu, Yi L [mailto:yi.l@linux.intel.com]
> Sent: Friday, May 19, 2017 1:24 PM
> 
> Hi Alex,
> 
> What's your opinion with Tianyu's question? Is it accepatable
> to use VFIO API in intel_iommu emulator?

Did you actually need such translation at all? SID should be
filled by kernel IOMMU driver based on which device is
requested with invalidation request, regardless of which 
guest SID is used in user space. Qemu only needs to know
which fd corresponds to guest SID, and then initiates an
invalidation request on that fd?

> 
> Thanks,
> Yi L
> On Fri, Apr 28, 2017 at 02:46:16PM +0800, Lan Tianyu wrote:
> > On 2017年04月26日 18:06, Liu, Yi L wrote:
> > > With vIOMMU exposed to guest, vIOMMU emulator needs to do
> translation
> > > between host and guest. e.g. a device-selective TLB flush, vIOMMU
> > > emulator needs to replace guest SID with host SID so that to limit
> > > the invalidation. This patch introduces a new callback
> > > iommu_ops->record_device() to notify vIOMMU emulator to record
> necessary
> > > information about the assigned device.
> >
> > This patch is to prepare to translate guest sbdf to host sbdf.
> >
> > Alex:
> > Could we add a new vfio API to do such translation? This will be more
> > straight forward than storing host sbdf in the vIOMMU device model.
> >
> > >
> > > Signed-off-by: Liu, Yi L 
> > > ---
> > >  include/exec/memory.h | 11 +++
> > >  memory.c  | 12 
> > >  2 files changed, 23 insertions(+)
> > >
> > > diff --git a/include/exec/memory.h b/include/exec/memory.h
> > > index 7bd13ab..49087ef 100644
> > > --- a/include/exec/memory.h
> > > +++ b/include/exec/memory.h
> > > @@ -203,6 +203,8 @@ struct MemoryRegionIOMMUOps {
> > >  IOMMUNotifierFlag new_flags);
> > >  /* Set this up to provide customized IOMMU replay function */
> > >  void (*replay)(MemoryRegion *iommu, IOMMUNotifier *notifier);
> > > +void (*record_device)(MemoryRegion *iommu,
> > > +  void *device_info);
> > >  };
> > >
> > >  typedef struct CoalescedMemoryRange CoalescedMemoryRange;
> > > @@ -708,6 +710,15 @@ void
> memory_region_notify_iommu(MemoryRegion *mr,
> > >  void memory_region_notify_one(IOMMUNotifier *notifier,
> > >IOMMUTLBEntry *entry);
> > >
> > > +/*
> > > + * memory_region_notify_device_record: notify IOMMU to record
> assign
> > > + * device.
> > > + * @mr: the memory region to notify
> > > + * @ device_info: device information
> > > + */
> > > +void memory_region_notify_device_record(MemoryRegion *mr,
> > > +void *info);
> > > +
> > >  /**
> > >   * memory_region_register_iommu_notifier: register a notifier for
> changes to
> > >   * IOMMU translation entries.
> > > diff --git a/memory.c b/memory.c
> > > index 0728e62..45ef069 100644
> > > --- a/memory.c
> > > +++ b/memory.c
> > > @@ -1600,6 +1600,18 @@ static void
> memory_region_update_iommu_notify_flags(MemoryRegion *mr)
> > >  mr->iommu_notify_flags = flags;
> > >  }
> > >
> > > +void memory_region_notify_device_record(MemoryRegion *mr,
> > > +void *info)
> > > +{
> > > +assert(memory_region_is_iommu(mr));
> > > +
> > > +if (mr->iommu_ops->record_device) {
> > > +mr->iommu_ops->record_device(mr, info);
> > > +}
> > > +
> > > +return;
> > > +}
> > > +
> > >  void memory_region_register_iommu_notifier(MemoryRegion *mr,
> > > IOMMUNotifier *n)
> > >  {
> > >
> >
> >
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [RFC PATCH 7/8] VFIO: Add new IOCTL for IOMMU TLB invalidate propagation

2017-07-04 Thread Tian, Kevin
> From: Liu, Yi L [mailto:yi.l@linux.intel.com]
> Sent: Sunday, May 14, 2017 6:55 PM
> 
> On Fri, May 12, 2017 at 03:58:43PM -0600, Alex Williamson wrote:
> > On Wed, 26 Apr 2017 18:12:04 +0800
> > "Liu, Yi L"  wrote:
> >
> > > From: "Liu, Yi L" 
> > >
> > > This patch adds VFIO_IOMMU_TLB_INVALIDATE to propagate IOMMU
> TLB
> > > invalidate request from guest to host.
> > >
> > > In the case of SVM virtualization on VT-d, host IOMMU driver has
> > > no knowledge of caching structure updates unless the guest
> > > invalidation activities are passed down to the host. So a new
> > > IOCTL is needed to propagate the guest cache invalidation through
> > > VFIO.
> > >
> > > Signed-off-by: Liu, Yi L 
> > > ---
> > >  include/uapi/linux/vfio.h | 9 +
> > >  1 file changed, 9 insertions(+)
> > >
> > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > index 6b97987..50c51f8 100644
> > > --- a/include/uapi/linux/vfio.h
> > > +++ b/include/uapi/linux/vfio.h
> > > @@ -564,6 +564,15 @@ struct vfio_device_svm {
> > >
> > >  #define VFIO_IOMMU_SVM_BIND_TASK _IO(VFIO_TYPE, VFIO_BASE +
> 22)
> > >
> > > +/* For IOMMU TLB Invalidation Propagation */
> > > +struct vfio_iommu_tlb_invalidate {
> > > + __u32   argsz;
> > > + __u32   length;
> > > + __u8data[];
> > > +};
> > > +
> > > +#define VFIO_IOMMU_TLB_INVALIDATE_IO(VFIO_TYPE, VFIO_BASE +
> 23)
> >
> > I'm kind of wondering why this isn't just a new flag bit on
> > vfio_device_svm, the data structure is so similar.  Of course data
> > needs to be fully specified in uapi.
> 
> Hi Alex,
> 
> For this part, it depends on using opaque structure or not. The following
> link mentioned it in [Open] session.
> 
> http://www.spinics.net/lists/kvm/msg148798.html
> 
> If we pick the full opaque solution for iommu tlb invalidate propagation.
> Then I may add a flag bit on vfio_device_svm and also add definition in
> uapi as you suggested.
> 

there is another benefit to keep it as a separate command. For now
we only need to invalidate 1st level translation (GVA->GPA) for SVM,
since 1st level page table is provided by guest while directly walked
by IOMMU. It's possible some vendor may also choose to implement
a nested 2nd level translation (e.g. GIOVA->GPA->HPA) then hardware
can directly walk guest GIOVA page table thus explicit invalidation is
also required. We'd better not to limit invalidation interface with 
svm structure.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [Qemu-devel] [RFC PATCH 7/8] VFIO: Add new IOCTL for IOMMU TLB invalidate propagation

2017-07-04 Thread Tian, Kevin
> From: Liu, Yi L
> Sent: Monday, July 3, 2017 6:31 PM
> 
> Hi Jean,
> 
> 
> >
> > > 2. Define a structure in include/uapi/linux/iommu.h(newly added header
> file)
> > >
> > > struct iommu_tlb_invalidate {
> > >   __u32   scope;
> > > /* pasid-selective invalidation described by @pasid */
> > > #define IOMMU_INVALIDATE_PASID(1 << 0)
> > > /* address-selevtive invalidation described by (@vaddr, @size) */
> > > #define IOMMU_INVALIDATE_VADDR(1 << 1)

For VT-d above two flags are related. There is no method of flushing
(@vaddr, @size) for all pasids, which doesn't make sense. address-
selective invalidation is valid only for a given pasid. So it's not appropriate
to put them in same level of scope definition at least for VT-d.

> > >   __u32   flags;
> > > /*  targets non-pasid mappings, @pasid is not valid */
> > > #define IOMMU_INVALIDATE_NO_PASID (1 << 0)
> >
> > Although it was my proposal, I don't like this flag. In ARM SMMU, we're
> > using a special mode where PASID 0 is reserved and any traffic without
> > PASID uses entry 0 of the PASID table. So I proposed the "NO_PASID" flag
> > to invalidate that special context explicitly. But this means that
> > invalidation packet targeted at that context will have "scope = PASID" and
> > "flags = NO_PASID", which is utterly confusing.
> >
> > I now think that we should get rid of the IOMMU_INVALIDATE_NO_PASID
> flag
> > and just use PASID 0 to invalidate this context on ARM. I don't think
> > other architectures would use the NO_PASID flag anyway, but might be
> mistaken.
> 
> I may suggest to keep it so far. On VT-d, we may pass some data in opaque,
> so
> we may work without it. But if other vendor want to issue non-PASID tagged
> cache, then may encounter problem.

I'm worried about what's the criteria which attribute should be abstracted
in common structure and which can be left to opaque. It doesn't make
much sense to do such abstraction purely because different vendor formats
have some common fields. Usually we do such abstraction because 
vendor-agnostic code need to do some common handling before going to
vendor specific code. However in this case VFIO is not expected to do anything
with those IOMMU specific attributes. Then the structure is directly forwarded
to IOMMU driver, which simply translates the structure into vendor specific
opaque data again. Then why bothering to do double translations in Qemu
and IOMMU driver side?

Take VT-d for example. Below is a summary of all possible selections around
invalidation of 1st level structure for svm:

Scope: All PASIDs, single PASID
for each PASID:
all mappings, or page-selective mappings (addr, size)
invalidation target:
IOTLB entries (leaf)
paging structure cache (non-leaf)
PASID cache (pasid->cr3)
invalidation hint:
whether global pages are included
drain reads/writes

Above are pretty architectural attributes if just looking at functional
purpose. Then if we really consider defining a common structure, it
might be more natural to define a superset of all vendors' capabilities
and remove the opaque field at all. But as said earlier the purpose of
doing such abstraction is not clear if there is no vendor-agnostic
user actually digesting those fields. Then should we reconsider the
full opaque approach?

Welcome comments since I may overlook something here. :-)

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH 2/9] iommu/vt-d: add bind_pasid_table function

2017-07-05 Thread Tian, Kevin
> From: Jacob Pan [mailto:jacob.jun@linux.intel.com]
> Sent: Wednesday, June 28, 2017 3:48 AM
> 
> Add Intel VT-d ops to the generic iommu_bind_pasid_table API
> functions.
> 
> The primary use case is for direct assignment of SVM capable
> device. Originated from emulated IOMMU in the guest, the request goes
> through many layers (e.g. VFIO). Upon calling host IOMMU driver, caller
> passes guest PASID table pointer (GPA) and size.
> 
> Device context table entry is modified by Intel IOMMU specific
> bind_pasid_table function. This will turn on nesting mode and matching
> translation type.
> 
> The unbind operation restores default context mapping.
> 
> Signed-off-by: Jacob Pan 
> Signed-off-by: Liu, Yi L 
> Signed-off-by: Ashok Raj 
> ---
>  drivers/iommu/intel-iommu.c   | 117
> ++
>  include/linux/dma_remapping.h |   1 +
>  2 files changed, 118 insertions(+)
> 
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index 8274ce3..ef05b59 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -5430,6 +5430,119 @@ struct intel_iommu
> *intel_svm_device_to_iommu(struct device *dev)
> 
>   return iommu;
>  }
> +
> +static int intel_iommu_bind_pasid_table(struct iommu_domain *domain,
> + struct device *dev, struct pasid_table_info *pasidt_binfo)
> +{
> + struct intel_iommu *iommu;
> + struct context_entry *context;
> + struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> + struct device_domain_info *info;
> + struct pci_dev *pdev;
> + u8 bus, devfn;
> + u16 did, *sid;
> + int ret = 0;
> + unsigned long flags;
> + u64 ctx_lo;
> +
> + if (pasidt_binfo == NULL || pasidt_binfo->model !=
> IOMMU_MODEL_INTEL_VTD) {
> + pr_warn("%s: Invalid bind request!\n", __func__);
> + return -EINVAL;
> + }
> +
> + iommu = device_to_iommu(dev, &bus, &devfn);
> + if (!iommu)
> + return -ENODEV;
> +
> + sid = (u16 *)&pasidt_binfo->opaque;
> + /*
> +  * check SID, if it is not correct, return success to allow looping
> +  * through all devices within a group
> +  */

Can you elaborate this comment since it's related to caller behavior?
Also better move to earlier as the comment for whole function...

> + if (PCI_DEVID(bus, devfn) != *sid)
> + return 0;
> +
> + if (!dev || !dev_is_pci(dev))
> + return -ENODEV;
> +
> + pdev = to_pci_dev(dev);
> + if (!pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_PRI))
> + return -EINVAL;
> +
> + info = dev->archdata.iommu;
> + if (!info || !info->pasid_supported ||
> + !pci_enable_pasid(pdev, info->pasid_supported & ~1)) {
> + pr_err("PCI %04x:%02x:%02x.%d: has no PASID support\n",
> +pci_domain_nr(pdev->bus), bus, PCI_SLOT(devfn),
> +PCI_FUNC(devfn));
> + ret = -EINVAL;
> + goto out;
> + }
> +
> + if (pasidt_binfo->size > intel_iommu_get_pts(iommu)) {
> + pr_err("Invalid gPASID table size %llu, host size %lu\n",
> + pasidt_binfo->size,
> + intel_iommu_get_pts(iommu));
> + ret = -EINVAL;
> + goto out;
> + }
> + spin_lock_irqsave(&iommu->lock, flags);
> + context = iommu_context_addr(iommu, bus, devfn, 0);
> + if (!context || !context_present(context)) {
> + pr_warn("%s: ctx not present for bus devfn %x:%x\n",
> + __func__, bus, devfn);
> + spin_unlock_irqrestore(&iommu->lock, flags);
> + goto out;
> + }
> +
> + /* Anticipate guest to use SVM and owns the first level */
> + ctx_lo = context[0].lo;
> + ctx_lo |= CONTEXT_NESTE;
> + ctx_lo |= CONTEXT_PRS;
> + ctx_lo |= CONTEXT_PASIDE;
> + ctx_lo &= ~CONTEXT_TT_MASK;
> + ctx_lo |= CONTEXT_TT_DEV_IOTLB << 2;
> + context[0].lo = ctx_lo;
> +
> + /* Assign guest PASID table pointer and size */
> + ctx_lo = (pasidt_binfo->ptr & VTD_PAGE_MASK) | pasidt_binfo->size;
> + context[1].lo = ctx_lo;
> + /* make sure context entry is updated before flushing */
> + wmb();
> + did = dmar_domain->iommu_did[iommu->seq_id];
> + iommu->flush.flush_context(iommu, did,
> + (((u16)bus) << 8) | devfn,
> + DMA_CCMD_MASK_NOBIT,
> + DMA_CCMD_DEVICE_INVL);
> + iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH);
> + spin_unlock_irqrestore(&iommu->lock, flags);
> +
> +
> +out:
> + return ret;
> +}
> +
> +static int intel_iommu_unbind_pasid_table(struct iommu_domain *domain,
> + struct device *dev)
> +{
> + struct intel_iommu *iommu;
> + struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> + u8 bus, devfn;
> +
> + iommu = 

RE: [PATCH 3/9] iommu: Introduce iommu do invalidate API function

2017-07-05 Thread Tian, Kevin
> From: Jean-Philippe Brucker [mailto:jean-philippe.bruc...@arm.com]
> Sent: Thursday, June 29, 2017 1:08 AM
> 
> On 28/06/17 17:09, Jacob Pan wrote:
> > On Wed, 28 Jun 2017 12:08:23 +0200
> > Joerg Roedel  wrote:
> >
> >> On Tue, Jun 27, 2017 at 12:47:57PM -0700, Jacob Pan wrote:
> >>> From: "Liu, Yi L" 
> >>>
> >>> When a SVM capable device is assigned to a guest, the first level
> >>> page tables are owned by the guest and the guest PASID table
> >>> pointer is linked to the device context entry of the physical IOMMU.
> >>>
> >>> Host IOMMU driver has no knowledge of caching structure updates
> >>> unless the guest invalidation activities are passed down to the
> >>> host. The primary usage is derived from emulated IOMMU in the
> >>> guest, where QEMU can trap invalidation activities before pass them
> >>> down the host/physical IOMMU. There are IOMMU architectural
> >>> specific actions need to be taken which requires the generic APIs
> >>> introduced in this patch to have opaque data in the
> >>> tlb_invalidate_info argument.
> >>
> >> Which "IOMMU architectural specific actions" are you thinking of?
> >>
> > construction of queued invalidation descriptors, then submit them to
> > the IOMMU QI interface.
> >>> +int iommu_invalidate(struct iommu_domain *domain,
> >>> + struct device *dev, struct tlb_invalidate_info
> >>> *inv_info) +{
> >>> + int ret = 0;
> >>> +
> >>> + if (unlikely(!domain->ops->invalidate))
> >>> + return -ENODEV;
> >>> +
> >>> + ret = domain->ops->invalidate(domain, dev, inv_info);
> >>> +
> >>> + return ret;
> >>> +}
> >>> +EXPORT_SYMBOL_GPL(iommu_invalidate);
> >>
> >> [...]
> >>
> >>> +struct tlb_invalidate_info {
> >>> + __u32   model;
> >>> + __u32   length;
> >>> + __u8opaque[];
> >>> +};
> >>
> >> This interface is aweful. It requires the user of a generic api to
> >> know details about the implementation behind to do anything useful.
> >>
> >> Please explain in more detail why this is needed. My feeling is that
> >> we can make this more generic with a small set of invalidation
> >> functions in the iommu-api.

A curious question here. Joreg, which part based on below information
could be generalized in your mind? Previously I also preferred to defining
a common structure. However later I realized there is little code logic
which can be further abstracted to use that structure, since the main
task here is just to construct vendor specific invalidation descriptor upon 
the request...

> >>
> > My thinking was that via configuration control, there will be unlikely
> > any mixed IOMMU models between pIOMMU and vIOMMU. We could
> have just
> > model specific data pass through layers of SW (QEMU, VFIO) for
> > performance reasons. We do have an earlier hybrid version that has
> > generic data and opaque raw data. Would the below work for all IOMMU
> > models?
> 
> For reference, this was also discussed in the initial posting of the series:
> https://lists.gnu.org/archive/html/qemu-devel/2017-05/msg03452.html
> 
> At least for ARM SMMUv2 and v3, I think the invalidation format you
> propose should be sufficient, although "device_selective" should probably
> be "domain_selective". And maybe a flag field could contain relatively
> generic hints such as "only invalidate leaf table when page_selective".
> 
> Thanks,
> Jean
> 
> > https://www.spinics.net/lists/kvm/msg148798.html
> >
> > struct tlb_invalidate_info
> > {
> > __u32   model;  /* Vendor number */
> > __u8 granularity
> > #define DEVICE_SELECTVIE_INV(1 << 0)
> > #define PAGE_SELECTIVE_INV  (1 << 0)
> > #define PASID_SELECTIVE_INV (1 << 1)
> > __u32 pasid;
> > __u64 addr;
> > __u64 size;
> >
> > /* Since IOMMU format has already been validated for this table,
> >the IOMMU driver knows that the following structure is in a
> >format it knows */
> > __u8 opaque[];
> > };
> >

I just gave some information in another thread:

https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg00853.html

Below summarizes all the invalidation capabilities supported by Intel VTd:

Scope: All PASIDs, single PASID
for each PASID:
all mappings, or page-selective mappings (addr, size)
invalidation target:
IOTLB entries (leaf)
paging structure cache (non-leaf)
PASID cache (pasid->cr3)
invalidation hint:
whether global pages are included
drain reads/writes

(Jean, you may add ARM specific capabilities here)

If we want to define a common structure, go with defining a superset 
of all possible capabilities from all vendors (no opaque then) or only 
including a subset used by some common IOMMU abstraction?
The latter depends on what exactly need to be generalized which needs
to be solved first, otherwise it's difficult to judge why proposed format
is necessary and enough...

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https:/

RE: [Qemu-devel] [RFC PATCH 7/8] VFIO: Add new IOCTL for IOMMU TLB invalidate propagation

2017-07-05 Thread Tian, Kevin
> From: Alex Williamson [mailto:alex.william...@redhat.com]
> Sent: Thursday, July 6, 2017 1:28 AM
> 
> On Wed, 5 Jul 2017 13:42:03 +0100
> Jean-Philippe Brucker  wrote:
> 
> > On 05/07/17 07:45, Tian, Kevin wrote:
> > >> From: Liu, Yi L
> > >> Sent: Monday, July 3, 2017 6:31 PM
> > >>
> > >> Hi Jean,
> > >>
> > >>
> > >>>
> > >>>> 2. Define a structure in include/uapi/linux/iommu.h(newly added
> header
> > >> file)
> > >>>>
> > >>>> struct iommu_tlb_invalidate {
> > >>>>__u32   scope;
> > >>>> /* pasid-selective invalidation described by @pasid */
> > >>>> #define IOMMU_INVALIDATE_PASID (1 << 0)
> > >>>> /* address-selevtive invalidation described by (@vaddr, @size) */
> > >>>> #define IOMMU_INVALIDATE_VADDR (1 << 1)
> > >
> > > For VT-d above two flags are related. There is no method of flushing
> > > (@vaddr, @size) for all pasids, which doesn't make sense. address-
> > > selective invalidation is valid only for a given pasid. So it's not 
> > > appropriate
> > > to put them in same level of scope definition at least for VT-d.
> >
> > For ARM SMMU the "flush all by VA" operation is valid. Although it's
> > unclear at this point if we will ever allow that, it should probably stay
> > in the common format, if there is one.
> >
> > >>>>__u32   flags;
> > >>>> /*  targets non-pasid mappings, @pasid is not valid */
> > >>>> #define IOMMU_INVALIDATE_NO_PASID  (1 << 0)
> > >>>
> > >>> Although it was my proposal, I don't like this flag. In ARM SMMU, we're
> > >>> using a special mode where PASID 0 is reserved and any traffic without
> > >>> PASID uses entry 0 of the PASID table. So I proposed the "NO_PASID"
> flag
> > >>> to invalidate that special context explicitly. But this means that
> > >>> invalidation packet targeted at that context will have "scope = PASID"
> and
> > >>> "flags = NO_PASID", which is utterly confusing.
> > >>>
> > >>> I now think that we should get rid of the
> IOMMU_INVALIDATE_NO_PASID
> > >> flag
> > >>> and just use PASID 0 to invalidate this context on ARM. I don't think
> > >>> other architectures would use the NO_PASID flag anyway, but might be
> > >> mistaken.
> > >>
> > >> I may suggest to keep it so far. On VT-d, we may pass some data in
> opaque,
> > >> so
> > >> we may work without it. But if other vendor want to issue non-PASID
> tagged
> > >> cache, then may encounter problem.
> > >
> > > I'm worried about what's the criteria which attribute should be
> abstracted
> > > in common structure and which can be left to opaque. It doesn't make
> > > much sense to do such abstraction purely because different vendor
> formats
> > > have some common fields. Usually we do such abstraction because
> > > vendor-agnostic code need to do some common handling before going to
> > > vendor specific code. However in this case VFIO is not expected to do
> anything
> > > with those IOMMU specific attributes. Then the structure is directly
> forwarded
> > > to IOMMU driver, which simply translates the structure into vendor
> specific
> > > opaque data again. Then why bothering to do double translations in
> Qemu
> > > and IOMMU driver side?>
> > > Take VT-d for example. Below is a summary of all possible selections
> around
> > > invalidation of 1st level structure for svm:
> > >
> > > Scope: All PASIDs, single PASID
> > > for each PASID:
> > >   all mappings, or page-selective mappings (addr, size)
> > > invalidation target:
> > >   IOTLB entries (leaf)
> > >   paging structure cache (non-leaf)
> >
> > I'm curious, can you invalidate all intermediate paging structures for a
> > given PASID without invalidating the leaves?
> >
> > >   PASID cache (pasid->cr3)
> > I guess any implementations that gives the whole PASID table to userspace
> > will need the PASID cache invalidation. This was missing from my proposal
> > since it was from virtio-iommu.
> >
> > > invalidation hint:
> > >   whether global pages are included
> > &

RE: [Qemu-devel] [RFC PATCH 7/8] VFIO: Add new IOCTL for IOMMU TLB invalidate propagation

2017-07-05 Thread Tian, Kevin
> From: Jean-Philippe Brucker
> Sent: Wednesday, July 5, 2017 8:42 PM
> 
> On 05/07/17 07:45, Tian, Kevin wrote:
> >> From: Liu, Yi L
> >> Sent: Monday, July 3, 2017 6:31 PM
> >>
> >> Hi Jean,
> >>
> >>
> >>>
> >>>> 2. Define a structure in include/uapi/linux/iommu.h(newly added
> header
> >> file)
> >>>>
> >>>> struct iommu_tlb_invalidate {
> >>>>  __u32   scope;
> >>>> /* pasid-selective invalidation described by @pasid */
> >>>> #define IOMMU_INVALIDATE_PASID   (1 << 0)
> >>>> /* address-selevtive invalidation described by (@vaddr, @size) */
> >>>> #define IOMMU_INVALIDATE_VADDR   (1 << 1)
> >
> > For VT-d above two flags are related. There is no method of flushing
> > (@vaddr, @size) for all pasids, which doesn't make sense. address-
> > selective invalidation is valid only for a given pasid. So it's not 
> > appropriate
> > to put them in same level of scope definition at least for VT-d.
> 
> For ARM SMMU the "flush all by VA" operation is valid. Although it's
> unclear at this point if we will ever allow that, it should probably stay
> in the common format, if there is one.

fine in common format. earlier I was thinking whether it should
be in scope. possibly fine after another thinking. :-)

> 
> >>>>  __u32   flags;
> >>>> /*  targets non-pasid mappings, @pasid is not valid */
> >>>> #define IOMMU_INVALIDATE_NO_PASID(1 << 0)
> >>>
> >>> Although it was my proposal, I don't like this flag. In ARM SMMU, we're
> >>> using a special mode where PASID 0 is reserved and any traffic without
> >>> PASID uses entry 0 of the PASID table. So I proposed the "NO_PASID" flag
> >>> to invalidate that special context explicitly. But this means that
> >>> invalidation packet targeted at that context will have "scope = PASID"
> and
> >>> "flags = NO_PASID", which is utterly confusing.
> >>>
> >>> I now think that we should get rid of the
> IOMMU_INVALIDATE_NO_PASID
> >> flag
> >>> and just use PASID 0 to invalidate this context on ARM. I don't think
> >>> other architectures would use the NO_PASID flag anyway, but might be
> >> mistaken.
> >>
> >> I may suggest to keep it so far. On VT-d, we may pass some data in
> opaque,
> >> so
> >> we may work without it. But if other vendor want to issue non-PASID
> tagged
> >> cache, then may encounter problem.
> >
> > I'm worried about what's the criteria which attribute should be abstracted
> > in common structure and which can be left to opaque. It doesn't make
> > much sense to do such abstraction purely because different vendor
> formats
> > have some common fields. Usually we do such abstraction because
> > vendor-agnostic code need to do some common handling before going to
> > vendor specific code. However in this case VFIO is not expected to do
> anything
> > with those IOMMU specific attributes. Then the structure is directly
> forwarded
> > to IOMMU driver, which simply translates the structure into vendor specific
> > opaque data again. Then why bothering to do double translations in Qemu
> > and IOMMU driver side?>
> > Take VT-d for example. Below is a summary of all possible selections
> around
> > invalidation of 1st level structure for svm:
> >
> > Scope: All PASIDs, single PASID
> > for each PASID:
> > all mappings, or page-selective mappings (addr, size)
> > invalidation target:
> > IOTLB entries (leaf)
> > paging structure cache (non-leaf)
> 
> I'm curious, can you invalidate all intermediate paging structures for a
> given PASID without invalidating the leaves?

I don't think so. usually IOTLB flush is the base. one can further
specify whether flush should apply to non-leaves.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC PATCH 0/6] Auxiliary IOMMU domains and Arm SMMUv3

2018-10-21 Thread Tian, Kevin
> From: Jean-Philippe Brucker [mailto:jean-philippe.bruc...@arm.com]
> Sent: Saturday, October 20, 2018 2:12 AM
> 
> This is a first prototype adding auxiliary domain support to Arm SMMUv3,
> following Lu Baolu's latest proposal for IOMMU aware mediated devices
> [1]. It works, but the attach() API still doesn't feel right. See (2)
> below.
> 
> Patch 1 adapts iommu.c to the current proposal for auxiliary domains.
> Patches 2-4 rework the PASID allocator to make it usable for SVA and
> AUXD simultaneously. Patches 5-6 add AUXD support to SMMUv3.
> 
> 
> When a device can have multiple address space, for instance with PCI
> PASID, an auxiliary domain (AUXD) is the IOMMU representation of one
> address space. I distinguish auxiliary from "main" domain, which
> represents the non-PASID address space but also (at least for SMMUv3)
> the whole device context, PASID tables etc.

I'd like to clarify a bit. :-)

a domain can always represent one or more address spaces, where an 
address space could be for IOVA or GPA and/or other address spaces for 
SVA. Address spaces may be chained together in so-called nested mode.

a domain can be associated with a full device (in BDF granular), and/or 
with a partition of a device (say in PASID granular). In the former case,
the domain becomes the master (or primary) domain of the device. In
the latter case, the domain becomes the auxiliary domain of the device.

vendor domain structure may include vendor-specific states which
are applied to device context at attach time, e.g. PASID table (SMMUv3) 
if representing as master domain, or PASID value (VT-d scalable mode)
if representing as auxiliary domain.

so the domain definition is never changed with the introduction of
AUXD. 'auxiliary' is a per-device attribute which takes effect when at
domain attach time. A domain is perfectly sane to represent as a
master domain to deviceA and an auxiliary domain to deviceB at the
same time (say when device A and a mdev on deviceB are assigned to
same VM), while supporting one or more address spaces.

I explained above concept in my KVM forum session:

https://events.linuxfoundation.org/wp-content/uploads/2017/12/Hardware-Assisted-Mediated-Pass-Through-with-VFIO-Kevin-Tian-Intel.pdf
 (slide 16/17)

> 
> Auxiliary domains will be used by VFIO for IOMMU-aware mdev, and by
> any
> other device driver that wants to use PASID for private address spaces
> (as opposed to SVA [2]). The following API is available to device
> drivers:
> 
> (1) Enable AUXD for a device. Enable PASID if necessary and set an AUXD
> flag on the IOMMU data associated to a device.
> 
> For my own convenience I've been using the SVA infrastructure since
> I already had the locking and IOMMU ops in place. The proposed
> interface is also missing min_pasid and max_pasid parameters, which
> could be needed by device drivers to enforce PASID limits.
> iommu_sva_init_device() without arguments already enables PASID, so
> I just added an AUXD flag to SVA features:
> 
>   iommu_sva_init_device(dev, IOMMU_SVA_FEAT_AUXD,
> min_pasid, max_pasid, NULL)
>   iommu_sva_shutdown_device(dev)
> 
> Or as proposed in [1]:
> 
>   iommu_set_dev_attr(dev, IOMMU_DEV_ATTR_AUXD_ENABLE, NULL)
>   iommu_set_dev_attr(dev, IOMMU_DEV_ATTR_AUXD_DISABLE, NULL)
> 
> Finding a compromise for this interface should be easy.
> 
> (2) Allocate a domain and attach it to the device.
> 
>   dom = iommu_domain_alloc()
>   iommu_attach_device(dom, dev)
> 
> I still have concerns about this part, which are highlighted by the
> messy changes of patch 1. I think it would make more sense to
> introduce new attach/detach_dev_aux() functions instead of reusing
> attach/detach_dev()
> 
> Can we reconsider this and avoid unnecessary complications in IOMMU
> core and drivers? Does the VFIO code greatly benefit from using the
> same attach() function? It could as well use a different one for
> devices in AUXD mode, which the mediating driver could tell by
> adding a flag in mdev_set_iommu_device(), for example.

Baolu gave some recommendations to patch 1. Please check whether it can
help reduce the mess.

IMO using same API is conceptually clearer to VFIO... let's gather in KVM
forum to have a conclusion, with Alex being there.

Thanks
Kevin

> 
> And I don't think other users of AUXD would benefit from using the
> same attach() function, since they will know whether they want to be
> using main or auxiliary domain when doing attach().
> 
> (3) Get the PASID, and program it in the device
> 
>   iommu_domain_get_attr(dom, DOMAIN_ATTR_AUXD_ID, &pasid)
> 
> (4) Create DMA mappings
> 
>   iommu_map(dom, ...)
>   iommu_unmap(dom, ...)
> 
> Ultimately it would be nice to add PASID support to the DMA API as
> well. For now drivers allocate IOVAs and pages themselves.
> 
> 
> For vfio-mdev, a driver that wants to create mdevs o

RE: [RFC PATCH 0/6] Auxiliary IOMMU domains and Arm SMMUv3

2018-10-23 Thread Tian, Kevin
> From: Raj, Ashok
> Sent: Wednesday, October 24, 2018 1:17 AM
> 
> >
> > But that's not reason enough to completely disable PASID for the
> > device,
> > it could be the only one bound to that process, or PASID could be
> > only
> > used privately by the host device driver.
> 
> Agree, there could be other use cases.
> 
> If the device was already attached during boot the driver comes early
> to get the low PASID space. If the device was hot-added and the PASID
> supported by device wasn't available its going to fail.
> 
> Enforcing something that will always work will be more reliable. But i
> agree it maybe too strict for some cases.
> 
> Maybe its a IOMMU enforced limit for the platform on the minimum
> requirement for consistency.
> 

what about keeping the flexibility in common logic (e.g. allowing 
pasid_min and pasid_max in PASID allocator, detecting PASID
width limitation at binding time, etc.), while letting vendor driver
to vote disabling PASID on device based on its own need (e.g. if
not supporting 20bit width).

Thanks
kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC PATCH 2/6] drivers core: Add I/O ASID allocator

2018-10-23 Thread Tian, Kevin
> From: Lu Baolu [mailto:baolu...@linux.intel.com]
> Sent: Tuesday, October 23, 2018 2:57 PM
> 
> Hi,
> 
> On 10/22/18 6:22 PM, Raj, Ashok wrote:
> > On Mon, Oct 22, 2018 at 12:49:47PM +0800, Lu Baolu wrote:
> >> Hi,
> >>
> >> On 10/20/18 2:11 AM, Jean-Philippe Brucker wrote:
> >>> Some devices might support multiple DMA address spaces, in particular
> >>> those that have the PCI PASID feature. PASID (Process Address Space ID)
> >>> allows to share process address spaces with devices (SVA), partition a
> >>> device into VM-assignable entities (VFIO mdev) or simply provide
> >>> multiple DMA address space to kernel drivers. Add a global PASID
> >>> allocator usable by different drivers at the same time. Name it I/O ASID
> >>> to avoid confusion with ASIDs allocated by arch code, which are usually
> >>> a separate ID space.
> >>>
> >>> The IOASID space is global. Each device can have its own PASID space,
> >>> but by convention the IOMMU ended up having a global PASID space,
> so
> >>> that with SVA, each mm_struct is associated to a single PASID.
> >>>
> >>> The allocator doesn't really belong in drivers/iommu because some
> >>> drivers would like to allocate PASIDs for devices that aren't managed by
> >>> an IOMMU, using the same ID space as IOMMU. It doesn't really
> belong in
> >>> drivers/pci either since platform device also support PASID. Add the
> >>> allocator in drivers/base.
> >>
> >> One concern of moving pasid allocator here is about paravirtual
> >> allocation of pasid.
> >>
> >> Since there is only a single set of pasid tables which is controlled by
> >
> > Minor correction: Single system wide PASID namespace, but PASID tables
> > would be created ideally per-bdf for isolation purposes.
> >
> > I'm sure you meant name space, but didn't want that to be mis-
> interpreted.
> 
> Yes, confirmed.
> 

Above essentially means that multiple IOASID allocators may exist.
There needs a way for IOMMU driver to choose which one to use, e.g.
for current two usages the core allocator will be used on host and
also in VM (non-VTd cases), and pv allocator will be used in VM w/
virtual VTd.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC PATCH 2/6] drivers core: Add I/O ASID allocator

2018-11-21 Thread Tian, Kevin
> From: Koenig, Christian
> Sent: Thursday, November 22, 2018 3:10 AM
> 
> Am 21.11.18 um 12:16 schrieb Jean-Philippe Brucker:
> > On 12/11/2018 14:40, Joerg Roedel wrote:
> >> Hi Jean-Philippe,
> >>
> >> On Fri, Oct 19, 2018 at 07:11:54PM +0100, Jean-Philippe Brucker wrote:
> >>> The allocator doesn't really belong in drivers/iommu because some
> >>> drivers would like to allocate PASIDs for devices that aren't managed by
> >>> an IOMMU, using the same ID space as IOMMU. It doesn't really
> belong in
> >>> drivers/pci either since platform device also support PASID. Add the
> >>> allocator in drivers/base.
> >> I don't really buy this argument, in the end it is the IOMMU that
> >> enforces the PASID mappings for a device. Why should a device not
> behind
> >> an IOMMU need to allocate a pasid? Do you have examples of such
> devices
> >> which also don't have their own iommu built-in?
> > I misunderstood that requirement. Reading through the thread again
> > (https://www.mail-archive.com/iommu@lists.linux-
> foundation.org/msg25640.html)
> > it's more about a device using PASIDs as context IDs. Some contexts are
> > not bound to processes but they still need their ID in the same PASID
> > space as the contexts that are bound to process address spaces. So we
> > can keep that code in drivers/iommu
> 
> The problem with that approach is that we also need this allocator when
> IOMMU is completely disabled.
> 
> In other words PASIDs are used as contexts IDs by the hardware for
> things like signaling which application has caused an interrupt/event
> even when they are not used by IOMMU later on.
> 
> Additional to that we have some MMUs which are internal to the devices
> (GPUVM on AMD GPUs for example, similar exists for NVidia) which uses
> PASIDs for address translation internally in the device.
> 
> All of that is completely unrelated to IOMMU, but when IOMMU is enabled
> you need to use the same allocator because all use cases use the same ID
> space.
> 

imo context ID is a resource used by device internal while PASID is sth.
informational to external world (as defined in PCI bus package). ideally two 
things are separate in hardware but can be associated (e.g. PASID as an 
optional parameter per context). That is what I know for Intel GPUs.

For your device, is "PASIDs are used as contexts IDs" a software 
implementation policy or actual hardware requirement (i.e. no dedicated
PASID recording)?

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC PATCH 0/6] Auxiliary IOMMU domains and Arm SMMUv3

2018-11-22 Thread Tian, Kevin
> From: j...@8bytes.org [mailto:j...@8bytes.org]
> Sent: Monday, November 12, 2018 10:56 PM
> 
> Hi Jean-Philippe,
> 
> On Thu, Nov 08, 2018 at 06:29:42PM +, Jean-Philippe Brucker wrote:
> > (1) My initial approach would have been to use the same page tables for
> > the default_domain and this new domain, but it might be precisely what
> > you were trying to avoid with this new proposal: a device attached to
> > two domains at the same time.
> 
> My request was about the initial assumptions a device driver can make
> about a device. This assumptions is that DMA-API is set up and
> initialized for it, for everything else (like SVA) the device driver
> needs to take special action, like allocating an SVA domain and attaching
> the device to it.

Hi, Joerg,

I agree special action needs to be taken for everything else (other than
DMA-API), but the point that I didn't get is why the action must be based
a new SVA-type domain, instead of extending default domain with SVA
capability (including related paging structures which are accessed through 
new SVA APIs). In the latter case domain-wise attribute (e.g. IRQ mapping) 
is naturally shared between capabilities (DMA-API and SVA). There is no 
need to create cross-domain connections as two options that you listed 
below.

Can you help elaborate more about the motivation behind proposal?

P.S. as you may see from other threads, we support same IOMMU 
capabilities (both IOVA and SVA) on vfio-mdev (i.e. aux domain) as ones 
on vfio-pci. which leads to below situation following your proposal, if
'AUX' is treated as a separate capability as SVA:

|-default domain (DMA-API)
|-sva domain1 (SVA)
|-sva domain2 (SVA)
|-...
|-sva domainN (AUX, guest DMA-API)
|   |- sva domainN1 (AUX, guest SVA)
|   |- sva domainN2 (AUX, guest SVA)
|   |-...
|-sva domainM (AUX, guest DMA-API)
|   |- sva domainM1 (AUX, guest SVA)
|   |- sva domainM2 (AUX, guest SVA)
|   |-...

Is that what is in your mind? would it cause unnecessary complexities
on handling those layers?

The hierarchy could be simplified if we allow aux domain to carry both 
DMA-API and SVA capability:

|-default domain (DMA-API)
|-sva domain1 (SVA)
|-sva domain2 (SVA)
|-...
|-sva domainN (AUX, guest DMA-API, guest SVA)
|-sva domainM (AUX, guest DMA-API, guest SVA)

However doing so cause inconsistency on domain definition. why cannot
default domain have full capability but aux domain can?

Further moving along that road, we could have below (as proposed in
this series):

|-default domain (DMA-API, SVA)
|-sva domainN (AUX, guest DMA-API, guest SVA)
|-sva domainM (AUX, guest DMA-API, guest SVA)
(whether putting aux domains as sub-domain or same level is not big deal)

There each domain could support full DMA translation capabilities (IOVA, SVA, 
GPA, GIOVA, etc) with supporting structures together (1st level, 2nd level, irq 
mapping, etc.), and new APIs are invented to utilized new capabilities beyond 
existing DMA-API in each domain. 'AUX' here becomes a orthogonal way to 
those DMA translation  capabilities, and is chosen at domain attach time based 
on device configuration. means same domain can be default on deviceA while
being aux on deviceB (when assigning pci and mdev to same VM).

Thoughts? :-)

> 
> This should of course not break any IRQ setups for the device, and also
> enforcing an ordering is not a good and maintainable solution.
> 
> So what you could do here is to either:
> 
>   1) Save the needed IRQ mappings in some extra datastructure and
>  duplicate it in the SVA domain. This also makes it easier to
>  use the same SVA domain with multiple devices.
> 
>   2) Just re-use the 'page-tables' from the default domain in the
>  sva-domain. This needs to happen at attach-time, because at
>  allocation time you don't know the device yet.
> 
> I think 1) is the best solution, what do you think?
> 
> Btw, things would be different if we could expose SVA through the
> DMA-API to drivers. In this case we could just make the default domain
> of type SVA and be done, but we are not there yet.
> 
> Regards,

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC PATCH 0/6] Auxiliary IOMMU domains and Arm SMMUv3

2018-11-25 Thread Tian, Kevin
> From: j...@8bytes.org [mailto:j...@8bytes.org]
> Sent: Friday, November 23, 2018 7:21 PM
> 
> On Wed, Nov 21, 2018 at 12:40:44PM +0800, Lu Baolu wrote:
> > Can you please elaborate a bit more about the concept of subdomains?
> > From my point of view, an aux-domain is a normal un-managed domain
> which
> > has a PASID and could be attached to any ADIs through the aux-domain
> > specific attach/detach APIs. It could also be attached to a device with
> > the existing domain_attach/detach_device() APIs at the same time, hence
> > mdev and pci devices in a vfio container could share a domain.
> 
> Okay, let's think a bit about having aux-specific attach/detach
> functions, in the end I don't insist on my proposal as long as the
> IOMMU-API extensions are clean, consistent, and have well defined
> semantics.
> 
> If we have aux-domain specific attach/detach functions like
> iommu_aux_domain_attach/detach(), what happens when the primary
> domain
> of the device is changed with iommu_attach/detach()?
> 
>   1) Will the aux-domains stay in place? If yes, how does this
>  work in setups where the pasid-bound page-tables are
>  guest-owned and translated through the primary domain
>  page-tables?
> 
>   2) Will the aux-domains be unbound too? In that case, if the
>  primary domain is re-used, will the aux-domains be implicitly
>  bound too when iommu_device_attach() is called?
> 
>   3) Do we just disallow changing the primary domain through that
>  interface as long as there are aux-domains or mm_structs
>  bound to the device?

3) sounds like a right option. 

IMO Primary domain represents the primary ownership of DMA capability 
of the whole device. The owner (say host driver) may lend partial capability 
(e.g. allocating some queues) to a sub-owner (through aux-domain to a VM 
or mm_struct to a process), when ownership is still claimed by this owner.

Primary domain switch means ownership change (e.g. from host driver to 
guest driver if type is changed from DMA-API to UNMANAGED), which usually
implies driver load/unload thus all resource usages from previous owner must 
be cleaned up before actual switch happens, which include things allocated 
from DMA-API, aux domain API, and SVA APIs. 

> 
> Using option 2) or 3) would mean that the aux-domains are still bound to
> the primary domain, but that is not reflected in the API. Further, if an

Do you suggest some reflection in API interface definition, or specific API 
implementation? If the former I didn't get how to do it. If the latter, yes 
it's missed in current RFC. We should add check in necessary code path.

> aux-domain is just a normal domain (struct iommu_domain), what
> happens
> when a domain that was used as a primary domain and has bound
> aux-domains to it, is bound with iommu_aux_domain_attach() to another
> domain?

aux domain is essentially a way to lend partial DMA capability to less-
privileged entity (process or VM), which implies that part of DMA 
unmanaged by kernel driver. If we can make such assumption that aux 
only applies to UNMANAGED domain (thus disallow BLOCKED/IDENTITY/
DMA from using aux), then above open doesn't exist, because if primary 
domain is UNMANAGED type it essentially means the whole device fully 
owned by user space or guest thus no host driver entity to ever create 
aux domain. If primary domain is not UNMANAGED type, it will fail in 
aux API. with this design, each device will have at most one primary
and multiple aux, i.e. just two layers. no aux-bind-to-aux thing.

this assumption should be true in discussed usages around aux domain. 
but I may overlook something... Jean?

> 
> As you can see, this design decission raises a lot of questions, but
> maybe we can work it out and find a solution we all agree on.
> 

Thanks for raising those good opens!

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC PATCH 0/6] Auxiliary IOMMU domains and Arm SMMUv3

2018-11-25 Thread Tian, Kevin
> From: Tian, Kevin
> Sent: Monday, November 26, 2018 11:01 AM
[...]
> > aux-domain is just a normal domain (struct iommu_domain), what
> > happens
> > when a domain that was used as a primary domain and has bound
> > aux-domains to it, is bound with iommu_aux_domain_attach() to another
> > domain?
> 
> aux domain is essentially a way to lend partial DMA capability to less-
> privileged entity (process or VM), which implies that part of DMA
> unmanaged by kernel driver. If we can make such assumption that aux
> only applies to UNMANAGED domain (thus disallow BLOCKED/IDENTITY/
> DMA from using aux), then above open doesn't exist, because if primary
> domain is UNMANAGED type it essentially means the whole device fully
> owned by user space or guest thus no host driver entity to ever create
> aux domain. If primary domain is not UNMANAGED type, it will fail in
> aux API. with this design, each device will have at most one primary
> and multiple aux, i.e. just two layers. no aux-bind-to-aux thing.
> 
> this assumption should be true in discussed usages around aux domain.
> but I may overlook something... Jean?
> 

btw Baolu just reminded me one thing which is worthy of noting here.
'primary' vs. 'aux' concept makes sense only when we look from a device
p.o.v. That binding relationship is not (*should not be*) carry-and-forwarded
cross devices. every domain must be explicitly attached to other devices
(instead of implicitly attached being above example), and new primary/aux
attribute on another device will be decided at attach time.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC PATCH 0/6] Auxiliary IOMMU domains and Arm SMMUv3

2018-12-09 Thread Tian, Kevin
> From: 'j...@8bytes.org' [mailto:j...@8bytes.org]
> Sent: Friday, December 7, 2018 6:29 PM
> 
> Hi,
> 
> On Mon, Nov 26, 2018 at 07:29:45AM +, Tian, Kevin wrote:
> > btw Baolu just reminded me one thing which is worthy of noting here.
> > 'primary' vs. 'aux' concept makes sense only when we look from a device
> > p.o.v. That binding relationship is not (*should not be*) carry-and-
> forwarded
> > cross devices. every domain must be explicitly attached to other devices
> > (instead of implicitly attached being above example), and new
> primary/aux
> > attribute on another device will be decided at attach time.
> 
> Okay, so after all the discussions we had I learned a few more things
> about the scalable mode feature and thought a bit longer about how to
> best support it in the IOMMU-API.

Thanks for thinking through this.

> 
> The concept of sub-domains I initially proposed certainly makes no
> sense, but scalable-mode specific attach/detach functions do. So instead
> of a sub-domain mode, I'd like to propose device-feature sets.

Can I interpret above as that you agree with the aux domain concept (i.e. one
device can be linked to multiple domains) in general, and now we're just trying
to address the remaining open in API level?

> 
> The posted patch-set already includes this as device-attributes, but I
> don't like this naming as we are really talking about additional
> feature sets of a device. So how about we introduce this:
> 
>   enum iommu_dev_features {
>   /* ... */
>   IOMMU_DEV_FEAT_AUX,
>   IOMMU_DEV_FEAT_SVA,
>   /* ... */
>   };
> 

Does above represent whether a device implements aux/sva features,
or whether a device has been enabled by driver to support aux/sva 
features?

>   /* Check if a device supports a given feature of the IOMMU-API */
>   bool iommu_dev_has_feature(struct device *dev, enum
> iommu_dev_features *feat);

If the latter we also need iommu_dev_set_feature so driver can poke
it based on its own configuration. 

> 
>   /*
>* Only works if iommu_dev_has_feature(dev,
> IOMMU_DEV_FEAT_AUX)
>* returns true
>*
>* Also, as long as domains are attached to a device through
>* this interface, any trys to call iommu_attach_device() should
>* fail (iommu_detach_device() can't fail, so we fail on the
>* tryint to re-attach). This should make us safe against a
>* device being attached to a guest as a whole while there are
>* still pasid users on it (aux and sva).

yes, it makes sense.

>*/
>   int iommu_aux_attach_device(struct iommu_domain *domain,
>   struct device *dev);
> 
>   int iommu_aux_detach_device(struct iommu_domain *domain,
>   struct device *dev);
>   /*
>* I know we are targeting a system-wide pasid-space, so that
>* the pasid would be the same for one domain on all devices,
>* let's just keep the option open to have different
>* pasid-spaces in one system. Also this way we can use it to
>* check whether the domain is attached to this device at all.
>*
>* returns pasid or <0 if domain has no pasid on that device.
>*/
>   int iommu_aux_get_pasid(struct iommu_domain *domain,
>   struct device *dev);
> 
>   /* So we need a iommu_aux_detach_all()? */

for what scenario?

> 
> This concept can also be easily extended for supporting SVA in parallel
> on the same device, with the same constraints regarding the behavior of
> iommu_attach_device()/iommu_detach_device().
> 
> So what do you think about that approach?
> 
> Regards,
> 
>   Joerg
> 
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC PATCH 0/6] Auxiliary IOMMU domains and Arm SMMUv3

2018-12-12 Thread Tian, Kevin
> From: 'j...@8bytes.org'
> Sent: Monday, December 10, 2018 4:58 PM
> 
> Hi Kevin,
> 
> On Mon, Dec 10, 2018 at 02:06:44AM +, Tian, Kevin wrote:
> > Can I interpret above as that you agree with the aux domain concept (i.e.
> one
> > device can be linked to multiple domains) in general, and now we're just
> trying
> > to address the remaining open in API level?
> 
> Yes, I thouht about alternatives, but in the end they are all harder to
> use than the aux-domain concept. We just need to make sure that we have
> a clear definition of what the API extension do and how they impact the
> behavior of existing functions.

sounds great!

> 
> > >   enum iommu_dev_features {
> > >   /* ... */
> > >   IOMMU_DEV_FEAT_AUX,
> > >   IOMMU_DEV_FEAT_SVA,
> > >   /* ... */
> > >   };
> > >
> >
> > Does above represent whether a device implements aux/sva features,
> > or whether a device has been enabled by driver to support aux/sva
> > features?
> 
> These represent whether the device together with the IOMMU support
> them,
> basically whether these features are usable via the IOMMU-API.

"device together with IOMMU" or just "IOMMU itself"? 

the former might be OK for sva. As Jean replied in another mail, enabling
it in both IOMMU and device side just consumes some resources, while 
not impacting other usages on that device.

however there is a problem with aux. A device may implement both
SR-IOV and Scalable IOV capabilities, but at any time only one of them
can be enabled. Driver will provide interfaces for end user to choose.
In such case we cannot assume that device-side Scalable-IOV can be
always enabled while IOMMU is in scalable mode.

It works better if we position those features just representing IOMMU 
support only. In that case, aux is related only to scalable mode of IOMMU
itself, which is orthogonal to whether device side supports SIOV.

> 
> 
> >
> > >   /* Check if a device supports a given feature of the IOMMU-API */
> > >   bool iommu_dev_has_feature(struct device *dev, enum
> > > iommu_dev_features *feat);
> >
> > If the latter we also need iommu_dev_set_feature so driver can poke
> > it based on its own configuration.
> 
> Do you mean we need an explict enable-API for the features? I thought
> about that too, but some features clearly don't need it and I didn't
> want to complicate the usage. My assumption was that when the feature is
> available, it is also enabled.
> 
> > >   /* So we need a iommu_aux_detach_all()? */
> >
> > for what scenario?
> 
> Maybe for detaching any PASID users from a device so that it can be
> assigned to a VM as a whole. But I am not sure it is needed.
> 

yes, possibly useful in reset path as Jean pointed out.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC PATCH 0/6] Auxiliary IOMMU domains and Arm SMMUv3

2018-12-12 Thread Tian, Kevin
> From: 'j...@8bytes.org' [mailto:j...@8bytes.org]
> Sent: Wednesday, December 12, 2018 5:54 PM
> 
> Hi Kevin,
> 
> On Wed, Dec 12, 2018 at 09:31:27AM +, Tian, Kevin wrote:
> > > From: 'j...@8bytes.org'
> > > Sent: Monday, December 10, 2018 4:58 PM
> > > These represent whether the device together with the IOMMU support
> > > them,
> > > basically whether these features are usable via the IOMMU-API.
> >
> > "device together with IOMMU" or just "IOMMU itself"?
> 
> No, it should mean device together with IOMMU support it. It is a way
> for users of the IOMMU-API to check whether they can successfully use
> the aux-specific functions.
> 
> > however there is a problem with aux. A device may implement both
> > SR-IOV and Scalable IOV capabilities, but at any time only one of them
> > can be enabled. Driver will provide interfaces for end user to choose.
> > In such case we cannot assume that device-side Scalable-IOV can be
> > always enabled while IOMMU is in scalable mode.
> >
> > It works better if we position those features just representing IOMMU
> > support only. In that case, aux is related only to scalable mode of IOMMU
> > itself, which is orthogonal to whether device side supports SIOV.
> 
> Yeah, but we don't make that decision in the IOMMU code. Whether the
> device exposes SR-IOV or PASID based isolation is decided in PCI code
> based on user input (SR-IOV is also enabled in PCI code and IOMMU just
> uses the new devices that appear).
> 
> Only if the user enabled scalable mode on the device and the IOMMU
> supports it too, the feature-check function returns true for the aux
> feature.
> 

Then this is another proof for an enable/disable part in API. :-)

Thanks
kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v2 2/2] iommu/vt-d: Enable PASID only if device expects PASID in PRG Response.

2019-02-13 Thread Tian, Kevin
> From: iommu-boun...@lists.linux-foundation.org [mailto:iommu-
> boun...@lists.linux-foundation.org] On Behalf Of
> sathyanarayanan.kuppusw...@linux.intel.com
> Sent: Tuesday, February 12, 2019 5:51 AM
> To: bhelg...@google.com; j...@8bytes.org; dw...@infradead.org
> Cc: Raj, Ashok ; linux-...@vger.kernel.org; linux-
> ker...@vger.kernel.org; Busch, Keith ;
> iommu@lists.linux-foundation.org; Pan, Jacob jun
> 
> Subject: [PATCH v2 2/2] iommu/vt-d: Enable PASID only if device expects
> PASID in PRG Response.
> 
> From: Kuppuswamy Sathyanarayanan
> 
> 
> In Intel IOMMU, if the Page Request Queue (PRQ) is full, it will
> automatically respond to the device with a success message as a keep
> alive. And when sending the success message, IOMMU will include PASID in
> the Response Message when the Page Request has a PASID in Request
> Message and It does not check against the PRG Response PASID
> requirement
> of the device before sending the response. Also, If the device receives the
> PRG response with PASID when its not expecting it then the device behavior
> is undefined. So enable PASID support only if device expects PASID in PRG
> response message.
> 
> Cc: Ashok Raj 
> Cc: Jacob Pan 
> Cc: Keith Busch 
> Suggested-by: Ashok Raj 
> Signed-off-by: Kuppuswamy Sathyanarayanan
> 
> ---
>  drivers/iommu/intel-iommu.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index 1457f931218e..af2e4a011787 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -1399,7 +1399,8 @@ static void iommu_enable_dev_iotlb(struct
> device_domain_info *info)
>  undefined. So always enable PASID support on devices which
>  have it, even if we can't yet know if we're ever going to
>  use it. */
> - if (info->pasid_supported && !pci_enable_pasid(pdev, info-
> >pasid_supported & ~1))
> + if (info->pasid_supported && pci_prg_resp_pasid_required(pdev)
> &&
> + !pci_enable_pasid(pdev, info->pasid_supported & ~1))
>   info->pasid_enabled = 1;

Above logic looks problematic. As Dave commented in another thread,
PRI and PASID are orthogonal capabilities. Especially with introduction
of VT-d scalable mode, PASID will be used alone even w/o PRI...

Why not doing the check when PRI is actually enabled? At that point
you can fail the request if above condition is false. 

> 
>   if (info->pri_supported && !pci_reset_pri(pdev)
> && !pci_enable_pri(pdev, 32))
> --
> 2.20.1
> 
> ___
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH 1/1] iommu: Bind process address spaces to devices

2019-02-27 Thread Tian, Kevin
> From: Jacob Pan [mailto:jacob.jun@linux.intel.com]
> Sent: Thursday, February 28, 2019 5:41 AM
> 
> On Tue, 26 Feb 2019 12:17:43 +0100
> Joerg Roedel  wrote:
> 
> >
> > How about a 'struct iommu_sva' with an iommu-private definition that
> > is returned by this function:
> >
> > struct iommu_sva *iommu_sva_bind_device(struct device *dev,
> > struct mm_struct *mm);
> >
> Just trying to understand how to use this API.
> So if we bind the same mm to two different devices, we should get two
> different iommu_sva handle, right?
> I think intel-svm still needs a flag argument for supervisor pasid etc.
> Other than that, I think both interface should work for vt-d.
> 
> Another question is that for nested SVA, we will need to bind guest mm.
> Do you think we should try to reuse this or have it separate? I am
> working on a separate API for now.
> 

It has to be different. Host doesn't know guest mm.

Also note that from virtualization p.o.v we just focus on 'nested
translation' in host side. The 1st level may point to guest CPU
page table (SVA), or IOVA page table. In that manner, the API
(as currently defined in your series) is purely about setting up
nested translation on VFIO assigned device. 

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v2 1/3] iommu/virtio: Add topology description to virtio-iommu config space

2020-03-05 Thread Tian, Kevin
> From: Jean-Philippe Brucker
> Sent: Saturday, February 29, 2020 1:26 AM
> 
> Platforms without device-tree do not currently have a method for
> describing the vIOMMU topology. Provide a topology description embedded
> into the virtio device.
> 
> Use PCI FIXUP to probe the config space early, because we need to
> discover the topology before any DMA configuration takes place, and the
> virtio driver may be loaded much later. Since we discover the topology
> description when probing the PCI hierarchy, the virtual IOMMU cannot
> manage other platform devices discovered earlier.
> 
> This solution isn't elegant nor foolproof, but is the best we can do at

can you elaborate "isn't elegant nor foolproof" part? is there any other 
limitation (beside pci fixup) along the route, when comparing it to 
the ACPI-approach?

> the moment and works with existing virtio-iommu implementations. It also
> enables an IOMMU for lightweight hypervisors that do not rely on
> firmware methods for booting.
> 
> Signed-off-by: Eric Auger 
> Signed-off-by: Jean-Philippe Brucker 
> ---
>  MAINTAINERS   |   2 +
>  drivers/iommu/Kconfig |  10 +
>  drivers/iommu/Makefile|   1 +
>  drivers/iommu/virtio-iommu-topology.c | 343
> ++
>  drivers/iommu/virtio-iommu.c  |   3 +
>  include/linux/virt_iommu.h|  19 ++
>  include/uapi/linux/virtio_iommu.h |  26 ++
>  7 files changed, 404 insertions(+)
>  create mode 100644 drivers/iommu/virtio-iommu-topology.c
>  create mode 100644 include/linux/virt_iommu.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index fcd79fc38928..65a03ce53096 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -17781,6 +17781,8 @@ M:Jean-Philippe Brucker  phili...@linaro.org>
>  L:   virtualizat...@lists.linux-foundation.org
>  S:   Maintained
>  F:   drivers/iommu/virtio-iommu.c
> +F:   drivers/iommu/virtio-iommu-topology.c
> +F:   include/linux/virt_iommu.h
>  F:   include/uapi/linux/virtio_iommu.h
> 
>  VIRTUAL BOX GUEST DEVICE DRIVER
> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> index c5df570ef84a..f8cb45d84bb0 100644
> --- a/drivers/iommu/Kconfig
> +++ b/drivers/iommu/Kconfig
> @@ -516,4 +516,14 @@ config VIRTIO_IOMMU
> 
> Say Y here if you intend to run this kernel as a guest.
> 
> +config VIRTIO_IOMMU_TOPOLOGY
> + bool "Topology properties for the virtio-iommu"
> + depends on VIRTIO_IOMMU
> + default y
> + help
> +   Enable early probing of the virtio-iommu device, to detect the
> +   built-in topology description.
> +
> +   Say Y here if you intend to run this kernel as a guest.
> +
>  endif # IOMMU_SUPPORT
> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
> index 9f33fdb3bb05..5da24280d08c 100644
> --- a/drivers/iommu/Makefile
> +++ b/drivers/iommu/Makefile
> @@ -37,3 +37,4 @@ obj-$(CONFIG_S390_IOMMU) += s390-iommu.o
>  obj-$(CONFIG_QCOM_IOMMU) += qcom_iommu.o
>  obj-$(CONFIG_HYPERV_IOMMU) += hyperv-iommu.o
>  obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
> +obj-$(CONFIG_VIRTIO_IOMMU_TOPOLOGY) += virtio-iommu-topology.o
> diff --git a/drivers/iommu/virtio-iommu-topology.c b/drivers/iommu/virtio-
> iommu-topology.c
> new file mode 100644
> index ..2188624ef216
> --- /dev/null
> +++ b/drivers/iommu/virtio-iommu-topology.c
> @@ -0,0 +1,343 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +struct viommu_cap_config {
> + u8 bar;
> + u32 length; /* structure size */
> + u32 offset; /* structure offset within the bar */
> +};
> +
> +union viommu_topo_cfg {
> + __le16  type;
> + struct virtio_iommu_topo_pci_range  pci;
> + struct virtio_iommu_topo_endpoint   ep;
> +};
> +
> +struct viommu_spec {
> + struct device   *dev; /* transport device */
> + struct fwnode_handle*fwnode;
> + struct iommu_ops*ops;
> + struct list_headlist;
> + size_t  num_items;
> + /* The config array of length num_items follows */
> + union viommu_topo_cfg   cfg[];
> +};
> +
> +static LIST_HEAD(viommus);
> +static DEFINE_MUTEX(viommus_lock);
> +
> +#define VPCI_FIELD(field) offsetof(struct virtio_pci_cap, field)
> +
> +static inline int viommu_pci_find_capability(struct pci_dev *dev, u8 
> cfg_type,
> +  struct viommu_cap_config *cap)
> +{
> + int pos;
> + u8 bar;
> +
> + for (pos = pci_find_capability(dev, PCI_CAP_ID_VNDR);
> +  pos > 0;
> +  pos = pci_find_next_capability(dev, pos, PCI_CAP_ID_VNDR)) {
> + u8 type;
> +
> + pci_read_config_byte(dev, pos + VPCI_FIELD(cfg_type),
> &type);
> + if (ty

RE: [PATCH v2 1/3] iommu/uapi: Define uapi version and capabilities

2020-03-26 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Friday, March 27, 2020 12:45 AM
> 
> Hi Christoph,
> 
> Thanks for the review. Please see my comments inline.
> 
> On Thu, 26 Mar 2020 02:23:16 -0700
> Christoph Hellwig  wrote:
> 
> > On Wed, Mar 25, 2020 at 04:17:05PM -0700, Jacob Pan wrote:
> > > Having a single UAPI version to govern the user-kernel data
> > > structures makes compatibility check straightforward. On the
> > > contrary, supporting combinations of multiple versions of the data
> > > can be a nightmare to maintain.
> > >
> > > This patch defines a unified UAPI version to be used for
> > > compatibility checks between user and kernel.
> >
> > This is bullshit.  Version numbers don't scale and we've avoided them
> > everywhere.  You need need flags specifying specific behavior.
> >
> We have flags or the equivalent in each UAPI structures. The flags
> are used for checking validity of extensions as well. And you are right
> we can use them for checking specific behavior.
> 
> So we have two options here:
> 1. Have a overall version for a quick compatibility check while
> starting a VM. If not compatible, we will stop guest SVA entirely.
> 
> 2. Let each API calls check its own capabilities/flags at runtime. It
> may fail later on and lead to more complex error handling.
> For example, guest starts with SVA support, allocate a PASID, bind
> guest PASID works great. Then when IO page fault happens, it try to do
> page request service and found out certain flags are not compatible.
> This case is more complex to handle.

If those API calls are inter-dependent for composing a feature (e.g. SVA),
shouldn't we need a way to check them together before exposing the 
feature to the guest, e.g. through a iommu_get_uapi_capabilities interface?

> 
> I am guessing your proposal is #2. right?
> 
> Overall, we don;t expect much change to the UAPI other than adding some
> vendor specific part of the union. Is the scalability concern based on
> frequent changes?
> 
> > > +#define IOMMU_UAPI_VERSION   1
> > > +static inline int iommu_get_uapi_version(void)
> > > +{
> > > + return IOMMU_UAPI_VERSION;
> > > +}
> >
> > Also inline functions like this in UAPI headers that actually get
> > included by userspace programs simply don't work.
> 
> I will remove that, user can just use IOMMU_UAPI_VERSION directly.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH 01/10] iommu/ioasid: Introduce system-wide capacity

2020-03-27 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Thursday, March 26, 2020 1:55 AM
> 
> IOASID is a limited system-wide resource that can be allocated at
> runtime. This limitation can be enumerated during boot. For example, on
> x86 platforms, PCI Process Address Space ID (PASID) allocation uses
> IOASID service. The number of supported PASID bits are enumerated by
> extended capability register as defined in the VT-d spec.
> 
> This patch adds a helper to set the system capacity, it expected to be
> set during boot prior to any allocation request.
> 
> Signed-off-by: Jacob Pan 
> ---
>  drivers/iommu/ioasid.c | 15 +++
>  include/linux/ioasid.h |  5 -
>  2 files changed, 19 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> index 0f8dd377aada..4026e52855b9 100644
> --- a/drivers/iommu/ioasid.c
> +++ b/drivers/iommu/ioasid.c
> @@ -17,6 +17,21 @@ struct ioasid_data {
>   struct rcu_head rcu;
>  };
> 
> +static ioasid_t ioasid_capacity;
> +static ioasid_t ioasid_capacity_avail;
> +
> +/* System capacity can only be set once */
> +void ioasid_install_capacity(ioasid_t total)
> +{
> + if (ioasid_capacity) {
> + pr_warn("IOASID capacity already set at %d\n",
> ioasid_capacity);
> + return;
> + }
> +
> + ioasid_capacity = ioasid_capacity_avail = total;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_install_capacity);
> +

>From all the texts in this patch, looks ioasid_set_capacity might be
a better name?

>  /*
>   * struct ioasid_allocator_data - Internal data structure to hold information
>   * about an allocator. There are two types of allocators:
> diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> index 6f000d7a0ddc..9711fa0dc357 100644
> --- a/include/linux/ioasid.h
> +++ b/include/linux/ioasid.h
> @@ -40,7 +40,7 @@ void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
>  int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
>  void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
>  int ioasid_set_data(ioasid_t ioasid, void *data);
> -
> +void ioasid_install_capacity(ioasid_t total);
>  #else /* !CONFIG_IOASID */
>  static inline ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
>   ioasid_t max, void *private)
> @@ -72,5 +72,8 @@ static inline int ioasid_set_data(ioasid_t ioasid, void
> *data)
>   return -ENOTSUPP;
>  }
> 
> +static inline void ioasid_install_capacity(ioasid_t total)
> +{
> +}
>  #endif /* CONFIG_IOASID */
>  #endif /* __LINUX_IOASID_H */
> --
> 2.7.4

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH 02/10] iommu/vt-d: Set IOASID capacity when SVM is enabled

2020-03-27 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Thursday, March 26, 2020 1:55 AM
> 
> Assign system-wide PASID capacity with enumerated max value.
> Currently, all Intel SVM capable devices should support full 20 bits of
> PASID value.
> 
> Signed-off-by: Jacob Pan 
> ---
>  drivers/iommu/intel-iommu.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index a699a765c983..ec3fc121744a 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -3510,6 +3510,10 @@ static int __init init_dmars(void)
>   if (ret)
>   goto free_iommu;
> 
> + /* PASID is needed for scalable mode irrespective to SVM */
> + if (intel_iommu_sm)
> + ioasid_install_capacity(intel_pasid_max_id);
> +
>   /*
>* for each drhd
>*   enable fault log
> --
> 2.7.4

Reviewed-by: Kevin Tian 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH 03/10] iommu/ioasid: Introduce per set allocation APIs

2020-03-27 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Thursday, March 26, 2020 1:55 AM
> 
> IOASID set defines a group of IDs that share the same token. The
> ioasid_set concept helps to do permission checking among users as in the
> current code.
> 
> With guest SVA usage, each VM has its own IOASID set. More
> functionalities are needed:
> 1. Enforce quota, each guest may be assigned limited quota such that one
> guest cannot abuse all the system resource.
> 2. Stores IOASID mapping between guest and host IOASIDs
> 3. Per set operations, e.g. free the entire set
> 
> For each ioasid_set token, a unique set ID is assigned. This makes
> reference of the set and data lookup much easier to implement.
> 
> Signed-off-by: Liu Yi L 
> Signed-off-by: Jacob Pan 
> ---
>  drivers/iommu/ioasid.c | 147
> +
>  include/linux/ioasid.h |  13 +
>  2 files changed, 160 insertions(+)
> 
> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> index 4026e52855b9..27ee57f7079b 100644
> --- a/drivers/iommu/ioasid.c
> +++ b/drivers/iommu/ioasid.c
> @@ -10,6 +10,25 @@
>  #include 
>  #include 
> 
> +static DEFINE_XARRAY_ALLOC(ioasid_sets);
> +/**
> + * struct ioasid_set_data - Meta data about ioasid_set
> + *
> + * @token:   Unique to identify an IOASID set
> + * @xa:  XArray to store subset ID and IOASID mapping

what is a subset? is it a different thing from set?

> + * @size:Max number of IOASIDs can be allocated within the set

'size' reads more like 'current size' instead of 'max size'. maybe call
it 'max_ioasids' to align with 'nr_ioasids'? or simplify both as 
'max' and 'nr'?

> + * @nr_ioasids   Number of IOASIDs allocated in the set
> + * @sid  ID of the set
> + */
> +struct ioasid_set_data {
> + struct ioasid_set *token;
> + struct xarray xa;
> + int size;
> + int nr_ioasids;
> + int sid;
> + struct rcu_head rcu;
> +};
> +
>  struct ioasid_data {
>   ioasid_t id;
>   struct ioasid_set *set;
> @@ -388,6 +407,111 @@ void ioasid_free(ioasid_t ioasid)
>  EXPORT_SYMBOL_GPL(ioasid_free);
> 
>  /**
> + * ioasid_alloc_set - Allocate a set of IOASIDs

'a set of IOASIDS' sounds like 'many IOASIDs'. Just saying 'allocate
an IOASID set' is more clear. 😊

> + * @token:   Unique token of the IOASID set
> + * @quota:   Quota allowed in this set
> + * @sid: IOASID set ID to be assigned
> + *
> + * Return 0 upon success. Token will be stored internally for lookup,
> + * IOASID allocation within the set and other per set operations will use
> + * the @sid assigned.
> + *
> + */
> +int ioasid_alloc_set(struct ioasid_set *token, ioasid_t quota, int *sid)
> +{
> + struct ioasid_set_data *sdata;
> + ioasid_t id;
> + int ret = 0;
> +
> + if (quota > ioasid_capacity_avail) {
> + pr_warn("Out of IOASID capacity! ask %d, avail %d\n",
> + quota, ioasid_capacity_avail);
> + return -ENOSPC;
> + }
> +
> + sdata = kzalloc(sizeof(*sdata), GFP_KERNEL);
> + if (!sdata)
> + return -ENOMEM;
> +
> + spin_lock(&ioasid_allocator_lock);
> +
> + ret = xa_alloc(&ioasid_sets, &id, sdata,
> +XA_LIMIT(0, ioasid_capacity_avail - quota),
> +GFP_KERNEL);

Interestingly I didn't find the definition of ioasid_sets. and it is not in
existing file.

I'm not sure how many sets can be created, but anyway the set
namespace is different from ioasid name space. Then why do we
use ioasid capability as the limitation for allocating set id here?

> + if (ret) {
> + kfree(sdata);
> + goto error;
> + }
> +
> + sdata->token = token;

given token must be unique, a check on any conflict is required here?

> + sdata->size = quota;
> + sdata->sid = id;
> +
> + /*
> +  * Set Xarray is used to store IDs within the set, get ready for
> +  * sub-set ID and system-wide IOASID allocation results.

looks 'subset' is the same thing as 'set'. let's make it consistent.

> +  */
> + xa_init_flags(&sdata->xa, XA_FLAGS_ALLOC);
> +
> + ioasid_capacity_avail -= quota;
> + *sid = id;
> +
> +error:
> + spin_unlock(&ioasid_allocator_lock);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_alloc_set);
> +
> +/**
> + * ioasid_free_set - Free all IOASIDs within the set
> + *
> + * @sid: The IOASID set ID to be freed
> + * @destroy_set: Whether to keep the set for further allocation.
> + *   If true, the set will be destroyed.
> + *
> + * All IOASIDs allocated within the set will be freed upon return.
> + */
> +void ioasid_free_set(int sid, bool destroy_set)
> +{

what is the actual usage of just freeing ioasid while keeping the
set itself?

> + struct ioasid_set_data *sdata;
> + struct ioasid_data *entry;
> + unsigned long index;
> +
> + spin_lock(&ioasid_allocator_lock);
> + sdata = xa_load(&ioasid_sets, sid);
> + if (!sdata) {
> +   

RE: [PATCH 04/10] iommu/ioasid: Rename ioasid_set_data to avoid confusion with ioasid_set

2020-03-27 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Thursday, March 26, 2020 1:55 AM
> 
> IOASID set refers to a group of IOASIDs that shares the same token.
> ioasid_set_data() function is used to attach a private data to an IOASID,
> rename it to ioasid_attach_data() avoid being confused with the group/set
> concept.
> 
> Signed-off-by: Jacob Pan 
> ---
>  drivers/iommu/intel-svm.c | 11 ++-
>  drivers/iommu/ioasid.c|  6 +++---
>  include/linux/ioasid.h|  4 ++--
>  3 files changed, 11 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
> index b6405df6cfb5..1991587fd3fd 100644
> --- a/drivers/iommu/intel-svm.c
> +++ b/drivers/iommu/intel-svm.c
> @@ -319,14 +319,15 @@ int intel_svm_bind_gpasid(struct iommu_domain
> *domain,
>   svm->gpasid = data->gpasid;
>   svm->flags |= SVM_FLAG_GUEST_PASID;
>   }
> - ioasid_set_data(data->hpasid, svm);
> +
> + ioasid_attach_data(data->hpasid, svm);
>   INIT_LIST_HEAD_RCU(&svm->devs);
>   mmput(svm->mm);
>   }
>   sdev = kzalloc(sizeof(*sdev), GFP_KERNEL);
>   if (!sdev) {
>   if (list_empty(&svm->devs)) {
> - ioasid_set_data(data->hpasid, NULL);
> + ioasid_attach_data(data->hpasid, NULL);
>   kfree(svm);
>   }
>   ret = -ENOMEM;
> @@ -346,7 +347,7 @@ int intel_svm_bind_gpasid(struct iommu_domain
> *domain,
>* was allocated in this function.
>*/
>   if (list_empty(&svm->devs)) {
> - ioasid_set_data(data->hpasid, NULL);
> + ioasid_attach_data(data->hpasid, NULL);
>   kfree(svm);
>   }
>   goto out;
> @@ -375,7 +376,7 @@ int intel_svm_bind_gpasid(struct iommu_domain
> *domain,
>*/
>   kfree(sdev);
>   if (list_empty(&svm->devs)) {
> - ioasid_set_data(data->hpasid, NULL);
> + ioasid_attach_data(data->hpasid, NULL);
>   kfree(svm);
>   }
>   goto out;
> @@ -438,7 +439,7 @@ int intel_svm_unbind_gpasid(struct device *dev, int
> pasid)
>* that PASID allocated by one guest cannot
> be
>* used by another.
>*/
> - ioasid_set_data(pasid, NULL);
> + ioasid_attach_data(pasid, NULL);
>   kfree(svm);
>   }
>   }
> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> index 27ee57f7079b..6265d2dbbced 100644
> --- a/drivers/iommu/ioasid.c
> +++ b/drivers/iommu/ioasid.c
> @@ -292,14 +292,14 @@ void ioasid_unregister_allocator(struct
> ioasid_allocator_ops *ops)
>  EXPORT_SYMBOL_GPL(ioasid_unregister_allocator);
> 
>  /**
> - * ioasid_set_data - Set private data for an allocated ioasid
> + * ioasid_attach_data - Set private data for an allocated ioasid
>   * @ioasid: the ID to set data
>   * @data:   the private data
>   *
>   * For IOASID that is already allocated, private data can be set
>   * via this API. Future lookup can be done via ioasid_find.
>   */
> -int ioasid_set_data(ioasid_t ioasid, void *data)
> +int ioasid_attach_data(ioasid_t ioasid, void *data)
>  {
>   struct ioasid_data *ioasid_data;
>   int ret = 0;
> @@ -321,7 +321,7 @@ int ioasid_set_data(ioasid_t ioasid, void *data)
> 
>   return ret;
>  }
> -EXPORT_SYMBOL_GPL(ioasid_set_data);
> +EXPORT_SYMBOL_GPL(ioasid_attach_data);
> 
>  /**
>   * ioasid_alloc - Allocate an IOASID
> diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> index be158e03c034..8c82d2625671 100644
> --- a/include/linux/ioasid.h
> +++ b/include/linux/ioasid.h
> @@ -39,7 +39,7 @@ void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
> bool (*getter)(void *));
>  int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
>  void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
> -int ioasid_set_data(ioasid_t ioasid, void *data);
> +int ioasid_attach_data(ioasid_t ioasid, void *data);
>  void ioasid_install_capacity(ioasid_t total);
>  int ioasid_alloc_set(struct ioasid_set *token, ioasid_t quota, int *sid);
>  void ioasid_free_set(int sid, bool destroy_set);
> @@ -79,7 +79,7 @@ static inline void ioasid_unregister_allocator(struct
> ioasid_allocator_ops *allo
>  {
>  }
> 
> -static inline int ioasid_set_data(ioasid_t ioasid, void *data)
> +static inline int ioasid_attach_data(ioasid_t ioasid, void *data)
>  {
>   return -ENOTSUPP;
>  }
> --
> 2.7.4

Reviewed-by: Kevin Tian 

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH 05/10] iommu/ioasid: Create an IOASID set for host SVA use

2020-03-27 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Thursday, March 26, 2020 1:55 AM
> 
> Bare metal SVA allocates IOASIDs for native process addresses. This
> should be separated from VM allocated IOASIDs thus under its own set.

A curious question. Now bare metal SVA uses the system set and guest
SVA uses dynamically-created set. Then do we still allow ioasid_alloc
with a NULL set pointer?

> 
> This patch creates a system IOASID set with its quota set to PID_MAX.
> This is a reasonable default in that SVM capable devices can only bind
> to limited user processes.
> 
> Signed-off-by: Jacob Pan 
> ---
>  drivers/iommu/intel-iommu.c | 8 +++-
>  drivers/iommu/ioasid.c  | 9 +
>  include/linux/ioasid.h  | 9 +
>  3 files changed, 25 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index ec3fc121744a..af7a1ef7b31e 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -3511,8 +3511,14 @@ static int __init init_dmars(void)
>   goto free_iommu;
> 
>   /* PASID is needed for scalable mode irrespective to SVM */
> - if (intel_iommu_sm)
> + if (intel_iommu_sm) {
>   ioasid_install_capacity(intel_pasid_max_id);
> + /* We should not run out of IOASIDs at boot */
> + if (ioasid_alloc_system_set(PID_MAX_DEFAULT)) {
> + pr_err("Failed to enable host PASID allocator\n");
> + intel_iommu_sm = 0;
> + }
> + }
> 
>   /*
>* for each drhd
> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> index 6265d2dbbced..9135af171a7c 100644
> --- a/drivers/iommu/ioasid.c
> +++ b/drivers/iommu/ioasid.c
> @@ -39,6 +39,9 @@ struct ioasid_data {
>  static ioasid_t ioasid_capacity;
>  static ioasid_t ioasid_capacity_avail;
> 
> +int system_ioasid_sid;
> +static DECLARE_IOASID_SET(system_ioasid);
> +
>  /* System capacity can only be set once */
>  void ioasid_install_capacity(ioasid_t total)
>  {
> @@ -51,6 +54,12 @@ void ioasid_install_capacity(ioasid_t total)
>  }
>  EXPORT_SYMBOL_GPL(ioasid_install_capacity);
> 
> +int ioasid_alloc_system_set(int quota)
> +{
> + return ioasid_alloc_set(&system_ioasid, quota, &system_ioasid_sid);
> +}
> +EXPORT_SYMBOL_GPL(ioasid_alloc_system_set);
> +
>  /*
>   * struct ioasid_allocator_data - Internal data structure to hold information
>   * about an allocator. There are two types of allocators:
> diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> index 8c82d2625671..097b1cc043a3 100644
> --- a/include/linux/ioasid.h
> +++ b/include/linux/ioasid.h
> @@ -29,6 +29,9 @@ struct ioasid_allocator_ops {
>   void *pdata;
>  };
> 
> +/* Shared IOASID set for reserved for host system use */
> +extern int system_ioasid_sid;
> +
>  #define DECLARE_IOASID_SET(name) struct ioasid_set name = { 0 }
> 
>  #if IS_ENABLED(CONFIG_IOASID)
> @@ -41,6 +44,7 @@ int ioasid_register_allocator(struct ioasid_allocator_ops
> *allocator);
>  void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
>  int ioasid_attach_data(ioasid_t ioasid, void *data);
>  void ioasid_install_capacity(ioasid_t total);
> +int ioasid_alloc_system_set(int quota);
>  int ioasid_alloc_set(struct ioasid_set *token, ioasid_t quota, int *sid);
>  void ioasid_free_set(int sid, bool destroy_set);
>  int ioasid_find_sid(ioasid_t ioasid);
> @@ -88,5 +92,10 @@ static inline void ioasid_install_capacity(ioasid_t total)
>  {
>  }
> 
> +static inline int ioasid_alloc_system_set(int quota)
> +{
> + return -ENOTSUPP;
> +}
> +
>  #endif /* CONFIG_IOASID */
>  #endif /* __LINUX_IOASID_H */
> --
> 2.7.4

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH 06/10] iommu/ioasid: Convert to set aware allocations

2020-03-27 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Thursday, March 26, 2020 1:55 AM
> 
> The current ioasid_alloc function takes a token/ioasid_set then record it
> on the IOASID being allocated. There is no alloc/free on the ioasid_set.
> 
> With the IOASID set APIs, callers must allocate an ioasid_set before
> allocate IOASIDs within the set. Quota and other ioasid_set level
> activities can then be enforced.
> 
> This patch converts existing API to the new ioasid_set model.
> 
> Signed-off-by: Liu Yi L 
> Signed-off-by: Jacob Pan 
> ---
>  drivers/iommu/intel-iommu.c | 10 +++---
>  drivers/iommu/intel-svm.c   | 10 +++---
>  drivers/iommu/ioasid.c  | 78 +---
> -
>  include/linux/ioasid.h  | 11 +++
>  4 files changed, 72 insertions(+), 37 deletions(-)
> 
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index af7a1ef7b31e..c571cc8d9e57 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -3323,11 +3323,11 @@ static void intel_ioasid_free(ioasid_t ioasid, void
> *data)
>   if (!iommu)
>   return;
>   /*
> -  * Sanity check the ioasid owner is done at upper layer, e.g. VFIO
> -  * We can only free the PASID when all the devices are unbound.
> +  * In the guest, all IOASIDs belong to the system_ioasid set.
> +  * Sanity check against the system set.

below code has nothing to deal with guest, then why putting the comment
specifically for guest?

>*/
> - if (ioasid_find(NULL, ioasid, NULL)) {
> - pr_alert("Cannot free active IOASID %d\n", ioasid);
> + if (IS_ERR(ioasid_find(system_ioasid_sid, ioasid, NULL))) {
> + pr_err("Cannot free IOASID %d, not in system set\n", ioasid);
>   return;
>   }
>   vcmd_free_pasid(iommu, ioasid);
> @@ -5541,7 +5541,7 @@ static int aux_domain_add_dev(struct
> dmar_domain *domain,
>   int pasid;
> 
>   /* No private data needed for the default pasid */
> - pasid = ioasid_alloc(NULL, PASID_MIN,
> + pasid = ioasid_alloc(system_ioasid_sid, PASID_MIN,
>pci_max_pasids(to_pci_dev(dev)) - 1,
>NULL);
>   if (pasid == INVALID_IOASID) {
> diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
> index 1991587fd3fd..f511855d187b 100644
> --- a/drivers/iommu/intel-svm.c
> +++ b/drivers/iommu/intel-svm.c
> @@ -268,7 +268,7 @@ int intel_svm_bind_gpasid(struct iommu_domain
> *domain,
>   }
> 
>   mutex_lock(&pasid_mutex);
> - svm = ioasid_find(NULL, data->hpasid, NULL);
> + svm = ioasid_find(INVALID_IOASID_SET, data->hpasid, NULL);
>   if (IS_ERR(svm)) {
>   ret = PTR_ERR(svm);
>   goto out;
> @@ -401,7 +401,7 @@ int intel_svm_unbind_gpasid(struct device *dev, int
> pasid)
>   return -EINVAL;
> 
>   mutex_lock(&pasid_mutex);
> - svm = ioasid_find(NULL, pasid, NULL);
> + svm = ioasid_find(INVALID_IOASID_SET, pasid, NULL);
>   if (!svm) {
>   ret = -EINVAL;
>   goto out;
> @@ -559,7 +559,7 @@ static int intel_svm_bind_mm(struct device *dev, int
> flags, struct svm_dev_ops *
>   pasid_max = intel_pasid_max_id;
> 
>   /* Do not use PASID 0, reserved for RID to PASID */
> - svm->pasid = ioasid_alloc(NULL, PASID_MIN,
> + svm->pasid = ioasid_alloc(system_ioasid_sid, PASID_MIN,
> pasid_max - 1, svm);
>   if (svm->pasid == INVALID_IOASID) {
>   kfree(svm);
> @@ -642,7 +642,7 @@ int intel_svm_unbind_mm(struct device *dev, int
> pasid)
>   if (!iommu)
>   goto out;
> 
> - svm = ioasid_find(NULL, pasid, NULL);
> + svm = ioasid_find(system_ioasid_sid, pasid, NULL);
>   if (!svm)
>   goto out;
> 
> @@ -778,7 +778,7 @@ static irqreturn_t prq_event_thread(int irq, void *d)
> 
>   if (!svm || svm->pasid != req->pasid) {
>   rcu_read_lock();
> - svm = ioasid_find(NULL, req->pasid, NULL);
> + svm = ioasid_find(INVALID_IOASID_SET, req->pasid,
> NULL);

is there a criteria when INVALID_IOASID_SET should be used?

>   /* It *can't* go away, because the driver is not
> permitted
>* to unbind the mm while any page faults are
> outstanding.
>* So we only need RCU to protect the internal idr
> code. */
> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> index 9135af171a7c..f89a595f6978 100644
> --- a/drivers/iommu/ioasid.c
> +++ b/drivers/iommu/ioasid.c
> @@ -31,7 +31,7 @@ struct ioasid_set_data {
> 
>  struct ioasid_data {
>   ioasid_t id;
> - struct ioasid_set *set;
> + struct ioasid_set_data *sdata;
>   void *private;
>   struct rcu_head rcu;
>  };
> @@ -334,7 +334,7 @@ E

RE: [PATCH 07/10] iommu/ioasid: Use mutex instead of spinlock

2020-03-27 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Thursday, March 26, 2020 1:55 AM
> 
> Each IOASID or set could have multiple users with its own HW context
> to maintain. Often times access to the HW context requires thread context.
> For example, consumers of IOASIDs can register notification blocks to
> sync up its states. Having an atomic notifier is not feasible for these
> update operations.
> 
> This patch converts allocator lock from spinlock to mutex in preparation
> for IOASID notifier.
> 
> Signed-off-by: Jacob Pan 
> ---
>  drivers/iommu/ioasid.c | 45 +++--
>  1 file changed, 23 insertions(+), 22 deletions(-)
> 
> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> index f89a595f6978..8612fe6477dc 100644
> --- a/drivers/iommu/ioasid.c
> +++ b/drivers/iommu/ioasid.c
> @@ -98,7 +98,7 @@ struct ioasid_allocator_data {
>   struct rcu_head rcu;
>  };
> 
> -static DEFINE_SPINLOCK(ioasid_allocator_lock);
> +static DEFINE_MUTEX(ioasid_allocator_lock);
>  static LIST_HEAD(allocators_list);
> 
>  static ioasid_t default_alloc(ioasid_t min, ioasid_t max, void *opaque);
> @@ -121,7 +121,7 @@ static ioasid_t default_alloc(ioasid_t min, ioasid_t
> max, void *opaque)
>  {
>   ioasid_t id;
> 
> - if (xa_alloc(&default_allocator.xa, &id, opaque, XA_LIMIT(min, max),
> GFP_ATOMIC)) {
> + if (xa_alloc(&default_allocator.xa, &id, opaque, XA_LIMIT(min, max),
> GFP_KERNEL)) {
>   pr_err("Failed to alloc ioasid from %d to %d\n", min, max);
>   return INVALID_IOASID;
>   }
> @@ -142,7 +142,7 @@ static struct ioasid_allocator_data
> *ioasid_alloc_allocator(struct ioasid_alloca
>  {
>   struct ioasid_allocator_data *ia_data;
> 
> - ia_data = kzalloc(sizeof(*ia_data), GFP_ATOMIC);
> + ia_data = kzalloc(sizeof(*ia_data), GFP_KERNEL);
>   if (!ia_data)
>   return NULL;
> 
> @@ -184,7 +184,7 @@ int ioasid_register_allocator(struct
> ioasid_allocator_ops *ops)
>   struct ioasid_allocator_data *pallocator;
>   int ret = 0;
> 
> - spin_lock(&ioasid_allocator_lock);
> + mutex_lock(&ioasid_allocator_lock);
> 
>   ia_data = ioasid_alloc_allocator(ops);
>   if (!ia_data) {
> @@ -228,12 +228,12 @@ int ioasid_register_allocator(struct
> ioasid_allocator_ops *ops)
>   }
>   list_add_tail(&ia_data->list, &allocators_list);
> 
> - spin_unlock(&ioasid_allocator_lock);
> + mutex_unlock(&ioasid_allocator_lock);
>   return 0;
>  out_free:
>   kfree(ia_data);
>  out_unlock:
> - spin_unlock(&ioasid_allocator_lock);
> + mutex_unlock(&ioasid_allocator_lock);
>   return ret;
>  }
>  EXPORT_SYMBOL_GPL(ioasid_register_allocator);
> @@ -251,7 +251,7 @@ void ioasid_unregister_allocator(struct
> ioasid_allocator_ops *ops)
>   struct ioasid_allocator_data *pallocator;
>   struct ioasid_allocator_ops *sops;
> 
> - spin_lock(&ioasid_allocator_lock);
> + mutex_lock(&ioasid_allocator_lock);
>   if (list_empty(&allocators_list)) {
>   pr_warn("No custom IOASID allocators active!\n");
>   goto exit_unlock;
> @@ -296,7 +296,7 @@ void ioasid_unregister_allocator(struct
> ioasid_allocator_ops *ops)
>   }
> 
>  exit_unlock:
> - spin_unlock(&ioasid_allocator_lock);
> + mutex_unlock(&ioasid_allocator_lock);
>  }
>  EXPORT_SYMBOL_GPL(ioasid_unregister_allocator);
> 
> @@ -313,13 +313,13 @@ int ioasid_attach_data(ioasid_t ioasid, void *data)
>   struct ioasid_data *ioasid_data;
>   int ret = 0;
> 
> - spin_lock(&ioasid_allocator_lock);
> + mutex_lock(&ioasid_allocator_lock);
>   ioasid_data = xa_load(&active_allocator->xa, ioasid);
>   if (ioasid_data)
>   rcu_assign_pointer(ioasid_data->private, data);
>   else
>   ret = -ENOENT;
> - spin_unlock(&ioasid_allocator_lock);
> + mutex_unlock(&ioasid_allocator_lock);
> 
>   /*
>* Wait for readers to stop accessing the old private data, so the
> @@ -374,7 +374,7 @@ ioasid_t ioasid_alloc(int sid, ioasid_t min, ioasid_t
> max, void *private)
>* Custom allocator needs allocator data to perform platform specific
>* operations.
>*/
> - spin_lock(&ioasid_allocator_lock);
> + mutex_lock(&ioasid_allocator_lock);
>   adata = active_allocator->flags & IOASID_ALLOCATOR_CUSTOM ?
> active_allocator->ops->pdata : data;
>   id = active_allocator->ops->alloc(min, max, adata);
>   if (id == INVALID_IOASID) {
> @@ -383,7 +383,7 @@ ioasid_t ioasid_alloc(int sid, ioasid_t min, ioasid_t
> max, void *private)
>   }
> 
>   if ((active_allocator->flags & IOASID_ALLOCATOR_CUSTOM) &&
> -  xa_alloc(&active_allocator->xa, &id, data, XA_LIMIT(id, id),
> GFP_ATOMIC)) {
> +  xa_alloc(&active_allocator->xa, &id, data, XA_LIMIT(id, id),
> GFP_KERNEL)) {
>   /* Custom allocator needs framework to store and track
> allocation results */
>   pr_err("Failed to alloc ioasid from %d\n",

RE: [PATCH 08/10] iommu/ioasid: Introduce notifier APIs

2020-03-27 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Thursday, March 26, 2020 1:55 AM
> 
> IOASID users fit into the publisher-subscriber pattern, a system wide
> blocking notifier chain can be used to inform subscribers of state
> changes. Notifier mechanism also abstracts publisher from knowing the
> private context each subcriber may have.
> 
> This patch adds APIs and a global notifier chain, a further optimization
> might be per set notifier for ioasid_set aware users.
> 
> Usage example:
> KVM register notifier block such that it can keep its guest-host PASID
> translation table in sync with any IOASID updates.
> 
> VFIO publish IOASID change by performing alloc/free, bind/unbind
> operations.
> 
> IOMMU driver gets notified when IOASID is freed by VFIO or core mm code
> such that PASID context can be cleaned up.

above example looks mixed. You have KVM registers the notifier but
finally having IOMMU driver to get notified... 😊

> 
> Signed-off-by: Liu Yi L 
> Signed-off-by: Jacob Pan 
> ---
>  drivers/iommu/ioasid.c | 61
> ++
>  include/linux/ioasid.h | 40 +
>  2 files changed, 101 insertions(+)
> 
> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> index 8612fe6477dc..27dce2cb5af2 100644
> --- a/drivers/iommu/ioasid.c
> +++ b/drivers/iommu/ioasid.c
> @@ -11,6 +11,22 @@
>  #include 
> 
>  static DEFINE_XARRAY_ALLOC(ioasid_sets);
> +/*
> + * An IOASID could have multiple consumers. When a status change occurs,
> + * this notifier chain is used to keep them in sync. Each consumer of the
> + * IOASID service must register notifier block early to ensure no events
> + * are missed.
> + *
> + * This is a publisher-subscriber pattern where publisher can change the
> + * state of each IOASID, e.g. alloc/free, bind IOASID to a device and mm.
> + * On the other hand, subscribers gets notified for the state change and
> + * keep local states in sync.
> + *
> + * Currently, the notifier is global. A further optimization could be per
> + * IOASID set notifier chain.
> + */
> +static BLOCKING_NOTIFIER_HEAD(ioasid_chain);
> +
>  /**
>   * struct ioasid_set_data - Meta data about ioasid_set
>   *
> @@ -408,6 +424,7 @@ static void ioasid_free_locked(ioasid_t ioasid)
>  {
>   struct ioasid_data *ioasid_data;
>   struct ioasid_set_data *sdata;
> + struct ioasid_nb_args args;
> 
>   ioasid_data = xa_load(&active_allocator->xa, ioasid);
>   if (!ioasid_data) {
> @@ -415,6 +432,13 @@ static void ioasid_free_locked(ioasid_t ioasid)
>   return;
>   }
> 
> + args.id = ioasid;
> + args.sid = ioasid_data->sdata->sid;
> + args.pdata = ioasid_data->private;
> + args.set_token = ioasid_data->sdata->token;
> +
> + /* Notify all users that this IOASID is being freed */
> + blocking_notifier_call_chain(&ioasid_chain, IOASID_FREE, &args);
>   active_allocator->ops->free(ioasid, active_allocator->ops->pdata);
>   /* Custom allocator needs additional steps to free the xa element */
>   if (active_allocator->flags & IOASID_ALLOCATOR_CUSTOM) {
> @@ -624,6 +648,43 @@ int ioasid_find_sid(ioasid_t ioasid)
>  }
>  EXPORT_SYMBOL_GPL(ioasid_find_sid);
> 
> +int ioasid_add_notifier(struct notifier_block *nb)
> +{
> + return blocking_notifier_chain_register(&ioasid_chain, nb);
> +}
> +EXPORT_SYMBOL_GPL(ioasid_add_notifier);
> +
> +void ioasid_remove_notifier(struct notifier_block *nb)
> +{
> + blocking_notifier_chain_unregister(&ioasid_chain, nb);
> +}
> +EXPORT_SYMBOL_GPL(ioasid_remove_notifier);

register/unregister

> +
> +int ioasid_notify(ioasid_t ioasid, enum ioasid_notify_val cmd)

add a comment on when this function should be used?

> +{
> + struct ioasid_data *ioasid_data;
> + struct ioasid_nb_args args;
> + int ret = 0;
> +
> + mutex_lock(&ioasid_allocator_lock);
> + ioasid_data = xa_load(&active_allocator->xa, ioasid);
> + if (!ioasid_data) {
> + pr_err("Trying to free unknown IOASID %u\n", ioasid);

why is it fixed to 'free'?

> + mutex_unlock(&ioasid_allocator_lock);
> + return -EINVAL;
> + }
> +
> + args.id = ioasid;
> + args.sid = ioasid_data->sdata->sid;
> + args.pdata = ioasid_data->private;

why no token info as did in ioasid_free?

> +
> + ret = blocking_notifier_call_chain(&ioasid_chain, cmd, &args);
> + mutex_unlock(&ioasid_allocator_lock);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_notify);
> +
>  MODULE_AUTHOR("Jean-Philippe Brucker  philippe.bruc...@arm.com>");
>  MODULE_AUTHOR("Jacob Pan ");
>  MODULE_DESCRIPTION("IO Address Space ID (IOASID) allocator");
> diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> index e19c0ad93bd7..32d032913828 100644
> --- a/include/linux/ioasid.h
> +++ b/include/linux/ioasid.h
> @@ -4,6 +4,7 @@
> 
>  #include 
>  #include 
> +#include 
> 
>  #define INVALID_IOASID ((ioasid_t)-1)
>  #define INVALID_IOASID_SET (-1)
> @@ -30,6 +31,27 @@ str

RE: [PATCH 09/10] iommu/ioasid: Support ioasid_set quota adjustment

2020-03-27 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Thursday, March 26, 2020 1:56 AM
> 
> IOASID set is allocated with an initial quota, at runtime there may be
> needs to balance IOASID resources among different VMs/sets.
> 

I may overlook previous patches but I didn't see any place setting the
initial quota...

> This patch adds a new API to adjust per set quota.

since this is purely an internal kernel API, implies that the publisher
(e.g. VFIO) is responsible for exposing its own uAPI to set the quota?

> 
> Signed-off-by: Jacob Pan 
> ---
>  drivers/iommu/ioasid.c | 44
> 
>  include/linux/ioasid.h |  6 ++
>  2 files changed, 50 insertions(+)
> 
> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> index 27dce2cb5af2..5ac28862a1db 100644
> --- a/drivers/iommu/ioasid.c
> +++ b/drivers/iommu/ioasid.c
> @@ -578,6 +578,50 @@ void ioasid_free_set(int sid, bool destroy_set)
>  }
>  EXPORT_SYMBOL_GPL(ioasid_free_set);
> 
> +/**
> + * ioasid_adjust_set - Adjust the quota of an IOASID set
> + * @quota:   Quota allowed in this set
> + * @sid: IOASID set ID to be assigned
> + *
> + * Return 0 on success. If the new quota is smaller than the number of
> + * IOASIDs already allocated, -EINVAL will be returned. No change will be
> + * made to the existing quota.
> + */
> +int ioasid_adjust_set(int sid, int quota)
> +{
> + struct ioasid_set_data *sdata;
> + int ret = 0;
> +
> + mutex_lock(&ioasid_allocator_lock);
> + sdata = xa_load(&ioasid_sets, sid);
> + if (!sdata || sdata->nr_ioasids > quota) {
> + pr_err("Failed to adjust IOASID set %d quota %d\n",
> + sid, quota);
> + ret = -EINVAL;
> + goto done_unlock;
> + }
> +
> + if (quota >= ioasid_capacity_avail) {
> + ret = -ENOSPC;
> + goto done_unlock;
> + }
> +
> + /* Return the delta back to system pool */
> + ioasid_capacity_avail += sdata->size - quota;
> +
> + /*
> +  * May have a policy to prevent giving all available IOASIDs
> +  * to one set. But we don't enforce here, it should be in the
> +  * upper layers.
> +  */
> + sdata->size = quota;
> +
> +done_unlock:
> + mutex_unlock(&ioasid_allocator_lock);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_adjust_set);
> 
>  /**
>   * ioasid_find - Find IOASID data
> diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> index 32d032913828..6e7de6fb91bf 100644
> --- a/include/linux/ioasid.h
> +++ b/include/linux/ioasid.h
> @@ -73,6 +73,7 @@ int ioasid_alloc_set(struct ioasid_set *token, ioasid_t
> quota, int *sid);
>  void ioasid_free_set(int sid, bool destroy_set);
>  int ioasid_find_sid(ioasid_t ioasid);
>  int ioasid_notify(ioasid_t id, enum ioasid_notify_val cmd);
> +int ioasid_adjust_set(int sid, int quota);
> 
>  #else /* !CONFIG_IOASID */
>  static inline ioasid_t ioasid_alloc(int sid, ioasid_t min,
> @@ -136,5 +137,10 @@ static inline int ioasid_alloc_system_set(int quota)
>   return -ENOTSUPP;
>  }
> 
> +static inline int ioasid_adjust_set(int sid, int quota)
> +{
> + return -ENOTSUPP;
> +}
> +
>  #endif /* CONFIG_IOASID */
>  #endif /* __LINUX_IOASID_H */
> --
> 2.7.4

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH 10/10] iommu/vt-d: Register PASID notifier for status change

2020-03-27 Thread Tian, Kevin
> From: Jacob Pan
> Sent: Thursday, March 26, 2020 1:56 AM
> 
> In bare metal SVA, IOMMU driver ensures that IOASID free call always comes
> after IOASID unbind operation.
> 
> However, for guest SVA the unbind and free call come from user space
> via VFIO, which could be out of order. This patch registers a notifier
> block in case IOASID free() comes before unbind such that VT-d driver
> can take action to clean up PASID context and data.

clearly the patch includes more than above usage. It also handles the
bind usage to notify KVM for setting PASID translation table.

> 
> Signed-off-by: Liu Yi L 
> Signed-off-by: Jacob Pan 
> ---
>  drivers/iommu/intel-svm.c   | 68
> -
>  include/linux/intel-iommu.h |  1 +
>  2 files changed, 68 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
> index f511855d187b..779dd2c6f9e1 100644
> --- a/drivers/iommu/intel-svm.c
> +++ b/drivers/iommu/intel-svm.c
> @@ -23,6 +23,7 @@
>  #include "intel-pasid.h"
> 
>  static irqreturn_t prq_event_thread(int irq, void *d);
> +static DEFINE_MUTEX(pasid_mutex);
> 
>  #define PRQ_ORDER 0
> 
> @@ -92,6 +93,65 @@ static inline bool intel_svm_capable(struct
> intel_iommu *iommu)
>   return iommu->flags & VTD_FLAG_SVM_CAPABLE;
>  }
> 
> +#define pasid_lock_held() lock_is_held(&pasid_mutex.dep_map)
> +
> +static int pasid_status_change(struct notifier_block *nb,
> + unsigned long code, void *data)
> +{
> + struct ioasid_nb_args *args = (struct ioasid_nb_args *)data;
> + struct intel_svm_dev *sdev;
> + struct intel_svm *svm;
> + int ret = NOTIFY_DONE;
> +
> + if (code == IOASID_FREE) {
> + /*
> +  * Unbind all devices associated with this PASID which is
> +  * being freed by other users such as VFIO.
> +  */
> + mutex_lock(&pasid_mutex);
> + svm = ioasid_find(INVALID_IOASID_SET, args->id, NULL);
> + if (!svm || !svm->iommu)
> + goto done_unlock;

should we treat !svm->iommu as an error condition? if not, do you have
an example when it may occur in normal situation?

> +
> + if (IS_ERR(svm)) {
> + ret = NOTIFY_BAD;
> + goto done_unlock;
> + }

svm->iommu should be referenced after IS_ERR check

> +
> + list_for_each_entry_rcu(sdev, &svm->devs, list,
> pasid_lock_held()) {
> + /* Does not poison forward pointer */
> + list_del_rcu(&sdev->list);
> + intel_pasid_tear_down_entry(svm->iommu, sdev-
> >dev,
> + svm->pasid);
> + kfree_rcu(sdev, rcu);
> +
> + /*
> +  * Free before unbind only happens with guest
> usaged
> +  * host PASIDs. IOASID free will detach private data
> +  * and free the IOASID entry.

"guest usaged host PASIDs"?

> +  */
> + if (list_empty(&svm->devs))
> + kfree(svm);
> + }
> + mutex_unlock(&pasid_mutex);
> +
> + return NOTIFY_OK;
> + }
> +
> +done_unlock:
> + mutex_unlock(&pasid_mutex);
> + return ret;
> +}
> +
> +static struct notifier_block pasid_nb = {
> + .notifier_call = pasid_status_change,
> +};
> +
> +void intel_svm_add_pasid_notifier(void)
> +{
> + ioasid_add_notifier(&pasid_nb);
> +}
> +
>  void intel_svm_check(struct intel_iommu *iommu)
>  {
>   if (!pasid_supported(iommu))
> @@ -219,7 +279,6 @@ static const struct mmu_notifier_ops intel_mmuops =
> {
>   .invalidate_range = intel_invalidate_range,
>  };
> 
> -static DEFINE_MUTEX(pasid_mutex);
>  static LIST_HEAD(global_svm_list);
> 
>  #define for_each_svm_dev(sdev, svm, d)   \
> @@ -319,6 +378,7 @@ int intel_svm_bind_gpasid(struct iommu_domain
> *domain,
>   svm->gpasid = data->gpasid;
>   svm->flags |= SVM_FLAG_GUEST_PASID;
>   }
> + svm->iommu = iommu;

ah, it's interesting to see we have a field defined before but never used. 😊

> 
>   ioasid_attach_data(data->hpasid, svm);
>   INIT_LIST_HEAD_RCU(&svm->devs);
> @@ -383,6 +443,11 @@ int intel_svm_bind_gpasid(struct iommu_domain
> *domain,
>   }
>   svm->flags |= SVM_FLAG_GUEST_MODE;
> 
> + /*
> +  * Notify KVM new host-guest PASID bind is ready. KVM will set up
> +  * PASID translation table to support guest ENQCMD.
> +  */

should take it as an example instead of the only possible usage.

> + ioasid_notify(data->hpasid, IOASID_BIND);
>   init_rcu_head(&sdev->rcu);
>   list_add_rcu(&sdev->list, &svm->devs);
>   out:
> @@ -440,6 +505,7 @@ int intel_svm_unbind_gpasid(struct device *dev, int
> pasid)
>

RE: [PATCH V10 01/11] iommu/vt-d: Move domain helper to header

2020-03-27 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Saturday, March 21, 2020 7:28 AM
> 
> Move domain helper to header to be used by SVA code.
> 
> Signed-off-by: Jacob Pan 
> Reviewed-by: Eric Auger 
> ---
>  drivers/iommu/intel-iommu.c | 6 --
>  include/linux/intel-iommu.h | 6 ++
>  2 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index 4be549478691..e599b2537b1c 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -446,12 +446,6 @@ static void init_translation_status(struct
> intel_iommu *iommu)
>   iommu->flags |= VTD_FLAG_TRANS_PRE_ENABLED;
>  }
> 
> -/* Convert generic 'struct iommu_domain to private struct dmar_domain */
> -static struct dmar_domain *to_dmar_domain(struct iommu_domain *dom)
> -{
> - return container_of(dom, struct dmar_domain, domain);
> -}
> -
>  static int __init intel_iommu_setup(char *str)
>  {
>   if (!str)
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index 980234ae0312..ed7171d2ae1f 100644
> --- a/include/linux/intel-iommu.h
> +++ b/include/linux/intel-iommu.h
> @@ -595,6 +595,12 @@ static inline void __iommu_flush_cache(
>   clflush_cache_range(addr, size);
>  }
> 
> +/* Convert generic struct iommu_domain to private struct dmar_domain */
> +static inline struct dmar_domain *to_dmar_domain(struct iommu_domain
> *dom)
> +{
> + return container_of(dom, struct dmar_domain, domain);
> +}
> +
>  /*
>   * 0: readable
>   * 1: writable
> --
> 2.7.4

Reviewed-by: Kevin Tian 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH V10 02/11] iommu/uapi: Define a mask for bind data

2020-03-27 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Saturday, March 21, 2020 7:28 AM
> 
> Memory type related flags can be grouped together for one simple check.
> 
> ---
> v9 renamed from EMT to MTS since these are memory type support flags.
> ---
> 
> Signed-off-by: Jacob Pan 
> ---
>  include/uapi/linux/iommu.h | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> index 4ad3496e5c43..d7bcbc5f79b0 100644
> --- a/include/uapi/linux/iommu.h
> +++ b/include/uapi/linux/iommu.h
> @@ -284,7 +284,10 @@ struct iommu_gpasid_bind_data_vtd {
>   __u32 pat;
>   __u32 emt;
>  };
> -
> +#define IOMMU_SVA_VTD_GPASID_MTS_MASK
>   (IOMMU_SVA_VTD_GPASID_CD | \
> +  IOMMU_SVA_VTD_GPASID_EMTE | \
> +  IOMMU_SVA_VTD_GPASID_PCD |  \
> +  IOMMU_SVA_VTD_GPASID_PWT)
>  /**
>   * struct iommu_gpasid_bind_data - Information about device and guest
> PASID binding
>   * @version: Version of this data structure
> --
> 2.7.4

Reviewed-by: Kevin Tian 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH V10 03/11] iommu/vt-d: Add a helper function to skip agaw

2020-03-27 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Saturday, March 21, 2020 7:28 AM
> 
> Signed-off-by: Jacob Pan 

could you elaborate in which scenario this helper function is required?
 
> ---
>  drivers/iommu/intel-pasid.c | 22 ++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/drivers/iommu/intel-pasid.c b/drivers/iommu/intel-pasid.c
> index 22b30f10b396..191508c7c03e 100644
> --- a/drivers/iommu/intel-pasid.c
> +++ b/drivers/iommu/intel-pasid.c
> @@ -500,6 +500,28 @@ int intel_pasid_setup_first_level(struct intel_iommu
> *iommu,
>  }
> 
>  /*
> + * Skip top levels of page tables for iommu which has less agaw
> + * than default. Unnecessary for PT mode.
> + */
> +static inline int iommu_skip_agaw(struct dmar_domain *domain,
> +   struct intel_iommu *iommu,
> +   struct dma_pte **pgd)
> +{
> + int agaw;
> +
> + for (agaw = domain->agaw; agaw > iommu->agaw; agaw--) {
> + *pgd = phys_to_virt(dma_pte_addr(*pgd));
> + if (!dma_pte_present(*pgd)) {
> + return -EINVAL;
> + }
> + }
> + pr_debug_ratelimited("%s: pgd: %llx, agaw %d d_agaw %d\n",
> __func__, (u64)*pgd,
> + iommu->agaw, domain->agaw);
> +
> + return agaw;
> +}
> +
> +/*
>   * Set up the scalable mode pasid entry for second only translation type.
>   */
>  int intel_pasid_setup_second_level(struct intel_iommu *iommu,
> --
> 2.7.4

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH V10 04/11] iommu/vt-d: Use helper function to skip agaw for SL

2020-03-27 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Saturday, March 21, 2020 7:28 AM
> 
> Signed-off-by: Jacob Pan 
> ---
>  drivers/iommu/intel-pasid.c | 14 --
>  1 file changed, 4 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/iommu/intel-pasid.c b/drivers/iommu/intel-pasid.c
> index 191508c7c03e..9bdb7ee228b6 100644
> --- a/drivers/iommu/intel-pasid.c
> +++ b/drivers/iommu/intel-pasid.c
> @@ -544,17 +544,11 @@ int intel_pasid_setup_second_level(struct
> intel_iommu *iommu,
>   return -EINVAL;
>   }
> 
> - /*
> -  * Skip top levels of page tables for iommu which has less agaw
> -  * than default. Unnecessary for PT mode.
> -  */
>   pgd = domain->pgd;
> - for (agaw = domain->agaw; agaw > iommu->agaw; agaw--) {
> - pgd = phys_to_virt(dma_pte_addr(pgd));
> - if (!dma_pte_present(pgd)) {
> - dev_err(dev, "Invalid domain page table\n");
> - return -EINVAL;
> - }
> + agaw = iommu_skip_agaw(domain, iommu, &pgd);
> + if (agaw < 0) {
> + dev_err(dev, "Invalid domain page table\n");
> + return -EINVAL;
>   }

ok, I see how it is used. possibly combine last and this one together since
it's mostly moving code...

> 
>   pgd_val = virt_to_phys(pgd);
> --
> 2.7.4

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH V10 05/11] iommu/vt-d: Add nested translation helper function

2020-03-27 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Saturday, March 21, 2020 7:28 AM
> 
> Nested translation mode is supported in VT-d 3.0 Spec.CH 3.8.

now the spec is already at rev3.1 😊

> With PASID granular translation type set to 0x11b, translation
> result from the first level(FL) also subject to a second level(SL)
> page table translation. This mode is used for SVA virtualization,
> where FL performs guest virtual to guest physical translation and
> SL performs guest physical to host physical translation.
> 
> This patch adds a helper function for setting up nested translation
> where second level comes from a domain and first level comes from
> a guest PGD.
> 
> Signed-off-by: Jacob Pan 
> Signed-off-by: Liu, Yi L 
> ---
>  drivers/iommu/intel-pasid.c | 240
> +++-
>  drivers/iommu/intel-pasid.h |  12 +++
>  include/linux/intel-iommu.h |   3 +
>  3 files changed, 252 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/iommu/intel-pasid.c b/drivers/iommu/intel-pasid.c
> index 9bdb7ee228b6..10c7856afc6b 100644
> --- a/drivers/iommu/intel-pasid.c
> +++ b/drivers/iommu/intel-pasid.c
> @@ -359,6 +359,76 @@ pasid_set_flpm(struct pasid_entry *pe, u64 value)
>   pasid_set_bits(&pe->val[2], GENMASK_ULL(3, 2), value << 2);
>  }
> 
> +/*
> + * Setup the Extended Memory Type(EMT) field (Bits 91-93)
> + * of a scalable mode PASID entry.
> + */
> +static inline void
> +pasid_set_emt(struct pasid_entry *pe, u64 value)
> +{
> + pasid_set_bits(&pe->val[1], GENMASK_ULL(29, 27), value << 27);
> +}
> +
> +/*
> + * Setup the Page Attribute Table (PAT) field (Bits 96-127)
> + * of a scalable mode PASID entry.
> + */
> +static inline void
> +pasid_set_pat(struct pasid_entry *pe, u64 value)
> +{
> + pasid_set_bits(&pe->val[1], GENMASK_ULL(63, 32), value << 32);
> +}
> +
> +/*
> + * Setup the Cache Disable (CD) field (Bit 89)
> + * of a scalable mode PASID entry.
> + */
> +static inline void
> +pasid_set_cd(struct pasid_entry *pe)
> +{
> + pasid_set_bits(&pe->val[1], 1 << 25, 1 << 25);
> +}
> +
> +/*
> + * Setup the Extended Memory Type Enable (EMTE) field (Bit 90)
> + * of a scalable mode PASID entry.
> + */
> +static inline void
> +pasid_set_emte(struct pasid_entry *pe)
> +{
> + pasid_set_bits(&pe->val[1], 1 << 26, 1 << 26);
> +}
> +
> +/*
> + * Setup the Extended Access Flag Enable (EAFE) field (Bit 135)
> + * of a scalable mode PASID entry.
> + */
> +static inline void
> +pasid_set_eafe(struct pasid_entry *pe)
> +{
> + pasid_set_bits(&pe->val[2], 1 << 7, 1 << 7);
> +}
> +
> +/*
> + * Setup the Page-level Cache Disable (PCD) field (Bit 95)
> + * of a scalable mode PASID entry.
> + */
> +static inline void
> +pasid_set_pcd(struct pasid_entry *pe)
> +{
> + pasid_set_bits(&pe->val[1], 1 << 31, 1 << 31);
> +}
> +
> +/*
> + * Setup the Page-level Write-Through (PWT)) field (Bit 94)
> + * of a scalable mode PASID entry.
> + */
> +static inline void
> +pasid_set_pwt(struct pasid_entry *pe)
> +{
> + pasid_set_bits(&pe->val[1], 1 << 30, 1 << 30);
> +}
> +
>  static void
>  pasid_cache_invalidation_with_pasid(struct intel_iommu *iommu,
>   u16 did, int pasid)
> @@ -492,7 +562,7 @@ int intel_pasid_setup_first_level(struct intel_iommu
> *iommu,
>   pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
> 
>   /* Setup Present and PASID Granular Transfer Type: */
> - pasid_set_translation_type(pte, 1);
> + pasid_set_translation_type(pte, PASID_ENTRY_PGTT_FL_ONLY);
>   pasid_set_present(pte);
>   pasid_flush_caches(iommu, pte, pasid, did);
> 
> @@ -564,7 +634,7 @@ int intel_pasid_setup_second_level(struct
> intel_iommu *iommu,
>   pasid_set_domain_id(pte, did);
>   pasid_set_slptr(pte, pgd_val);
>   pasid_set_address_width(pte, agaw);
> - pasid_set_translation_type(pte, 2);
> + pasid_set_translation_type(pte, PASID_ENTRY_PGTT_SL_ONLY);
>   pasid_set_fault_enable(pte);
>   pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
> 
> @@ -598,7 +668,7 @@ int intel_pasid_setup_pass_through(struct
> intel_iommu *iommu,
>   pasid_clear_entry(pte);
>   pasid_set_domain_id(pte, did);
>   pasid_set_address_width(pte, iommu->agaw);
> - pasid_set_translation_type(pte, 4);
> + pasid_set_translation_type(pte, PASID_ENTRY_PGTT_PT);
>   pasid_set_fault_enable(pte);
>   pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
> 
> @@ -612,3 +682,167 @@ int intel_pasid_setup_pass_through(struct
> intel_iommu *iommu,
> 
>   return 0;
>  }
> +
> +static int intel_pasid_setup_bind_data(struct intel_iommu *iommu,
> + struct pasid_entry *pte,
> + struct iommu_gpasid_bind_data_vtd
> *pasid_data)
> +{
> + /*
> +  * Not all guest PASID table entry fields are passed down during bind,
> +  * here we only set up the ones that are dependent on guest settings.
> +  * Execution related bits such as NXE, SMEP are not meaningful to
> IOM

RE: [PATCH 03/10] iommu/ioasid: Introduce per set allocation APIs

2020-03-27 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Saturday, March 28, 2020 12:59 AM
> 
> On Fri, 27 Mar 2020 08:38:44 +0000
> "Tian, Kevin"  wrote:
> 
> > > From: Jacob Pan 
> > > Sent: Thursday, March 26, 2020 1:55 AM
> > >
> > > IOASID set defines a group of IDs that share the same token. The
> > > ioasid_set concept helps to do permission checking among users as
> > > in the current code.
> > >
> > > With guest SVA usage, each VM has its own IOASID set. More
> > > functionalities are needed:
> > > 1. Enforce quota, each guest may be assigned limited quota such
> > > that one guest cannot abuse all the system resource.
> > > 2. Stores IOASID mapping between guest and host IOASIDs
> > > 3. Per set operations, e.g. free the entire set
> > >
> > > For each ioasid_set token, a unique set ID is assigned. This makes
> > > reference of the set and data lookup much easier to implement.
> > >
> > > Signed-off-by: Liu Yi L 
> > > Signed-off-by: Jacob Pan 
> > > ---
> > >  drivers/iommu/ioasid.c | 147
> > > +
> > >  include/linux/ioasid.h |  13 +
> > >  2 files changed, 160 insertions(+)
> > >
> > > diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> > > index 4026e52855b9..27ee57f7079b 100644
> > > --- a/drivers/iommu/ioasid.c
> > > +++ b/drivers/iommu/ioasid.c
> > > @@ -10,6 +10,25 @@
> > >  #include 
> > >  #include 
> > >
> > > +static DEFINE_XARRAY_ALLOC(ioasid_sets);
> > > +/**
> > > + * struct ioasid_set_data - Meta data about ioasid_set
> > > + *
> > > + * @token:   Unique to identify an IOASID set
> > > + * @xa:  XArray to store subset ID and IOASID
> > > mapping
> >
> > what is a subset? is it a different thing from set?
> >
> Subset is a set, but a subset ID is an ID only valid within the set.
> When we have non-identity Guest-Host PASID mapping, Subset ID is
> the Guest PASID but in more general terms. Or call it "Set Private ID"
> 
> This can be confusing, perhaps I rephrase it as:
> "XArray to store ioasid_set private ID to system-wide IOASID mapping"
> 
> 
> > > + * @size:Max number of IOASIDs can be allocated within the
> > > set
> >
> > 'size' reads more like 'current size' instead of 'max size'. maybe
> > call it 'max_ioasids' to align with 'nr_ioasids'? or simplify both as
> > 'max' and 'nr'?
> >
> Right, how about max_id and nr_id?

sounds good.

> 
> > > + * @nr_ioasids   Number of IOASIDs allocated in the set
> > > + * @sid  ID of the set
> > > + */
> > > +struct ioasid_set_data {
> > > + struct ioasid_set *token;
> > > + struct xarray xa;
> > > + int size;
> > > + int nr_ioasids;
> > > + int sid;
> > > + struct rcu_head rcu;
> > > +};
> > > +
> > >  struct ioasid_data {
> > >   ioasid_t id;
> > >   struct ioasid_set *set;
> > > @@ -388,6 +407,111 @@ void ioasid_free(ioasid_t ioasid)
> > >  EXPORT_SYMBOL_GPL(ioasid_free);
> > >
> > >  /**
> > > + * ioasid_alloc_set - Allocate a set of IOASIDs
> >
> > 'a set of IOASIDS' sounds like 'many IOASIDs'. Just saying 'allocate
> > an IOASID set' is more clear. 😊
> >
> Make sense
> 
> > > + * @token:   Unique token of the IOASID set
> > > + * @quota:   Quota allowed in this set
> > > + * @sid: IOASID set ID to be assigned
> > > + *
> > > + * Return 0 upon success. Token will be stored internally for
> > > lookup,
> > > + * IOASID allocation within the set and other per set operations
> > > will use
> > > + * the @sid assigned.
> > > + *
> > > + */
> > > +int ioasid_alloc_set(struct ioasid_set *token, ioasid_t quota, int
> > > *sid) +{
> > > + struct ioasid_set_data *sdata;
> > > + ioasid_t id;
> > > + int ret = 0;
> > > +
> > > + if (quota > ioasid_capacity_avail) {
> > > + pr_warn("Out of IOASID capacity! ask %d, avail
> > > %d\n",
> > > + quota, ioasid_capacity_avail);
> > > + return -ENOSPC;
> > > + }
> > > +
> > > + sdata = kzalloc(sizeof(*sdata), GFP_KERNEL);
> > > + if (!sdata)
> > > +

RE: [PATCH 05/10] iommu/ioasid: Create an IOASID set for host SVA use

2020-03-27 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Saturday, March 28, 2020 1:29 AM
> 
> On Fri, 27 Mar 2020 09:41:55 +0000
> "Tian, Kevin"  wrote:
> 
> > > From: Jacob Pan 
> > > Sent: Thursday, March 26, 2020 1:55 AM
> > >
> > > Bare metal SVA allocates IOASIDs for native process addresses. This
> > > should be separated from VM allocated IOASIDs thus under its own
> > > set.
> >
> > A curious question. Now bare metal SVA uses the system set and guest
> > SVA uses dynamically-created set. Then do we still allow ioasid_alloc
> > with a NULL set pointer?
> >
> Good point! There shouldn't be NULL set. That was one of the sticky
> point in the previous allocation API. I will add a check in
> ioasid_alloc_set().
> 
> However, there is still need for global search, e.g. PRS.
> https://lore.kernel.org/linux-arm-kernel/1d62b2e1-fe8c-066d-34e0-
> f7929f6a7...@arm.com/#t
> 
> In that case, use INVALID_IOASID_SET to indicate the search is global.
> e.g.
> ioasid_find(INVALID_IOASID_SET, data->hpasid, NULL);

ok, it makes sense.

> 
> > >
> > > This patch creates a system IOASID set with its quota set to
> > > PID_MAX. This is a reasonable default in that SVM capable devices
> > > can only bind to limited user processes.
> > >
> > > Signed-off-by: Jacob Pan 
> > > ---
> > >  drivers/iommu/intel-iommu.c | 8 +++-
> > >  drivers/iommu/ioasid.c  | 9 +
> > >  include/linux/ioasid.h  | 9 +
> > >  3 files changed, 25 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/iommu/intel-iommu.c
> > > b/drivers/iommu/intel-iommu.c index ec3fc121744a..af7a1ef7b31e
> > > 100644 --- a/drivers/iommu/intel-iommu.c
> > > +++ b/drivers/iommu/intel-iommu.c
> > > @@ -3511,8 +3511,14 @@ static int __init init_dmars(void)
> > >   goto free_iommu;
> > >
> > >   /* PASID is needed for scalable mode irrespective to SVM */
> > > - if (intel_iommu_sm)
> > > + if (intel_iommu_sm) {
> > >   ioasid_install_capacity(intel_pasid_max_id);
> > > + /* We should not run out of IOASIDs at boot */
> > > + if (ioasid_alloc_system_set(PID_MAX_DEFAULT)) {
> > > + pr_err("Failed to enable host PASID
> > > allocator\n");
> > > + intel_iommu_sm = 0;
> > > + }
> > > + }
> > >
> > >   /*
> > >* for each drhd
> > > diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> > > index 6265d2dbbced..9135af171a7c 100644
> > > --- a/drivers/iommu/ioasid.c
> > > +++ b/drivers/iommu/ioasid.c
> > > @@ -39,6 +39,9 @@ struct ioasid_data {
> > >  static ioasid_t ioasid_capacity;
> > >  static ioasid_t ioasid_capacity_avail;
> > >
> > > +int system_ioasid_sid;
> > > +static DECLARE_IOASID_SET(system_ioasid);
> > > +
> > >  /* System capacity can only be set once */
> > >  void ioasid_install_capacity(ioasid_t total)
> > >  {
> > > @@ -51,6 +54,12 @@ void ioasid_install_capacity(ioasid_t total)
> > >  }
> > >  EXPORT_SYMBOL_GPL(ioasid_install_capacity);
> > >
> > > +int ioasid_alloc_system_set(int quota)
> > > +{
> > > + return ioasid_alloc_set(&system_ioasid, quota,
> > > &system_ioasid_sid); +}
> > > +EXPORT_SYMBOL_GPL(ioasid_alloc_system_set);
> > > +
> > >  /*
> > >   * struct ioasid_allocator_data - Internal data structure to hold
> > > information
> > >   * about an allocator. There are two types of allocators:
> > > diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> > > index 8c82d2625671..097b1cc043a3 100644
> > > --- a/include/linux/ioasid.h
> > > +++ b/include/linux/ioasid.h
> > > @@ -29,6 +29,9 @@ struct ioasid_allocator_ops {
> > >   void *pdata;
> > >  };
> > >
> > > +/* Shared IOASID set for reserved for host system use */
> > > +extern int system_ioasid_sid;
> > > +
> > >  #define DECLARE_IOASID_SET(name) struct ioasid_set name = { 0 }
> > >
> > >  #if IS_ENABLED(CONFIG_IOASID)
> > > @@ -41,6 +44,7 @@ int ioasid_register_allocator(struct
> > > ioasid_allocator_ops *allocator);
> > >  void ioasid_unregister_allocator(struct ioasid_allocator_ops
> > > *allocator); int ioasid_attach_data(ioasid_t ioasid, void *data);
> > >  void ioasid_install_capacity(ioasid_t total);
> > > +int ioasid_alloc_system_set(int quota);
> > >  int ioasid_alloc_set(struct ioasid_set *token, ioasid_t quota, int
> > > *sid); void ioasid_free_set(int sid, bool destroy_set);
> > >  int ioasid_find_sid(ioasid_t ioasid);
> > > @@ -88,5 +92,10 @@ static inline void
> > > ioasid_install_capacity(ioasid_t total) {
> > >  }
> > >
> > > +static inline int ioasid_alloc_system_set(int quota)
> > > +{
> > > + return -ENOTSUPP;
> > > +}
> > > +
> > >  #endif /* CONFIG_IOASID */
> > >  #endif /* __LINUX_IOASID_H */
> > > --
> > > 2.7.4
> >
> 
> [Jacob Pan]
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH 06/10] iommu/ioasid: Convert to set aware allocations

2020-03-27 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Saturday, March 28, 2020 1:42 AM
> 
> On Fri, 27 Mar 2020 09:54:11 +0000
> "Tian, Kevin"  wrote:
> 
> > > From: Jacob Pan 
> > > Sent: Thursday, March 26, 2020 1:55 AM
> > >
> > > The current ioasid_alloc function takes a token/ioasid_set then
> > > record it on the IOASID being allocated. There is no alloc/free on
> > > the ioasid_set.
> > >
> > > With the IOASID set APIs, callers must allocate an ioasid_set before
> > > allocate IOASIDs within the set. Quota and other ioasid_set level
> > > activities can then be enforced.
> > >
> > > This patch converts existing API to the new ioasid_set model.
> > >
> > > Signed-off-by: Liu Yi L 
> > > Signed-off-by: Jacob Pan 
> > > ---
> > >  drivers/iommu/intel-iommu.c | 10 +++---
> > >  drivers/iommu/intel-svm.c   | 10 +++---
> > >  drivers/iommu/ioasid.c  | 78
> > > +--- -
> > >  include/linux/ioasid.h  | 11 +++
> > >  4 files changed, 72 insertions(+), 37 deletions(-)
> > >
> > > diff --git a/drivers/iommu/intel-iommu.c
> > > b/drivers/iommu/intel-iommu.c index af7a1ef7b31e..c571cc8d9e57
> > > 100644 --- a/drivers/iommu/intel-iommu.c
> > > +++ b/drivers/iommu/intel-iommu.c
> > > @@ -3323,11 +3323,11 @@ static void intel_ioasid_free(ioasid_t
> > > ioasid, void *data)
> > >   if (!iommu)
> > >   return;
> > >   /*
> > > -  * Sanity check the ioasid owner is done at upper layer,
> > > e.g. VFIO
> > > -  * We can only free the PASID when all the devices are
> > > unbound.
> > > +  * In the guest, all IOASIDs belong to the system_ioasid
> > > set.
> > > +  * Sanity check against the system set.
> >
> > below code has nothing to deal with guest, then why putting the
> > comment specifically for guest?
> >
> intel_ioasid_alloc/free() is the custom IOASID allocator only
> registered when running in the guest.

in that case may be rename the functions to intel_guest_ioasid_alloc/free
would avoid similar confusion as I had?

> 
> The custom allocator calls virtual command. Since we don't support
> nested guest, all IOASIDs belong to the system ioasid_set.

could you put no support of nested guest in the comment, so later
when people want to add nested support they will know some
additional work required here?

> 
> > >*/
> > > - if (ioasid_find(NULL, ioasid, NULL)) {
> > > - pr_alert("Cannot free active IOASID %d\n", ioasid);
> > > + if (IS_ERR(ioasid_find(system_ioasid_sid, ioasid, NULL))) {
> > > + pr_err("Cannot free IOASID %d, not in system
> > > set\n", ioasid); return;
> > >   }
> > >   vcmd_free_pasid(iommu, ioasid);
> > > @@ -5541,7 +5541,7 @@ static int aux_domain_add_dev(struct
> > > dmar_domain *domain,
> > >   int pasid;
> > >
> > >   /* No private data needed for the default pasid */
> > > - pasid = ioasid_alloc(NULL, PASID_MIN,
> > > + pasid = ioasid_alloc(system_ioasid_sid, PASID_MIN,
> > >pci_max_pasids(to_pci_dev(dev))
> > > - 1, NULL);
> > >   if (pasid == INVALID_IOASID) {
> > > diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
> > > index 1991587fd3fd..f511855d187b 100644
> > > --- a/drivers/iommu/intel-svm.c
> > > +++ b/drivers/iommu/intel-svm.c
> > > @@ -268,7 +268,7 @@ int intel_svm_bind_gpasid(struct iommu_domain
> > > *domain,
> > >   }
> > >
> > >   mutex_lock(&pasid_mutex);
> > > - svm = ioasid_find(NULL, data->hpasid, NULL);
> > > + svm = ioasid_find(INVALID_IOASID_SET, data->hpasid, NULL);
> > >   if (IS_ERR(svm)) {
> > >   ret = PTR_ERR(svm);
> > >   goto out;
> > > @@ -401,7 +401,7 @@ int intel_svm_unbind_gpasid(struct device *dev,
> > > int pasid)
> > >   return -EINVAL;
> > >
> > >   mutex_lock(&pasid_mutex);
> > > - svm = ioasid_find(NULL, pasid, NULL);
> > > + svm = ioasid_find(INVALID_IOASID_SET, pasid, NULL);
> > >   if (!svm) {
> > >   ret = -EINVAL;
> > >   goto out;
> > > @@ -559,7 +559,7 @@ static int intel_svm_bind_mm(struct device
> > > *dev, int flags, struct svm_dev_ops *
> > >   pasid_max = intel_pasid_max_id;
&g

RE: [PATCH 08/10] iommu/ioasid: Introduce notifier APIs

2020-03-27 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Saturday, March 28, 2020 2:37 AM
> 
> On Fri, 27 Mar 2020 10:03:26 +0000
> "Tian, Kevin"  wrote:
> 
> > > From: Jacob Pan 
> > > Sent: Thursday, March 26, 2020 1:55 AM
> > >
> > > IOASID users fit into the publisher-subscriber pattern, a system
> > > wide blocking notifier chain can be used to inform subscribers of
> > > state changes. Notifier mechanism also abstracts publisher from
> > > knowing the private context each subcriber may have.
> > >
> > > This patch adds APIs and a global notifier chain, a further
> > > optimization might be per set notifier for ioasid_set aware users.
> > >
> > > Usage example:
> > > KVM register notifier block such that it can keep its guest-host
> > > PASID translation table in sync with any IOASID updates.
> > >
> > > VFIO publish IOASID change by performing alloc/free, bind/unbind
> > > operations.
> > >
> > > IOMMU driver gets notified when IOASID is freed by VFIO or core mm
> > > code such that PASID context can be cleaned up.
> >
> > above example looks mixed. You have KVM registers the notifier but
> > finally having IOMMU driver to get notified... 😊
> >
> Right, felt like a tale of two subscribers got mixed. I meant to list a
> few use cases with publisher and subscriber roles separate.
> I will change that to "Usage examples", and explicit state each role.
> 
> > >
> > > Signed-off-by: Liu Yi L 
> > > Signed-off-by: Jacob Pan 
> > > ---
> > >  drivers/iommu/ioasid.c | 61
> > > ++
> > >  include/linux/ioasid.h | 40 +
> > >  2 files changed, 101 insertions(+)
> > >
> > > diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> > > index 8612fe6477dc..27dce2cb5af2 100644
> > > --- a/drivers/iommu/ioasid.c
> > > +++ b/drivers/iommu/ioasid.c
> > > @@ -11,6 +11,22 @@
> > >  #include 
> > >
> > >  static DEFINE_XARRAY_ALLOC(ioasid_sets);
> > > +/*
> > > + * An IOASID could have multiple consumers. When a status change
> > > occurs,
> > > + * this notifier chain is used to keep them in sync. Each consumer
> > > of the
> > > + * IOASID service must register notifier block early to ensure no
> > > events
> > > + * are missed.
> > > + *
> > > + * This is a publisher-subscriber pattern where publisher can
> > > change the
> > > + * state of each IOASID, e.g. alloc/free, bind IOASID to a device
> > > and mm.
> > > + * On the other hand, subscribers gets notified for the state
> > > change and
> > > + * keep local states in sync.
> > > + *
> > > + * Currently, the notifier is global. A further optimization could
> > > be per
> > > + * IOASID set notifier chain.
> > > + */
> > > +static BLOCKING_NOTIFIER_HEAD(ioasid_chain);
> > > +
> > >  /**
> > >   * struct ioasid_set_data - Meta data about ioasid_set
> > >   *
> > > @@ -408,6 +424,7 @@ static void ioasid_free_locked(ioasid_t ioasid)
> > >  {
> > >   struct ioasid_data *ioasid_data;
> > >   struct ioasid_set_data *sdata;
> > > + struct ioasid_nb_args args;
> > >
> > >   ioasid_data = xa_load(&active_allocator->xa, ioasid);
> > >   if (!ioasid_data) {
> > > @@ -415,6 +432,13 @@ static void ioasid_free_locked(ioasid_t ioasid)
> > >   return;
> > >   }
> > >
> > > + args.id = ioasid;
> > > + args.sid = ioasid_data->sdata->sid;
> > > + args.pdata = ioasid_data->private;
> > > + args.set_token = ioasid_data->sdata->token;
> > > +
> > > + /* Notify all users that this IOASID is being freed */
> > > + blocking_notifier_call_chain(&ioasid_chain, IOASID_FREE,
> > > &args); active_allocator->ops->free(ioasid,
> > > active_allocator->ops->pdata); /* Custom allocator needs additional
> > > steps to free the xa element */ if (active_allocator->flags &
> > > IOASID_ALLOCATOR_CUSTOM) { @@ -624,6 +648,43 @@ int
> > > ioasid_find_sid(ioasid_t ioasid) }
> > >  EXPORT_SYMBOL_GPL(ioasid_find_sid);
> > >
> > > +int ioasid_add_notifier(struct notifier_block *nb)
> > > +{
> > > + return blocking_notifier_chain_register(&ioasid_chain, nb);
> > > +}
> > > +EXPORT_S

RE: [PATCH 09/10] iommu/ioasid: Support ioasid_set quota adjustment

2020-03-27 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Saturday, March 28, 2020 7:31 AM
> 
> On Fri, 27 Mar 2020 10:09:04 +0000
> "Tian, Kevin"  wrote:
> 
> > > From: Jacob Pan 
> > > Sent: Thursday, March 26, 2020 1:56 AM
> > >
> > > IOASID set is allocated with an initial quota, at runtime there may
> > > be needs to balance IOASID resources among different VMs/sets.
> > >
> >
> > I may overlook previous patches but I didn't see any place setting the
> > initial quota...
> >
> Initial quota is in place when the ioasid_set is allocated.
> 
> > > This patch adds a new API to adjust per set quota.
> >
> > since this is purely an internal kernel API, implies that the
> > publisher (e.g. VFIO) is responsible for exposing its own uAPI to set
> > the quota?
> >
> yes, VFIO will do the adjustment. I think Alex suggested module
> parameters.

ok, I remember that.

> 
> > >
> > > Signed-off-by: Jacob Pan 
> > > ---
> > >  drivers/iommu/ioasid.c | 44
> > > 
> > >  include/linux/ioasid.h |  6 ++
> > >  2 files changed, 50 insertions(+)
> > >
> > > diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> > > index 27dce2cb5af2..5ac28862a1db 100644
> > > --- a/drivers/iommu/ioasid.c
> > > +++ b/drivers/iommu/ioasid.c
> > > @@ -578,6 +578,50 @@ void ioasid_free_set(int sid, bool destroy_set)
> > >  }
> > >  EXPORT_SYMBOL_GPL(ioasid_free_set);
> > >
> > > +/**
> > > + * ioasid_adjust_set - Adjust the quota of an IOASID set
> > > + * @quota:   Quota allowed in this set
> > > + * @sid: IOASID set ID to be assigned
> > > + *
> > > + * Return 0 on success. If the new quota is smaller than the
> > > number of
> > > + * IOASIDs already allocated, -EINVAL will be returned. No change
> > > will be
> > > + * made to the existing quota.
> > > + */
> > > +int ioasid_adjust_set(int sid, int quota)
> > > +{
> > > + struct ioasid_set_data *sdata;
> > > + int ret = 0;
> > > +
> > > + mutex_lock(&ioasid_allocator_lock);
> > > + sdata = xa_load(&ioasid_sets, sid);
> > > + if (!sdata || sdata->nr_ioasids > quota) {
> > > + pr_err("Failed to adjust IOASID set %d quota %d\n",
> > > + sid, quota);
> > > + ret = -EINVAL;
> > > + goto done_unlock;
> > > + }
> > > +
> > > + if (quota >= ioasid_capacity_avail) {
> > > + ret = -ENOSPC;
> > > + goto done_unlock;
> > > + }
> > > +
> > > + /* Return the delta back to system pool */
> > > + ioasid_capacity_avail += sdata->size - quota;
> > > +
> > > + /*
> > > +  * May have a policy to prevent giving all available
> > > IOASIDs
> > > +  * to one set. But we don't enforce here, it should be in
> > > the
> > > +  * upper layers.
> > > +  */
> > > + sdata->size = quota;
> > > +
> > > +done_unlock:
> > > + mutex_unlock(&ioasid_allocator_lock);
> > > +
> > > + return ret;
> > > +}
> > > +EXPORT_SYMBOL_GPL(ioasid_adjust_set);
> > >
> > >  /**
> > >   * ioasid_find - Find IOASID data
> > > diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> > > index 32d032913828..6e7de6fb91bf 100644
> > > --- a/include/linux/ioasid.h
> > > +++ b/include/linux/ioasid.h
> > > @@ -73,6 +73,7 @@ int ioasid_alloc_set(struct ioasid_set *token,
> > > ioasid_t quota, int *sid);
> > >  void ioasid_free_set(int sid, bool destroy_set);
> > >  int ioasid_find_sid(ioasid_t ioasid);
> > >  int ioasid_notify(ioasid_t id, enum ioasid_notify_val cmd);
> > > +int ioasid_adjust_set(int sid, int quota);
> > >
> > >  #else /* !CONFIG_IOASID */
> > >  static inline ioasid_t ioasid_alloc(int sid, ioasid_t min,
> > > @@ -136,5 +137,10 @@ static inline int ioasid_alloc_system_set(int
> > > quota) return -ENOTSUPP;
> > >  }
> > >
> > > +static inline int ioasid_adjust_set(int sid, int quota)
> > > +{
> > > + return -ENOTSUPP;
> > > +}
> > > +
> > >  #endif /* CONFIG_IOASID */
> > >  #endif /* __LINUX_IOASID_H */
> > > --
> > > 2.7.4
> >
> 
> [Jacob Pan]
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH V10 06/11] iommu/vt-d: Add bind guest PASID support

2020-03-28 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Saturday, March 21, 2020 7:28 AM
> 
> When supporting guest SVA with emulated IOMMU, the guest PASID
> table is shadowed in VMM. Updates to guest vIOMMU PASID table
> will result in PASID cache flush which will be passed down to
> the host as bind guest PASID calls.
> 
> For the SL page tables, it will be harvested from device's
> default domain (request w/o PASID), or aux domain in case of
> mediated device.
> 
> .-.  .---.
> |   vIOMMU|  | Guest process CR3, FL only|
> | |  '---'
> ./
> | PASID Entry |--- PASID cache flush -
> '-'   |
> | |   V
> | |CR3 in GPA
> '-'
> Guest
> --| Shadow |--|
>   vv  v
> Host
> .-.  .--.
> |   pIOMMU|  | Bind FL for GVA-GPA  |
> | |  '--'
> ./  |
> | PASID Entry | V (Nested xlate)
> '\.--.
> | |   |SL for GPA-HPA, default domain|
> | |   '--'
> '-'
> Where:
>  - FL = First level/stage one page tables
>  - SL = Second level/stage two page tables
> 
> Signed-off-by: Jacob Pan 
> Signed-off-by: Liu, Yi L 
> ---
>  drivers/iommu/intel-iommu.c |   4 +
>  drivers/iommu/intel-svm.c   | 224
> 
>  include/linux/intel-iommu.h |   8 +-
>  include/linux/intel-svm.h   |  17 
>  4 files changed, 252 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index e599b2537b1c..b1477cd423dd 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -6203,6 +6203,10 @@ const struct iommu_ops intel_iommu_ops = {
>   .dev_disable_feat   = intel_iommu_dev_disable_feat,
>   .is_attach_deferred = intel_iommu_is_attach_deferred,
>   .pgsize_bitmap  = INTEL_IOMMU_PGSIZES,
> +#ifdef CONFIG_INTEL_IOMMU_SVM
> + .sva_bind_gpasid= intel_svm_bind_gpasid,
> + .sva_unbind_gpasid  = intel_svm_unbind_gpasid,
> +#endif
>  };
> 
>  static void quirk_iommu_igfx(struct pci_dev *dev)
> diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
> index d7f2a5358900..47c0deb5ae56 100644
> --- a/drivers/iommu/intel-svm.c
> +++ b/drivers/iommu/intel-svm.c
> @@ -226,6 +226,230 @@ static LIST_HEAD(global_svm_list);
>   list_for_each_entry((sdev), &(svm)->devs, list) \
>   if ((d) != (sdev)->dev) {} else
> 
> +int intel_svm_bind_gpasid(struct iommu_domain *domain,
> + struct device *dev,
> + struct iommu_gpasid_bind_data *data)
> +{
> + struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
> + struct dmar_domain *ddomain;

what about the full name e.g. dmar_domain? though a bit longer
but clearer than ddomain.

> + struct intel_svm_dev *sdev;
> + struct intel_svm *svm;
> + int ret = 0;
> +
> + if (WARN_ON(!iommu) || !data)
> + return -EINVAL;
> +
> + if (data->version != IOMMU_GPASID_BIND_VERSION_1 ||
> + data->format != IOMMU_PASID_FORMAT_INTEL_VTD)
> + return -EINVAL;
> +
> + if (dev_is_pci(dev)) {
> + /* VT-d supports devices with full 20 bit PASIDs only */
> + if (pci_max_pasids(to_pci_dev(dev)) != PASID_MAX)
> + return -EINVAL;
> + } else {
> + return -ENOTSUPP;
> + }
> +
> + /*
> +  * We only check host PASID range, we have no knowledge to check
> +  * guest PASID range nor do we use the guest PASID.
> +  */
> + if (data->hpasid <= 0 || data->hpasid >= PASID_MAX)
> + return -EINVAL;
> +
> + ddomain = to_dmar_domain(domain);
> +
> + /* Sanity check paging mode support match between host and guest
> */
> + if (data->addr_width == ADDR_WIDTH_5LEVEL &&
> + !cap_5lp_support(iommu->cap)) {
> + pr_err("Cannot support 5 level paging requested by
> guest!\n");
> + return -EINVAL;
> + }

-ENOTSUPP?

> +
> + mutex_lock(&pasid_mutex);
> + svm = ioasid_find(NULL, data->hpasid, NULL);
> + if (IS_ERR(svm)) {
> + ret = PTR_ERR(svm);
> + goto out;
> + }
> +
> + if (svm) {
> + /*
> +  * If we found svm for the PASID, there must be at
> +  * least one device bond, otherwise svm should be freed.
> +  */
> + if (WARN_ON(list_empty(&svm->devs))) {
> + ret = -EINVAL;
> + goto out;
> + }
> +
> + if (svm->mm == get_task_mm(current) &&
> + data->hpasid == svm->pasi

RE: [PATCH V10 08/11] iommu/vt-d: Add svm/sva invalidate function

2020-03-28 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Saturday, March 21, 2020 7:28 AM
> 
> When Shared Virtual Address (SVA) is enabled for a guest OS via
> vIOMMU, we need to provide invalidation support at IOMMU API and driver
> level. This patch adds Intel VT-d specific function to implement
> iommu passdown invalidate API for shared virtual address.
> 
> The use case is for supporting caching structure invalidation
> of assigned SVM capable devices. Emulated IOMMU exposes queue

emulated IOMMU -> vIOMMU, since virito-iommu could use the
interface as well.

> invalidation capability and passes down all descriptors from the guest
> to the physical IOMMU.
> 
> The assumption is that guest to host device ID mapping should be
> resolved prior to calling IOMMU driver. Based on the device handle,
> host IOMMU driver can replace certain fields before submit to the
> invalidation queue.
> 
> ---
> v7 review fixed in v10
> ---
> 
> Signed-off-by: Jacob Pan 
> Signed-off-by: Ashok Raj 
> Signed-off-by: Liu, Yi L 
> ---
>  drivers/iommu/intel-iommu.c | 182
> 
>  1 file changed, 182 insertions(+)
> 
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index b1477cd423dd..a76afb0fd51a 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -5619,6 +5619,187 @@ static void
> intel_iommu_aux_detach_device(struct iommu_domain *domain,
>   aux_domain_remove_dev(to_dmar_domain(domain), dev);
>  }
> 
> +/*
> + * 2D array for converting and sanitizing IOMMU generic TLB granularity to
> + * VT-d granularity. Invalidation is typically included in the unmap 
> operation
> + * as a result of DMA or VFIO unmap. However, for assigned devices guest
> + * owns the first level page tables. Invalidations of translation caches in 
> the
> + * guest are trapped and passed down to the host.
> + *
> + * vIOMMU in the guest will only expose first level page tables, therefore
> + * we do not include IOTLB granularity for request without PASID (second
> level).

I would revise above as "We do not support IOTLB granularity for request 
without PASID (second level), therefore any vIOMMU implementation that
exposes the SVA capability to the guest should only expose the first level
page tables, implying all invalidation requests from the guest will include
a valid PASID"

> + *
> + * For example, to find the VT-d granularity encoding for IOTLB
> + * type and page selective granularity within PASID:
> + * X: indexed by iommu cache type
> + * Y: indexed by enum iommu_inv_granularity
> + * [IOMMU_CACHE_INV_TYPE_IOTLB][IOMMU_INV_GRANU_ADDR]
> + *
> + * Granu_map array indicates validity of the table. 1: valid, 0: invalid
> + *
> + */
> +const static int
> inv_type_granu_map[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU_
> NR] = {
> + /*
> +  * PASID based IOTLB invalidation: PASID selective (per PASID),
> +  * page selective (address granularity)
> +  */
> + {0, 1, 1},
> + /* PASID based dev TLBs, only support all PASIDs or single PASID */
> + {1, 1, 0},

Is this combination correct? when single PASID is being specified, it is 
essentially a page-selective invalidation since you need provide Address
and Size. 

> + /* PASID cache */

PASID cache is fully managed by the host. Guest PASID cache invalidation
is interpreted by vIOMMU for bind and unbind operations. I don't think
we should accept any PASID cache invalidation from userspace or guest.

> + {1, 1, 0}
> +};
> +
> +const static int
> inv_type_granu_table[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU
> _NR] = {
> + /* PASID based IOTLB */
> + {0, QI_GRAN_NONG_PASID, QI_GRAN_PSI_PASID},
> + /* PASID based dev TLBs */
> + {QI_DEV_IOTLB_GRAN_ALL, QI_DEV_IOTLB_GRAN_PASID_SEL, 0},
> + /* PASID cache */
> + {QI_PC_ALL_PASIDS, QI_PC_PASID_SEL, 0},
> +};
> +
> +static inline int to_vtd_granularity(int type, int granu, int *vtd_granu)
> +{
> + if (type >= IOMMU_CACHE_INV_TYPE_NR || granu >=
> IOMMU_INV_GRANU_NR ||
> + !inv_type_granu_map[type][granu])
> + return -EINVAL;
> +
> + *vtd_granu = inv_type_granu_table[type][granu];
> +

btw do we really need both map and table here? Can't we just
use one table with unsupported granularity marked as a special
value?

> + return 0;
> +}
> +
> +static inline u64 to_vtd_size(u64 granu_size, u64 nr_granules)
> +{
> + u64 nr_pages = (granu_size * nr_granules) >> VTD_PAGE_SHIFT;
> +
> + /* VT-d size is encoded as 2^size of 4K pages, 0 for 4k, 9 for 2MB, etc.
> +  * IOMMU cache invalidate API passes granu_size in bytes, and
> number of
> +  * granu size in contiguous memory.
> +  */
> + return order_base_2(nr_pages);
> +}
> +
> +#ifdef CONFIG_INTEL_IOMMU_SVM
> +static int intel_iommu_sva_invalidate(struct iommu_domain *domain,
> + struct device *dev, struct iommu_cache_invalidate_info
> *inv_info)
> +{
> + struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> +   

RE: [PATCH V10 09/11] iommu/vt-d: Cache virtual command capability register

2020-03-28 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Saturday, March 21, 2020 7:28 AM
> 
> Virtual command registers are used in the guest only, to prevent
> vmexit cost, we cache the capability and store it during initialization.
> 
> Signed-off-by: Jacob Pan 
> Reviewed-by: Eric Auger 
> Reviewed-by: Lu Baolu 
> 
> ---
> v7 Reviewed by Eric & Baolu
> ---
> ---
>  drivers/iommu/dmar.c| 1 +
>  include/linux/intel-iommu.h | 5 +
>  2 files changed, 6 insertions(+)
> 
> diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
> index 4d6b7b5b37ee..3b36491c8bbb 100644
> --- a/drivers/iommu/dmar.c
> +++ b/drivers/iommu/dmar.c
> @@ -963,6 +963,7 @@ static int map_iommu(struct intel_iommu *iommu,
> u64 phys_addr)
>   warn_invalid_dmar(phys_addr, " returns all ones");
>   goto unmap;
>   }
> + iommu->vccap = dmar_readq(iommu->reg + DMAR_VCCAP_REG);
> 
>   /* the registers might be more than one page */
>   map_size = max_t(int, ecap_max_iotlb_offset(iommu->ecap),
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index 43539713b3b3..ccbf164fb711 100644
> --- a/include/linux/intel-iommu.h
> +++ b/include/linux/intel-iommu.h
> @@ -194,6 +194,9 @@
>  #define ecap_max_handle_mask(e) ((e >> 20) & 0xf)
>  #define ecap_sc_support(e)   ((e >> 7) & 0x1) /* Snooping Control */
> 
> +/* Virtual command interface capabilities */

capabilities -> capability

Reviewed-by: Kevin Tian 

> +#define vccap_pasid(v)   ((v & DMA_VCS_PAS)) /* PASID
> allocation */
> +
>  /* IOTLB_REG */
>  #define DMA_TLB_FLUSH_GRANU_OFFSET  60
>  #define DMA_TLB_GLOBAL_FLUSH (((u64)1) << 60)
> @@ -287,6 +290,7 @@
> 
>  /* PRS_REG */
>  #define DMA_PRS_PPR  ((u32)1)
> +#define DMA_VCS_PAS  ((u64)1)
> 
>  #define IOMMU_WAIT_OP(iommu, offset, op, cond, sts)
>   \
>  do { \
> @@ -537,6 +541,7 @@ struct intel_iommu {
>   u64 reg_size; /* size of hw register set */
>   u64 cap;
>   u64 ecap;
> + u64 vccap;
>   u32 gcmd; /* Holds TE, EAFL. Don't need SRTP, SFL, WBF
> */
>   raw_spinlock_t  register_lock; /* protect register handling */
>   int seq_id; /* sequence id of the iommu */
> --
> 2.7.4

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH V10 10/11] iommu/vt-d: Enlightened PASID allocation

2020-03-28 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Saturday, March 21, 2020 7:28 AM
> 
> From: Lu Baolu 
> 
> Enabling IOMMU in a guest requires communication with the host
> driver for certain aspects. Use of PASID ID to enable Shared Virtual
> Addressing (SVA) requires managing PASID's in the host. VT-d 3.0 spec
> provides a Virtual Command Register (VCMD) to facilitate this.
> Writes to this register in the guest are trapped by QEMU which
> proxies the call to the host driver.

Qemu -> vIOMMU

> 
> This virtual command interface consists of a capability register,
> a virtual command register, and a virtual response register. Refer
> to section 10.4.42, 10.4.43, 10.4.44 for more information.
> 
> This patch adds the enlightened PASID allocation/free interfaces
> via the virtual command interface.
> 
> Cc: Ashok Raj 
> Cc: Jacob Pan 
> Cc: Kevin Tian 
> Signed-off-by: Liu Yi L 
> Signed-off-by: Lu Baolu 
> Signed-off-by: Jacob Pan 
> Reviewed-by: Eric Auger 
> ---
>  drivers/iommu/intel-pasid.c | 57
> +
>  drivers/iommu/intel-pasid.h | 13 ++-
>  include/linux/intel-iommu.h |  1 +
>  3 files changed, 70 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/intel-pasid.c b/drivers/iommu/intel-pasid.c
> index 9f6d07410722..e87ad67aad36 100644
> --- a/drivers/iommu/intel-pasid.c
> +++ b/drivers/iommu/intel-pasid.c
> @@ -27,6 +27,63 @@
>  static DEFINE_SPINLOCK(pasid_lock);
>  u32 intel_pasid_max_id = PASID_MAX;
> 
> +int vcmd_alloc_pasid(struct intel_iommu *iommu, unsigned int *pasid)
> +{
> + unsigned long flags;
> + u8 status_code;
> + int ret = 0;
> + u64 res;
> +
> + raw_spin_lock_irqsave(&iommu->register_lock, flags);
> + dmar_writeq(iommu->reg + DMAR_VCMD_REG,
> VCMD_CMD_ALLOC);
> + IOMMU_WAIT_OP(iommu, DMAR_VCRSP_REG, dmar_readq,
> +   !(res & VCMD_VRSP_IP), res);
> + raw_spin_unlock_irqrestore(&iommu->register_lock, flags);
> +
> + status_code = VCMD_VRSP_SC(res);
> + switch (status_code) {
> + case VCMD_VRSP_SC_SUCCESS:
> + *pasid = VCMD_VRSP_RESULT_PASID(res);
> + break;
> + case VCMD_VRSP_SC_NO_PASID_AVAIL:
> + pr_info("IOMMU: %s: No PASID available\n", iommu->name);
> + ret = -ENOSPC;
> + break;
> + default:
> + ret = -ENODEV;
> + pr_warn("IOMMU: %s: Unexpected error code %d\n",
> + iommu->name, status_code);
> + }
> +
> + return ret;
> +}
> +
> +void vcmd_free_pasid(struct intel_iommu *iommu, unsigned int pasid)
> +{
> + unsigned long flags;
> + u8 status_code;
> + u64 res;
> +
> + raw_spin_lock_irqsave(&iommu->register_lock, flags);
> + dmar_writeq(iommu->reg + DMAR_VCMD_REG,
> + VCMD_CMD_OPERAND(pasid) | VCMD_CMD_FREE);
> + IOMMU_WAIT_OP(iommu, DMAR_VCRSP_REG, dmar_readq,
> +   !(res & VCMD_VRSP_IP), res);
> + raw_spin_unlock_irqrestore(&iommu->register_lock, flags);
> +
> + status_code = VCMD_VRSP_SC(res);
> + switch (status_code) {
> + case VCMD_VRSP_SC_SUCCESS:
> + break;
> + case VCMD_VRSP_SC_INVALID_PASID:
> + pr_info("IOMMU: %s: Invalid PASID\n", iommu->name);
> + break;
> + default:
> + pr_warn("IOMMU: %s: Unexpected error code %d\n",
> + iommu->name, status_code);
> + }
> +}
> +
>  /*
>   * Per device pasid table management:
>   */
> diff --git a/drivers/iommu/intel-pasid.h b/drivers/iommu/intel-pasid.h
> index 698015ee3f04..cd3d63f3e936 100644
> --- a/drivers/iommu/intel-pasid.h
> +++ b/drivers/iommu/intel-pasid.h
> @@ -23,6 +23,16 @@
>  #define is_pasid_enabled(entry)  (((entry)->lo >> 3) & 0x1)
>  #define get_pasid_dir_size(entry)(1 << entry)->lo >> 9) & 0x7) + 7))
> 
> +/* Virtual command interface for enlightened pasid management. */
> +#define VCMD_CMD_ALLOC   0x1
> +#define VCMD_CMD_FREE0x2
> +#define VCMD_VRSP_IP 0x1
> +#define VCMD_VRSP_SC(e)  (((e) >> 1) & 0x3)
> +#define VCMD_VRSP_SC_SUCCESS 0
> +#define VCMD_VRSP_SC_NO_PASID_AVAIL  1
> +#define VCMD_VRSP_SC_INVALID_PASID   1
> +#define VCMD_VRSP_RESULT_PASID(e)(((e) >> 8) & 0xf)
> +#define VCMD_CMD_OPERAND(e)  ((e) << 8)
>  /*
>   * Domain ID reserved for pasid entries programmed for first-level
>   * only and pass-through transfer modes.
> @@ -113,5 +123,6 @@ int intel_pasid_setup_nested(struct intel_iommu
> *iommu,
>   int addr_width);
>  void intel_pasid_tear_down_entry(struct intel_iommu *iommu,
>struct device *dev, int pasid);
> -
> +int vcmd_alloc_pasid(struct intel_iommu *iommu, unsigned int *pasid);
> +void vcmd_free_pasid(struct intel_iommu *iommu, unsigned int pasid);
>  #endif /* __INTEL_PASID_H */
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index ccb

RE: [PATCH V10 11/11] iommu/vt-d: Add custom allocator for IOASID

2020-03-28 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Saturday, March 21, 2020 7:28 AM
> 
> When VT-d driver runs in the guest, PASID allocation must be
> performed via virtual command interface. This patch registers a
> custom IOASID allocator which takes precedence over the default
> XArray based allocator. The resulting IOASID allocation will always
> come from the host. This ensures that PASID namespace is system-
> wide.
> 
> Signed-off-by: Lu Baolu 
> Signed-off-by: Liu, Yi L 
> Signed-off-by: Jacob Pan 
> ---
>  drivers/iommu/intel-iommu.c | 84
> +
>  include/linux/intel-iommu.h |  2 ++
>  2 files changed, 86 insertions(+)
> 
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index a76afb0fd51a..c1c0b0fb93c3 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -1757,6 +1757,9 @@ static void free_dmar_iommu(struct intel_iommu
> *iommu)
>   if (ecap_prs(iommu->ecap))
>   intel_svm_finish_prq(iommu);
>   }
> + if (ecap_vcs(iommu->ecap) && vccap_pasid(iommu->vccap))
> + ioasid_unregister_allocator(&iommu->pasid_allocator);
> +
>  #endif
>  }
> 
> @@ -3291,6 +3294,84 @@ static int copy_translation_tables(struct
> intel_iommu *iommu)
>   return ret;
>  }
> 
> +#ifdef CONFIG_INTEL_IOMMU_SVM
> +static ioasid_t intel_ioasid_alloc(ioasid_t min, ioasid_t max, void *data)

the name is too generic... can we add vcmd in the name to clarify
its purpose, e.g. intel_vcmd_ioasid_alloc?

> +{
> + struct intel_iommu *iommu = data;
> + ioasid_t ioasid;
> +
> + if (!iommu)
> + return INVALID_IOASID;
> + /*
> +  * VT-d virtual command interface always uses the full 20 bit
> +  * PASID range. Host can partition guest PASID range based on
> +  * policies but it is out of guest's control.
> +  */
> + if (min < PASID_MIN || max > intel_pasid_max_id)
> + return INVALID_IOASID;
> +
> + if (vcmd_alloc_pasid(iommu, &ioasid))
> + return INVALID_IOASID;
> +
> + return ioasid;
> +}
> +
> +static void intel_ioasid_free(ioasid_t ioasid, void *data)
> +{
> + struct intel_iommu *iommu = data;
> +
> + if (!iommu)
> + return;
> + /*
> +  * Sanity check the ioasid owner is done at upper layer, e.g. VFIO
> +  * We can only free the PASID when all the devices are unbound.
> +  */
> + if (ioasid_find(NULL, ioasid, NULL)) {
> + pr_alert("Cannot free active IOASID %d\n", ioasid);
> + return;
> + }

However the sanity check is not done in default_free. Is there a reason
why using vcmd adds such  new requirement?

> + vcmd_free_pasid(iommu, ioasid);
> +}
> +
> +static void register_pasid_allocator(struct intel_iommu *iommu)
> +{
> + /*
> +  * If we are running in the host, no need for custom allocator
> +  * in that PASIDs are allocated from the host system-wide.
> +  */
> + if (!cap_caching_mode(iommu->cap))
> + return;

is it more accurate to check against vcmd capability?

> +
> + if (!sm_supported(iommu)) {
> + pr_warn("VT-d Scalable Mode not enabled, no PASID
> allocation\n");
> + return;
> + }
> +
> + /*
> +  * Register a custom PASID allocator if we are running in a guest,
> +  * guest PASID must be obtained via virtual command interface.
> +  * There can be multiple vIOMMUs in each guest but only one
> allocator
> +  * is active. All vIOMMU allocators will eventually be calling the same

which one? the first or last?

> +  * host allocator.
> +  */
> + if (ecap_vcs(iommu->ecap) && vccap_pasid(iommu->vccap)) {
> + pr_info("Register custom PASID allocator\n");
> + iommu->pasid_allocator.alloc = intel_ioasid_alloc;
> + iommu->pasid_allocator.free = intel_ioasid_free;
> + iommu->pasid_allocator.pdata = (void *)iommu;
> + if (ioasid_register_allocator(&iommu->pasid_allocator)) {
> + pr_warn("Custom PASID allocator failed, scalable
> mode disabled\n");
> + /*
> +  * Disable scalable mode on this IOMMU if there
> +  * is no custom allocator. Mixing SM capable
> vIOMMU
> +  * and non-SM vIOMMU are not supported.
> +  */
> + intel_iommu_sm = 0;

since you register an allocator for every vIOMMU, means previously
registered allocators should also be unregistered here?

> + }
> + }
> +}
> +#endif
> +
>  static int __init init_dmars(void)
>  {
>   struct dmar_drhd_unit *drhd;
> @@ -3408,6 +3489,9 @@ static int __init init_dmars(void)
>*/
>   for_each_active_iommu(iommu, drhd) {
>   iommu_flush_write_buffer(iommu);
> +#ifdef CONFIG_INTEL_IOMMU_SVM
> + register_pasid_allocator(iommu);
> +#endif
>   iommu_set_root_entry(io

RE: [PATCH v2 1/3] iommu/uapi: Define uapi version and capabilities

2020-03-29 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Saturday, March 28, 2020 7:54 AM
> 
> On Fri, 27 Mar 2020 00:47:02 -0700
> Christoph Hellwig  wrote:
> 
> > On Fri, Mar 27, 2020 at 02:49:55AM +, Tian, Kevin wrote:
> > > If those API calls are inter-dependent for composing a feature
> > > (e.g. SVA), shouldn't we need a way to check them together before
> > > exposing the feature to the guest, e.g. through a
> > > iommu_get_uapi_capabilities interface?
> >
> > Yes, that makes sense.  The important bit is to have a capability
> > flags and not version numbers.
> 
> The challenge is that there are two consumers in the kernel for this.
> 1. VFIO only look for compatibility, and size of each data struct such
> that it can copy_from_user.
> 
> 2. IOMMU driver, the "real consumer" of the content.
> 
> For 2, I agree and we do plan to use the capability flags to check
> content and maintain backward compatibility etc.
> 
> For VFIO, it is difficult to do size look up based on capability flags.
> 

Can you elaborate the difficulty in VFIO? if, as Christoph Hellwig
pointed out, version number is already avoided everywhere, it is 
interesting to know whether this work becomes a real exception
or just requires a different mindset.

btw the most relevant discussion which I can find out now is here:
https://lkml.org/lkml/2020/2/3/1126

It mentioned 3 options for handling extension:
--
1. Disallow adding new members to each structure other than reuse
padding bits or adding union members at the end.
2. Allow extension of the structures beyond union, but union size has
to be fixed with reserved spaces
3. Adopt VFIO argsz scheme, I don't think we need version for each
struct anymore. argsz implies the version that user is using assuming
UAPI data is extension only.
--

the first two are both version-based. Looks most guys agreed with 
option-1 (in this v2), but Alex didn't give his opinion at the moment. 
The last response from him was the raise of option-3 using argsz to 
avoid version. So, we also need hear from him. Alex?

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)

2020-03-30 Thread Tian, Kevin
> From: Liu, Yi L 
> Sent: Sunday, March 22, 2020 8:32 PM
> 
> From: Liu Yi L 
> 
> For a long time, devices have only one DMA address space from platform
> IOMMU's point of view. This is true for both bare metal and directed-
> access in virtualization environment. Reason is the source ID of DMA in
> PCIe are BDF (bus/dev/fnc ID), which results in only device granularity

are->is

> DMA isolation. However, this is changing with the latest advancement in
> I/O technology area. More and more platform vendors are utilizing the PCIe
> PASID TLP prefix in DMA requests, thus to give devices with multiple DMA
> address spaces as identified by their individual PASIDs. For example,
> Shared Virtual Addressing (SVA, a.k.a Shared Virtual Memory) is able to
> let device access multiple process virtual address space by binding the

"address space" -> "address spaces"

"binding the" -> "binding each"

> virtual address space with a PASID. Wherein the PASID is allocated in
> software and programmed to device per device specific manner. Devices
> which support PASID capability are called PASID-capable devices. If such
> devices are passed through to VMs, guest software are also able to bind
> guest process virtual address space on such devices. Therefore, the guest
> software could reuse the bare metal software programming model, which
> means guest software will also allocate PASID and program it to device
> directly. This is a dangerous situation since it has potential PASID
> conflicts and unauthorized address space access. It would be safer to
> let host intercept in the guest software's PASID allocation. Thus PASID
> are managed system-wide.
> 
> This patch adds VFIO_IOMMU_PASID_REQUEST ioctl which aims to
> passdown
> PASID allocation/free request from the virtual IOMMU. Additionally, such

"Additionally, because such"

> requests are intended to be invoked by QEMU or other applications which

simplify to "intended to be invoked from userspace"

> are running in userspace, it is necessary to have a mechanism to prevent
> single application from abusing available PASIDs in system. With such
> consideration, this patch tracks the VFIO PASID allocation per-VM. There
> was a discussion to make quota to be per assigned devices. e.g. if a VM
> has many assigned devices, then it should have more quota. However, it
> is not sure how many PASIDs an assigned devices will use. e.g. it is

devices -> device

> possible that a VM with multiples assigned devices but requests less
> PASIDs. Therefore per-VM quota would be better.
> 
> This patch uses struct mm pointer as a per-VM token. We also considered
> using task structure pointer and vfio_iommu structure pointer. However,
> task structure is per-thread, which means it cannot achieve per-VM PASID
> alloc tracking purpose. While for vfio_iommu structure, it is visible
> only within vfio. Therefore, structure mm pointer is selected. This patch
> adds a structure vfio_mm. A vfio_mm is created when the first vfio
> container is opened by a VM. On the reverse order, vfio_mm is free when
> the last vfio container is released. Each VM is assigned with a PASID
> quota, so that it is not able to request PASID beyond its quota. This
> patch adds a default quota of 1000. This quota could be tuned by
> administrator. Making PASID quota tunable will be added in another patch
> in this series.
> 
> Previous discussions:
> https://patchwork.kernel.org/patch/11209429/
> 
> Cc: Kevin Tian 
> CC: Jacob Pan 
> Cc: Alex Williamson 
> Cc: Eric Auger 
> Cc: Jean-Philippe Brucker 
> Signed-off-by: Liu Yi L 
> Signed-off-by: Yi Sun 
> Signed-off-by: Jacob Pan 
> ---
>  drivers/vfio/vfio.c | 130
> 
>  drivers/vfio/vfio_iommu_type1.c | 104
> 
>  include/linux/vfio.h|  20 +++
>  include/uapi/linux/vfio.h   |  41 +
>  4 files changed, 295 insertions(+)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index c848262..d13b483 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -32,6 +32,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
> 
>  #define DRIVER_VERSION   "0.3"
>  #define DRIVER_AUTHOR"Alex Williamson
> "
> @@ -46,6 +47,8 @@ static struct vfio {
>   struct mutexgroup_lock;
>   struct cdev group_cdev;
>   dev_t   group_devt;
> + struct list_headvfio_mm_list;
> + struct mutexvfio_mm_lock;
>   wait_queue_head_t   release_q;
>  } vfio;
> 
> @@ -2129,6 +2132,131 @@ int vfio_unregister_notifier(struct device *dev,
> enum vfio_notify_type type,
>  EXPORT_SYMBOL(vfio_unregister_notifier);
> 
>  /**
> + * VFIO_MM objects - create, release, get, put, search

why capitalizing vfio_mm?

> + * Caller of the function should have held vfio.vfio_mm_lock.
> + */
> +static struct vfio_mm *vfio_create_mm(struct mm_stru

RE: [PATCH v1 2/8] vfio/type1: Add vfio_iommu_type1 parameter for quota tuning

2020-03-30 Thread Tian, Kevin
> From: Liu, Yi L 
> Sent: Sunday, March 22, 2020 8:32 PM
> 
> From: Liu Yi L 
> 
> This patch adds a module option to make the PASID quota tunable by
> administrator.
> 
> TODO: needs to think more on how to  make the tuning to be per-process.
> 
> Previous discussions:
> https://patchwork.kernel.org/patch/11209429/
> 
> Cc: Kevin Tian 
> CC: Jacob Pan 
> Cc: Alex Williamson 
> Cc: Eric Auger 
> Cc: Jean-Philippe Brucker 
> Signed-off-by: Liu Yi L 
> ---
>  drivers/vfio/vfio.c | 8 +++-
>  drivers/vfio/vfio_iommu_type1.c | 7 ++-
>  include/linux/vfio.h| 3 ++-
>  3 files changed, 15 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index d13b483..020a792 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -2217,13 +2217,19 @@ struct vfio_mm *vfio_mm_get_from_task(struct
> task_struct *task)
>  }
>  EXPORT_SYMBOL_GPL(vfio_mm_get_from_task);
> 
> -int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max)
> +int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int quota, int min, int max)
>  {
>   ioasid_t pasid;
>   int ret = -ENOSPC;
> 
>   mutex_lock(&vmm->pasid_lock);
> 
> + /* update quota as it is tunable by admin */
> + if (vmm->pasid_quota != quota) {
> + vmm->pasid_quota = quota;
> + ioasid_adjust_set(vmm->ioasid_sid, quota);
> + }
> +

It's a bit weird to have quota adjusted in the alloc path, since the latter 
might
be initiated by non-privileged users. Why not doing the simple math in vfio_
create_mm to set the quota when the ioasid set is created? even in the future
you may allow per-process quota setting, that should come from separate 
privileged path instead of thru alloc...

>   pasid = ioasid_alloc(vmm->ioasid_sid, min, max, NULL);
>   if (pasid == INVALID_IOASID) {
>   ret = -ENOSPC;
> diff --git a/drivers/vfio/vfio_iommu_type1.c
> b/drivers/vfio/vfio_iommu_type1.c
> index 331ceee..e40afc0 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -60,6 +60,11 @@ module_param_named(dma_entry_limit,
> dma_entry_limit, uint, 0644);
>  MODULE_PARM_DESC(dma_entry_limit,
>"Maximum number of user DMA mappings per container
> (65535).");
> 
> +static int pasid_quota = VFIO_DEFAULT_PASID_QUOTA;
> +module_param_named(pasid_quota, pasid_quota, uint, 0644);
> +MODULE_PARM_DESC(pasid_quota,
> +  "Quota of user owned PASIDs per vfio-based application
> (1000).");
> +
>  struct vfio_iommu {
>   struct list_headdomain_list;
>   struct list_headiova_list;
> @@ -2200,7 +2205,7 @@ static int vfio_iommu_type1_pasid_alloc(struct
> vfio_iommu *iommu,
>   goto out_unlock;
>   }
>   if (vmm)
> - ret = vfio_mm_pasid_alloc(vmm, min, max);
> + ret = vfio_mm_pasid_alloc(vmm, pasid_quota, min, max);
>   else
>   ret = -EINVAL;
>  out_unlock:
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 75f9f7f1..af2ef78 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -106,7 +106,8 @@ struct vfio_mm {
> 
>  extern struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task);
>  extern void vfio_mm_put(struct vfio_mm *vmm);
> -extern int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max);
> +extern int vfio_mm_pasid_alloc(struct vfio_mm *vmm,
> + int quota, int min, int max);
>  extern int vfio_mm_pasid_free(struct vfio_mm *vmm, ioasid_t pasid);
> 
>  /*
> --
> 2.7.4

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v1 2/8] vfio/type1: Add vfio_iommu_type1 parameter for quota tuning

2020-03-30 Thread Tian, Kevin
> From: Liu, Yi L 
> Sent: Monday, March 30, 2020 4:53 PM
> 
> > From: Tian, Kevin 
> > Sent: Monday, March 30, 2020 4:41 PM
> > To: Liu, Yi L ; alex.william...@redhat.com;
> > Subject: RE: [PATCH v1 2/8] vfio/type1: Add vfio_iommu_type1 parameter
> for quota
> > tuning
> >
> > > From: Liu, Yi L 
> > > Sent: Sunday, March 22, 2020 8:32 PM
> > >
> > > From: Liu Yi L 
> > >
> > > This patch adds a module option to make the PASID quota tunable by
> > > administrator.
> > >
> > > TODO: needs to think more on how to  make the tuning to be per-process.
> > >
> > > Previous discussions:
> > > https://patchwork.kernel.org/patch/11209429/
> > >
> > > Cc: Kevin Tian 
> > > CC: Jacob Pan 
> > > Cc: Alex Williamson 
> > > Cc: Eric Auger 
> > > Cc: Jean-Philippe Brucker 
> > > Signed-off-by: Liu Yi L 
> > > ---
> > >  drivers/vfio/vfio.c | 8 +++-
> > >  drivers/vfio/vfio_iommu_type1.c | 7 ++-
> > >  include/linux/vfio.h| 3 ++-
> > >  3 files changed, 15 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> > > index d13b483..020a792 100644
> > > --- a/drivers/vfio/vfio.c
> > > +++ b/drivers/vfio/vfio.c
> > > @@ -2217,13 +2217,19 @@ struct vfio_mm
> *vfio_mm_get_from_task(struct
> > > task_struct *task)
> > >  }
> > >  EXPORT_SYMBOL_GPL(vfio_mm_get_from_task);
> > >
> > > -int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max)
> > > +int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int quota, int min, int
> max)
> > >  {
> > >   ioasid_t pasid;
> > >   int ret = -ENOSPC;
> > >
> > >   mutex_lock(&vmm->pasid_lock);
> > >
> > > + /* update quota as it is tunable by admin */
> > > + if (vmm->pasid_quota != quota) {
> > > + vmm->pasid_quota = quota;
> > > + ioasid_adjust_set(vmm->ioasid_sid, quota);
> > > + }
> > > +
> >
> > It's a bit weird to have quota adjusted in the alloc path, since the latter
> might
> > be initiated by non-privileged users. Why not doing the simple math in
> vfio_
> > create_mm to set the quota when the ioasid set is created? even in the
> future
> > you may allow per-process quota setting, that should come from separate
> > privileged path instead of thru alloc..
> 
> The reason is the kernel parameter modification has no event which
> can be used to adjust the quota. So I chose to adjust it in pasid_alloc
> path. If it's not good, how about adding one more IOCTL to let user-
> space trigger a quota adjustment event? Then even non-privileged
> user could trigger quota adjustment, the quota is actually controlled
> by privileged user. How about your opinion?
> 

why do you need an event to adjust? As I said, you can set the quota
when the set is created in vfio_create_mm...

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v1 3/8] vfio/type1: Report PASID alloc/free support to userspace

2020-03-30 Thread Tian, Kevin
> From: Liu, Yi L 
> Sent: Sunday, March 22, 2020 8:32 PM
> 
> From: Liu Yi L 
> 
> This patch reports PASID alloc/free availability to userspace (e.g. QEMU)
> thus userspace could do a pre-check before utilizing this feature.
> 
> Cc: Kevin Tian 
> CC: Jacob Pan 
> Cc: Alex Williamson 
> Cc: Eric Auger 
> Cc: Jean-Philippe Brucker 
> Signed-off-by: Liu Yi L 
> ---
>  drivers/vfio/vfio_iommu_type1.c | 28 
>  include/uapi/linux/vfio.h   |  8 
>  2 files changed, 36 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c
> b/drivers/vfio/vfio_iommu_type1.c
> index e40afc0..ddd1ffe 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -2234,6 +2234,30 @@ static int vfio_iommu_type1_pasid_free(struct
> vfio_iommu *iommu,
>   return ret;
>  }
> 
> +static int vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
> +  struct vfio_info_cap *caps)
> +{
> + struct vfio_info_cap_header *header;
> + struct vfio_iommu_type1_info_cap_nesting *nesting_cap;
> +
> + header = vfio_info_cap_add(caps, sizeof(*nesting_cap),
> +VFIO_IOMMU_TYPE1_INFO_CAP_NESTING,
> 1);
> + if (IS_ERR(header))
> + return PTR_ERR(header);
> +
> + nesting_cap = container_of(header,
> + struct vfio_iommu_type1_info_cap_nesting,
> + header);
> +
> + nesting_cap->nesting_capabilities = 0;
> + if (iommu->nesting) {

Is it good to report a nesting cap when iommu->nesting is disabled? I suppose
the check should move before vfio_info_cap_add...

> + /* nesting iommu type supports PASID requests (alloc/free)
> */
> + nesting_cap->nesting_capabilities |=
> VFIO_IOMMU_PASID_REQS;

VFIO_IOMMU_CAP_PASID_REQ? to avoid confusion with ioctl cmd
VFIO_IOMMU_PASID_REQUEST...

> + }
> +
> + return 0;
> +}
> +
>  static long vfio_iommu_type1_ioctl(void *iommu_data,
>  unsigned int cmd, unsigned long arg)
>  {
> @@ -2283,6 +2307,10 @@ static long vfio_iommu_type1_ioctl(void
> *iommu_data,
>   if (ret)
>   return ret;
> 
> + ret = vfio_iommu_info_add_nesting_cap(iommu, &caps);
> + if (ret)
> + return ret;
> +
>   if (caps.size) {
>   info.flags |= VFIO_IOMMU_INFO_CAPS;
> 
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 298ac80..8837219 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -748,6 +748,14 @@ struct vfio_iommu_type1_info_cap_iova_range {
>   struct  vfio_iova_range iova_ranges[];
>  };
> 
> +#define VFIO_IOMMU_TYPE1_INFO_CAP_NESTING  2
> +
> +struct vfio_iommu_type1_info_cap_nesting {
> + struct  vfio_info_cap_header header;
> +#define VFIO_IOMMU_PASID_REQS(1 << 0)
> + __u32   nesting_capabilities;
> +};
> +
>  #define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
> 
>  /**
> --
> 2.7.4

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v1 2/8] vfio/type1: Add vfio_iommu_type1 parameter for quota tuning

2020-03-30 Thread Tian, Kevin
> From: Liu, Yi L 
> Sent: Monday, March 30, 2020 5:27 PM
> 
> > From: Tian, Kevin 
> > Sent: Monday, March 30, 2020 5:20 PM
> > To: Liu, Yi L ; alex.william...@redhat.com;
> > Subject: RE: [PATCH v1 2/8] vfio/type1: Add vfio_iommu_type1 parameter
> for quota
> > tuning
> >
> > > From: Liu, Yi L 
> > > Sent: Monday, March 30, 2020 4:53 PM
> > >
> > > > From: Tian, Kevin 
> > > > Sent: Monday, March 30, 2020 4:41 PM
> > > > To: Liu, Yi L ; alex.william...@redhat.com;
> > > > Subject: RE: [PATCH v1 2/8] vfio/type1: Add vfio_iommu_type1
> > > > parameter
> > > for quota
> > > > tuning
> > > >
> > > > > From: Liu, Yi L 
> > > > > Sent: Sunday, March 22, 2020 8:32 PM
> > > > >
> > > > > From: Liu Yi L 
> > > > >
> > > > > This patch adds a module option to make the PASID quota tunable by
> > > > > administrator.
> > > > >
> > > > > TODO: needs to think more on how to  make the tuning to be per-
> process.
> > > > >
> > > > > Previous discussions:
> > > > > https://patchwork.kernel.org/patch/11209429/
> > > > >
> > > > > Cc: Kevin Tian 
> > > > > CC: Jacob Pan 
> > > > > Cc: Alex Williamson 
> > > > > Cc: Eric Auger 
> > > > > Cc: Jean-Philippe Brucker 
> > > > > Signed-off-by: Liu Yi L 
> > > > > ---
> > > > >  drivers/vfio/vfio.c | 8 +++-
> > > > >  drivers/vfio/vfio_iommu_type1.c | 7 ++-
> > > > >  include/linux/vfio.h| 3 ++-
> > > > >  3 files changed, 15 insertions(+), 3 deletions(-)
> > > > >
> > > > > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c index
> > > > > d13b483..020a792 100644
> > > > > --- a/drivers/vfio/vfio.c
> > > > > +++ b/drivers/vfio/vfio.c
> > > > > @@ -2217,13 +2217,19 @@ struct vfio_mm
> > > *vfio_mm_get_from_task(struct
> > > > > task_struct *task)
> > > > >  }
> > > > >  EXPORT_SYMBOL_GPL(vfio_mm_get_from_task);
> > > > >
> > > > > -int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max)
> > > > > +int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int quota, int min,
> > > > > +int
> > > max)
> > > > >  {
> > > > >   ioasid_t pasid;
> > > > >   int ret = -ENOSPC;
> > > > >
> > > > >   mutex_lock(&vmm->pasid_lock);
> > > > >
> > > > > + /* update quota as it is tunable by admin */
> > > > > + if (vmm->pasid_quota != quota) {
> > > > > + vmm->pasid_quota = quota;
> > > > > + ioasid_adjust_set(vmm->ioasid_sid, quota);
> > > > > + }
> > > > > +
> > > >
> > > > It's a bit weird to have quota adjusted in the alloc path, since the
> > > > latter
> > > might
> > > > be initiated by non-privileged users. Why not doing the simple math
> > > > in
> > > vfio_
> > > > create_mm to set the quota when the ioasid set is created? even in
> > > > the
> > > future
> > > > you may allow per-process quota setting, that should come from
> > > > separate privileged path instead of thru alloc..
> > >
> > > The reason is the kernel parameter modification has no event which can
> > > be used to adjust the quota. So I chose to adjust it in pasid_alloc
> > > path. If it's not good, how about adding one more IOCTL to let user-
> > > space trigger a quota adjustment event? Then even non-privileged user
> > > could trigger quota adjustment, the quota is actually controlled by
> > > privileged user. How about your opinion?
> > >
> >
> > why do you need an event to adjust? As I said, you can set the quota when
> the set is
> > created in vfio_create_mm...
> 
> oh, it's to support runtime adjustments. I guess it may be helpful to let
> per-VM quota tunable even the VM is running. If just set the quota in
> vfio_create_mm(), it is not able to adjust at runtime.
> 

ok, I didn't note the module parameter was granted with a write permission.
However there is a further problem. We cannot support PASID reclaim now.
What about the admin sets a quota smaller than previous value while some
IOASID sets already exceed the new quota? I'm not sure how to fail a runtime
module parameter change due to that situation. possibly a normal sysfs 
node better suites the runtime change requirement...

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1 format to userspace

2020-03-30 Thread Tian, Kevin
> From: Liu, Yi L 
> Sent: Sunday, March 22, 2020 8:32 PM
> 
> From: Liu Yi L 
> 
> VFIO exposes IOMMU nesting translation (a.k.a dual stage translation)
> capability to userspace. Thus applications like QEMU could support
> vIOMMU with hardware's nesting translation capability for pass-through
> devices. Before setting up nesting translation for pass-through devices,
> QEMU and other applications need to learn the supported 1st-lvl/stage-1
> translation structure format like page table format.
> 
> Take vSVA (virtual Shared Virtual Addressing) as an example, to support
> vSVA for pass-through devices, QEMU setup nesting translation for pass-
> through devices. The guest page table are configured to host as 1st-lvl/
> stage-1 page table. Therefore, guest format should be compatible with
> host side.
> 
> This patch reports the supported 1st-lvl/stage-1 page table format on the
> current platform to userspace. QEMU and other alike applications should
> use this format info when trying to setup IOMMU nesting translation on
> host IOMMU.
> 
> Cc: Kevin Tian 
> CC: Jacob Pan 
> Cc: Alex Williamson 
> Cc: Eric Auger 
> Cc: Jean-Philippe Brucker 
> Signed-off-by: Liu Yi L 
> ---
>  drivers/vfio/vfio_iommu_type1.c | 56
> +
>  include/uapi/linux/vfio.h   |  1 +
>  2 files changed, 57 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c
> b/drivers/vfio/vfio_iommu_type1.c
> index 9aa2a67..82a9e0b 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -2234,11 +2234,66 @@ static int vfio_iommu_type1_pasid_free(struct
> vfio_iommu *iommu,
>   return ret;
>  }
> 
> +static int vfio_iommu_get_stage1_format(struct vfio_iommu *iommu,
> +  u32 *stage1_format)
> +{
> + struct vfio_domain *domain;
> + u32 format = 0, tmp_format = 0;
> + int ret;
> +
> + mutex_lock(&iommu->lock);
> + if (list_empty(&iommu->domain_list)) {
> + mutex_unlock(&iommu->lock);
> + return -EINVAL;
> + }
> +
> + list_for_each_entry(domain, &iommu->domain_list, next) {
> + if (iommu_domain_get_attr(domain->domain,
> + DOMAIN_ATTR_PASID_FORMAT, &format)) {
> + ret = -EINVAL;
> + format = 0;
> + goto out_unlock;
> + }
> + /*
> +  * format is always non-zero (the first format is
> +  * IOMMU_PASID_FORMAT_INTEL_VTD which is 1). For
> +  * the reason of potential different backed IOMMU
> +  * formats, here we expect to have identical formats
> +  * in the domain list, no mixed formats support.
> +  * return -EINVAL to fail the attempt of setup
> +  * VFIO_TYPE1_NESTING_IOMMU if non-identical formats
> +  * are detected.
> +  */
> + if (tmp_format && tmp_format != format) {
> + ret = -EINVAL;
> + format = 0;
> + goto out_unlock;
> + }
> +
> + tmp_format = format;
> + }

this path is invoked only in VFIO_IOMMU_GET_INFO path. If we don't
want to assume the status quo that one container holds only one
device w/ vIOMMU (the prerequisite for vSVA), looks we also need
check the format compatibility when attaching a new group to this
container?

> + ret = 0;
> +
> +out_unlock:
> + if (format)
> + *stage1_format = format;
> + mutex_unlock(&iommu->lock);
> + return ret;
> +}
> +
>  static int vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
>struct vfio_info_cap *caps)
>  {
>   struct vfio_info_cap_header *header;
>   struct vfio_iommu_type1_info_cap_nesting *nesting_cap;
> + u32 formats = 0;
> + int ret;
> +
> + ret = vfio_iommu_get_stage1_format(iommu, &formats);
> + if (ret) {
> + pr_warn("Failed to get stage-1 format\n");
> + return ret;
> + }
> 
>   header = vfio_info_cap_add(caps, sizeof(*nesting_cap),
>  VFIO_IOMMU_TYPE1_INFO_CAP_NESTING,
> 1);
> @@ -2254,6 +2309,7 @@ static int vfio_iommu_info_add_nesting_cap(struct
> vfio_iommu *iommu,
>   /* nesting iommu type supports PASID requests (alloc/free)
> */
>   nesting_cap->nesting_capabilities |=
> VFIO_IOMMU_PASID_REQS;
>   }
> + nesting_cap->stage1_formats = formats;
> 
>   return 0;
>  }
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index ed9881d..ebeaf3e 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -763,6 +763,7 @@ struct vfio_iommu_type1_info_cap_nesting {
>   struct  vfio_info_cap_header header;
>  #define VFIO_IOMMU_PASID_REQS(1 << 0)
>   __u32   nesting_capabilities;
> + __u32   stage1_formats;

do you plan to support multiple fo

RE: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host

2020-03-30 Thread Tian, Kevin
> From: Liu, Yi L 
> Sent: Sunday, March 22, 2020 8:32 PM
> 
> From: Liu Yi L 
> 
> VFIO_TYPE1_NESTING_IOMMU is an IOMMU type which is backed by
> hardware
> IOMMUs that have nesting DMA translation (a.k.a dual stage address
> translation). For such hardware IOMMUs, there are two stages/levels of
> address translation, and software may let userspace/VM to own the first-
> level/stage-1 translation structures. Example of such usage is vSVA (
> virtual Shared Virtual Addressing). VM owns the first-level/stage-1
> translation structures and bind the structures to host, then hardware
> IOMMU would utilize nesting translation when doing DMA translation fo
> the devices behind such hardware IOMMU.
> 
> This patch adds vfio support for binding guest translation (a.k.a stage 1)
> structure to host iommu. And for VFIO_TYPE1_NESTING_IOMMU, not only
> bind
> guest page table is needed, it also requires to expose interface to guest
> for iommu cache invalidation when guest modified the first-level/stage-1
> translation structures since hardware needs to be notified to flush stale
> iotlbs. This would be introduced in next patch.
> 
> In this patch, guest page table bind and unbind are done by using flags
> VFIO_IOMMU_BIND_GUEST_PGTBL and
> VFIO_IOMMU_UNBIND_GUEST_PGTBL under IOCTL
> VFIO_IOMMU_BIND, the bind/unbind data are conveyed by
> struct iommu_gpasid_bind_data. Before binding guest page table to host,
> VM should have got a PASID allocated by host via
> VFIO_IOMMU_PASID_REQUEST.
> 
> Bind guest translation structures (here is guest page table) to host

Bind -> Binding

> are the first step to setup vSVA (Virtual Shared Virtual Addressing).

are -> is. and you already explained vSVA earlier.

> 
> Cc: Kevin Tian 
> CC: Jacob Pan 
> Cc: Alex Williamson 
> Cc: Eric Auger 
> Cc: Jean-Philippe Brucker 
> Signed-off-by: Jean-Philippe Brucker 
> Signed-off-by: Liu Yi L 
> Signed-off-by: Jacob Pan 
> ---
>  drivers/vfio/vfio_iommu_type1.c | 158
> 
>  include/uapi/linux/vfio.h   |  46 
>  2 files changed, 204 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c
> b/drivers/vfio/vfio_iommu_type1.c
> index 82a9e0b..a877747 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -130,6 +130,33 @@ struct vfio_regions {
>  #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)  \
>   (!list_empty(&iommu->domain_list))
> 
> +struct domain_capsule {
> + struct iommu_domain *domain;
> + void *data;
> +};
> +
> +/* iommu->lock must be held */
> +static int vfio_iommu_for_each_dev(struct vfio_iommu *iommu,
> +   int (*fn)(struct device *dev, void *data),
> +   void *data)
> +{
> + struct domain_capsule dc = {.data = data};
> + struct vfio_domain *d;
> + struct vfio_group *g;
> + int ret = 0;
> +
> + list_for_each_entry(d, &iommu->domain_list, next) {
> + dc.domain = d->domain;
> + list_for_each_entry(g, &d->group_list, next) {
> + ret = iommu_group_for_each_dev(g->iommu_group,
> +&dc, fn);
> + if (ret)
> + break;
> + }
> + }
> + return ret;
> +}
> +
>  static int put_pfn(unsigned long pfn, int prot);
> 
>  /*
> @@ -2314,6 +2341,88 @@ static int
> vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
>   return 0;
>  }
> 
> +static int vfio_bind_gpasid_fn(struct device *dev, void *data)
> +{
> + struct domain_capsule *dc = (struct domain_capsule *)data;
> + struct iommu_gpasid_bind_data *gbind_data =
> + (struct iommu_gpasid_bind_data *) dc->data;
> +

In Jacob's vSVA iommu series, [PATCH 06/11]:

+   /* REVISIT: upper layer/VFIO can track host process that bind 
the PASID.
+* ioasid_set = mm might be sufficient for vfio to check pasid 
VMM
+* ownership.
+*/

I asked him who exactly should be responsible for tracking the pasid
ownership. Although no response yet, I expect vfio/iommu can have
a clear policy and also documented here to provide consistent 
message.

> + return iommu_sva_bind_gpasid(dc->domain, dev, gbind_data);
> +}
> +
> +static int vfio_unbind_gpasid_fn(struct device *dev, void *data)
> +{
> + struct domain_capsule *dc = (struct domain_capsule *)data;
> + struct iommu_gpasid_bind_data *gbind_data =
> + (struct iommu_gpasid_bind_data *) dc->data;
> +
> + return iommu_sva_unbind_gpasid(dc->domain, dev,
> + gbind_data->hpasid);

curious why we have to share the same bind_data structure
between bind and unbind, especially when unbind requires
only one field? I didn't see a clear reason, and just similar
to earlier ALLOC/FREE which don't share structure either.
Current way simply wastes space for unbind operation...


RE: [PATCH v1 7/8] vfio/type1: Add VFIO_IOMMU_CACHE_INVALIDATE

2020-03-30 Thread Tian, Kevin
> From: Liu, Yi L 
> Sent: Sunday, March 22, 2020 8:32 PM
> 
> From: Liu Yi L 
> 
> For VFIO IOMMUs with the type VFIO_TYPE1_NESTING_IOMMU, guest
> "owns" the
> first-level/stage-1 translation structures, the host IOMMU driver has no
> knowledge of first-level/stage-1 structure cache updates unless the guest
> invalidation requests are trapped and propagated to the host.
> 
> This patch adds a new IOCTL VFIO_IOMMU_CACHE_INVALIDATE to
> propagate guest
> first-level/stage-1 IOMMU cache invalidations to host to ensure IOMMU
> cache
> correctness.
> 
> With this patch, vSVA (Virtual Shared Virtual Addressing) can be used safely
> as the host IOMMU iotlb correctness are ensured.
> 
> Cc: Kevin Tian 
> CC: Jacob Pan 
> Cc: Alex Williamson 
> Cc: Eric Auger 
> Cc: Jean-Philippe Brucker 
> Signed-off-by: Liu Yi L 
> Signed-off-by: Eric Auger 
> Signed-off-by: Jacob Pan 
> ---
>  drivers/vfio/vfio_iommu_type1.c | 49
> +
>  include/uapi/linux/vfio.h   | 22 ++
>  2 files changed, 71 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c
> b/drivers/vfio/vfio_iommu_type1.c
> index a877747..937ec3f 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -2423,6 +2423,15 @@ static long
> vfio_iommu_type1_unbind_gpasid(struct vfio_iommu *iommu,
>   return ret;
>  }
> 
> +static int vfio_cache_inv_fn(struct device *dev, void *data)

vfio_iommu_cache_inv_fn

> +{
> + struct domain_capsule *dc = (struct domain_capsule *)data;
> + struct iommu_cache_invalidate_info *cache_inv_info =
> + (struct iommu_cache_invalidate_info *) dc->data;
> +
> + return iommu_cache_invalidate(dc->domain, dev, cache_inv_info);
> +}
> +
>  static long vfio_iommu_type1_ioctl(void *iommu_data,
>  unsigned int cmd, unsigned long arg)
>  {
> @@ -2629,6 +2638,46 @@ static long vfio_iommu_type1_ioctl(void
> *iommu_data,
>   }
>   kfree(gbind_data);
>   return ret;
> + } else if (cmd == VFIO_IOMMU_CACHE_INVALIDATE) {
> + struct vfio_iommu_type1_cache_invalidate cache_inv;
> + u32 version;
> + int info_size;
> + void *cache_info;
> + int ret;
> +
> + minsz = offsetofend(struct
> vfio_iommu_type1_cache_invalidate,
> + flags);
> +
> + if (copy_from_user(&cache_inv, (void __user *)arg, minsz))
> + return -EFAULT;
> +
> + if (cache_inv.argsz < minsz || cache_inv.flags)
> + return -EINVAL;
> +
> + /* Get the version of struct iommu_cache_invalidate_info */
> + if (copy_from_user(&version,
> + (void __user *) (arg + minsz), sizeof(version)))
> + return -EFAULT;
> +
> + info_size = iommu_uapi_get_data_size(
> + IOMMU_UAPI_CACHE_INVAL,
> version);
> +
> + cache_info = kzalloc(info_size, GFP_KERNEL);
> + if (!cache_info)
> + return -ENOMEM;
> +
> + if (copy_from_user(cache_info,
> + (void __user *) (arg + minsz), info_size)) {
> + kfree(cache_info);
> + return -EFAULT;
> + }
> +
> + mutex_lock(&iommu->lock);
> + ret = vfio_iommu_for_each_dev(iommu, vfio_cache_inv_fn,
> + cache_info);
> + mutex_unlock(&iommu->lock);
> + kfree(cache_info);
> + return ret;
>   }
> 
>   return -ENOTTY;
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 2235bc6..62ca791 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -899,6 +899,28 @@ struct vfio_iommu_type1_bind {
>   */
>  #define VFIO_IOMMU_BIND  _IO(VFIO_TYPE, VFIO_BASE + 23)
> 
> +/**
> + * VFIO_IOMMU_CACHE_INVALIDATE - _IOW(VFIO_TYPE, VFIO_BASE + 24,
> + *   struct vfio_iommu_type1_cache_invalidate)
> + *
> + * Propagate guest IOMMU cache invalidation to the host. The cache
> + * invalidation information is conveyed by @cache_info, the content
> + * format would be structures defined in uapi/linux/iommu.h. User
> + * should be aware of that the struct  iommu_cache_invalidate_info
> + * has a @version field, vfio needs to parse this field before getting
> + * data from userspace.
> + *
> + * Availability of this IOCTL is after VFIO_SET_IOMMU.
> + *
> + * returns: 0 on success, -errno on failure.
> + */
> +struct vfio_iommu_type1_cache_invalidate {
> + __u32   argsz;
> + __u32   flags;
> + struct  iommu_cache_invalidate_info cache_info;
> +};
> +#define VFIO_IOMMU_CACHE_INVALIDATE  _IO(VFIO_TYPE, VFIO_BASE +
> 24)
> +
>  /*  Additional API for SPAPR TCE (Server POWERPC) IOMMU  */
> 
>  /*
> --
> 2.7.4

This 

RE: [PATCH v1 8/8] vfio/type1: Add vSVA support for IOMMU-backed mdevs

2020-03-30 Thread Tian, Kevin
> From: Liu, Yi L 
> Sent: Sunday, March 22, 2020 8:32 PM
> 
> From: Liu Yi L 
> 
> Recent years, mediated device pass-through framework (e.g. vfio-mdev)
> are used to achieve flexible device sharing across domains (e.g. VMs).

are->is

> Also there are hardware assisted mediated pass-through solutions from
> platform vendors. e.g. Intel VT-d scalable mode which supports Intel
> Scalable I/O Virtualization technology. Such mdevs are called IOMMU-
> backed mdevs as there are IOMMU enforced DMA isolation for such mdevs.
> In kernel, IOMMU-backed mdevs are exposed to IOMMU layer by aux-
> domain
> concept, which means mdevs are protected by an iommu domain which is
> aux-domain of its physical device. Details can be found in the KVM

"by an iommu domain which is auxiliary to the domain that the kernel
driver primarily uses for DMA API"

> presentation from Kevin Tian. IOMMU-backed equals to IOMMU-capable.
> 
> https://events19.linuxfoundation.org/wp-content/uploads/2017/12/\
> Hardware-Assisted-Mediated-Pass-Through-with-VFIO-Kevin-Tian-Intel.pdf
> 
> This patch supports NESTING IOMMU for IOMMU-backed mdevs by figuring
> out the physical device of an IOMMU-backed mdev and then invoking
> IOMMU
> requests to IOMMU layer with the physical device and the mdev's aux
> domain info.

"and then calling into the IOMMU layer to complete the vSVA operations
on the aux domain associated with that mdev"

> 
> With this patch, vSVA (Virtual Shared Virtual Addressing) can be used
> on IOMMU-backed mdevs.
> 
> Cc: Kevin Tian 
> CC: Jacob Pan 
> CC: Jun Tian 
> Cc: Alex Williamson 
> Cc: Eric Auger 
> Cc: Jean-Philippe Brucker 
> Signed-off-by: Liu Yi L 
> ---
>  drivers/vfio/vfio_iommu_type1.c | 23 ---
>  1 file changed, 20 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c
> b/drivers/vfio/vfio_iommu_type1.c
> index 937ec3f..d473665 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -132,6 +132,7 @@ struct vfio_regions {
> 
>  struct domain_capsule {
>   struct iommu_domain *domain;
> + struct vfio_group *group;
>   void *data;
>  };
> 
> @@ -148,6 +149,7 @@ static int vfio_iommu_for_each_dev(struct
> vfio_iommu *iommu,
>   list_for_each_entry(d, &iommu->domain_list, next) {
>   dc.domain = d->domain;
>   list_for_each_entry(g, &d->group_list, next) {
> + dc.group = g;
>   ret = iommu_group_for_each_dev(g->iommu_group,
>  &dc, fn);
>   if (ret)
> @@ -2347,7 +2349,12 @@ static int vfio_bind_gpasid_fn(struct device *dev,
> void *data)
>   struct iommu_gpasid_bind_data *gbind_data =
>   (struct iommu_gpasid_bind_data *) dc->data;
> 
> - return iommu_sva_bind_gpasid(dc->domain, dev, gbind_data);
> + if (dc->group->mdev_group)
> + return iommu_sva_bind_gpasid(dc->domain,
> + vfio_mdev_get_iommu_device(dev), gbind_data);
> + else
> + return iommu_sva_bind_gpasid(dc->domain,
> + dev, gbind_data);
>  }
> 
>  static int vfio_unbind_gpasid_fn(struct device *dev, void *data)
> @@ -2356,8 +2363,13 @@ static int vfio_unbind_gpasid_fn(struct device
> *dev, void *data)
>   struct iommu_gpasid_bind_data *gbind_data =
>   (struct iommu_gpasid_bind_data *) dc->data;
> 
> - return iommu_sva_unbind_gpasid(dc->domain, dev,
> + if (dc->group->mdev_group)
> + return iommu_sva_unbind_gpasid(dc->domain,
> + vfio_mdev_get_iommu_device(dev),
>   gbind_data->hpasid);
> + else
> + return iommu_sva_unbind_gpasid(dc->domain, dev,
> + gbind_data->hpasid);
>  }
> 
>  /**
> @@ -2429,7 +2441,12 @@ static int vfio_cache_inv_fn(struct device *dev,
> void *data)
>   struct iommu_cache_invalidate_info *cache_inv_info =
>   (struct iommu_cache_invalidate_info *) dc->data;
> 
> - return iommu_cache_invalidate(dc->domain, dev, cache_inv_info);
> + if (dc->group->mdev_group)
> + return iommu_cache_invalidate(dc->domain,
> + vfio_mdev_get_iommu_device(dev), cache_inv_info);
> + else
> + return iommu_cache_invalidate(dc->domain,
> + dev, cache_inv_info);
>  }

possibly above could be simplified, e.g. 

static struct device *vfio_get_iommu_device(struct vfio_group *group, 
struct device *dev)
{
if  (group->mdev_group)
return vfio_mdev_get_iommu_device(dev);
else
return dev;
}

Then use it to replace plain 'dev' in all three places.

> 
>  static long vfio_iommu_type1_ioctl(void *iommu_data,
> --
> 2.7.4

___
iommu mailing list
iommu@lists.l

RE: [PATCH V10 08/11] iommu/vt-d: Add svm/sva invalidate function

2020-03-30 Thread Tian, Kevin
> From: Auger Eric 
> Sent: Sunday, March 29, 2020 11:34 PM
> 
> Hi,
> 
> On 3/28/20 11:01 AM, Tian, Kevin wrote:
> >> From: Jacob Pan 
> >> Sent: Saturday, March 21, 2020 7:28 AM
> >>
> >> When Shared Virtual Address (SVA) is enabled for a guest OS via
> >> vIOMMU, we need to provide invalidation support at IOMMU API and
> driver
> >> level. This patch adds Intel VT-d specific function to implement
> >> iommu passdown invalidate API for shared virtual address.
> >>
> >> The use case is for supporting caching structure invalidation
> >> of assigned SVM capable devices. Emulated IOMMU exposes queue
> >
> > emulated IOMMU -> vIOMMU, since virito-iommu could use the
> > interface as well.
> >
> >> invalidation capability and passes down all descriptors from the guest
> >> to the physical IOMMU.
> >>
> >> The assumption is that guest to host device ID mapping should be
> >> resolved prior to calling IOMMU driver. Based on the device handle,
> >> host IOMMU driver can replace certain fields before submit to the
> >> invalidation queue.
> >>
> >> ---
> >> v7 review fixed in v10
> >> ---
> >>
> >> Signed-off-by: Jacob Pan 
> >> Signed-off-by: Ashok Raj 
> >> Signed-off-by: Liu, Yi L 
> >> ---
> >>  drivers/iommu/intel-iommu.c | 182
> >> 
> >>  1 file changed, 182 insertions(+)
> >>
> >> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> >> index b1477cd423dd..a76afb0fd51a 100644
> >> --- a/drivers/iommu/intel-iommu.c
> >> +++ b/drivers/iommu/intel-iommu.c
> >> @@ -5619,6 +5619,187 @@ static void
> >> intel_iommu_aux_detach_device(struct iommu_domain *domain,
> >>aux_domain_remove_dev(to_dmar_domain(domain), dev);
> >>  }
> >>
> >> +/*
> >> + * 2D array for converting and sanitizing IOMMU generic TLB granularity
> to
> >> + * VT-d granularity. Invalidation is typically included in the unmap
> operation
> >> + * as a result of DMA or VFIO unmap. However, for assigned devices
> guest
> >> + * owns the first level page tables. Invalidations of translation caches 
> >> in
> the
> >> + * guest are trapped and passed down to the host.
> >> + *
> >> + * vIOMMU in the guest will only expose first level page tables, therefore
> >> + * we do not include IOTLB granularity for request without PASID (second
> >> level).
> >
> > I would revise above as "We do not support IOTLB granularity for request
> > without PASID (second level), therefore any vIOMMU implementation that
> > exposes the SVA capability to the guest should only expose the first level
> > page tables, implying all invalidation requests from the guest will include
> > a valid PASID"
> >
> >> + *
> >> + * For example, to find the VT-d granularity encoding for IOTLB
> >> + * type and page selective granularity within PASID:
> >> + * X: indexed by iommu cache type
> >> + * Y: indexed by enum iommu_inv_granularity
> >> + * [IOMMU_CACHE_INV_TYPE_IOTLB][IOMMU_INV_GRANU_ADDR]
> >> + *
> >> + * Granu_map array indicates validity of the table. 1: valid, 0: invalid
> >> + *
> >> + */
> >> +const static int
> >>
> inv_type_granu_map[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU_
> >> NR] = {
> >> +  /*
> >> +   * PASID based IOTLB invalidation: PASID selective (per PASID),
> >> +   * page selective (address granularity)
> >> +   */
> >> +  {0, 1, 1},
> >> +  /* PASID based dev TLBs, only support all PASIDs or single PASID */
> >> +  {1, 1, 0},
> >
> > Is this combination correct? when single PASID is being specified, it is
> > essentially a page-selective invalidation since you need provide Address
> > and Size.
> Isn't it the same when G=1? Still the addr/size is used. Doesn't it

I thought addr/size is not used when G=1, but it might be wrong. I'm
checking with our vt-d spec owner.

> correspond to IOMMU_INV_GRANU_ADDR with
> IOMMU_INV_ADDR_FLAGS_PASID flag
> unset?
> 
> so {0, 0, 1}?

I have one more open:

How does userspace know which invalidation type/gran is supported?
I didn't see such capability reporting in Yi's VFIO vSVA patch set. Do we
want the user/kernel assume the same capability set if they are 
architectural? However the kernel could also do some optimization
e.g. hide devtlb invalidati

RE: [PATCH V10 08/11] iommu/vt-d: Add svm/sva invalidate function

2020-03-30 Thread Tian, Kevin
> From: Auger Eric 
> Sent: Monday, March 30, 2020 12:05 AM
> 
> On 3/28/20 11:01 AM, Tian, Kevin wrote:
> >> From: Jacob Pan 
> >> Sent: Saturday, March 21, 2020 7:28 AM
> >>
> >> When Shared Virtual Address (SVA) is enabled for a guest OS via
> >> vIOMMU, we need to provide invalidation support at IOMMU API and
> driver
> >> level. This patch adds Intel VT-d specific function to implement
> >> iommu passdown invalidate API for shared virtual address.
> >>
> >> The use case is for supporting caching structure invalidation
> >> of assigned SVM capable devices. Emulated IOMMU exposes queue
> >
> > emulated IOMMU -> vIOMMU, since virito-iommu could use the
> > interface as well.
> >
> >> invalidation capability and passes down all descriptors from the guest
> >> to the physical IOMMU.
> >>
> >> The assumption is that guest to host device ID mapping should be
> >> resolved prior to calling IOMMU driver. Based on the device handle,
> >> host IOMMU driver can replace certain fields before submit to the
> >> invalidation queue.
> >>
> >> ---
> >> v7 review fixed in v10
> >> ---
> >>
> >> Signed-off-by: Jacob Pan 
> >> Signed-off-by: Ashok Raj 
> >> Signed-off-by: Liu, Yi L 
> >> ---
> >>  drivers/iommu/intel-iommu.c | 182
> >> 
> >>  1 file changed, 182 insertions(+)
> >>
> >> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> >> index b1477cd423dd..a76afb0fd51a 100644
> >> --- a/drivers/iommu/intel-iommu.c
> >> +++ b/drivers/iommu/intel-iommu.c
> >> @@ -5619,6 +5619,187 @@ static void
> >> intel_iommu_aux_detach_device(struct iommu_domain *domain,
> >>aux_domain_remove_dev(to_dmar_domain(domain), dev);
> >>  }
> >>
> >> +/*
> >> + * 2D array for converting and sanitizing IOMMU generic TLB granularity
> to
> >> + * VT-d granularity. Invalidation is typically included in the unmap
> operation
> >> + * as a result of DMA or VFIO unmap. However, for assigned devices
> guest
> >> + * owns the first level page tables. Invalidations of translation caches 
> >> in
> the
> >> + * guest are trapped and passed down to the host.
> >> + *
> >> + * vIOMMU in the guest will only expose first level page tables, therefore
> >> + * we do not include IOTLB granularity for request without PASID (second
> >> level).
> >
> > I would revise above as "We do not support IOTLB granularity for request
> > without PASID (second level), therefore any vIOMMU implementation that
> > exposes the SVA capability to the guest should only expose the first level
> > page tables, implying all invalidation requests from the guest will include
> > a valid PASID"
> >
> >> + *
> >> + * For example, to find the VT-d granularity encoding for IOTLB
> >> + * type and page selective granularity within PASID:
> >> + * X: indexed by iommu cache type
> >> + * Y: indexed by enum iommu_inv_granularity
> >> + * [IOMMU_CACHE_INV_TYPE_IOTLB][IOMMU_INV_GRANU_ADDR]
> >> + *
> >> + * Granu_map array indicates validity of the table. 1: valid, 0: invalid
> >> + *
> >> + */
> >> +const static int
> >>
> inv_type_granu_map[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU_
> >> NR] = {
> >> +  /*
> >> +   * PASID based IOTLB invalidation: PASID selective (per PASID),
> >> +   * page selective (address granularity)
> >> +   */
> >> +  {0, 1, 1},
> >> +  /* PASID based dev TLBs, only support all PASIDs or single PASID */
> >> +  {1, 1, 0},
> >
> > Is this combination correct? when single PASID is being specified, it is
> > essentially a page-selective invalidation since you need provide Address
> > and Size.
> >
> >> +  /* PASID cache */
> >
> > PASID cache is fully managed by the host. Guest PASID cache invalidation
> > is interpreted by vIOMMU for bind and unbind operations. I don't think
> > we should accept any PASID cache invalidation from userspace or guest.
> I tend to agree here.
> >
> >> +  {1, 1, 0}
> >> +};
> >> +
> >> +const static int
> >>
> inv_type_granu_table[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU
> >> _NR] = {
> >> +  /* PASID based IOTLB */
> >> +  {0, QI_GRAN_NONG_PASID, QI_GRAN_PSI_PASID},
> >> +  /* PASID

RE: [PATCH V10 05/11] iommu/vt-d: Add nested translation helper function

2020-03-30 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Tuesday, March 31, 2020 2:22 AM
> 
> On Sun, 29 Mar 2020 16:03:36 +0800
> Lu Baolu  wrote:
> 
> > On 2020/3/27 20:21, Tian, Kevin wrote:
> > >> From: Jacob Pan 
> > >> Sent: Saturday, March 21, 2020 7:28 AM
> > >>
> > >> Nested translation mode is supported in VT-d 3.0 Spec.CH 3.8.
> > >
> > > now the spec is already at rev3.1 😊
> >
> > Updated.
> >
> > >
> > >> With PASID granular translation type set to 0x11b, translation
> > >> result from the first level(FL) also subject to a second level(SL)
> > >> page table translation. This mode is used for SVA virtualization,
> > >> where FL performs guest virtual to guest physical translation and
> > >> SL performs guest physical to host physical translation.
> > >>
> > >> This patch adds a helper function for setting up nested translation
> > >> where second level comes from a domain and first level comes from
> > >> a guest PGD.
> > >>
> > >> Signed-off-by: Jacob Pan 
> > >> Signed-off-by: Liu, Yi L 
> > >> ---
> > >>   drivers/iommu/intel-pasid.c | 240
> > >> +++-
> > >>   drivers/iommu/intel-pasid.h |  12 +++
> > >>   include/linux/intel-iommu.h |   3 +
> > >>   3 files changed, 252 insertions(+), 3 deletions(-)
> > >>
> > >> diff --git a/drivers/iommu/intel-pasid.c
> > >> b/drivers/iommu/intel-pasid.c index 9bdb7ee228b6..10c7856afc6b
> > >> 100644 --- a/drivers/iommu/intel-pasid.c
> > >> +++ b/drivers/iommu/intel-pasid.c
> > >> @@ -359,6 +359,76 @@ pasid_set_flpm(struct pasid_entry *pe, u64
> > >> value) pasid_set_bits(&pe->val[2], GENMASK_ULL(3, 2), value << 2);
> > >>   }
> > >>
> > >> +/*
> > >> + * Setup the Extended Memory Type(EMT) field (Bits 91-93)
> > >> + * of a scalable mode PASID entry.
> > >> + */
> > >> +static inline void
> > >> +pasid_set_emt(struct pasid_entry *pe, u64 value)
> > >> +{
> > >> +pasid_set_bits(&pe->val[1], GENMASK_ULL(29, 27), value <<
> > >> 27); +}
> > >> +
> > >> +/*
> > >> + * Setup the Page Attribute Table (PAT) field (Bits 96-127)
> > >> + * of a scalable mode PASID entry.
> > >> + */
> > >> +static inline void
> > >> +pasid_set_pat(struct pasid_entry *pe, u64 value)
> > >> +{
> > >> +pasid_set_bits(&pe->val[1], GENMASK_ULL(63, 32), value <<
> > >> 32); +}
> > >> +
> > >> +/*
> > >> + * Setup the Cache Disable (CD) field (Bit 89)
> > >> + * of a scalable mode PASID entry.
> > >> + */
> > >> +static inline void
> > >> +pasid_set_cd(struct pasid_entry *pe)
> > >> +{
> > >> +pasid_set_bits(&pe->val[1], 1 << 25, 1 << 25);
> > >> +}
> > >> +
> > >> +/*
> > >> + * Setup the Extended Memory Type Enable (EMTE) field (Bit 90)
> > >> + * of a scalable mode PASID entry.
> > >> + */
> > >> +static inline void
> > >> +pasid_set_emte(struct pasid_entry *pe)
> > >> +{
> > >> +pasid_set_bits(&pe->val[1], 1 << 26, 1 << 26);
> > >> +}
> > >> +
> > >> +/*
> > >> + * Setup the Extended Access Flag Enable (EAFE) field (Bit 135)
> > >> + * of a scalable mode PASID entry.
> > >> + */
> > >> +static inline void
> > >> +pasid_set_eafe(struct pasid_entry *pe)
> > >> +{
> > >> +pasid_set_bits(&pe->val[2], 1 << 7, 1 << 7);
> > >> +}
> > >> +
> > >> +/*
> > >> + * Setup the Page-level Cache Disable (PCD) field (Bit 95)
> > >> + * of a scalable mode PASID entry.
> > >> + */
> > >> +static inline void
> > >> +pasid_set_pcd(struct pasid_entry *pe)
> > >> +{
> > >> +pasid_set_bits(&pe->val[1], 1 << 31, 1 << 31);
> > >> +}
> > >> +
> > >> +/*
> > >> + * Setup the Page-level Write-Through (PWT)) field (Bit 94)
> > >> + * of a scalable mode PASID entry.
> > >> + */
> > >> +static inline void
> > >> +

RE: [PATCH V10 06/11] iommu/vt-d: Add bind guest PASID support

2020-03-30 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Tuesday, March 31, 2020 4:52 AM
> 
> On Sat, 28 Mar 2020 08:02:01 +0000
> "Tian, Kevin"  wrote:
> 
> > > From: Jacob Pan 
> > > Sent: Saturday, March 21, 2020 7:28 AM
> > >
> > > When supporting guest SVA with emulated IOMMU, the guest PASID
> > > table is shadowed in VMM. Updates to guest vIOMMU PASID table
> > > will result in PASID cache flush which will be passed down to
> > > the host as bind guest PASID calls.
> > >
> > > For the SL page tables, it will be harvested from device's
> > > default domain (request w/o PASID), or aux domain in case of
> > > mediated device.
> > >
> > > .-.  .---.
> > > |   vIOMMU|  | Guest process CR3, FL only|
> > > | |  '---'
> > > ./
> > > | PASID Entry |--- PASID cache flush -
> > > '-'   |
> > > | |   V
> > > | |CR3 in GPA
> > > '-'
> > > Guest
> > > --| Shadow |--|
> > >   vv  v
> > > Host
> > > .-.  .--.
> > > |   pIOMMU|  | Bind FL for GVA-GPA  |
> > > | |  '--'
> > > ./  |
> > > | PASID Entry | V (Nested xlate)
> > > '\.--.
> > > | |   |SL for GPA-HPA, default domain|
> > > | |   '--'
> > > '-'
> > > Where:
> > >  - FL = First level/stage one page tables
> > >  - SL = Second level/stage two page tables
> > >
> > > Signed-off-by: Jacob Pan 
> > > Signed-off-by: Liu, Yi L 
> > > ---
> > >  drivers/iommu/intel-iommu.c |   4 +
> > >  drivers/iommu/intel-svm.c   | 224
> > > 
> > >  include/linux/intel-iommu.h |   8 +-
> > >  include/linux/intel-svm.h   |  17 
> > >  4 files changed, 252 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/iommu/intel-iommu.c
> > > b/drivers/iommu/intel-iommu.c index e599b2537b1c..b1477cd423dd
> > > 100644 --- a/drivers/iommu/intel-iommu.c
> > > +++ b/drivers/iommu/intel-iommu.c
> > > @@ -6203,6 +6203,10 @@ const struct iommu_ops intel_iommu_ops = {
> > >   .dev_disable_feat   = intel_iommu_dev_disable_feat,
> > >   .is_attach_deferred =
> > > intel_iommu_is_attach_deferred, .pgsize_bitmap=
> > > INTEL_IOMMU_PGSIZES, +#ifdef CONFIG_INTEL_IOMMU_SVM
> > > + .sva_bind_gpasid= intel_svm_bind_gpasid,
> > > + .sva_unbind_gpasid  = intel_svm_unbind_gpasid,
> > > +#endif
> > >  };
> > >
> > >  static void quirk_iommu_igfx(struct pci_dev *dev)
> > > diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
> > > index d7f2a5358900..47c0deb5ae56 100644
> > > --- a/drivers/iommu/intel-svm.c
> > > +++ b/drivers/iommu/intel-svm.c
> > > @@ -226,6 +226,230 @@ static LIST_HEAD(global_svm_list);
> > >   list_for_each_entry((sdev), &(svm)->devs, list) \
> > >   if ((d) != (sdev)->dev) {} else
> > >
> > > +int intel_svm_bind_gpasid(struct iommu_domain *domain,
> > > + struct device *dev,
> > > + struct iommu_gpasid_bind_data *data)
> > > +{
> > > + struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
> > > + struct dmar_domain *ddomain;
> >
> > what about the full name e.g. dmar_domain? though a bit longer
> > but clearer than ddomain.
> >
> Sure, I don't have preference.
> 
> > > + struct intel_svm_dev *sdev;
> > > + struct intel_svm *svm;
> > > + int ret = 0;
> > > +
> > > + if (WARN_ON(!iommu) || !data)
> > > + return -EINVAL;
> > > +
> > > + if (data->version != IOMMU_GPASID_BIND_VERSION_1 ||
> > > + data->format != IOMMU_PASID_FORMAT_INTEL_VTD)
> > > + return -EINVAL;
> > > +
> > > + if (dev_is_pci(dev)) {
> > > + /* VT-d supports devices with full 20 bit PASIDs
> > > only

RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)

2020-03-30 Thread Tian, Kevin
> From: Liu, Yi L 
> Sent: Monday, March 30, 2020 10:37 PM
> 
> > From: Tian, Kevin 
> > Sent: Monday, March 30, 2020 4:32 PM
> > To: Liu, Yi L ; alex.william...@redhat.com;
> > Subject: RE: [PATCH v1 1/8] vfio: Add
> VFIO_IOMMU_PASID_REQUEST(alloc/free)
> >
> > > From: Liu, Yi L 
> > > Sent: Sunday, March 22, 2020 8:32 PM
> > >
> > > From: Liu Yi L 
> > >
> > > For a long time, devices have only one DMA address space from platform
> > > IOMMU's point of view. This is true for both bare metal and directed-
> > > access in virtualization environment. Reason is the source ID of DMA in
> > > PCIe are BDF (bus/dev/fnc ID), which results in only device granularity
> >
> > are->is
> 
> thanks.
> 
> > > DMA isolation. However, this is changing with the latest advancement in
> > > I/O technology area. More and more platform vendors are utilizing the
> PCIe
> > > PASID TLP prefix in DMA requests, thus to give devices with multiple DMA
> > > address spaces as identified by their individual PASIDs. For example,
> > > Shared Virtual Addressing (SVA, a.k.a Shared Virtual Memory) is able to
> > > let device access multiple process virtual address space by binding the
> >
> > "address space" -> "address spaces"
> >
> > "binding the" -> "binding each"
> 
> will correct both.
> 
> > > virtual address space with a PASID. Wherein the PASID is allocated in
> > > software and programmed to device per device specific manner. Devices
> > > which support PASID capability are called PASID-capable devices. If such
> > > devices are passed through to VMs, guest software are also able to bind
> > > guest process virtual address space on such devices. Therefore, the guest
> > > software could reuse the bare metal software programming model,
> which
> > > means guest software will also allocate PASID and program it to device
> > > directly. This is a dangerous situation since it has potential PASID
> > > conflicts and unauthorized address space access. It would be safer to
> > > let host intercept in the guest software's PASID allocation. Thus PASID
> > > are managed system-wide.
> > >
> > > This patch adds VFIO_IOMMU_PASID_REQUEST ioctl which aims to
> > > passdown
> > > PASID allocation/free request from the virtual IOMMU. Additionally, such
> >
> > "Additionally, because such"
> >
> > > requests are intended to be invoked by QEMU or other applications
> which
> >
> > simplify to "intended to be invoked from userspace"
> 
> got it.
> 
> > > are running in userspace, it is necessary to have a mechanism to prevent
> > > single application from abusing available PASIDs in system. With such
> > > consideration, this patch tracks the VFIO PASID allocation per-VM. There
> > > was a discussion to make quota to be per assigned devices. e.g. if a VM
> > > has many assigned devices, then it should have more quota. However, it
> > > is not sure how many PASIDs an assigned devices will use. e.g. it is
> >
> > devices -> device
> 
> got it.
> 
> > > possible that a VM with multiples assigned devices but requests less
> > > PASIDs. Therefore per-VM quota would be better.
> > >
> > > This patch uses struct mm pointer as a per-VM token. We also considered
> > > using task structure pointer and vfio_iommu structure pointer. However,
> > > task structure is per-thread, which means it cannot achieve per-VM PASID
> > > alloc tracking purpose. While for vfio_iommu structure, it is visible
> > > only within vfio. Therefore, structure mm pointer is selected. This patch
> > > adds a structure vfio_mm. A vfio_mm is created when the first vfio
> > > container is opened by a VM. On the reverse order, vfio_mm is free when
> > > the last vfio container is released. Each VM is assigned with a PASID
> > > quota, so that it is not able to request PASID beyond its quota. This
> > > patch adds a default quota of 1000. This quota could be tuned by
> > > administrator. Making PASID quota tunable will be added in another
> patch
> > > in this series.
> > >
> > > Previous discussions:
> > > https://patchwork.kernel.org/patch/11209429/
> > >
> > > Cc: Kevin Tian 
> > > CC: Jacob Pan 
> > > Cc: Alex Williamson 
> > > Cc: Eric Auger 
> > > Cc: Jean-Philippe Brucker 
> > > Signed-o

RE: [PATCH v2 1/3] iommu/uapi: Define uapi version and capabilities

2020-03-30 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Tuesday, March 31, 2020 12:08 AM
> 
> On Mon, 30 Mar 2020 05:40:40 +0000
> "Tian, Kevin"  wrote:
> 
> > > From: Jacob Pan 
> > > Sent: Saturday, March 28, 2020 7:54 AM
> > >
> > > On Fri, 27 Mar 2020 00:47:02 -0700
> > > Christoph Hellwig  wrote:
> > >
> > > > On Fri, Mar 27, 2020 at 02:49:55AM +, Tian, Kevin wrote:
> > > > > If those API calls are inter-dependent for composing a feature
> > > > > (e.g. SVA), shouldn't we need a way to check them together
> > > > > before exposing the feature to the guest, e.g. through a
> > > > > iommu_get_uapi_capabilities interface?
> > > >
> > > > Yes, that makes sense.  The important bit is to have a capability
> > > > flags and not version numbers.
> > >
> > > The challenge is that there are two consumers in the kernel for
> > > this. 1. VFIO only look for compatibility, and size of each data
> > > struct such that it can copy_from_user.
> > >
> > > 2. IOMMU driver, the "real consumer" of the content.
> > >
> > > For 2, I agree and we do plan to use the capability flags to check
> > > content and maintain backward compatibility etc.
> > >
> > > For VFIO, it is difficult to do size look up based on capability
> > > flags.
> >
> > Can you elaborate the difficulty in VFIO? if, as Christoph Hellwig
> > pointed out, version number is already avoided everywhere, it is
> > interesting to know whether this work becomes a real exception
> > or just requires a different mindset.
> >
> From VFIO p.o.v. the IOMMU UAPI data is opaque, it only needs to do two
> things:
> 1. is the UAPI compatible?
> 2. what is the size to copy?
> 
> If you look at the version number, this is really a "version as size"
> lookup, as provided by the helper function in this patch. An example
> can be the newly introduced clone3 syscall.
> https://lwn.net/Articles/792628/
> In clone3, new version must have new size. The slight difference here
> is that, unlike clone3, we have multiple data structures instead of a
> single struct clone_args {}. And each struct has flags to enumerate its
> contents besides size.

Thanks for providing that link. However clone3 doesn't include a version
field to do "version as size" lookup. Instead, as you said, it includes a
size parameter which sounds like the option 3 (argsz) listed below.

> 
> Besides breaching data abstraction, if VFIO has to check IOMMU flags to
> determine the sizes, it has many combinations.
> 
> We also separate the responsibilities into two parts
> 1. compatibility - version, size by VFIO
> 2. sanity check - capability flags - by IOMMU

I feel argsz+flags approach can perfectly meet above requirement. The
userspace set the size and flags for whatever capabilities it uses, and
VFIO simply copies the parameters by size and pass to IOMMU for
further sanity check. Of course the assumption is that we do provide
an interface for userspace to enumerate all supported capabilities.

Is there anything that I overlooked here? I suppose there might be
some difficulties that block you from going the argsz way...

Thanks
Kevin

> 
> I think the latter matches what Christoph's comments. So we are in
> agreement at the IOMMU level :)
> 
> For example:
> During guest PASID bind, IOMMU driver operates on the data passed from
> VFIO and check format & flags to take different code path.
> 
> #define IOMMU_PASID_FORMAT_INTEL_VTD  1
>   __u32 format;
> #define IOMMU_SVA_GPASID_VAL  (1 << 0) /* guest PASID valid */
>   __u64 flags;
> 
> Jacob
> 
> > btw the most relevant discussion which I can find out now is here:
> > https://lkml.org/lkml/2020/2/3/1126
> >
> > It mentioned 3 options for handling extension:
> > --
> > 1. Disallow adding new members to each structure other than reuse
> > padding bits or adding union members at the end.
> > 2. Allow extension of the structures beyond union, but union size has
> > to be fixed with reserved spaces
> > 3. Adopt VFIO argsz scheme, I don't think we need version for each
> > struct anymore. argsz implies the version that user is using assuming
> > UAPI data is extension only.
> > --
> >
> > the first two are both version-based. Looks most guys agreed with
> > option-1 (in this v2), but Alex didn't give his opinion at the
> > moment. The last response from him was the raise of option-3 using
> > argsz to avoid version. So, we also need hear from him. Alex?
> >
> > Thanks
> > Kevin
> 
> [Jacob Pan]
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v1 0/2] vfio/pci: expose device's PASID capability to VMs

2020-03-30 Thread Tian, Kevin
> From: Liu, Yi L 
> Sent: Sunday, March 22, 2020 8:33 PM
> 
> From: Liu Yi L 
> 
> Shared Virtual Addressing (SVA), a.k.a, Shared Virtual Memory (SVM) on
> Intel platforms allows address space sharing between device DMA and
> applications. SVA can reduce programming complexity and enhance security.
> 
> To enable SVA, device needs to have PASID capability, which is a key
> capability for SVA. This patchset exposes the device's PASID capability
> to guest instead of hiding it from guest.
> 
> The second patch emulates PASID capability for VFs (Virtual Function) since
> VFs don't implement such capability per PCIe spec. This patch emulates such
> capability and expose to VM if the capability is enabled in PF (Physical
> Function).
> 
> However, there is an open for PASID emulation. If PF driver disables PASID
> capability at runtime, then it may be an issue. e.g. PF should not disable
> PASID capability if there is guest using this capability on any VF related
> to this PF. To solve it, may need to introduce a generic communication
> framework between vfio-pci driver and PF drivers. Please feel free to give
> your suggestions on it.

I'm not sure how this is addressed on bate metal today, i.e. between normal 
kernel PF and VF drivers. I look at pasid enable/disable code in intel-iommu.c.
There is no check on PF/VF dependency so far. The cap is toggled when 
attaching/detaching the PF to its domain. Let's see how IOMMU guys 
respond, and if there is a way for VF driver to block PF driver from disabling
the pasid cap when it's being actively used by VF driver, then we may
leverage the same trick in VFIO when emulation is provided to guest.

Thanks
Kevin

> 
> Regards,
> Yi Liu
> 
> Changelog:
>   - RFC v1 -> Patch v1:
> Add CONFIG_PCI_ATS #ifdef control to avoid compiling error.
> 
> Liu Yi L (2):
>   vfio/pci: Expose PCIe PASID capability to guest
>   vfio/pci: Emulate PASID/PRI capability for VFs
> 
>  drivers/vfio/pci/vfio_pci_config.c | 327
> -
>  1 file changed, 324 insertions(+), 3 deletions(-)
> 
> --
> 2.7.4

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v1 1/2] vfio/pci: Expose PCIe PASID capability to guest

2020-03-30 Thread Tian, Kevin
> From: Liu, Yi L 
> Sent: Sunday, March 22, 2020 8:33 PM
> 
> From: Liu Yi L 
> 
> This patch exposes PCIe PASID capability to guest. Existing vfio_pci
> driver hides it from guest by setting the capability length as 0 in
> pci_ext_cap_length[].
> 
> This capability is required for vSVA enabling on pass-through PCIe
> devices.

should this be [PATCH 2/2], after you have the emulation in place?

and it might be worthy of noting that PRI is already exposed, to
avoid confusion from one like me that why two capabilities are
emulated in this series while only one is being exposed. 😊

> 
> Cc: Kevin Tian 
> CC: Jacob Pan 
> Cc: Alex Williamson 
> Cc: Eric Auger 
> Cc: Jean-Philippe Brucker 
> Signed-off-by: Liu Yi L 
> ---
>  drivers/vfio/pci/vfio_pci_config.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_config.c
> b/drivers/vfio/pci/vfio_pci_config.c
> index 90c0b80..4b9af99 100644
> --- a/drivers/vfio/pci/vfio_pci_config.c
> +++ b/drivers/vfio/pci/vfio_pci_config.c
> @@ -95,7 +95,7 @@ static const u16
> pci_ext_cap_length[PCI_EXT_CAP_ID_MAX + 1] = {
>   [PCI_EXT_CAP_ID_LTR]=   PCI_EXT_CAP_LTR_SIZEOF,
>   [PCI_EXT_CAP_ID_SECPCI] =   0,  /* not yet */
>   [PCI_EXT_CAP_ID_PMUX]   =   0,  /* not yet */
> - [PCI_EXT_CAP_ID_PASID]  =   0,  /* not yet */
> + [PCI_EXT_CAP_ID_PASID]  =   PCI_EXT_CAP_PASID_SIZEOF,
>  };
> 
>  /*
> --
> 2.7.4

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [PATCH v2 1/3] iommu/uapi: Define uapi version and capabilities

2020-03-31 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Tuesday, March 31, 2020 11:55 PM
> 
> On Tue, 31 Mar 2020 06:06:38 +0000
> "Tian, Kevin"  wrote:
> 
> > > From: Jacob Pan 
> > > Sent: Tuesday, March 31, 2020 12:08 AM
> > >
> > > On Mon, 30 Mar 2020 05:40:40 +
> > > "Tian, Kevin"  wrote:
> > >
> > > > > From: Jacob Pan 
> > > > > Sent: Saturday, March 28, 2020 7:54 AM
> > > > >
> > > > > On Fri, 27 Mar 2020 00:47:02 -0700
> > > > > Christoph Hellwig  wrote:
> > > > >
> > > > > > On Fri, Mar 27, 2020 at 02:49:55AM +, Tian, Kevin wrote:
> > > > > > > If those API calls are inter-dependent for composing a
> > > > > > > feature (e.g. SVA), shouldn't we need a way to check them
> > > > > > > together before exposing the feature to the guest, e.g.
> > > > > > > through a iommu_get_uapi_capabilities interface?
> > > > > >
> > > > > > Yes, that makes sense.  The important bit is to have a
> > > > > > capability flags and not version numbers.
> > > > >
> > > > > The challenge is that there are two consumers in the kernel for
> > > > > this. 1. VFIO only look for compatibility, and size of each data
> > > > > struct such that it can copy_from_user.
> > > > >
> > > > > 2. IOMMU driver, the "real consumer" of the content.
> > > > >
> > > > > For 2, I agree and we do plan to use the capability flags to
> > > > > check content and maintain backward compatibility etc.
> > > > >
> > > > > For VFIO, it is difficult to do size look up based on capability
> > > > > flags.
> > > >
> > > > Can you elaborate the difficulty in VFIO? if, as Christoph Hellwig
> > > > pointed out, version number is already avoided everywhere, it is
> > > > interesting to know whether this work becomes a real exception
> > > > or just requires a different mindset.
> > > >
> > > From VFIO p.o.v. the IOMMU UAPI data is opaque, it only needs to do
> > > two things:
> > > 1. is the UAPI compatible?
> > > 2. what is the size to copy?
> > >
> > > If you look at the version number, this is really a "version as
> > > size" lookup, as provided by the helper function in this patch. An
> > > example can be the newly introduced clone3 syscall.
> > > https://lwn.net/Articles/792628/
> > > In clone3, new version must have new size. The slight difference
> > > here is that, unlike clone3, we have multiple data structures
> > > instead of a single struct clone_args {}. And each struct has flags
> > > to enumerate its contents besides size.
> >
> > Thanks for providing that link. However clone3 doesn't include a
> > version field to do "version as size" lookup. Instead, as you said,
> > it includes a size parameter which sounds like the option 3 (argsz)
> > listed below.
> >
> Right, there is no version in clone3. size = version. I view this as
> a 1:1 lookup.
> 
> > >
> > > Besides breaching data abstraction, if VFIO has to check IOMMU
> > > flags to determine the sizes, it has many combinations.
> > >
> > > We also separate the responsibilities into two parts
> > > 1. compatibility - version, size by VFIO
> > > 2. sanity check - capability flags - by IOMMU
> >
> > I feel argsz+flags approach can perfectly meet above requirement. The
> > userspace set the size and flags for whatever capabilities it uses,
> > and VFIO simply copies the parameters by size and pass to IOMMU for
> > further sanity check. Of course the assumption is that we do provide
> > an interface for userspace to enumerate all supported capabilities.
> >
> You cannot trust user for argsz. the size to be copied from user must
> be based on knowledge in kernel. That is why we have this version to
> size lookup.
> 
> In VFIO, the size to copy is based on knowledge of each VFIO UAPI
> structures and VFIO flags. But here the flags are IOMMU UAPI flags. As
> you pointed out in another thread, VFIO is one user.

If that is the case, can we let VFIO only copy its own UAPI fields while 
simply passing the user pointer of IOMMU UAPI structure to IOMMU
driver for further size check and copy? Otherwise we are entering a
dead end that VFIO doesn't want to parse a structure which is not
defined by him while using versio

RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)

2020-03-31 Thread Tian, Kevin
> From: Liu, Yi L 
> Sent: Tuesday, March 31, 2020 9:22 PM
> 
> > From: Tian, Kevin 
> > Sent: Tuesday, March 31, 2020 1:41 PM
> > To: Liu, Yi L ; alex.william...@redhat.com;
> > eric.au...@redhat.com
> > Subject: RE: [PATCH v1 1/8] vfio: Add
> VFIO_IOMMU_PASID_REQUEST(alloc/free)
> >
> > > From: Liu, Yi L 
> > > Sent: Monday, March 30, 2020 10:37 PM
> > >
> > > > From: Tian, Kevin 
> > > > Sent: Monday, March 30, 2020 4:32 PM
> > > > To: Liu, Yi L ; alex.william...@redhat.com;
> > > > Subject: RE: [PATCH v1 1/8] vfio: Add
> > > VFIO_IOMMU_PASID_REQUEST(alloc/free)
> > > >
> > > > > From: Liu, Yi L 
> > > > > Sent: Sunday, March 22, 2020 8:32 PM
> > > > >
> > > > > From: Liu Yi L 
> > > > >
> > > > > For a long time, devices have only one DMA address space from
> > > > > platform IOMMU's point of view. This is true for both bare metal
> > > > > and directed- access in virtualization environment. Reason is the
> > > > > source ID of DMA in PCIe are BDF (bus/dev/fnc ID), which results
> > > > > in only device granularity
> > > >
> > > > are->is
> > >
> > > thanks.
> > >
> > > > > DMA isolation. However, this is changing with the latest
> > > > > advancement in I/O technology area. More and more platform
> vendors
> > > > > are utilizing the
> > > PCIe
> > > > > PASID TLP prefix in DMA requests, thus to give devices with
> > > > > multiple DMA address spaces as identified by their individual
> > > > > PASIDs. For example, Shared Virtual Addressing (SVA, a.k.a Shared
> > > > > Virtual Memory) is able to let device access multiple process
> > > > > virtual address space by binding the
> > > >
> > > > "address space" -> "address spaces"
> > > >
> > > > "binding the" -> "binding each"
> > >
> > > will correct both.
> > >
> > > > > virtual address space with a PASID. Wherein the PASID is allocated
> > > > > in software and programmed to device per device specific manner.
> > > > > Devices which support PASID capability are called PASID-capable
> > > > > devices. If such devices are passed through to VMs, guest software
> > > > > are also able to bind guest process virtual address space on such
> > > > > devices. Therefore, the guest software could reuse the bare metal
> > > > > software programming model,
> > > which
> > > > > means guest software will also allocate PASID and program it to
> > > > > device directly. This is a dangerous situation since it has
> > > > > potential PASID conflicts and unauthorized address space access.
> > > > > It would be safer to let host intercept in the guest software's
> > > > > PASID allocation. Thus PASID are managed system-wide.
> > > > >
> > > > > This patch adds VFIO_IOMMU_PASID_REQUEST ioctl which aims to
> > > > > passdown PASID allocation/free request from the virtual IOMMU.
> > > > > Additionally, such
> > > >
> > > > "Additionally, because such"
> > > >
> > > > > requests are intended to be invoked by QEMU or other applications
> > > which
> > > >
> > > > simplify to "intended to be invoked from userspace"
> > >
> > > got it.
> > >
> > > > > are running in userspace, it is necessary to have a mechanism to
> > > > > prevent single application from abusing available PASIDs in
> > > > > system. With such consideration, this patch tracks the VFIO PASID
> > > > > allocation per-VM. There was a discussion to make quota to be per
> > > > > assigned devices. e.g. if a VM has many assigned devices, then it
> > > > > should have more quota. However, it is not sure how many PASIDs an
> > > > > assigned devices will use. e.g. it is
> > > >
> > > > devices -> device
> > >
> > > got it.
> > >
> > > > > possible that a VM with multiples assigned devices but requests
> > > > > less PASIDs. Therefore per-VM quota would be better.
> > > > >
> > > > > This patch uses struct mm pointer as a

RE: [PATCH V10 08/11] iommu/vt-d: Add svm/sva invalidate function

2020-03-31 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Wednesday, April 1, 2020 2:14 AM
> 
> On Sat, 28 Mar 2020 10:01:42 +0000
> "Tian, Kevin"  wrote:
> 
> > > From: Jacob Pan 
> > > Sent: Saturday, March 21, 2020 7:28 AM
> > >
> > > When Shared Virtual Address (SVA) is enabled for a guest OS via
> > > vIOMMU, we need to provide invalidation support at IOMMU API and
> > > driver level. This patch adds Intel VT-d specific function to
> > > implement iommu passdown invalidate API for shared virtual address.
> > >
> > > The use case is for supporting caching structure invalidation
> > > of assigned SVM capable devices. Emulated IOMMU exposes queue
> >
> > emulated IOMMU -> vIOMMU, since virito-iommu could use the
> > interface as well.
> >
> True, but it does not invalidate this statement about emulated IOMMU. I
> will add another statement saying "the same interface can be used for
> virtio-IOMMU as well". OK?

sure

> 
> > > invalidation capability and passes down all descriptors from the
> > > guest to the physical IOMMU.
> > >
> > > The assumption is that guest to host device ID mapping should be
> > > resolved prior to calling IOMMU driver. Based on the device handle,
> > > host IOMMU driver can replace certain fields before submit to the
> > > invalidation queue.
> > >
> > > ---
> > > v7 review fixed in v10
> > > ---
> > >
> > > Signed-off-by: Jacob Pan 
> > > Signed-off-by: Ashok Raj 
> > > Signed-off-by: Liu, Yi L 
> > > ---
> > >  drivers/iommu/intel-iommu.c | 182
> > > 
> > >  1 file changed, 182 insertions(+)
> > >
> > > diff --git a/drivers/iommu/intel-iommu.c
> > > b/drivers/iommu/intel-iommu.c index b1477cd423dd..a76afb0fd51a
> > > 100644 --- a/drivers/iommu/intel-iommu.c
> > > +++ b/drivers/iommu/intel-iommu.c
> > > @@ -5619,6 +5619,187 @@ static void
> > > intel_iommu_aux_detach_device(struct iommu_domain *domain,
> > >   aux_domain_remove_dev(to_dmar_domain(domain), dev);
> > >  }
> > >
> > > +/*
> > > + * 2D array for converting and sanitizing IOMMU generic TLB
> > > granularity to
> > > + * VT-d granularity. Invalidation is typically included in the
> > > unmap operation
> > > + * as a result of DMA or VFIO unmap. However, for assigned devices
> > > guest
> > > + * owns the first level page tables. Invalidations of translation
> > > caches in the
> > > + * guest are trapped and passed down to the host.
> > > + *
> > > + * vIOMMU in the guest will only expose first level page tables,
> > > therefore
> > > + * we do not include IOTLB granularity for request without PASID
> > > (second level).
> >
> > I would revise above as "We do not support IOTLB granularity for
> > request without PASID (second level), therefore any vIOMMU
> > implementation that exposes the SVA capability to the guest should
> > only expose the first level page tables, implying all invalidation
> > requests from the guest will include a valid PASID"
> >
> Sounds good.
> 
> > > + *
> > > + * For example, to find the VT-d granularity encoding for IOTLB
> > > + * type and page selective granularity within PASID:
> > > + * X: indexed by iommu cache type
> > > + * Y: indexed by enum iommu_inv_granularity
> > > + * [IOMMU_CACHE_INV_TYPE_IOTLB][IOMMU_INV_GRANU_ADDR]
> > > + *
> > > + * Granu_map array indicates validity of the table. 1: valid, 0:
> > > invalid
> > > + *
> > > + */
> > > +const static int
> > >
> inv_type_granu_map[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU_
> > > NR] = {
> > > + /*
> > > +  * PASID based IOTLB invalidation: PASID selective (per
> > > PASID),
> > > +  * page selective (address granularity)
> > > +  */
> > > + {0, 1, 1},
> > > + /* PASID based dev TLBs, only support all PASIDs or single
> > > PASID */
> > > + {1, 1, 0},
> >
> > Is this combination correct? when single PASID is being specified, it
> > is essentially a page-selective invalidation since you need provide
> > Address and Size.
> >
> This is for translation between generic UAPI granu to VT-d granu, it
> has nothing to do with address and size.

Generic UAPI defines three granularities: domain, pasid and addr.
from the definition domain applies all entries related to d

RE: [PATCH V10 08/11] iommu/vt-d: Add svm/sva invalidate function

2020-03-31 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Wednesday, April 1, 2020 4:58 AM
> 
> On Tue, 31 Mar 2020 02:49:21 +0000
> "Tian, Kevin"  wrote:
> 
> > > From: Auger Eric 
> > > Sent: Sunday, March 29, 2020 11:34 PM
> > >
> > > Hi,
> > >
> > > On 3/28/20 11:01 AM, Tian, Kevin wrote:
> > > >> From: Jacob Pan 
> > > >> Sent: Saturday, March 21, 2020 7:28 AM
> > > >>
> > > >> When Shared Virtual Address (SVA) is enabled for a guest OS via
> > > >> vIOMMU, we need to provide invalidation support at IOMMU API
> > > >> and
> > > driver
> > > >> level. This patch adds Intel VT-d specific function to implement
> > > >> iommu passdown invalidate API for shared virtual address.
> > > >>
> > > >> The use case is for supporting caching structure invalidation
> > > >> of assigned SVM capable devices. Emulated IOMMU exposes queue
> >  [...]
> >  [...]
> > > to
> > > >> + * VT-d granularity. Invalidation is typically included in the
> > > >> unmap
> > > operation
> > > >> + * as a result of DMA or VFIO unmap. However, for assigned
> > > >> devices
> > > guest
> > > >> + * owns the first level page tables. Invalidations of
> > > >> translation caches in
> > > the
> >  [...]
> >  [...]
> >  [...]
> > >
> inv_type_granu_map[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU_
> > > >> NR] = {
> > > >> +  /*
> > > >> +   * PASID based IOTLB invalidation: PASID selective (per
> > > >> PASID),
> > > >> +   * page selective (address granularity)
> > > >> +   */
> > > >> +  {0, 1, 1},
> > > >> +  /* PASID based dev TLBs, only support all PASIDs or
> > > >> single PASID */
> > > >> +  {1, 1, 0},
> > > >
> > > > Is this combination correct? when single PASID is being
> > > > specified, it is essentially a page-selective invalidation since
> > > > you need provide Address and Size.
> > > Isn't it the same when G=1? Still the addr/size is used. Doesn't
> > > it
> >
> > I thought addr/size is not used when G=1, but it might be wrong. I'm
> > checking with our vt-d spec owner.
> >
> 
> > > correspond to IOMMU_INV_GRANU_ADDR with
> > > IOMMU_INV_ADDR_FLAGS_PASID flag
> > > unset?
> > >
> > > so {0, 0, 1}?
> >
> I am not sure I got your logic. The three fields correspond to
>   IOMMU_INV_GRANU_DOMAIN, /* domain-selective
> invalidation */
>   IOMMU_INV_GRANU_PASID,  /* PASID-selective invalidation */
>   IOMMU_INV_GRANU_ADDR,   /* page-selective invalidation *
> 
> For devTLB, we use domain as global since there is no domain. Then I
> came up with {1, 1, 0}, which means we could have global and pasid
> granu invalidation for PASID based devTLB.
> 
> If the caller also provide addr and S bit, the flush routine will put

"also" -> "must", because vt-d requires addr/size must be provided
in devtlb descriptor, that is why Eric suggests {0, 0, 1}.

> that into QI descriptor. I know this is a little odd, but from the
> granu translation p.o.v. VT-d spec has no G bit for page selective
> invalidation.

We don't need such odd way if can do it properly. 😊

> 
> > I have one more open:
> >
> > How does userspace know which invalidation type/gran is supported?
> > I didn't see such capability reporting in Yi's VFIO vSVA patch set.
> > Do we want the user/kernel assume the same capability set if they are
> > architectural? However the kernel could also do some optimization
> > e.g. hide devtlb invalidation capability given that the kernel
> > already invalidate devtlb automatically when serving iotlb
> > invalidation...
> >
> In general, we are trending to use VFIO capability chain to expose iommu
> capabilities.
> 
> But for architectural features such as type/granu, we have to assume
> the same capability between host & guest. Granu and types are not
> enumerated on the host IOMMU either.
> 
> For devTLB optimization, I agree we need to expose a capability to
> the guest stating that implicit devtlb invalidation is supported.
> Otherwise, if Linux guest runs on other OSes may not support implicit
> devtlb invalidation.
> 
> Right Yi?

Thanks for explanation. So we are assumed to support all operations

RE: [PATCH V10 08/11] iommu/vt-d: Add svm/sva invalidate function

2020-03-31 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Wednesday, April 1, 2020 5:08 AM
> 
> On Tue, 31 Mar 2020 03:34:22 +0000
> "Tian, Kevin"  wrote:
> 
> > > From: Auger Eric 
> > > Sent: Monday, March 30, 2020 12:05 AM
> > >
> > > On 3/28/20 11:01 AM, Tian, Kevin wrote:
> > > >> From: Jacob Pan 
> > > >> Sent: Saturday, March 21, 2020 7:28 AM
> > > >>
> > > >> When Shared Virtual Address (SVA) is enabled for a guest OS via
> > > >> vIOMMU, we need to provide invalidation support at IOMMU API
> > > >> and
> > > driver
> > > >> level. This patch adds Intel VT-d specific function to implement
> > > >> iommu passdown invalidate API for shared virtual address.
> > > >>
> > > >> The use case is for supporting caching structure invalidation
> > > >> of assigned SVM capable devices. Emulated IOMMU exposes queue
> > > >
> > > > emulated IOMMU -> vIOMMU, since virito-iommu could use the
> > > > interface as well.
> > > >
> > > >> invalidation capability and passes down all descriptors from the
> > > >> guest to the physical IOMMU.
> > > >>
> > > >> The assumption is that guest to host device ID mapping should be
> > > >> resolved prior to calling IOMMU driver. Based on the device
> > > >> handle, host IOMMU driver can replace certain fields before
> > > >> submit to the invalidation queue.
> > > >>
> > > >> ---
> > > >> v7 review fixed in v10
> > > >> ---
> > > >>
> > > >> Signed-off-by: Jacob Pan 
> > > >> Signed-off-by: Ashok Raj 
> > > >> Signed-off-by: Liu, Yi L 
> > > >> ---
> > > >>  drivers/iommu/intel-iommu.c | 182
> > > >> 
> > > >>  1 file changed, 182 insertions(+)
> > > >>
> > > >> diff --git a/drivers/iommu/intel-iommu.c
> > > >> b/drivers/iommu/intel-iommu.c index b1477cd423dd..a76afb0fd51a
> > > >> 100644 --- a/drivers/iommu/intel-iommu.c
> > > >> +++ b/drivers/iommu/intel-iommu.c
> > > >> @@ -5619,6 +5619,187 @@ static void
> > > >> intel_iommu_aux_detach_device(struct iommu_domain *domain,
> > > >>aux_domain_remove_dev(to_dmar_domain(domain), dev);
> > > >>  }
> > > >>
> > > >> +/*
> > > >> + * 2D array for converting and sanitizing IOMMU generic TLB
> > > >> granularity
> > > to
> > > >> + * VT-d granularity. Invalidation is typically included in the
> > > >> unmap
> > > operation
> > > >> + * as a result of DMA or VFIO unmap. However, for assigned
> > > >> devices
> > > guest
> > > >> + * owns the first level page tables. Invalidations of
> > > >> translation caches in
> > > the
> > > >> + * guest are trapped and passed down to the host.
> > > >> + *
> > > >> + * vIOMMU in the guest will only expose first level page
> > > >> tables, therefore
> > > >> + * we do not include IOTLB granularity for request without
> > > >> PASID (second level).
> > > >
> > > > I would revise above as "We do not support IOTLB granularity for
> > > > request without PASID (second level), therefore any vIOMMU
> > > > implementation that exposes the SVA capability to the guest
> > > > should only expose the first level page tables, implying all
> > > > invalidation requests from the guest will include a valid PASID"
> > > >
> > > >> + *
> > > >> + * For example, to find the VT-d granularity encoding for IOTLB
> > > >> + * type and page selective granularity within PASID:
> > > >> + * X: indexed by iommu cache type
> > > >> + * Y: indexed by enum iommu_inv_granularity
> > > >> + * [IOMMU_CACHE_INV_TYPE_IOTLB][IOMMU_INV_GRANU_ADDR]
> > > >> + *
> > > >> + * Granu_map array indicates validity of the table. 1: valid,
> > > >> 0: invalid
> > > >> + *
> > > >> + */
> > > >> +const static int
> > > >>
> > >
> inv_type_granu_map[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU_
> > > >> NR] = {
> > > >> +   

RE: [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1 format to userspace

2020-04-01 Thread Tian, Kevin
> From: Liu, Yi L 
> Sent: Wednesday, April 1, 2020 3:38 PM
> 
>  > From: Tian, Kevin 
> > Sent: Monday, March 30, 2020 7:49 PM
> > To: Liu, Yi L ; alex.william...@redhat.com;
> > Subject: RE: [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1 format to
> > userspace
> >
> > > From: Liu, Yi L 
> > > Sent: Sunday, March 22, 2020 8:32 PM
> > >
> > > From: Liu Yi L 
> > >
> > > VFIO exposes IOMMU nesting translation (a.k.a dual stage translation)
> > > capability to userspace. Thus applications like QEMU could support
> > > vIOMMU with hardware's nesting translation capability for pass-through
> > > devices. Before setting up nesting translation for pass-through
> > > devices, QEMU and other applications need to learn the supported
> > > 1st-lvl/stage-1 translation structure format like page table format.
> > >
> > > Take vSVA (virtual Shared Virtual Addressing) as an example, to
> > > support vSVA for pass-through devices, QEMU setup nesting translation
> > > for pass- through devices. The guest page table are configured to host
> > > as 1st-lvl/
> > > stage-1 page table. Therefore, guest format should be compatible with
> > > host side.
> > >
> > > This patch reports the supported 1st-lvl/stage-1 page table format on
> > > the current platform to userspace. QEMU and other alike applications
> > > should use this format info when trying to setup IOMMU nesting
> > > translation on host IOMMU.
> > >
> > > Cc: Kevin Tian 
> > > CC: Jacob Pan 
> > > Cc: Alex Williamson 
> > > Cc: Eric Auger 
> > > Cc: Jean-Philippe Brucker 
> > > Signed-off-by: Liu Yi L 
> > > ---
> > >  drivers/vfio/vfio_iommu_type1.c | 56
> > > +
> > >  include/uapi/linux/vfio.h   |  1 +
> > >  2 files changed, 57 insertions(+)
> > >
> > > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > > b/drivers/vfio/vfio_iommu_type1.c index 9aa2a67..82a9e0b 100644
> > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > @@ -2234,11 +2234,66 @@ static int
> vfio_iommu_type1_pasid_free(struct
> > > vfio_iommu *iommu,
> > >   return ret;
> > >  }
> > >
> > > +static int vfio_iommu_get_stage1_format(struct vfio_iommu *iommu,
> > > +  u32 *stage1_format)
> > > +{
> > > + struct vfio_domain *domain;
> > > + u32 format = 0, tmp_format = 0;
> > > + int ret;
> > > +
> > > + mutex_lock(&iommu->lock);
> > > + if (list_empty(&iommu->domain_list)) {
> > > + mutex_unlock(&iommu->lock);
> > > + return -EINVAL;
> > > + }
> > > +
> > > + list_for_each_entry(domain, &iommu->domain_list, next) {
> > > + if (iommu_domain_get_attr(domain->domain,
> > > + DOMAIN_ATTR_PASID_FORMAT, &format)) {
> > > + ret = -EINVAL;
> > > + format = 0;
> > > + goto out_unlock;
> > > + }
> > > + /*
> > > +  * format is always non-zero (the first format is
> > > +  * IOMMU_PASID_FORMAT_INTEL_VTD which is 1). For
> > > +  * the reason of potential different backed IOMMU
> > > +  * formats, here we expect to have identical formats
> > > +  * in the domain list, no mixed formats support.
> > > +  * return -EINVAL to fail the attempt of setup
> > > +  * VFIO_TYPE1_NESTING_IOMMU if non-identical formats
> > > +  * are detected.
> > > +  */
> > > + if (tmp_format && tmp_format != format) {
> > > + ret = -EINVAL;
> > > + format = 0;
> > > + goto out_unlock;
> > > + }
> > > +
> > > + tmp_format = format;
> > > + }
> >
> > this path is invoked only in VFIO_IOMMU_GET_INFO path. If we don't want
> to
> > assume the status quo that one container holds only one device w/
> vIOMMU
> > (the prerequisite for vSVA), looks we also need check the format
> > compatibility when attaching a new group to this container?
> 
> right. if attaching to a nesting type container (vfio_iommu.nesting bit
> indicates it), it should check if it is compabile with prior domains in
> the domain

RE: [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1 format to userspace

2020-04-01 Thread Tian, Kevin
> From: Liu, Yi L 
> Sent: Wednesday, April 1, 2020 4:07 PM
> 
> > From: Tian, Kevin 
> > Sent: Wednesday, April 1, 2020 3:56 PM
> > To: Liu, Yi L ; alex.william...@redhat.com;
> > Subject: RE: [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1 format to
> > userspace
> >
> > > From: Liu, Yi L 
> > > Sent: Wednesday, April 1, 2020 3:38 PM
> > >
> > >  > From: Tian, Kevin 
> > > > Sent: Monday, March 30, 2020 7:49 PM
> > > > To: Liu, Yi L ; alex.william...@redhat.com;
> > > > Subject: RE: [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1
> > > > format to userspace
> > > >
> > > > > From: Liu, Yi L 
> > > > > Sent: Sunday, March 22, 2020 8:32 PM
> > > > >
> > > > > From: Liu Yi L 
> > > > >
> > > > > VFIO exposes IOMMU nesting translation (a.k.a dual stage
> > > > > translation) capability to userspace. Thus applications like QEMU
> > > > > could support vIOMMU with hardware's nesting translation
> > > > > capability for pass-through devices. Before setting up nesting
> > > > > translation for pass-through devices, QEMU and other applications
> > > > > need to learn the supported
> > > > > 1st-lvl/stage-1 translation structure format like page table format.
> > > > >
> > > > > Take vSVA (virtual Shared Virtual Addressing) as an example, to
> > > > > support vSVA for pass-through devices, QEMU setup nesting
> > > > > translation for pass- through devices. The guest page table are
> > > > > configured to host as 1st-lvl/
> > > > > stage-1 page table. Therefore, guest format should be compatible
> > > > > with host side.
> > > > >
> > > > > This patch reports the supported 1st-lvl/stage-1 page table format
> > > > > on the current platform to userspace. QEMU and other alike
> > > > > applications should use this format info when trying to setup
> > > > > IOMMU nesting translation on host IOMMU.
> > > > >
> > > > > Cc: Kevin Tian 
> > > > > CC: Jacob Pan 
> > > > > Cc: Alex Williamson 
> > > > > Cc: Eric Auger 
> > > > > Cc: Jean-Philippe Brucker 
> > > > > Signed-off-by: Liu Yi L 
> > > > > ---
> > > > >  drivers/vfio/vfio_iommu_type1.c | 56
> > > > > +
> > > > >  include/uapi/linux/vfio.h   |  1 +
> > > > >  2 files changed, 57 insertions(+)
> > > > >
> > > > > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > > > > b/drivers/vfio/vfio_iommu_type1.c index 9aa2a67..82a9e0b 100644
> > > > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > > > @@ -2234,11 +2234,66 @@ static int
> > > vfio_iommu_type1_pasid_free(struct
> > > > > vfio_iommu *iommu,
> > > > >   return ret;
> > > > >  }
> > > > >
> > > > > +static int vfio_iommu_get_stage1_format(struct vfio_iommu *iommu,
> > > > > +  u32 *stage1_format)
> > > > > +{
> > > > > + struct vfio_domain *domain;
> > > > > + u32 format = 0, tmp_format = 0;
> > > > > + int ret;
> > > > > +
> > > > > + mutex_lock(&iommu->lock);
> > > > > + if (list_empty(&iommu->domain_list)) {
> > > > > + mutex_unlock(&iommu->lock);
> > > > > + return -EINVAL;
> > > > > + }
> > > > > +
> > > > > + list_for_each_entry(domain, &iommu->domain_list, next) {
> > > > > + if (iommu_domain_get_attr(domain->domain,
> > > > > + DOMAIN_ATTR_PASID_FORMAT, &format)) {
> > > > > + ret = -EINVAL;
> > > > > + format = 0;
> > > > > + goto out_unlock;
> > > > > + }
> > > > > + /*
> > > > > +  * format is always non-zero (the first format is
> > > > > +  * IOMMU_PASID_FORMAT_INTEL_VTD which is 1).
> For
> > > > > +  * the reason of potential different

RE: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host

2020-04-01 Thread Tian, Kevin
> From: Liu, Yi L 
> Sent: Wednesday, April 1, 2020 5:13 PM
> 
> > From: Tian, Kevin 
> > Sent: Monday, March 30, 2020 8:46 PM
> > Subject: RE: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host
> >
> > > From: Liu, Yi L 
> > > Sent: Sunday, March 22, 2020 8:32 PM
> > >
> > > From: Liu Yi L 
> > >
> > > VFIO_TYPE1_NESTING_IOMMU is an IOMMU type which is backed by
> hardware
> > > IOMMUs that have nesting DMA translation (a.k.a dual stage address
> > > translation). For such hardware IOMMUs, there are two stages/levels of
> > > address translation, and software may let userspace/VM to own the
> > > first-
> > > level/stage-1 translation structures. Example of such usage is vSVA (
> > > virtual Shared Virtual Addressing). VM owns the first-level/stage-1
> > > translation structures and bind the structures to host, then hardware
> > > IOMMU would utilize nesting translation when doing DMA translation fo
> > > the devices behind such hardware IOMMU.
> > >
> > > This patch adds vfio support for binding guest translation (a.k.a
> > > stage 1) structure to host iommu. And for VFIO_TYPE1_NESTING_IOMMU,
> > > not only bind guest page table is needed, it also requires to expose
> > > interface to guest for iommu cache invalidation when guest modified
> > > the first-level/stage-1 translation structures since hardware needs to
> > > be notified to flush stale iotlbs. This would be introduced in next
> > > patch.
> > >
> > > In this patch, guest page table bind and unbind are done by using
> > > flags VFIO_IOMMU_BIND_GUEST_PGTBL and
> > VFIO_IOMMU_UNBIND_GUEST_PGTBL
> > > under IOCTL VFIO_IOMMU_BIND, the bind/unbind data are conveyed by
> > > struct iommu_gpasid_bind_data. Before binding guest page table to
> > > host, VM should have got a PASID allocated by host via
> > > VFIO_IOMMU_PASID_REQUEST.
> > >
> > > Bind guest translation structures (here is guest page table) to host
> >
> > Bind -> Binding
> got it.
> > > are the first step to setup vSVA (Virtual Shared Virtual Addressing).
> >
> > are -> is. and you already explained vSVA earlier.
> oh yes, it is.
> > >
> > > Cc: Kevin Tian 
> > > CC: Jacob Pan 
> > > Cc: Alex Williamson 
> > > Cc: Eric Auger 
> > > Cc: Jean-Philippe Brucker 
> > > Signed-off-by: Jean-Philippe Brucker 
> > > Signed-off-by: Liu Yi L 
> > > Signed-off-by: Jacob Pan 
> > > ---
> > >  drivers/vfio/vfio_iommu_type1.c | 158
> > > 
> > >  include/uapi/linux/vfio.h   |  46 
> > >  2 files changed, 204 insertions(+)
> > >
> > > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > > b/drivers/vfio/vfio_iommu_type1.c index 82a9e0b..a877747 100644
> > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > @@ -130,6 +130,33 @@ struct vfio_regions {
> > >  #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)  \
> > >   (!list_empty(&iommu->domain_list))
> > >
> > > +struct domain_capsule {
> > > + struct iommu_domain *domain;
> > > + void *data;
> > > +};
> > > +
> > > +/* iommu->lock must be held */
> > > +static int vfio_iommu_for_each_dev(struct vfio_iommu *iommu,
> > > +   int (*fn)(struct device *dev, void *data),
> > > +   void *data)
> > > +{
> > > + struct domain_capsule dc = {.data = data};
> > > + struct vfio_domain *d;
> > > + struct vfio_group *g;
> > > + int ret = 0;
> > > +
> > > + list_for_each_entry(d, &iommu->domain_list, next) {
> > > + dc.domain = d->domain;
> > > + list_for_each_entry(g, &d->group_list, next) {
> > > + ret = iommu_group_for_each_dev(g->iommu_group,
> > > +&dc, fn);
> > > + if (ret)
> > > + break;
> > > + }
> > > + }
> > > + return ret;
> > > +}
> > > +
> > >  static int put_pfn(unsigned long pfn, int prot);
> > >
> > >  /*
> > > @@ -2314,6 +2341,88 @@ static int
> > > vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
> > >   return 0;
> > >  }
> > >
> > > +sta

RE: [PATCH V10 11/11] iommu/vt-d: Add custom allocator for IOASID

2020-04-01 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Wednesday, April 1, 2020 11:48 PM
> 
> On Sat, 28 Mar 2020 10:22:41 +0000
> "Tian, Kevin"  wrote:
> 
> > > From: Jacob Pan 
> > > Sent: Saturday, March 21, 2020 7:28 AM
> > >
> > > When VT-d driver runs in the guest, PASID allocation must be
> > > performed via virtual command interface. This patch registers a
> > > custom IOASID allocator which takes precedence over the default
> > > XArray based allocator. The resulting IOASID allocation will always
> > > come from the host. This ensures that PASID namespace is system-
> > > wide.
> > >
> > > Signed-off-by: Lu Baolu 
> > > Signed-off-by: Liu, Yi L 
> > > Signed-off-by: Jacob Pan 
> > > ---
> > >  drivers/iommu/intel-iommu.c | 84
> > > +
> > >  include/linux/intel-iommu.h |  2 ++
> > >  2 files changed, 86 insertions(+)
> > >
> > > diff --git a/drivers/iommu/intel-iommu.c
> > > b/drivers/iommu/intel-iommu.c index a76afb0fd51a..c1c0b0fb93c3
> > > 100644 --- a/drivers/iommu/intel-iommu.c
> > > +++ b/drivers/iommu/intel-iommu.c
> > > @@ -1757,6 +1757,9 @@ static void free_dmar_iommu(struct
> intel_iommu
> > > *iommu)
> > >   if (ecap_prs(iommu->ecap))
> > >   intel_svm_finish_prq(iommu);
> > >   }
> > > + if (ecap_vcs(iommu->ecap) && vccap_pasid(iommu->vccap))
> > > +
> > > ioasid_unregister_allocator(&iommu->pasid_allocator); +
> > >  #endif
> > >  }
> > >
> > > @@ -3291,6 +3294,84 @@ static int copy_translation_tables(struct
> > > intel_iommu *iommu)
> > >   return ret;
> > >  }
> > >
> > > +#ifdef CONFIG_INTEL_IOMMU_SVM
> > > +static ioasid_t intel_ioasid_alloc(ioasid_t min, ioasid_t max,
> > > void *data)
> >
> > the name is too generic... can we add vcmd in the name to clarify
> > its purpose, e.g. intel_vcmd_ioasid_alloc?
> >
> I feel the intel_ prefix is a natural extension of a generic API, we do
> that for other IOMMU APIs, right?

other IOMMU APIs have no difference between host and guest, but
this one only applies to guest with vcmd interface. 

> 
> > > +{
> > > + struct intel_iommu *iommu = data;
> > > + ioasid_t ioasid;
> > > +
> > > + if (!iommu)
> > > + return INVALID_IOASID;
> > > + /*
> > > +  * VT-d virtual command interface always uses the full 20
> > > bit
> > > +  * PASID range. Host can partition guest PASID range based
> > > on
> > > +  * policies but it is out of guest's control.
> > > +  */
> > > + if (min < PASID_MIN || max > intel_pasid_max_id)
> > > + return INVALID_IOASID;
> > > +
> > > + if (vcmd_alloc_pasid(iommu, &ioasid))
> > > + return INVALID_IOASID;
> > > +
> > > + return ioasid;
> > > +}
> > > +
> > > +static void intel_ioasid_free(ioasid_t ioasid, void *data)
> > > +{
> > > + struct intel_iommu *iommu = data;
> > > +
> > > + if (!iommu)
> > > + return;
> > > + /*
> > > +  * Sanity check the ioasid owner is done at upper layer,
> > > e.g. VFIO
> > > +  * We can only free the PASID when all the devices are
> > > unbound.
> > > +  */
> > > + if (ioasid_find(NULL, ioasid, NULL)) {
> > > + pr_alert("Cannot free active IOASID %d\n", ioasid);
> > > + return;
> > > + }
> >
> > However the sanity check is not done in default_free. Is there a
> > reason why using vcmd adds such  new requirement?
> >
> Since we don't support nested guest. This vcmd allocator is only used
> by the guest IOMMU driver not VFIO. We expect IOMMU driver to have
> control of the free()/unbind() ordering.
> 
> For default_free, it can come from user space and host VFIO which can
> be out of order. But we will solve that issue with the blocking
> notifier.
> 
> > > + vcmd_free_pasid(iommu, ioasid);
> > > +}
> > > +
> > > +static void register_pasid_allocator(struct intel_iommu *iommu)
> > > +{
> > > + /*
> > > +  * If we are running in the host, no need for custom
> > > allocator
> > > +  * in that PASIDs are allocated from the host system-wide.
> > > +  */
> > > + if (!cap_caching_mode(iommu->cap))
> > > + return;
> >
&g

RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)

2020-04-02 Thread Tian, Kevin
> From: Alex Williamson 
> Sent: Friday, April 3, 2020 1:50 AM
> 
> On Sun, 22 Mar 2020 05:31:58 -0700
> "Liu, Yi L"  wrote:
> 
> > From: Liu Yi L 
> >
> > For a long time, devices have only one DMA address space from platform
> > IOMMU's point of view. This is true for both bare metal and directed-
> > access in virtualization environment. Reason is the source ID of DMA in
> > PCIe are BDF (bus/dev/fnc ID), which results in only device granularity
> > DMA isolation. However, this is changing with the latest advancement in
> > I/O technology area. More and more platform vendors are utilizing the
> PCIe
> > PASID TLP prefix in DMA requests, thus to give devices with multiple DMA
> > address spaces as identified by their individual PASIDs. For example,
> > Shared Virtual Addressing (SVA, a.k.a Shared Virtual Memory) is able to
> > let device access multiple process virtual address space by binding the
> > virtual address space with a PASID. Wherein the PASID is allocated in
> > software and programmed to device per device specific manner. Devices
> > which support PASID capability are called PASID-capable devices. If such
> > devices are passed through to VMs, guest software are also able to bind
> > guest process virtual address space on such devices. Therefore, the guest
> > software could reuse the bare metal software programming model, which
> > means guest software will also allocate PASID and program it to device
> > directly. This is a dangerous situation since it has potential PASID
> > conflicts and unauthorized address space access. It would be safer to
> > let host intercept in the guest software's PASID allocation. Thus PASID
> > are managed system-wide.
> 
> Providing an allocation interface only allows for collaborative usage
> of PASIDs though.  Do we have any ability to enforce PASID usage or can
> a user spoof other PASIDs on the same BDF?

An user can access only PASIDs allocated to itself, i.e. the specific IOASID
set tied to its mm_struct.

Thanks
Kevin

> 
> > This patch adds VFIO_IOMMU_PASID_REQUEST ioctl which aims to
> passdown
> > PASID allocation/free request from the virtual IOMMU. Additionally, such
> > requests are intended to be invoked by QEMU or other applications which
> > are running in userspace, it is necessary to have a mechanism to prevent
> > single application from abusing available PASIDs in system. With such
> > consideration, this patch tracks the VFIO PASID allocation per-VM. There
> > was a discussion to make quota to be per assigned devices. e.g. if a VM
> > has many assigned devices, then it should have more quota. However, it
> > is not sure how many PASIDs an assigned devices will use. e.g. it is
> > possible that a VM with multiples assigned devices but requests less
> > PASIDs. Therefore per-VM quota would be better.
> >
> > This patch uses struct mm pointer as a per-VM token. We also considered
> > using task structure pointer and vfio_iommu structure pointer. However,
> > task structure is per-thread, which means it cannot achieve per-VM PASID
> > alloc tracking purpose. While for vfio_iommu structure, it is visible
> > only within vfio. Therefore, structure mm pointer is selected. This patch
> > adds a structure vfio_mm. A vfio_mm is created when the first vfio
> > container is opened by a VM. On the reverse order, vfio_mm is free when
> > the last vfio container is released. Each VM is assigned with a PASID
> > quota, so that it is not able to request PASID beyond its quota. This
> > patch adds a default quota of 1000. This quota could be tuned by
> > administrator. Making PASID quota tunable will be added in another patch
> > in this series.
> >
> > Previous discussions:
> > https://patchwork.kernel.org/patch/11209429/
> >
> > Cc: Kevin Tian 
> > CC: Jacob Pan 
> > Cc: Alex Williamson 
> > Cc: Eric Auger 
> > Cc: Jean-Philippe Brucker 
> > Signed-off-by: Liu Yi L 
> > Signed-off-by: Yi Sun 
> > Signed-off-by: Jacob Pan 
> > ---
> >  drivers/vfio/vfio.c | 130
> 
> >  drivers/vfio/vfio_iommu_type1.c | 104
> 
> >  include/linux/vfio.h|  20 +++
> >  include/uapi/linux/vfio.h   |  41 +
> >  4 files changed, 295 insertions(+)
> >
> > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> > index c848262..d13b483 100644
> > --- a/drivers/vfio/vfio.c
> > +++ b/drivers/vfio/vfio.c
> > @@ -32,6 +32,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >
> >  #define DRIVER_VERSION "0.3"
> >  #define DRIVER_AUTHOR  "Alex Williamson
> "
> > @@ -46,6 +47,8 @@ static struct vfio {
> > struct mutexgroup_lock;
> > struct cdev group_cdev;
> > dev_t   group_devt;
> > +   struct list_headvfio_mm_list;
> > +   struct mutexvfio_mm_lock;
> > wait_queue_head_t   release_q;
> >  } vfio;
> >
> > @@ -2129,6 +2132,131

RE: [PATCH v1 7/8] vfio/type1: Add VFIO_IOMMU_CACHE_INVALIDATE

2020-04-02 Thread Tian, Kevin
> From: Alex Williamson 
> Sent: Friday, April 3, 2020 4:24 AM
> 
> On Sun, 22 Mar 2020 05:32:04 -0700
> "Liu, Yi L"  wrote:
> 
> > From: Liu Yi L 
> >
> > For VFIO IOMMUs with the type VFIO_TYPE1_NESTING_IOMMU, guest
> "owns" the
> > first-level/stage-1 translation structures, the host IOMMU driver has no
> > knowledge of first-level/stage-1 structure cache updates unless the guest
> > invalidation requests are trapped and propagated to the host.
> >
> > This patch adds a new IOCTL VFIO_IOMMU_CACHE_INVALIDATE to
> propagate guest
> > first-level/stage-1 IOMMU cache invalidations to host to ensure IOMMU
> cache
> > correctness.
> >
> > With this patch, vSVA (Virtual Shared Virtual Addressing) can be used safely
> > as the host IOMMU iotlb correctness are ensured.
> >
> > Cc: Kevin Tian 
> > CC: Jacob Pan 
> > Cc: Alex Williamson 
> > Cc: Eric Auger 
> > Cc: Jean-Philippe Brucker 
> > Signed-off-by: Liu Yi L 
> > Signed-off-by: Eric Auger 
> > Signed-off-by: Jacob Pan 
> > ---
> >  drivers/vfio/vfio_iommu_type1.c | 49
> +
> >  include/uapi/linux/vfio.h   | 22 ++
> >  2 files changed, 71 insertions(+)
> >
> > diff --git a/drivers/vfio/vfio_iommu_type1.c
> b/drivers/vfio/vfio_iommu_type1.c
> > index a877747..937ec3f 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -2423,6 +2423,15 @@ static long
> vfio_iommu_type1_unbind_gpasid(struct vfio_iommu *iommu,
> > return ret;
> >  }
> >
> > +static int vfio_cache_inv_fn(struct device *dev, void *data)
> > +{
> > +   struct domain_capsule *dc = (struct domain_capsule *)data;
> > +   struct iommu_cache_invalidate_info *cache_inv_info =
> > +   (struct iommu_cache_invalidate_info *) dc->data;
> > +
> > +   return iommu_cache_invalidate(dc->domain, dev, cache_inv_info);
> > +}
> > +
> >  static long vfio_iommu_type1_ioctl(void *iommu_data,
> >unsigned int cmd, unsigned long arg)
> >  {
> > @@ -2629,6 +2638,46 @@ static long vfio_iommu_type1_ioctl(void
> *iommu_data,
> > }
> > kfree(gbind_data);
> > return ret;
> > +   } else if (cmd == VFIO_IOMMU_CACHE_INVALIDATE) {
> > +   struct vfio_iommu_type1_cache_invalidate cache_inv;
> > +   u32 version;
> > +   int info_size;
> > +   void *cache_info;
> > +   int ret;
> > +
> > +   minsz = offsetofend(struct
> vfio_iommu_type1_cache_invalidate,
> > +   flags);
> 
> This breaks backward compatibility as soon as struct
> iommu_cache_invalidate_info changes size by its defined versioning
> scheme.  ie. a field gets added, the version is bumped, all existing
> userspace breaks.  Our minsz is offsetofend to the version field,
> interpret the version to size, then reevaluate argsz.

btw the version scheme is challenged by Christoph Hellwig. After
some discussions, we need your guidance how to move forward.
Jacob summarized available options below:
https://lkml.org/lkml/2020/4/2/876

> 
> > +
> > +   if (copy_from_user(&cache_inv, (void __user *)arg, minsz))
> > +   return -EFAULT;
> > +
> > +   if (cache_inv.argsz < minsz || cache_inv.flags)
> > +   return -EINVAL;
> > +
> > +   /* Get the version of struct iommu_cache_invalidate_info */
> > +   if (copy_from_user(&version,
> > +   (void __user *) (arg + minsz), sizeof(version)))
> > +   return -EFAULT;
> > +
> > +   info_size = iommu_uapi_get_data_size(
> > +   IOMMU_UAPI_CACHE_INVAL,
> version);
> > +
> > +   cache_info = kzalloc(info_size, GFP_KERNEL);
> > +   if (!cache_info)
> > +   return -ENOMEM;
> > +
> > +   if (copy_from_user(cache_info,
> > +   (void __user *) (arg + minsz), info_size)) {
> > +   kfree(cache_info);
> > +   return -EFAULT;
> > +   }
> > +
> > +   mutex_lock(&iommu->lock);
> > +   ret = vfio_iommu_for_each_dev(iommu, vfio_cache_inv_fn,
> > +   cache_info);
> 
> How does a user respond when their cache invalidate fails?  Isn't this
> also another case where our for_each_dev can fail at an arbitrary point
> leaving us with no idea whether each device even had the opportunity to
> perform the invalidation request.  I don't see how we have any chance
> to maintain coherency after this faults.

Then can we make it simple to support singleton group only? 

> 
> > +   mutex_unlock(&iommu->lock);
> > +   kfree(cache_info);
> > +   return ret;
> > }
> >
> > return -ENOTTY;
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index 2235bc6..62ca791 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -899,6 +899,28 @@ struct vfio_iommu_

RE: [PATCH v1 2/2] vfio/pci: Emulate PASID/PRI capability for VFs

2020-04-06 Thread Tian, Kevin
> From: Alex Williamson 
> Sent: Saturday, April 4, 2020 1:26 AM
[...]
> > > > +   if (!pasid_cap.control_reg.paside) {
> > > > +   pr_debug("%s: its PF's PASID capability is not 
> > > > enabled\n",
> > > > +   dev_name(&vdev->pdev->dev));
> > > > +   ret = 0;
> > > > +   goto out;
> > > > +   }
> > >
> > > What happens if the PF's PASID gets disabled while we're using it??
> >
> > This is actually the open I highlighted in cover letter. Per the reply
> > from Baolu, this seems to be an open for bare-metal all the same.
> > https://lkml.org/lkml/2020/3/31/95
> 
> Seems that needs to get sorted out before we can expose this.  Maybe
> some sort of registration with the PF driver that PASID is being used
> by a VF so it cannot be disabled?

I guess we may do vSVA for PF first, and then adding VF vSVA later
given above additional need. It's not necessarily to enable both
in one step.

[...]
> > > > @@ -1604,6 +1901,18 @@ static int vfio_ecap_init(struct
> vfio_pci_device *vdev)
> > > > if (!ecaps)
> > > > *(u32 *)&vdev->vconfig[PCI_CFG_SPACE_SIZE] = 0;
> > > >
> > > > +#ifdef CONFIG_PCI_ATS
> > > > +   if (pdev->is_virtfn) {
> > > > +   struct pci_dev *physfn = pdev->physfn;
> > > > +
> > > > +   ret = vfio_pci_add_emulated_cap_for_vf(vdev,
> > > > +   physfn, epos_max, prev);
> > > > +   if (ret)
> > > > +   pr_info("%s, failed to add special caps for VF 
> > > > %s\n",
> > > > +   __func__, dev_name(&vdev->pdev->dev));
> > > > +   }
> > > > +#endif
> > >
> > > I can only imagine that we should place the caps at the same location
> > > they exist on the PF, we don't know what hidden registers might be
> > > hiding in config space.

Is there vendor guarantee that hidden registers will locate at the
same offset between PF and VF config space? 

> >
> > but we are not sure whether the same location is available on VF. In
> > this patch, it actually places the emulated cap physically behind the
> > cap which lays farthest (its offset is largest) within VF's config space
> > as the PCIe caps are linked in a chain.
> 
> But, as we've found on Broadcom NICs (iirc), hardware developers have a
> nasty habit of hiding random registers in PCI config space, outside of
> defined capabilities.  I feel like IGD might even do this too, is that
> true?  So I don't think we can guarantee that just because a section of
> config space isn't part of a defined capability that its unused.  It
> only means that it's unused by common code, but it might have device
> specific purposes.  So of the PCIe spec indicates that VFs cannot
> include these capabilities and virtialization software needs to
> emulate them, we need somewhere safe to place them in config space, and
> simply placing them off the end of known capabilities doesn't give me
> any confidence.  Also, hardware has no requirement to make compact use
> of extended config space.  The first capability must be at 0x100, the
> very next capability could consume all the way to the last byte of the
> 4K extended range, and the next link in the chain could be somewhere in
> the middle.  Thanks,
> 

Then what would be a viable option? Vendor nasty habit implies
no standard, thus I don't see how VFIO can find a safe location
by itself. Also curious how those hidden registers are identified
by VFIO and employed with proper r/w policy today. If sort of quirks
are used, then could such quirk way be extended to also carry
the information about vendor specific safe location? When no
such quirk info is provided (the majority case), VFIO then finds
out a free location to carry the new cap.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)

2020-04-06 Thread Tian, Kevin
> From: Alex Williamson
> Sent: Friday, April 3, 2020 11:14 PM
> 
> On Fri, 3 Apr 2020 05:58:55 +0000
> "Tian, Kevin"  wrote:
> 
> > > From: Alex Williamson 
> > > Sent: Friday, April 3, 2020 1:50 AM
> > >
> > > On Sun, 22 Mar 2020 05:31:58 -0700
> > > "Liu, Yi L"  wrote:
> > >
> > > > From: Liu Yi L 
> > > >
> > > > For a long time, devices have only one DMA address space from
> platform
> > > > IOMMU's point of view. This is true for both bare metal and directed-
> > > > access in virtualization environment. Reason is the source ID of DMA in
> > > > PCIe are BDF (bus/dev/fnc ID), which results in only device granularity
> > > > DMA isolation. However, this is changing with the latest advancement in
> > > > I/O technology area. More and more platform vendors are utilizing the
> > > PCIe
> > > > PASID TLP prefix in DMA requests, thus to give devices with multiple
> DMA
> > > > address spaces as identified by their individual PASIDs. For example,
> > > > Shared Virtual Addressing (SVA, a.k.a Shared Virtual Memory) is able to
> > > > let device access multiple process virtual address space by binding the
> > > > virtual address space with a PASID. Wherein the PASID is allocated in
> > > > software and programmed to device per device specific manner.
> Devices
> > > > which support PASID capability are called PASID-capable devices. If such
> > > > devices are passed through to VMs, guest software are also able to bind
> > > > guest process virtual address space on such devices. Therefore, the
> guest
> > > > software could reuse the bare metal software programming model,
> which
> > > > means guest software will also allocate PASID and program it to device
> > > > directly. This is a dangerous situation since it has potential PASID
> > > > conflicts and unauthorized address space access. It would be safer to
> > > > let host intercept in the guest software's PASID allocation. Thus PASID
> > > > are managed system-wide.
> > >
> > > Providing an allocation interface only allows for collaborative usage
> > > of PASIDs though.  Do we have any ability to enforce PASID usage or can
> > > a user spoof other PASIDs on the same BDF?
> >
> > An user can access only PASIDs allocated to itself, i.e. the specific IOASID
> > set tied to its mm_struct.
> 
> A user is only _supposed_ to access PASIDs allocated to itself.  AIUI
> the mm_struct is used for managing the pool of IOASIDs from which the
> user may allocate that PASID.  We also state that programming the PASID
> into the device is device specific.  Therefore, are we simply trusting
> the user to use a PASID that's been allocated to them when they program
> the device?  If a user can program an arbitrary PASID into the device,
> then what prevents them from attempting to access data from another
> user via the device?   I think I've asked this question before, so if
> there's a previous explanation or spec section I need to review, please
> point me to it.  Thanks,
> 

There are two scenarios:

(1) for PF/VF, the iommu driver maintains an individual PASID table per
PDF. Although the PASID namespace is global, the per-BDF PASID table
contains only valid entries for those PASIDs which are allocated to the
mm_struct. The user is free to program arbitrary PASID into the assigned
device, but using invalid PASIDs simply hit iommu fault.

(2) for mdev, multiple mdev instances share the same PASID table of
the parent BDF. However, PASID programming is a privileged operation
in multiplexing usage, thus must be mediated by mdev device driver. 
The mediation logic will guarantee that only allocated PASIDs are 
forwarded to the device. 

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)

2020-04-06 Thread Tian, Kevin
> From: Alex Williamson
> Sent: Saturday, April 4, 2020 1:50 AM
[...]
> > > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > > index 9e843a1..298ac80 100644
> > > > --- a/include/uapi/linux/vfio.h
> > > > +++ b/include/uapi/linux/vfio.h
> > > > @@ -794,6 +794,47 @@ struct vfio_iommu_type1_dma_unmap {
> > > >  #define VFIO_IOMMU_ENABLE  _IO(VFIO_TYPE, VFIO_BASE + 15)
> > > >  #define VFIO_IOMMU_DISABLE _IO(VFIO_TYPE, VFIO_BASE + 16)
> > > >
> > > > +/*
> > > > + * PASID (Process Address Space ID) is a PCIe concept which
> > > > + * has been extended to support DMA isolation in fine-grain.
> > > > + * With device assigned to user space (e.g. VMs), PASID alloc
> > > > + * and free need to be system wide. This structure defines
> > > > + * the info for pasid alloc/free between user space and kernel
> > > > + * space.
> > > > + *
> > > > + * @flag=VFIO_IOMMU_PASID_ALLOC, refer to the @alloc_pasid
> > > > + * @flag=VFIO_IOMMU_PASID_FREE, refer to @free_pasid
> > > > + */
> > > > +struct vfio_iommu_type1_pasid_request {
> > > > +   __u32   argsz;
> > > > +#define VFIO_IOMMU_PASID_ALLOC (1 << 0)
> > > > +#define VFIO_IOMMU_PASID_FREE  (1 << 1)
> > > > +   __u32   flags;
> > > > +   union {
> > > > +   struct {
> > > > +   __u32 min;
> > > > +   __u32 max;
> > > > +   __u32 result;
> > > > +   } alloc_pasid;
> > > > +   __u32 free_pasid;
> > > > +   };
> > >
> > > We seem to be using __u8 data[] lately where the struct at data is
> > > defined by the flags.  should we do that here?
> >
> > yeah, I can do that. BTW. Do you want to let the structure in the
> > lately patch share the same structure with this one? As I can foresee,
> > the two structures would look like similar as both of them include
> > argsz, flags and data[] fields. The difference is the definition of
> > flags. what about your opinion?
> >
> > struct vfio_iommu_type1_pasid_request {
> > __u32   argsz;
> > #define VFIO_IOMMU_PASID_ALLOC  (1 << 0)
> > #define VFIO_IOMMU_PASID_FREE   (1 << 1)
> > __u32   flags;
> > __u8data[];
> > };
> >
> > struct vfio_iommu_type1_bind {
> > __u32   argsz;
> > __u32   flags;
> > #define VFIO_IOMMU_BIND_GUEST_PGTBL (1 << 0)
> > #define VFIO_IOMMU_UNBIND_GUEST_PGTBL   (1 << 1)
> > __u8data[];
> > };
> 
> 
> Yes, I was even wondering the same for the cache invalidate ioctl, or
> whether this is going too far for a general purpose "everything related
> to PASIDs" ioctl.  We need to factor usability into the equation too.
> I'd be interested in opinions from others here too.  Clearly I don't
> like single use, throw-away ioctls, but I can find myself on either
> side of the argument that allocation, binding, and invalidating are all
> within the domain of PASIDs and could fall within a single ioctl or
> they each represent different facets of managing PASIDs and should have
> separate ioctls.  Thanks,
> 

Looking at uapi/linux/iommu.h:

* Invalidations by %IOMMU_INV_GRANU_DOMAIN don't take any argument other than
 * @version and @cache.

Although intel-iommu handles only PASID-related invalidation now, I
suppose other vendors (or future usages?) may allow non-pasid
based invalidation too based on above comment. 

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v1 2/2] vfio/pci: Emulate PASID/PRI capability for VFs

2020-04-07 Thread Tian, Kevin
> From: Alex Williamson 
> Sent: Tuesday, April 7, 2020 11:58 PM
> 
> On Tue, 7 Apr 2020 04:26:23 +0000
> "Tian, Kevin"  wrote:
> 
> > > From: Alex Williamson 
> > > Sent: Saturday, April 4, 2020 1:26 AM
> > [...]
> > > > > > +   if (!pasid_cap.control_reg.paside) {
> > > > > > +   pr_debug("%s: its PF's PASID capability is not
> enabled\n",
> > > > > > +   dev_name(&vdev->pdev->dev));
> > > > > > +   ret = 0;
> > > > > > +   goto out;
> > > > > > +   }
> > > > >
> > > > > What happens if the PF's PASID gets disabled while we're using it??
> > > >
> > > > This is actually the open I highlighted in cover letter. Per the reply
> > > > from Baolu, this seems to be an open for bare-metal all the same.
> > > > https://lkml.org/lkml/2020/3/31/95
> > >
> > > Seems that needs to get sorted out before we can expose this.  Maybe
> > > some sort of registration with the PF driver that PASID is being used
> > > by a VF so it cannot be disabled?
> >
> > I guess we may do vSVA for PF first, and then adding VF vSVA later
> > given above additional need. It's not necessarily to enable both
> > in one step.
> >
> > [...]
> > > > > > @@ -1604,6 +1901,18 @@ static int vfio_ecap_init(struct
> > > vfio_pci_device *vdev)
> > > > > > if (!ecaps)
> > > > > > *(u32 *)&vdev->vconfig[PCI_CFG_SPACE_SIZE] = 0;
> > > > > >
> > > > > > +#ifdef CONFIG_PCI_ATS
> > > > > > +   if (pdev->is_virtfn) {
> > > > > > +   struct pci_dev *physfn = pdev->physfn;
> > > > > > +
> > > > > > +   ret = vfio_pci_add_emulated_cap_for_vf(vdev,
> > > > > > +   physfn, epos_max, prev);
> > > > > > +   if (ret)
> > > > > > +   pr_info("%s, failed to add special caps for
> VF %s\n",
> > > > > > +   __func__, dev_name(&vdev->pdev-
> >dev));
> > > > > > +   }
> > > > > > +#endif
> > > > >
> > > > > I can only imagine that we should place the caps at the same location
> > > > > they exist on the PF, we don't know what hidden registers might be
> > > > > hiding in config space.
> >
> > Is there vendor guarantee that hidden registers will locate at the
> > same offset between PF and VF config space?
> 
> I'm not sure if the spec really precludes hidden registers, but the
> fact that these registers are explicitly outside of the capability
> chain implies they're only intended for device specific use, so I'd say
> there are no guarantees about anything related to these registers.
> 
> FWIW, vfio started out being more strict about restricting config space
> access to defined capabilities, until...
> 
> commit a7d1ea1c11b33bda2691f3294b4d735ed635535a
> Author: Alex Williamson 
> Date:   Mon Apr 1 09:04:12 2013 -0600
> 
> vfio-pci: Enable raw access to unassigned config space
> 
> Devices like be2net hide registers between the gaps in capabilities
> and architected regions of PCI config space.  Our choices to support
> such devices is to either build an ever growing and unmanageable white
> list or rely on hardware isolation to protect us.  These registers are
> really no different than MMIO or I/O port space registers, which we
> don't attempt to regulate, so treat PCI config space in the same way.
> 
> > > > but we are not sure whether the same location is available on VF. In
> > > > this patch, it actually places the emulated cap physically behind the
> > > > cap which lays farthest (its offset is largest) within VF's config space
> > > > as the PCIe caps are linked in a chain.
> > >
> > > But, as we've found on Broadcom NICs (iirc), hardware developers have a
> > > nasty habit of hiding random registers in PCI config space, outside of
> > > defined capabilities.  I feel like IGD might even do this too, is that
> > > true?  So I don't think we can guarantee that just because a section of
> > > config space isn't part of a defined capability that its unused.  It
> > > only means that it's unused by common code, but it might h

RE: [PATCH v1 2/2] vfio/pci: Emulate PASID/PRI capability for VFs

2020-04-13 Thread Tian, Kevin
> From: Raj, Ashok 
> Sent: Monday, April 13, 2020 11:11 AM
> 
> On Wed, Apr 08, 2020 at 10:19:40AM -0600, Alex Williamson wrote:
> > On Tue, 7 Apr 2020 21:00:21 -0700
> > "Raj, Ashok"  wrote:
> >
> > > Hi Alex
> > >
> > > + Bjorn
> >
> >  + Don
> >
> > > FWIW I can't understand why PCI SIG went different ways with ATS,
> > > where its enumerated on PF and VF. But for PASID and PRI its only
> > > in PF.
> > >
> > > I'm checking with our internal SIG reps to followup on that.
> > >
> > > On Tue, Apr 07, 2020 at 09:58:01AM -0600, Alex Williamson wrote:
> > > > > Is there vendor guarantee that hidden registers will locate at the
> > > > > same offset between PF and VF config space?
> > > >
> > > > I'm not sure if the spec really precludes hidden registers, but the
> > > > fact that these registers are explicitly outside of the capability
> > > > chain implies they're only intended for device specific use, so I'd say
> > > > there are no guarantees about anything related to these registers.
> > >
> > > As you had suggested in the other thread, we could consider
> > > using the same offset as in PF, but even that's a better guess
> > > still not reliable.
> > >
> > > The other option is to maybe extend driver ops in the PF to expose
> > > where the offsets should be. Sort of adding the quirk in the
> > > implementation.
> > >
> > > I'm not sure how prevalent are PASID and PRI in VF devices. If SIG is
> resisting
> > > making VF's first class citizen, we might ask them to add some verbiage
> > > to suggest leave the same offsets as PF open to help emulation software.
> >
> > Even if we know where to expose these capabilities on the VF, it's not
> > clear to me how we can actually virtualize the capability itself.  If
> > the spec defines, for example, an enable bit as r/w then software that
> > interacts with that register expects the bit is settable.  There's no
> > protocol for "try to set the bit and re-read it to see if the hardware
> > accepted it".  Therefore a capability with a fixed enable bit
> > representing the state of the PF, not settable by the VF, is
> > disingenuous to the spec.
> 
> I think we are all in violent agreement. A lot of times the pci spec gets
> defined several years ahead of real products and no one remembers
> the justification on why they restricted things the way they did.
> 
> Maybe someone early product wasn't quite exposing these features to the
> VF
> and hence the spec is bug compatible :-)
> 
> >
> > If what we're trying to do is expose that PASID and PRI are enabled on
> > the PF to a VF driver, maybe duplicating the PF capabilities on the VF
> > without the ability to control it is not the right approach.  Maybe we
> 
> As long as the capability enable is only provided when the PF has enabled
> the feature. Then it seems the hardware seems to do the right thing.
> 
> Assume we expose PASID/PRI only when PF has enabled it. It will be the
> case since the PF driver needs to exist, and IOMMU would have set the
> PASID/PRI/ATS on PF.
> 
> If the emulation is purely spoofing the capability. Once vIOMMU driver
> enables PASID, the context entries for the VF are completely independent
> from the PF context entries.
> 
> vIOMMU would enable PASID, and we just spoof the PASID capability.
> 
> If vIOMMU or guest for some reason does disable_pasid(), then the
> vIOMMU driver can disaable PASID on the VF context entries. So the VF
> although the capability is blanket enabled on PF, IOMMU gaurantees the
> transactions are blocked.
> 
> 
> In the interim, it seems like the intent of the virtual capability
> can be honored via help from the IOMMU for the controlling aspect..
> 
> Did i miss anything?

Above works for emulating the enable bit (under the assumption that 
PF driver won't disable pasid when vf is assigned). However, there are 
also "Execute permission enable" and "Privileged mode enable" bits in 
PASID control registers. I don't know how those bits could be cleanly 
emulated when the guest writes a value different from PF's...

Similar problem also exists when talking about PRI emulation, e.g. 
to enable PRI the software usually waits until the 'stopped' bit
is set (indicating all previously issued requests have completed). How
to emulate this bit accurately when one guest toggles the enable bit
while the PF and other VFs are actively issuing page requests through
the shared page request interface? from pcie spec I didn't find a way
to catch when all previously-issued requests from a specific VF have
completed. Can a conservative big-enough timeout value help here?
I don't know... similar puzzle also exists for emulating the 'reset'
control bit which is supposed to clear the pending request state for
the whole page request interface.

I feel the main problem in pcie spec is that, while they invent SR-IOV
to address I/O virtualization requirement (where strict isolation is
required), they blurred the boundary by leaving some shared resource 
cross PF and VFs which imply sort

RE: [PATCH v1 2/2] vfio/pci: Emulate PASID/PRI capability for VFs

2020-04-13 Thread Tian, Kevin
> From: Tian, Kevin
> Sent: Monday, April 13, 2020 3:55 PM
> 
> > From: Raj, Ashok 
> > Sent: Monday, April 13, 2020 11:11 AM
> >
> > On Wed, Apr 08, 2020 at 10:19:40AM -0600, Alex Williamson wrote:
> > > On Tue, 7 Apr 2020 21:00:21 -0700
> > > "Raj, Ashok"  wrote:
> > >
> > > > Hi Alex
> > > >
> > > > + Bjorn
> > >
> > >  + Don
> > >
> > > > FWIW I can't understand why PCI SIG went different ways with ATS,
> > > > where its enumerated on PF and VF. But for PASID and PRI its only
> > > > in PF.
> > > >
> > > > I'm checking with our internal SIG reps to followup on that.
> > > >
> > > > On Tue, Apr 07, 2020 at 09:58:01AM -0600, Alex Williamson wrote:
> > > > > > Is there vendor guarantee that hidden registers will locate at the
> > > > > > same offset between PF and VF config space?
> > > > >
> > > > > I'm not sure if the spec really precludes hidden registers, but the
> > > > > fact that these registers are explicitly outside of the capability
> > > > > chain implies they're only intended for device specific use, so I'd 
> > > > > say
> > > > > there are no guarantees about anything related to these registers.
> > > >
> > > > As you had suggested in the other thread, we could consider
> > > > using the same offset as in PF, but even that's a better guess
> > > > still not reliable.
> > > >
> > > > The other option is to maybe extend driver ops in the PF to expose
> > > > where the offsets should be. Sort of adding the quirk in the
> > > > implementation.
> > > >
> > > > I'm not sure how prevalent are PASID and PRI in VF devices. If SIG is
> > resisting
> > > > making VF's first class citizen, we might ask them to add some verbiage
> > > > to suggest leave the same offsets as PF open to help emulation software.
> > >
> > > Even if we know where to expose these capabilities on the VF, it's not
> > > clear to me how we can actually virtualize the capability itself.  If
> > > the spec defines, for example, an enable bit as r/w then software that
> > > interacts with that register expects the bit is settable.  There's no
> > > protocol for "try to set the bit and re-read it to see if the hardware
> > > accepted it".  Therefore a capability with a fixed enable bit
> > > representing the state of the PF, not settable by the VF, is
> > > disingenuous to the spec.
> >
> > I think we are all in violent agreement. A lot of times the pci spec gets
> > defined several years ahead of real products and no one remembers
> > the justification on why they restricted things the way they did.
> >
> > Maybe someone early product wasn't quite exposing these features to the
> > VF
> > and hence the spec is bug compatible :-)
> >
> > >
> > > If what we're trying to do is expose that PASID and PRI are enabled on
> > > the PF to a VF driver, maybe duplicating the PF capabilities on the VF
> > > without the ability to control it is not the right approach.  Maybe we
> >
> > As long as the capability enable is only provided when the PF has enabled
> > the feature. Then it seems the hardware seems to do the right thing.
> >
> > Assume we expose PASID/PRI only when PF has enabled it. It will be the
> > case since the PF driver needs to exist, and IOMMU would have set the
> > PASID/PRI/ATS on PF.
> >
> > If the emulation is purely spoofing the capability. Once vIOMMU driver
> > enables PASID, the context entries for the VF are completely independent
> > from the PF context entries.
> >
> > vIOMMU would enable PASID, and we just spoof the PASID capability.
> >
> > If vIOMMU or guest for some reason does disable_pasid(), then the
> > vIOMMU driver can disaable PASID on the VF context entries. So the VF
> > although the capability is blanket enabled on PF, IOMMU gaurantees the
> > transactions are blocked.
> >
> >
> > In the interim, it seems like the intent of the virtual capability
> > can be honored via help from the IOMMU for the controlling aspect..
> >
> > Did i miss anything?
> 
> Above works for emulating the enable bit (under the assumption that
> PF driver won't disable pasid when vf is assigned). However, there are
> also "Execute permission

RE: [PATCH v1 2/2] vfio/pci: Emulate PASID/PRI capability for VFs

2020-04-13 Thread Tian, Kevin
> From: Alex Williamson 
> Sent: Tuesday, April 14, 2020 3:21 AM
> 
> On Mon, 13 Apr 2020 08:05:33 +0000
> "Tian, Kevin"  wrote:
> 
> > > From: Tian, Kevin
> > > Sent: Monday, April 13, 2020 3:55 PM
> > >
> > > > From: Raj, Ashok 
> > > > Sent: Monday, April 13, 2020 11:11 AM
> > > >
> > > > On Wed, Apr 08, 2020 at 10:19:40AM -0600, Alex Williamson wrote:
> > > > > On Tue, 7 Apr 2020 21:00:21 -0700
> > > > > "Raj, Ashok"  wrote:
> > > > >
> > > > > > Hi Alex
> > > > > >
> > > > > > + Bjorn
> > > > >
> > > > >  + Don
> > > > >
> > > > > > FWIW I can't understand why PCI SIG went different ways with ATS,
> > > > > > where its enumerated on PF and VF. But for PASID and PRI its only
> > > > > > in PF.
> > > > > >
> > > > > > I'm checking with our internal SIG reps to followup on that.
> > > > > >
> > > > > > On Tue, Apr 07, 2020 at 09:58:01AM -0600, Alex Williamson wrote:
> > > > > > > > Is there vendor guarantee that hidden registers will locate at 
> > > > > > > > the
> > > > > > > > same offset between PF and VF config space?
> > > > > > >
> > > > > > > I'm not sure if the spec really precludes hidden registers, but 
> > > > > > > the
> > > > > > > fact that these registers are explicitly outside of the capability
> > > > > > > chain implies they're only intended for device specific use, so 
> > > > > > > I'd
> say
> > > > > > > there are no guarantees about anything related to these registers.
> > > > > >
> > > > > > As you had suggested in the other thread, we could consider
> > > > > > using the same offset as in PF, but even that's a better guess
> > > > > > still not reliable.
> > > > > >
> > > > > > The other option is to maybe extend driver ops in the PF to expose
> > > > > > where the offsets should be. Sort of adding the quirk in the
> > > > > > implementation.
> > > > > >
> > > > > > I'm not sure how prevalent are PASID and PRI in VF devices. If SIG 
> > > > > > is
> > > > resisting
> > > > > > making VF's first class citizen, we might ask them to add some
> verbiage
> > > > > > to suggest leave the same offsets as PF open to help emulation
> software.
> > > > >
> > > > > Even if we know where to expose these capabilities on the VF, it's not
> > > > > clear to me how we can actually virtualize the capability itself.  If
> > > > > the spec defines, for example, an enable bit as r/w then software that
> > > > > interacts with that register expects the bit is settable.  There's no
> > > > > protocol for "try to set the bit and re-read it to see if the hardware
> > > > > accepted it".  Therefore a capability with a fixed enable bit
> > > > > representing the state of the PF, not settable by the VF, is
> > > > > disingenuous to the spec.
> > > >
> > > > I think we are all in violent agreement. A lot of times the pci spec 
> > > > gets
> > > > defined several years ahead of real products and no one remembers
> > > > the justification on why they restricted things the way they did.
> > > >
> > > > Maybe someone early product wasn't quite exposing these features to
> the
> > > > VF
> > > > and hence the spec is bug compatible :-)
> > > >
> > > > >
> > > > > If what we're trying to do is expose that PASID and PRI are enabled on
> > > > > the PF to a VF driver, maybe duplicating the PF capabilities on the VF
> > > > > without the ability to control it is not the right approach.  Maybe we
> > > >
> > > > As long as the capability enable is only provided when the PF has
> enabled
> > > > the feature. Then it seems the hardware seems to do the right thing.
> > > >
> > > > Assume we expose PASID/PRI only when PF has enabled it. It will be the
> > > > case since the PF driver needs to exist, and IOMMU would have set 

RE: [PATCH v1 2/2] vfio/pci: Emulate PASID/PRI capability for VFs

2020-04-13 Thread Tian, Kevin
> From: Alex Williamson 
> Sent: Tuesday, April 14, 2020 11:29 AM
> 
> On Tue, 14 Apr 2020 02:40:58 +0000
> "Tian, Kevin"  wrote:
> 
> > > From: Alex Williamson 
> > > Sent: Tuesday, April 14, 2020 3:21 AM
> > >
> > > On Mon, 13 Apr 2020 08:05:33 +0000
> > > "Tian, Kevin"  wrote:
> > >
> > > > > From: Tian, Kevin
> > > > > Sent: Monday, April 13, 2020 3:55 PM
> > > > >
> > > > > > From: Raj, Ashok 
> > > > > > Sent: Monday, April 13, 2020 11:11 AM
> > > > > >
> > > > > > On Wed, Apr 08, 2020 at 10:19:40AM -0600, Alex Williamson wrote:
> > > > > > > On Tue, 7 Apr 2020 21:00:21 -0700
> > > > > > > "Raj, Ashok"  wrote:
> > > > > > >
> > > > > > > > Hi Alex
> > > > > > > >
> > > > > > > > + Bjorn
> > > > > > >
> > > > > > >  + Don
> > > > > > >
> > > > > > > > FWIW I can't understand why PCI SIG went different ways with
> ATS,
> > > > > > > > where its enumerated on PF and VF. But for PASID and PRI its
> only
> > > > > > > > in PF.
> > > > > > > >
> > > > > > > > I'm checking with our internal SIG reps to followup on that.
> > > > > > > >
> > > > > > > > On Tue, Apr 07, 2020 at 09:58:01AM -0600, Alex Williamson
> wrote:
> > > > > > > > > > Is there vendor guarantee that hidden registers will locate 
> > > > > > > > > > at
> the
> > > > > > > > > > same offset between PF and VF config space?
> > > > > > > > >
> > > > > > > > > I'm not sure if the spec really precludes hidden registers, 
> > > > > > > > > but
> the
> > > > > > > > > fact that these registers are explicitly outside of the 
> > > > > > > > > capability
> > > > > > > > > chain implies they're only intended for device specific use, 
> > > > > > > > > so
> I'd
> > > say
> > > > > > > > > there are no guarantees about anything related to these
> registers.
> > > > > > > >
> > > > > > > > As you had suggested in the other thread, we could consider
> > > > > > > > using the same offset as in PF, but even that's a better guess
> > > > > > > > still not reliable.
> > > > > > > >
> > > > > > > > The other option is to maybe extend driver ops in the PF to
> expose
> > > > > > > > where the offsets should be. Sort of adding the quirk in the
> > > > > > > > implementation.
> > > > > > > >
> > > > > > > > I'm not sure how prevalent are PASID and PRI in VF devices. If
> SIG is
> > > > > > resisting
> > > > > > > > making VF's first class citizen, we might ask them to add some
> > > verbiage
> > > > > > > > to suggest leave the same offsets as PF open to help emulation
> > > software.
> > > > > > >
> > > > > > > Even if we know where to expose these capabilities on the VF, it's
> not
> > > > > > > clear to me how we can actually virtualize the capability itself. 
> > > > > > >  If
> > > > > > > the spec defines, for example, an enable bit as r/w then software
> that
> > > > > > > interacts with that register expects the bit is settable.  
> > > > > > > There's no
> > > > > > > protocol for "try to set the bit and re-read it to see if the 
> > > > > > > hardware
> > > > > > > accepted it".  Therefore a capability with a fixed enable bit
> > > > > > > representing the state of the PF, not settable by the VF, is
> > > > > > > disingenuous to the spec.
> > > > > >
> > > > > > I think we are all in violent agreement. A lot of times the pci spec
> gets
> > > > > > defined several years ahead of real products and no one
> remembers
> > > > > > the j

RE: [PATCH v2 1/3] iommu/uapi: Define uapi version and capabilities

2020-04-14 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Wednesday, April 15, 2020 6:32 AM
> 
> On Tue, 14 Apr 2020 10:13:04 -0700
> Jacob Pan  wrote:
> 
> > > > >  In any of the proposed solutions, the
> > > > > IOMMU driver is ultimately responsible for validating the user
> > > > > data, so do we want vfio performing the copy_from_user() to an
> > > > > object that could later be assumed to be sanitized, or should
> > > > > vfio just pass a user pointer to make it obvious that the
> > > > > consumer is responsible for all the user protections?  Seems
> > > > > like the latter.
> > > > I like the latter as well.
> > > >
> On a second thought, I think the former is better. Two reasons:
> 
> 1. IOMMU API such as page_response is also used in baremetal. So it is
> not suitable to pass a __user *.
> https://www.spinics.net/lists/arm-kernel/msg798677.html

You can have a wrapped version accepting a __user* and an internal
version for kernel pointers.

> 
> 2. Some data are in the mandatory (fixed offset, never removed or
> extended) portion of the uAPI structure. It is simpler for VFIO to
> extract that and pass it to IOMMU API. For example, the PASID value used
> for unbind_gpasid(). VFIO also need to sanitize the PASID value to make
> sure it belongs to the same VM that did the allocation.

I don't think this makes much difference. If anyway you still plan to
let IOMMU driver parse some user pointers, why not making a clear
split to have it sparse all IOMMU specific fields?

Thanks
Kevin

> 
> 
> > > > >  That still really
> > > > > doesn't address what's in that user data blob yet, but the vfio
> > > > > interface could be:
> > > > >
> > > > > struct {
> > > > >   __u32 argsz;
> > > > >   __u32 flags;
> > > > >   __u8  data[];
> > > > > }
> > > > >
> > > > > Where flags might be partitioned like we do for DEVICE_FEATURE
> > > > > to indicate the format of data and what vfio should do with it,
> > > > > and data might simply be defined as a (__u64 __user *).
> > > > >
> > > > So, __user * will be passed to IOMMU driver if VFIO checks minsz
> > > > include flags and they are valid.
> > > > IOMMU driver can copy the rest based on the mandatory
> > > > version/minsz and flags in the IOMMU uAPI structs.
> > > > Does it sound right? This is really choice #2.
> > >
> > > Sounds like each IOMMU UAPI struct just needs to have an embedded
> > > size and flags field, but yes.
> > >
> > Yes, an argsz field can be added to each UAPI. There are already flags
> > or the equivalent. IOMMU driver can process the __user * based on the
> > argsz, flags, check argsz against offsetofend(iommu_uapi_struct,
> > last_element), etc.;
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v1 2/2] vfio/pci: Emulate PASID/PRI capability for VFs

2020-04-14 Thread Tian, Kevin
> From: Alex Williamson 
> Sent: Tuesday, April 14, 2020 11:24 PM
> 
> On Tue, 14 Apr 2020 03:42:42 +0000
> "Tian, Kevin"  wrote:
> 
> > > From: Alex Williamson 
> > > Sent: Tuesday, April 14, 2020 11:29 AM
> > >
> > > On Tue, 14 Apr 2020 02:40:58 +
> > > "Tian, Kevin"  wrote:
> > >
> > > > > From: Alex Williamson 
> > > > > Sent: Tuesday, April 14, 2020 3:21 AM
> > > > >
> > > > > On Mon, 13 Apr 2020 08:05:33 +
> > > > > "Tian, Kevin"  wrote:
> > > > >
> > > > > > > From: Tian, Kevin
> > > > > > > Sent: Monday, April 13, 2020 3:55 PM
> > > > > > >
> > > > > > > > From: Raj, Ashok 
> > > > > > > > Sent: Monday, April 13, 2020 11:11 AM
> > > > > > > >
> > > > > > > > On Wed, Apr 08, 2020 at 10:19:40AM -0600, Alex Williamson
> wrote:
> > > > > > > > > On Tue, 7 Apr 2020 21:00:21 -0700
> > > > > > > > > "Raj, Ashok"  wrote:
> > > > > > > > >
> > > > > > > > > > Hi Alex
> > > > > > > > > >
> > > > > > > > > > + Bjorn
> > > > > > > > >
> > > > > > > > >  + Don
> > > > > > > > >
> > > > > > > > > > FWIW I can't understand why PCI SIG went different ways
> with
> > > ATS,
> > > > > > > > > > where its enumerated on PF and VF. But for PASID and PRI its
> > > only
> > > > > > > > > > in PF.
> > > > > > > > > >
> > > > > > > > > > I'm checking with our internal SIG reps to followup on that.
> > > > > > > > > >
> > > > > > > > > > On Tue, Apr 07, 2020 at 09:58:01AM -0600, Alex Williamson
> > > wrote:
> > > > > > > > > > > > Is there vendor guarantee that hidden registers will 
> > > > > > > > > > > > locate
> at
> > > the
> > > > > > > > > > > > same offset between PF and VF config space?
> > > > > > > > > > >
> > > > > > > > > > > I'm not sure if the spec really precludes hidden 
> > > > > > > > > > > registers,
> but
> > > the
> > > > > > > > > > > fact that these registers are explicitly outside of the
> capability
> > > > > > > > > > > chain implies they're only intended for device specific 
> > > > > > > > > > > use,
> so
> > > I'd
> > > > > say
> > > > > > > > > > > there are no guarantees about anything related to these
> > > registers.
> > > > > > > > > >
> > > > > > > > > > As you had suggested in the other thread, we could consider
> > > > > > > > > > using the same offset as in PF, but even that's a better 
> > > > > > > > > > guess
> > > > > > > > > > still not reliable.
> > > > > > > > > >
> > > > > > > > > > The other option is to maybe extend driver ops in the PF to
> > > expose
> > > > > > > > > > where the offsets should be. Sort of adding the quirk in the
> > > > > > > > > > implementation.
> > > > > > > > > >
> > > > > > > > > > I'm not sure how prevalent are PASID and PRI in VF devices. 
> > > > > > > > > > If
> > > SIG is
> > > > > > > > resisting
> > > > > > > > > > making VF's first class citizen, we might ask them to add
> some
> > > > > verbiage
> > > > > > > > > > to suggest leave the same offsets as PF open to help
> emulation
> > > > > software.
> > > > > > > > >
> > > > > > > > > Even if we know where to expose these capabilities on the VF,
> it's
> > > not
> > > > > > > > > clear to me h

RE: [PATCH v1 2/2] vfio/pci: Emulate PASID/PRI capability for VFs

2020-04-14 Thread Tian, Kevin
> From: Alex Williamson 
> Sent: Wednesday, April 15, 2020 8:36 AM
> 
> On Tue, 14 Apr 2020 23:57:33 +0000
> "Tian, Kevin"  wrote:
> 
> > > From: Alex Williamson 
> > > Sent: Tuesday, April 14, 2020 11:24 PM
> > >
> > > On Tue, 14 Apr 2020 03:42:42 +
> > > "Tian, Kevin"  wrote:
> > >
> > > > > From: Alex Williamson 
> > > > > Sent: Tuesday, April 14, 2020 11:29 AM
> > > > >
> > > > > On Tue, 14 Apr 2020 02:40:58 +
> > > > > "Tian, Kevin"  wrote:
> > > > >
> > > > > > > From: Alex Williamson 
> > > > > > > Sent: Tuesday, April 14, 2020 3:21 AM
> > > > > > >
> > > > > > > On Mon, 13 Apr 2020 08:05:33 +
> > > > > > > "Tian, Kevin"  wrote:
> > > > > > >
> > > > > > > > > From: Tian, Kevin
> > > > > > > > > Sent: Monday, April 13, 2020 3:55 PM
> > > > > > > > >
> > > > > > > > > > From: Raj, Ashok 
> > > > > > > > > > Sent: Monday, April 13, 2020 11:11 AM
> > > > > > > > > >
> > > > > > > > > > On Wed, Apr 08, 2020 at 10:19:40AM -0600, Alex Williamson
> > > wrote:
> > > > > > > > > > > On Tue, 7 Apr 2020 21:00:21 -0700
> > > > > > > > > > > "Raj, Ashok"  wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Alex
> > > > > > > > > > > >
> > > > > > > > > > > > + Bjorn
> > > > > > > > > > >
> > > > > > > > > > >  + Don
> > > > > > > > > > >
> > > > > > > > > > > > FWIW I can't understand why PCI SIG went different ways
> > > with
> > > > > ATS,
> > > > > > > > > > > > where its enumerated on PF and VF. But for PASID and
> PRI its
> > > > > only
> > > > > > > > > > > > in PF.
> > > > > > > > > > > >
> > > > > > > > > > > > I'm checking with our internal SIG reps to followup on
> that.
> > > > > > > > > > > >
> > > > > > > > > > > > On Tue, Apr 07, 2020 at 09:58:01AM -0600, Alex
> Williamson
> > > > > wrote:
> > > > > > > > > > > > > > Is there vendor guarantee that hidden registers will
> locate
> > > at
> > > > > the
> > > > > > > > > > > > > > same offset between PF and VF config space?
> > > > > > > > > > > > >
> > > > > > > > > > > > > I'm not sure if the spec really precludes hidden 
> > > > > > > > > > > > > registers,
> > > but
> > > > > the
> > > > > > > > > > > > > fact that these registers are explicitly outside of 
> > > > > > > > > > > > > the
> > > capability
> > > > > > > > > > > > > chain implies they're only intended for device 
> > > > > > > > > > > > > specific
> use,
> > > so
> > > > > I'd
> > > > > > > say
> > > > > > > > > > > > > there are no guarantees about anything related to 
> > > > > > > > > > > > > these
> > > > > registers.
> > > > > > > > > > > >
> > > > > > > > > > > > As you had suggested in the other thread, we could
> consider
> > > > > > > > > > > > using the same offset as in PF, but even that's a better
> guess
> > > > > > > > > > > > still not reliable.
> > > > > > > > > > > >
> > > > > > > > > > > > The other option is to maybe extend driver ops in the PF
> to
> > > > > expose
> > > > > > > > > > > > where the offsets should be. Sort of adding the quirk in
> the
&

RE: [PATCH v2 0/7] iommu/vt-d: Add page request draining support

2020-04-15 Thread Tian, Kevin
> From: Lu Baolu 
> Sent: Wednesday, April 15, 2020 1:26 PM
> 
> When a PASID is stopped or terminated, there can be pending PRQs
> (requests that haven't received responses) in the software and
> remapping hardware. The pending page requests must be drained
> so that the pasid could be reused. The register level interface
> for page request draining is defined in 7.11 of the VT-d spec.
> This series adds the support for page requests draining.

7.11 doesn't include register-level interface. It just talks about
the general requirements on system software, endpoint device
and its driver.

Thanks
Kevin

> 
> This includes two parts:
>  - PATCH 1/7 ~ 3/7: refactor the qi_submit_sync() to support
>multiple descriptors per submission which will be used by
>PATCH 6/7.
>  - PATCH 4/7 ~ 7/7: add page request drain support after a
>pasid entry is torn down due to an unbind operation.
> 
> Please help to review.
> 
> Best regards,
> baolu
> 
> Change log:
>  v1->v2:
>   - Fix race between multiple prq handling threads
> 
> Lu Baolu (7):
>   iommu/vt-d: Refactor parameters for qi_submit_sync()
>   iommu/vt-d: Multiple descriptors per qi_submit_sync()
>   iommu/vt-d: debugfs: Add support to show inv queue internals
>   iommu/vt-d: Refactor prq_event_thread()
>   iommu/vt-d: Save prq descriptors in an internal list
>   iommu/vt-d: Add page request draining support
>   iommu/vt-d: Remove redundant IOTLB flush
> 
>  drivers/iommu/dmar.c|  63 +++--
>  drivers/iommu/intel-iommu-debugfs.c |  62 +
>  drivers/iommu/intel-pasid.c |   4 +-
>  drivers/iommu/intel-svm.c   | 383 ++--
>  drivers/iommu/intel_irq_remapping.c |   2 +-
>  include/linux/intel-iommu.h |  12 +-
>  6 files changed, 369 insertions(+), 157 deletions(-)
> 
> --
> 2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v2 1/7] iommu/vt-d: Refactor parameters for qi_submit_sync()

2020-04-15 Thread Tian, Kevin
> From: Lu Baolu 
> Sent: Wednesday, April 15, 2020 1:26 PM
> 
> Current qi_submit_sync() supports single invalidation descriptor
> per submission and appends wait descriptor after each submission
> to poll hardware completion. This patch adjusts the parameters
> of this function so that multiple descriptors per submission can
> be supported.
> 
> Signed-off-by: Jacob Pan 
> Signed-off-by: Lu Baolu 
> ---
>  drivers/iommu/dmar.c| 24 ++--
>  drivers/iommu/intel-pasid.c |  4 ++--
>  drivers/iommu/intel-svm.c   |  6 +++---
>  drivers/iommu/intel_irq_remapping.c |  2 +-
>  include/linux/intel-iommu.h |  8 +++-
>  5 files changed, 27 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
> index d9dc787feef7..bb42177e2369 100644
> --- a/drivers/iommu/dmar.c
> +++ b/drivers/iommu/dmar.c
> @@ -1225,10 +1225,14 @@ static int qi_check_fault(struct intel_iommu
> *iommu, int index)
>  }
> 
>  /*
> - * Submit the queued invalidation descriptor to the remapping
> - * hardware unit and wait for its completion.
> + * Function to submit invalidation descriptors of all types to the queued
> + * invalidation interface(QI). Multiple descriptors can be submitted at a
> + * time, a wait descriptor will be appended to each submission to ensure
> + * hardware has completed the invalidation before return. Wait descriptors
> + * can be part of the submission but it will not be polled for completion.
>   */
> -int qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu)
> +int qi_submit_sync(struct intel_iommu *iommu, struct qi_desc *desc,
> +unsigned int count, unsigned long options)

Adding parameter w/o actually using them is not typical way of splitting
patches. Better squash this with 2/7 together.

>  {
>   int rc;
>   struct q_inval *qi = iommu->qi;
> @@ -1318,7 +1322,7 @@ void qi_global_iec(struct intel_iommu *iommu)
>   desc.qw3 = 0;
> 
>   /* should never fail */
> - qi_submit_sync(&desc, iommu);
> + qi_submit_sync(iommu, &desc, 1, 0);
>  }
> 
>  void qi_flush_context(struct intel_iommu *iommu, u16 did, u16 sid, u8 fm,
> @@ -1332,7 +1336,7 @@ void qi_flush_context(struct intel_iommu *iommu,
> u16 did, u16 sid, u8 fm,
>   desc.qw2 = 0;
>   desc.qw3 = 0;
> 
> - qi_submit_sync(&desc, iommu);
> + qi_submit_sync(iommu, &desc, 1, 0);
>  }
> 
>  void qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr,
> @@ -1356,7 +1360,7 @@ void qi_flush_iotlb(struct intel_iommu *iommu,
> u16 did, u64 addr,
>   desc.qw2 = 0;
>   desc.qw3 = 0;
> 
> - qi_submit_sync(&desc, iommu);
> + qi_submit_sync(iommu, &desc, 1, 0);
>  }
> 
>  void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
> @@ -1378,7 +1382,7 @@ void qi_flush_dev_iotlb(struct intel_iommu
> *iommu, u16 sid, u16 pfsid,
>   desc.qw2 = 0;
>   desc.qw3 = 0;
> 
> - qi_submit_sync(&desc, iommu);
> + qi_submit_sync(iommu, &desc, 1, 0);
>  }
> 
>  /* PASID-based IOTLB invalidation */
> @@ -1419,7 +1423,7 @@ void qi_flush_piotlb(struct intel_iommu *iommu,
> u16 did, u32 pasid, u64 addr,
>   QI_EIOTLB_AM(mask);
>   }
> 
> - qi_submit_sync(&desc, iommu);
> + qi_submit_sync(iommu, &desc, 1, 0);
>  }
> 
>  /* PASID-based device IOTLB Invalidate */
> @@ -1448,7 +1452,7 @@ void qi_flush_dev_iotlb_pasid(struct intel_iommu
> *iommu, u16 sid, u16 pfsid,
>   if (size_order)
>   desc.qw1 |= QI_DEV_EIOTLB_SIZE;
> 
> - qi_submit_sync(&desc, iommu);
> + qi_submit_sync(iommu, &desc, 1, 0);
>  }
> 
>  void qi_flush_pasid_cache(struct intel_iommu *iommu, u16 did,
> @@ -1458,7 +1462,7 @@ void qi_flush_pasid_cache(struct intel_iommu
> *iommu, u16 did,
> 
>   desc.qw0 = QI_PC_PASID(pasid) | QI_PC_DID(did) |
>   QI_PC_GRAN(granu) | QI_PC_TYPE;
> - qi_submit_sync(&desc, iommu);
> + qi_submit_sync(iommu, &desc, 1, 0);
>  }
> 
>  /*
> diff --git a/drivers/iommu/intel-pasid.c b/drivers/iommu/intel-pasid.c
> index 48cc9ca5f3dc..7969e3dac2ad 100644
> --- a/drivers/iommu/intel-pasid.c
> +++ b/drivers/iommu/intel-pasid.c
> @@ -498,7 +498,7 @@ pasid_cache_invalidation_with_pasid(struct
> intel_iommu *iommu,
>   desc.qw2 = 0;
>   desc.qw3 = 0;
> 
> - qi_submit_sync(&desc, iommu);
> + qi_submit_sync(iommu, &desc, 1, 0);
>  }
> 
>  static void
> @@ -512,7 +512,7 @@ iotlb_invalidation_with_pasid(struct intel_iommu
> *iommu, u16 did, u32 pasid)
>   desc.qw2 = 0;
>   desc.qw3 = 0;
> 
> - qi_submit_sync(&desc, iommu);
> + qi_submit_sync(iommu, &desc, 1, 0);
>  }
> 
>  static void
> diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
> index e9f4e979a71f..83dc4319f661 100644
> --- a/drivers/iommu/intel-svm.c
> +++ b/drivers/iommu/intel-svm.c
> @@ -138,7 +138,7 @@ static void intel_flush_svm_range_dev (struct
> intel_svm *svm, struct intel_svm_d
>   }
>   desc.qw2 =

  1   2   3   4   5   6   7   8   >