RE: [PATCH v18 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

2024-02-17 Thread Tian, Kevin
> From: ank...@nvidia.com 
> Sent: Friday, February 16, 2024 11:01 AM
> 
> From: Ankit Agrawal 
> 
> NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device
> for the on-chip GPU that is the logical OS representation of the
> internal proprietary chip-to-chip cache coherent interconnect.
> 
> The device is peculiar compared to a real PCI device in that whilst
> there is a real 64b PCI BAR1 (comprising region 2 & region 3) on the
> device, it is not used to access device memory once the faster
> chip-to-chip interconnect is initialized (occurs at the time of host
> system boot). The device memory is accessed instead using the chip-to-chip
> interconnect that is exposed as a contiguous physically addressable
> region on the host. This device memory aperture can be obtained from host
> ACPI table using device_property_read_u64(), according to the FW
> specification. Since the device memory is cache coherent with the CPU,
> it can be mmap into the user VMA with a cacheable mapping using
> remap_pfn_range() and used like a regular RAM. The device memory
> is not added to the host kernel, but mapped directly as this reduces
> memory wastage due to struct pages.
> 
> There is also a requirement of a minimum reserved 1G uncached region
> (termed as resmem) to support the Multi-Instance GPU (MIG) feature [1].
> This is to work around a HW defect. Based on [2], the requisite properties
> (uncached, unaligned access) can be achieved through a VM mapping (S1)
> of NORMAL_NC and host (S2) mapping with MemAttr[2:0]=0b101. To provide
> a different non-cached property to the reserved 1G region, it needs to
> be carved out from the device memory and mapped as a separate region
> in Qemu VMA with pgprot_writecombine(). pgprot_writecombine() sets the
> Qemu VMA page properties (pgprot) as NORMAL_NC.
> 
> Provide a VFIO PCI variant driver that adapts the unique device memory
> representation into a more standard PCI representation facing userspace.
> 
> The variant driver exposes these two regions - the non-cached reserved
> (resmem) and the cached rest of the device memory (termed as usemem) as
> separate VFIO 64b BAR regions. This is divergent from the baremetal
> approach, where the device memory is exposed as a device memory region.
> The decision for a different approach was taken in view of the fact that
> it would necessiate additional code in Qemu to discover and insert those
> regions in the VM IPA, along with the additional VM ACPI DSDT changes to
> communicate the device memory region IPA to the VM workloads. Moreover,
> this behavior would have to be added to a variety of emulators (beyond
> top of tree Qemu) out there desiring grace hopper support.
> 
> Since the device implements 64-bit BAR0, the VFIO PCI variant driver
> maps the uncached carved out region to the next available PCI BAR (i.e.
> comprising of region 2 and 3). The cached device memory aperture is
> assigned BAR region 4 and 5. Qemu will then naturally generate a PCI
> device in the VM with the uncached aperture reported as BAR2 region,
> the cacheable as BAR4. The variant driver provides emulation for these
> fake BARs' PCI config space offset registers.
> 
> The hardware ensures that the system does not crash when the memory
> is accessed with the memory enable turned off. It synthesis ~0 reads
> and dropped writes on such access. So there is no need to support the
> disablement/enablement of BAR through PCI_COMMAND config space
> register.
> 
> The memory layout on the host looks like the following:
>devmem (memlength)
> |--|
> |-cached|--NC--|
> |   |
> usemem.memphys  resmem.memphys
> 
> PCI BARs need to be aligned to the power-of-2, but the actual memory on the
> device may not. A read or write access to the physical address from the
> last device PFN up to the next power-of-2 aligned physical address
> results in reading ~0 and dropped writes. Note that the GPU device
> driver [6] is capable of knowing the exact device memory size through
> separate means. The device memory size is primarily kept in the system
> ACPI tables for use by the VFIO PCI variant module.
> 
> Note that the usemem memory is added by the VM Nvidia device driver [5]
> to the VM kernel as memblocks. Hence make the usable memory size
> memblock
> (MEMBLK_SIZE) aligned. This is a hardwired ABI value between the GPU FW
> and
> VFIO driver. The VM device driver make use of the same value for its
> calculation to determine USEMEM size.
> 
> Currently there is no provision in KVM for a S2 mapping with
> MemAttr[2:0]=0b101, but there is an ongoing effort to provide the same [3].
> As previously mentioned, resmem is mapped pgprot_writecombine(), that
> sets the Qemu VMA page properties (pgprot) as NORMAL_NC. Using the
> proposed changes in [3] and [4], KVM marks the region with
> MemAttr[2:0]=0b101 

RE: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

2024-02-07 Thread Tian, Kevin
> From: Ankit Agrawal 
> Sent: Thursday, February 8, 2024 3:13 PM
> >> > +    * Determine how many bytes to be actually read from the
> >> > device memory.
> >> > +    * Read request beyond the actual device memory size is
> >> > filled with ~0,
> >> > +    * while those beyond the actual reported size is skipped.
> >> > +    */
> >> > +   if (offset >= memregion->memlength)
> >> > +   mem_count = 0;
> >>
> >> If mem_count == 0, going through nvgrace_gpu_map_and_read() is not
> >> necessary.
> >
> > Harmless, other than the possibly unnecessary call through to
> > nvgrace_gpu_map_device_mem().  Maybe both
> nvgrace_gpu_map_and_read()
> > and nvgrace_gpu_map_and_write() could conditionally return 0 as their
> > first operation when !mem_count.  Thanks,
> >
> >Alex
> 
> IMO, this seems like adding too much code to reduce the call length for a
> very specific case. If there aren't any strong opinion on this, I'm planning 
> to
> leave this code as it is.

a slight difference. if mem_count==0 the result should always succeed
no matter nvgrace_gpu_map_device_mem() succeeds or not. Of course
if it fails it's already a big problem probably nobody cares about the subtle
difference when reading non-exist range.

but regarding to readability it's still clearer:

if (mem_count)
nvgrace_gpu_map_and_read();



RE: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

2024-02-07 Thread Tian, Kevin
> From: ank...@nvidia.com 
> Sent: Tuesday, February 6, 2024 7:01 AM
> 
> Note that the usemem memory is added by the VM Nvidia device driver [5]
> to the VM kernel as memblocks. Hence make the usable memory size
> memblock
> aligned.

Is memblock size defined in spec or purely a guest implementation choice?

> 
> If the bare metal properties are not present, the driver registers the
> vfio-pci-core function pointers.

so if qemu doesn't generate such property the variant driver running
inside guest will always go to use core functions and guest vfio userspace
will observe both resmem and usemem bars. But then there is nothing
in field to prohibit mapping resmem bar as cacheable.

should this driver check the presence of either ACPI property or 
resmem/usemem bars to enable variant function pointers?

> +config NVGRACE_GPU_VFIO_PCI
> + tristate "VFIO support for the GPU in the NVIDIA Grace Hopper
> Superchip"
> + depends on ARM64 || (COMPILE_TEST && 64BIT)
> + select VFIO_PCI_CORE
> + help
> +   VFIO support for the GPU in the NVIDIA Grace Hopper Superchip is
> +   required to assign the GPU device using KVM/qemu/etc.

"assign the GPU device to userspace"

> +
> +/* Memory size expected as non cached and reserved by the VM driver */
> +#define RESMEM_SIZE 0x4000
> +#define MEMBLK_SIZE 0x2000

also add a comment for MEMBLK_SIZE

> +
> +struct nvgrace_gpu_vfio_pci_core_device {

will nvgrace refer to a non-gpu device? if not probably all prefixes with
'nvgrace_gpu' can be simplified to 'nvgrace'.

btw following other variant drivers 'vfio' can be removed too.

> +
> +/*
> + * Both the usable (usemem) and the reserved (resmem) device memory
> region
> + * are exposed as a 64b fake BARs in the VM. These fake BARs must respond

s/VM/device/

> + * to the accesses on their respective PCI config space offsets.
> + *
> + * resmem BAR owns PCI_BASE_ADDRESS_2 & PCI_BASE_ADDRESS_3.
> + * usemem BAR owns PCI_BASE_ADDRESS_4 & PCI_BASE_ADDRESS_5.
> + */
> +static ssize_t
> +nvgrace_gpu_read_config_emu(struct vfio_device *core_vdev,
> + char __user *buf, size_t count, loff_t *ppos)
> +{
> + struct nvgrace_gpu_vfio_pci_core_device *nvdev =
> + container_of(core_vdev, struct
> nvgrace_gpu_vfio_pci_core_device,
> +  core_device.vdev);
> + struct mem_region *memregion = NULL;
> + u64 pos = *ppos & VFIO_PCI_OFFSET_MASK;
> + __le64 val64;
> + size_t register_offset;
> + loff_t copy_offset;
> + size_t copy_count;
> + int ret;
> +
> + ret = vfio_pci_core_read(core_vdev, buf, count, ppos);
> + if (ret < 0)
> + return ret;

here if core_read succeeds *ppos has been updated...

> +
> + if (vfio_pci_core_range_intersect_range(pos, count,
> PCI_BASE_ADDRESS_2,
> + sizeof(val64),
> + _offset, _count,
> + _offset))
> + memregion =
> nvgrace_gpu_memregion(RESMEM_REGION_INDEX, nvdev);
> + else if (vfio_pci_core_range_intersect_range(pos, count,
> +  PCI_BASE_ADDRESS_4,
> +  sizeof(val64),
> +  _offset,
> _count,
> +  _offset))
> + memregion =
> nvgrace_gpu_memregion(USEMEM_REGION_INDEX, nvdev);
> +
> + if (memregion) {
> + val64 = nvgrace_gpu_get_read_value(memregion->bar_size,
> +
> PCI_BASE_ADDRESS_MEM_TYPE_64 |
> +
> PCI_BASE_ADDRESS_MEM_PREFETCH,
> +memregion->bar_val);
> + if (copy_to_user(buf + copy_offset,
> +  (void *) + register_offset, copy_count))
> + return -EFAULT;

...but here it's not adjusted back upon error.

> +
> +/*
> + * Read the data from the device memory (mapped either through ioremap
> + * or memremap) into the user buffer.
> + */
> +static int
> +nvgrace_gpu_map_and_read(struct nvgrace_gpu_vfio_pci_core_device
> *nvdev,
> +  char __user *buf, size_t mem_count, loff_t *ppos)
> +{
> + unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> + u64 offset = *ppos & VFIO_PCI_OFFSET_MASK;
> + int ret;
> +
> + /*
> +  * Handle read on the BAR regions. Map to the target device memory
> +  * physical address and copy to the request read buffer.
> +  */

duplicate with the earlier comment for the function. 

> +/*
> + * Read count bytes from the device memory at an offset. The actual device
> + * memory size (available) may not be a power-of-2. So the driver fakes
> + * the size to a power-of-2 (reported) when exposing to a user space driver.
> + *
> + * Reads extending beyond the reported size are truncated; reads starting
> + * beyond the reported size generate 

RE: [PATCH v17 1/3] vfio/pci: rename and export do_io_rw()

2024-02-07 Thread Tian, Kevin
> From: ank...@nvidia.com 
> Sent: Tuesday, February 6, 2024 7:01 AM
> 
> From: Ankit Agrawal 
> 
> do_io_rw() is used to read/write to the device MMIO. The grace hopper
> VFIO PCI variant driver require this functionality to read/write to
> its memory.
> 
> Rename this as vfio_pci_core functions and export as GPL.
> 
> Signed-off-by: Ankit Agrawal 

Reviewed-by: Kevin Tian 



RE: [PATCH v17 2/3] vfio/pci: rename and export range_intesect_range

2024-02-07 Thread Tian, Kevin
> From: ank...@nvidia.com 
> Sent: Tuesday, February 6, 2024 7:01 AM
> 
> From: Ankit Agrawal 
> 
> range_intesect_range determines an overlap between two ranges. If an

s/intesect/intersect/

> overlap, the helper function returns the overlapping offset and size.
> 
> The VFIO PCI variant driver emulates the PCI config space BAR offset
> registers. These offset may be accessed for read/write with a variety
> of lengths including sub-word sizes from sub-word offsets. The driver
> makes use of this helper function to read/write the targeted part of
> the emulated register.
> 
> Make this a vfio_pci_core function, rename and export as GPL. Also
> update references in virtio driver.
> 
> Signed-off-by: Ankit Agrawal 
> ---
>  drivers/vfio/pci/vfio_pci_config.c | 45 +++
>  drivers/vfio/pci/virtio/main.c | 72 +++---
>  include/linux/vfio_pci_core.h  |  5 +++
>  3 files changed, 76 insertions(+), 46 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_config.c
> b/drivers/vfio/pci/vfio_pci_config.c
> index 672a1804af6a..4fc3c605af13 100644
> --- a/drivers/vfio/pci/vfio_pci_config.c
> +++ b/drivers/vfio/pci/vfio_pci_config.c
> @@ -1966,3 +1966,48 @@ ssize_t vfio_pci_config_rw(struct
> vfio_pci_core_device *vdev, char __user *buf,
> 
>   return done;
>  }
> +
> +/**
> + * vfio_pci_core_range_intersect_range() - Determine overlap between a
> buffer
> + *  and register offset ranges.
> + * @range1_start:start offset of the buffer
> + * @count1:   number of buffer bytes.
> + * @range2_start:start register offset
> + * @count2:   number of bytes of register
> + * @start_offset:start offset of overlap start in the buffer
> + * @intersect_count: number of overlapping bytes
> + * @register_offset: start offset of overlap start in register
> + *
> + * The function determines an overlap between a register and a buffer.
> + * range1 represents the buffer and range2 represents register.
> + *
> + * Returns: true if there is overlap, false if not.
> + * The overlap start and range is returned through function args.
> + */
> +bool vfio_pci_core_range_intersect_range(loff_t range1_start, size_t count1,
> +  loff_t range2_start, size_t count2,
> +  loff_t *start_offset,
> +  size_t *intersect_count,
> +  size_t *register_offset)

based on description it's probably clearer to rename:

range1_start -> buf_start
count1 -> buf_cnt
range2_start -> reg_start
count2 -> reg_cnt
start_offset -> buf_offset

but not big deal, so:

Reviewed-by: Kevin Tian 



RE: [PATCH] vfio: fix virtio-pci dependency

2024-01-09 Thread Tian, Kevin
> From: Arnd Bergmann 
> Sent: Tuesday, January 9, 2024 3:57 PM
> 
> From: Arnd Bergmann 
> 
> The new vfio-virtio driver already has a dependency on
> VIRTIO_PCI_ADMIN_LEGACY,
> but that is a bool symbol and allows vfio-virtio to be built-in even if
> virtio-pci itself is a loadable module. This leads to a link failure:
> 
> aarch64-linux-ld: drivers/vfio/pci/virtio/main.o: in function
> `virtiovf_pci_probe':
> main.c:(.text+0xec): undefined reference to `virtio_pci_admin_has_legacy_io'
> aarch64-linux-ld: drivers/vfio/pci/virtio/main.o: in function
> `virtiovf_pci_init_device':
> main.c:(.text+0x260): undefined reference to
> `virtio_pci_admin_legacy_io_notify_info'
> aarch64-linux-ld: drivers/vfio/pci/virtio/main.o: in function
> `virtiovf_pci_bar0_rw':
> main.c:(.text+0x6ec): undefined reference to
> `virtio_pci_admin_legacy_common_io_read'
> aarch64-linux-ld: main.c:(.text+0x6f4): undefined reference to
> `virtio_pci_admin_legacy_device_io_read'
> aarch64-linux-ld: main.c:(.text+0x7f0): undefined reference to
> `virtio_pci_admin_legacy_common_io_write'
> aarch64-linux-ld: main.c:(.text+0x7f8): undefined reference to
> `virtio_pci_admin_legacy_device_io_write'
> 
> Add another explicit dependency on the tristate symbol.
> 
> Fixes: eb61eca0e8c3 ("vfio/virtio: Introduce a vfio driver over virtio 
> devices")
> Signed-off-by: Arnd Bergmann 

Reviewed-by: Kevin Tian 



RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-07 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Wednesday, April 7, 2021 8:21 PM
> 
> On Wed, Apr 07, 2021 at 02:08:33AM +, Tian, Kevin wrote:
> 
> > > Because if you don't then we enter insane world where a PASID is being
> > > created under /dev/ioasid but its translation path flows through setup
> > > done by VFIO and the whole user API becomes an incomprehensible
> mess.
> > >
> > > How will you even associate the PASID with the other translation??
> >
> > PASID is attached to a specific iommu domain (created by VFIO/VDPA),
> which
> > has GPA->HPA mappings already configured. If we view that mapping as an
> > attribute of the iommu domain, it's reasonable to have the userspace-
> bound
> > pgtable through /dev/ioasid to nest on it.
> 
> A user controlled page table should absolutely not be an attribute of
> a hidden kernel object, nor should two parts of the kernel silently
> connect to each other via a hidden internal objects like this.
> 
> Security is important - the kind of connection must use some explicit
> FD authorization to access shared objects, not be made implicit!
> 
> IMHO this direction is a dead end for this reason.
> 

Could you elaborate what exact security problem is brought with this 
approach? Isn't ALLOW_PASID the authorization interface for the 
connection?

Based on all your replies now I see what you actually want is generalizing
all IOMMU related stuff through /dev/ioasid (sort of /dev/iommu), which
requires factoring out the vfio_iommu_type1 into the general part. This is
a huge work.

Is it really the only practice in Linux that any new feature has to be
blocked as long as a refactoring work is identified? Don't people accept
any balance between enabling new features and completing refactoring
work through a staging approach, as long as we don't introduce an uAPI
specifically for the staging purpose? ☹

Thanks
Kevin


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-07 Thread Tian, Kevin
> From: Jason Gunthorpe
> Sent: Tuesday, April 6, 2021 8:43 PM
> 
> On Tue, Apr 06, 2021 at 09:35:17AM +0800, Jason Wang wrote:
> 
> > > VFIO and VDPA has no buisness having map/unmap interfaces once we
> have
> > > /dev/ioasid. That all belongs in the iosaid side.
> > >
> > > I know they have those interfaces today, but that doesn't mean we have
> > > to keep using them for PASID use cases, they should be replaced with a
> > > 'do dma from this pasid on /dev/ioasid' interface certainly not a
> > > 'here is a pasid from /dev/ioasid, go ahead and configure it youself'
> > > interface
> >
> > So it looks like the PASID was bound to SVA in this design. I think it's not
> > necessairly the case:
> 
> No, I wish people would stop talking about SVA.
> 
> SVA and vSVA are a very special narrow configuration of a PASID. There
> are lots of other PASID configurations! That is the whole point, a
> PASID is complicated, there are many configuration scenarios, they
> need to be in one place with a very clearly defined uAPI
> 

I feel it also makes sense to allow a subsystem to specify which configurations
are permitted when allowing a PASID on its device, e.g. excluding things like
GPA mappings that existing subsystems (VFIO/VDPA) already handle well:

- Share GPA mappings between multiple devices (w/ or w/o PASID) for 
better IOTLB efficiency;

- Share GPA mappings between transactions w/ PASID and transactions w/o
PASID from the same device (e.g. GPU) for better IOTLB efficiency;

- Use the same page table for GPA mappings before and after the guest 
turns on/off the PASID capability;

All above are given as long as we continue to let VFIO/VDPA manage the
iommu domain and associated GPA mappings for PASID. The IOMMU driver 
already ensures a nested PASID entry linking to the established GPA paging 
structure of the domain when the 1st-level pgtable is bound through 
/dev/ioasid. 

In contrast, above merits are lost if forcing a model where GPA mappings
for PASID must be constructed through /dev/ioasid, as this will lead to
multiple paging structures for the same GPA mappings implying worse 
IOTLB usage and unnecessary cost of invalidations.

Therefore, I envision a scheme where the subsystem could specify 
permitted PASID configurations when doing ALLOW_PASID, and then 
userspace queries per-PASID capability to learn which operations
are allowed, e.g.:

1) To enable vSVA, VFIO/VDPA allows pgtable binding and related invalidation/
fault ops through /dev/ioasid;

2) for vDPA control vq usage, no configuration is allowed through /dev/ioasid;

3) for new subsystem which doesn't carry any legacy or similar usage as 
VFIO/VDPA, it could permit all configurations through /dev/ioasid including 
1st-level binding and 2nd-level mapping ops;

This approach also allows us to grow the uAPI in a staging approach. Now 
focus on 1) and 2) as VFIO/VDPA are the only two users for now with good
legacy to cover the GPA mappings. More ops can be introduced for 3) when 
there is a real example to show what exact ops are required for such a new 
subsystem.

Is this a good strategy to move forward?

btw this discussion was raised when discussing the I/O page fault handling
process. Currently the IOMMU layer implements a per-device fault reporting
mechanism, which requires VFIO to register a handler to receive all faults 
on its device and then forwards to ioasid if it's due to 1st-level. Possibly it 
makes more sense to convert it into a per-pgtable reporting scheme, and 
then the owner of each pgtable should register its own handler. It means
for 1) VFIO will register a 2nd-level pgtable handler while /dev/ioasid
will register a 1st-level pgtable handler, while for 3) /dev/ioasid will 
register 
handlers for both 1st-level and 2nd-level pgtable. Jean? also want to know 
your thoughts...  

Thanks
Kevin


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-06 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Tuesday, April 6, 2021 8:21 PM
> 
> On Tue, Apr 06, 2021 at 01:02:05AM +, Tian, Kevin wrote:
> > > From: Jason Gunthorpe 
> > > Sent: Tuesday, April 6, 2021 7:40 AM
> > >
> > > On Fri, Apr 02, 2021 at 07:58:02AM +, Tian, Kevin wrote:
> > > > > From: Jason Gunthorpe 
> > > > > Sent: Thursday, April 1, 2021 9:47 PM
> > > > >
> > > > > On Thu, Apr 01, 2021 at 01:43:36PM +, Liu, Yi L wrote:
> > > > > > > From: Jason Gunthorpe 
> > > > > > > Sent: Thursday, April 1, 2021 9:16 PM
> > > > > > >
> > > > > > > On Thu, Apr 01, 2021 at 01:10:48PM +, Liu, Yi L wrote:
> > > > > > > > > From: Jason Gunthorpe 
> > > > > > > > > Sent: Thursday, April 1, 2021 7:47 PM
> > > > > > > > [...]
> > > > > > > > > I'm worried Intel views the only use of PASID in a guest is 
> > > > > > > > > with
> > > > > > > > > ENQCMD, but that is not consistent with the industry. We need
> to
> > > see
> > > > > > > > > normal nested PASID support with assigned PCI VFs.
> > > > > > > >
> > > > > > > > I'm not quire flow here. Intel also allows PASID usage in guest
> > > without
> > > > > > > > ENQCMD. e.g. Passthru a PF to guest, and use PASID on it
> without
> > > > > > > ENQCMD.
> > > > > > >
> > > > > > > Then you need all the parts, the hypervisor calls from the vIOMMU,
> > > and
> > > > > > > you can't really use a vPASID.
> > > > > >
> > > > > > This is a diagram shows the vSVA setup.
> > > > >
> > > > > I'm not talking only about vSVA. Generic PASID support with arbitary
> > > > > mappings.
> > > > >
> > > > > And how do you deal with the vPASID vs pPASID issue if the system
> has
> > > > > a mix of physical devices and mdevs?
> > > > >
> > > >
> > > > We plan to support two schemes. One is vPASID identity-mapped to
> > > > pPASID then the mixed scenario just works, with the limitation of
> > > > lacking of live migration support. The other is non-identity-mapped
> > > > scheme, where live migration is supported but physical devices and
> > > > mdevs should not be mixed in one VM if both expose SVA capability
> > > > (requires some filtering check in Qemu).
> > >
> > > That just becomes "block vPASID support if any device that
> > > doesn't use ENQCMD is plugged into the guest"
> >
> > The limitation is only for physical device. and in reality it is not that
> > bad. To support live migration with physical device we anyway need
> > additional work to migrate the device state (e.g. based on Max's work),
> > then it's not unreasonable to also mediate guest programming of
> > device specific PASID register to enable vPASID (need to translate in
> > the whole VM lifespan but likely is not a hot path).
> 
> IMHO that is pretty unreasonable.. More likely we end up with vPASID
> tables in each migratable device like KVM has.

just like mdev needs to maintain allowed PASID list, this extends it to
all migratable devices.

> 
> > > Which needs a special VFIO capability of some kind so qemu knows to
> > > block it. This really needs to all be layed out together so someone
> > > can understand it :(
> >
> > Or could simply based on whether the VFIO device supports live migration.
> 
> You need to define affirmative caps that indicate that vPASID will be
> supported by the VFIO device.

Yes, this is required as I acked in another mail.

> 
> > > Why doesn't the siov cookbook explaining this stuff??
> > >
> > > > We hope the /dev/ioasid can support both schemes, with the minimal
> > > > requirement of allowing userspace to tag a vPASID to a pPASID and
> > > > allowing mdev to translate vPASID into pPASID, i.e. not assuming that
> > > > the guest will always use pPASID.
> > >
> > > What I'm a unclear of is if /dev/ioasid even needs to care about
> > > vPASID or if vPASID is just a hidden artifact of the KVM connection to
> > > setup the translation table and the vIOMMU driver in qemu.
> >
> > Not just for KVM. Also required by mdev, which needs to translate
> > vPASID into pPASID when ENQCMD is not used.
> 
> Do we have any mdev's that will do this?

definitely. Actually any mdev which doesn't do ENQCMD needs to do this.
In normal case, the PASID is programmed to a MMIO register (or in-memory
context) associate with the backend resource of the mdev. The value 
programmed from the guest is vPASID, thus must be translated into pPASID
before updating the physical register.

> 
> > should only care about the operations related to pPASID. VFIO could
> > carry vPASID information to mdev.
> 
> It depends how common this is, I suppose
> 

based on above I think it's a common case.

Thanks
Kevin


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-06 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Tuesday, April 6, 2021 8:35 PM
> 
> On Tue, Apr 06, 2021 at 01:27:15AM +, Tian, Kevin wrote:
> >
> > and here is one example why using existing VFIO/VDPA interface makes
> > sense. say dev1 (w/ sva) and dev2 (w/o sva) are placed in a single VFIO
> > container.
> 
> Forget about SVA, it is an irrelevant detail of how a PASID is
> configured.
> 
> > The container is associated to an iommu domain which contains a
> > single 2nd-level page table, shared by both devices (when attached
> > to the domain).
> 
> This level should be described by an ioasid.
> 
> > The VFIO MAP operation is applied to the 2nd-level
> > page table thus naturally applied to both devices. Then userspace
> > could use /dev/ioasid to further allocate IOASIDs and bind multiple
> > 1st-level page tables for dev1, nested on the shared 2nd-level page
> > table.
> 
> Because if you don't then we enter insane world where a PASID is being
> created under /dev/ioasid but its translation path flows through setup
> done by VFIO and the whole user API becomes an incomprehensible mess.
> 
> How will you even associate the PASID with the other translation??

PASID is attached to a specific iommu domain (created by VFIO/VDPA), which
has GPA->HPA mappings already configured. If we view that mapping as an
attribute of the iommu domain, it's reasonable to have the userspace-bound
pgtable through /dev/ioasid to nest on it.


> 
> The entire translation path for any ioasid or PASID should be defined
> only by /dev/ioasid. Everything else is a legacy API.
> 
> > If following your suggestion then VFIO must deny VFIO MAP operations
> > on sva1 (assume userspace should not mix sva1 and sva2 in the same
> > container and instead use /dev/ioasid to map for sva1)?
> 
> No, userspace creates an iosaid for the guest physical mapping and
> passes this ioasid to VFIO PCI which will assign it as the first layer
> mapping on the RID

Is it an dummy ioasid just for providing GPA mappings for nesting purpose
of other IOASIDs? Then we waste one per VM?

> 
> When PASIDs are allocated the uAPI will be told to logically nested
> under the first ioasid. When VFIO authorizes a PASID for a RID it
> checks that all the HW rules are being followed.

As I explained above, why cannot we just use iommu domain to connect 
the dots? Every passthrough framework needs to create an iommu domain
first. and It needs to support both devices w/ PASID and devices w/o PASID.
For devices w/o PASID it needs to invent its own MAP interface anyway.
Then why do we bother creating another MAP interface through /dev/ioasid
which not only duplicates but also creating transition burden between 
two set of MAP interfaces when the guest turns on/off the pasid capability
on the device?

> 
> If there are rules like groups of VFIO devices must always use the
> same IOASID then VFIO will check these too (and realistically qemu
> will have only one guest physical map ioasid anyhow)
> 
> There is no real difference between setting up an IOMMU table for a
> (RID,PASID) tuple or just a RID. We can do it universally with
> one interface for all consumers.
> 

'universally' upon from which angle you look at this problem. From IOASID
p.o.v possibly yes, but from device passthrough p.o.v. it's the opposite
since the passthrough framework needs to handle devices w/o PASID anyway
(or even for device w/ PASID it could send traffic w/o PASID) thus 'universally'
makes more sense if the passthrough framework can use one interface of its
own to manage GPA mappings for all consumers (apply to the case when a
PASID is allowed/authorized).

Thanks
Kevin


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-05 Thread Tian, Kevin
> From: Tian, Kevin
> Sent: Tuesday, April 6, 2021 9:02 AM
> 
> > From: Jason Gunthorpe 
> > Sent: Tuesday, April 6, 2021 7:40 AM
> >
> > On Fri, Apr 02, 2021 at 07:58:02AM +, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe 
> > > > Sent: Thursday, April 1, 2021 9:47 PM
> > > >
> > > > On Thu, Apr 01, 2021 at 01:43:36PM +, Liu, Yi L wrote:
> > > > > > From: Jason Gunthorpe 
> > > > > > Sent: Thursday, April 1, 2021 9:16 PM
> > > > > >
> > > > > > On Thu, Apr 01, 2021 at 01:10:48PM +, Liu, Yi L wrote:
> > > > > > > > From: Jason Gunthorpe 
> > > > > > > > Sent: Thursday, April 1, 2021 7:47 PM
> > > > > > > [...]
> > > > > > > > I'm worried Intel views the only use of PASID in a guest is with
> > > > > > > > ENQCMD, but that is not consistent with the industry. We need
> to
> > see
> > > > > > > > normal nested PASID support with assigned PCI VFs.
> > > > > > >
> > > > > > > I'm not quire flow here. Intel also allows PASID usage in guest
> > without
> > > > > > > ENQCMD. e.g. Passthru a PF to guest, and use PASID on it without
> > > > > > ENQCMD.
> > > > > >
> > > > > > Then you need all the parts, the hypervisor calls from the vIOMMU,
> > and
> > > > > > you can't really use a vPASID.
> > > > >
> > > > > This is a diagram shows the vSVA setup.
> > > >
> > > > I'm not talking only about vSVA. Generic PASID support with arbitary
> > > > mappings.
> > > >
> > > > And how do you deal with the vPASID vs pPASID issue if the system has
> > > > a mix of physical devices and mdevs?
> > > >
> > >
> > > We plan to support two schemes. One is vPASID identity-mapped to
> > > pPASID then the mixed scenario just works, with the limitation of
> > > lacking of live migration support. The other is non-identity-mapped
> > > scheme, where live migration is supported but physical devices and
> > > mdevs should not be mixed in one VM if both expose SVA capability
> > > (requires some filtering check in Qemu).
> >
> > That just becomes "block vPASID support if any device that
> > doesn't use ENQCMD is plugged into the guest"
> 
> The limitation is only for physical device. and in reality it is not that
> bad. To support live migration with physical device we anyway need
> additional work to migrate the device state (e.g. based on Max's work),
> then it's not unreasonable to also mediate guest programming of
> device specific PASID register to enable vPASID (need to translate in
> the whole VM lifespan but likely is not a hot path).
> 
> >
> > Which needs a special VFIO capability of some kind so qemu knows to
> > block it. This really needs to all be layed out together so someone
> > can understand it :(
> 
> Or could simply based on whether the VFIO device supports live migration.

Actually you are right on this point. VFIO should provide a per-device
capability to indicate whether vPASID is allowed on this device. likely 
yes for mdev, by default no for pdev (unless explicitly opt in). Qemu
should enable vPASID only if all assigned devices support it, and then 
provide vPASID information when using VFIO API to allow pPASIDs.

> 
> >
> > Why doesn't the siov cookbook explaining this stuff??
> >
> > > We hope the /dev/ioasid can support both schemes, with the minimal
> > > requirement of allowing userspace to tag a vPASID to a pPASID and
> > > allowing mdev to translate vPASID into pPASID, i.e. not assuming that
> > > the guest will always use pPASID.
> >
> > What I'm a unclear of is if /dev/ioasid even needs to care about
> > vPASID or if vPASID is just a hidden artifact of the KVM connection to
> > setup the translation table and the vIOMMU driver in qemu.
> 
> Not just for KVM. Also required by mdev, which needs to translate
> vPASID into pPASID when ENQCMD is not used. As I replied in another
> mail, possibly we don't need /dev/ioasid to know this fact, which
> should only care about the operations related to pPASID. VFIO could
> carry vPASID information to mdev. KVM should have its own interface
> to know this information, as you suggested earlier.
> 
> Thanks
> Kevin


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-05 Thread Tian, Kevin
> From: Jason Gunthorpe
> Sent: Tuesday, April 6, 2021 7:43 AM
> 
> On Fri, Apr 02, 2021 at 08:22:28AM +, Tian, Kevin wrote:
> > > From: Jason Gunthorpe 
> > > Sent: Tuesday, March 30, 2021 9:29 PM
> > >
> > > >
> > > > First, userspace may use ioasid in a non-SVA scenario where ioasid is
> > > > bound to specific security context (e.g. a control vq in vDPA) instead 
> > > > of
> > > > tying to mm. In this case there is no pgtable binding initiated from 
> > > > user
> > > > space. Instead, ioasid is allocated from /dev/ioasid and then
> programmed
> > > > to the intended security context through specific passthrough
> framework
> > > > which manages that context.
> > >
> > > This sounds like the exact opposite of what I'd like to see.
> > >
> > > I do not want to see every subsystem gaining APIs to program a
> > > PASID. All of that should be consolidated in *one place*.
> > >
> > > I do not want to see VDPA and VFIO have two nearly identical sets of
> > > APIs to control the PASID.
> > >
> > > Drivers consuming a PASID, like VDPA, should consume the PASID and do
> > > nothing more than authorize the HW to use it.
> > >
> > > quemu should have general code under the viommu driver that drives
> > > /dev/ioasid to create PASID's and manage the IO mapping according to
> > > the guest's needs.
> > >
> > > Drivers like VDPA and VFIO should simply accept that PASID and
> > > configure/authorize their HW to do DMA's with its tag.
> > >
> >
> > I agree with you on consolidating things in one place (especially for the
> > general SVA support). But here I was referring to an usage without
> > pgtable binding (Possibly Jason. W can say more here), where the
> > userspace just wants to allocate PASIDs, program/accept PASIDs to
> > various workqueues (device specific), and then use MAP/UNMAP
> > interface to manage address spaces associated with each PASID.
> > I just wanted to point out that the latter two steps are through
> > VFIO/VDPA specific interfaces.
> 
> No, don't do that.
> 
> VFIO and VDPA has no buisness having map/unmap interfaces once we have
> /dev/ioasid. That all belongs in the iosaid side.
> 
> I know they have those interfaces today, but that doesn't mean we have
> to keep using them for PASID use cases, they should be replaced with a
> 'do dma from this pasid on /dev/ioasid' interface certainly not a
> 'here is a pasid from /dev/ioasid, go ahead and configure it youself'
> interface
> 
> This is because PASID is *complicated* in the general case! For
> instance all the two level stuff you are talking about must not leak
> into every user!
> 

Hi, Jason,

I didn't get your last comment how the two level stuff is leaked into every
user. Could you elaborate it a bit?

and here is one example why using existing VFIO/VDPA interface makes
sense. say dev1 (w/ sva) and dev2 (w/o sva) are placed in a single VFIO 
container. The container is associated to an iommu domain which contains 
a single 2nd-level page table, shared by both devices (when attached to
the domain). The VFIO MAP operation is applied to the 2nd-level page 
table thus naturally applied to both devices. Then userspace could use 
/dev/ioasid to further allocate IOASIDs and bind multiple 1st-level page 
tables for dev1, nested on the shared 2nd-level page table. 

If following your suggestion then VFIO must deny VFIO MAP operations
on sva1 (assume userspace should not mix sva1 and sva2 in the same
container and instead use /dev/ioasid to map for sva1)? and even for 
a sva-capable device there is a window before the guest actually enables 
sva on that device then VFIO should still accept MAP in that window 
and then deny it after sva is enabled by the guest? This all sounds
unnecessary complex while there is already a clean way to achieve it...

Thanks
Kevin


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-05 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Tuesday, April 6, 2021 7:40 AM
> 
> On Fri, Apr 02, 2021 at 07:58:02AM +, Tian, Kevin wrote:
> > > From: Jason Gunthorpe 
> > > Sent: Thursday, April 1, 2021 9:47 PM
> > >
> > > On Thu, Apr 01, 2021 at 01:43:36PM +, Liu, Yi L wrote:
> > > > > From: Jason Gunthorpe 
> > > > > Sent: Thursday, April 1, 2021 9:16 PM
> > > > >
> > > > > On Thu, Apr 01, 2021 at 01:10:48PM +, Liu, Yi L wrote:
> > > > > > > From: Jason Gunthorpe 
> > > > > > > Sent: Thursday, April 1, 2021 7:47 PM
> > > > > > [...]
> > > > > > > I'm worried Intel views the only use of PASID in a guest is with
> > > > > > > ENQCMD, but that is not consistent with the industry. We need to
> see
> > > > > > > normal nested PASID support with assigned PCI VFs.
> > > > > >
> > > > > > I'm not quire flow here. Intel also allows PASID usage in guest
> without
> > > > > > ENQCMD. e.g. Passthru a PF to guest, and use PASID on it without
> > > > > ENQCMD.
> > > > >
> > > > > Then you need all the parts, the hypervisor calls from the vIOMMU,
> and
> > > > > you can't really use a vPASID.
> > > >
> > > > This is a diagram shows the vSVA setup.
> > >
> > > I'm not talking only about vSVA. Generic PASID support with arbitary
> > > mappings.
> > >
> > > And how do you deal with the vPASID vs pPASID issue if the system has
> > > a mix of physical devices and mdevs?
> > >
> >
> > We plan to support two schemes. One is vPASID identity-mapped to
> > pPASID then the mixed scenario just works, with the limitation of
> > lacking of live migration support. The other is non-identity-mapped
> > scheme, where live migration is supported but physical devices and
> > mdevs should not be mixed in one VM if both expose SVA capability
> > (requires some filtering check in Qemu).
> 
> That just becomes "block vPASID support if any device that
> doesn't use ENQCMD is plugged into the guest"

The limitation is only for physical device. and in reality it is not that
bad. To support live migration with physical device we anyway need 
additional work to migrate the device state (e.g. based on Max's work), 
then it's not unreasonable to also mediate guest programming of 
device specific PASID register to enable vPASID (need to translate in
the whole VM lifespan but likely is not a hot path).

> 
> Which needs a special VFIO capability of some kind so qemu knows to
> block it. This really needs to all be layed out together so someone
> can understand it :(

Or could simply based on whether the VFIO device supports live migration.

> 
> Why doesn't the siov cookbook explaining this stuff??
> 
> > We hope the /dev/ioasid can support both schemes, with the minimal
> > requirement of allowing userspace to tag a vPASID to a pPASID and
> > allowing mdev to translate vPASID into pPASID, i.e. not assuming that
> > the guest will always use pPASID.
> 
> What I'm a unclear of is if /dev/ioasid even needs to care about
> vPASID or if vPASID is just a hidden artifact of the KVM connection to
> setup the translation table and the vIOMMU driver in qemu.

Not just for KVM. Also required by mdev, which needs to translate
vPASID into pPASID when ENQCMD is not used. As I replied in another
mail, possibly we don't need /dev/ioasid to know this fact, which 
should only care about the operations related to pPASID. VFIO could
carry vPASID information to mdev. KVM should have its own interface
to know this information, as you suggested earlier.

Thanks
Kevin


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-05 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Tuesday, April 6, 2021 7:35 AM
> 
> On Fri, Apr 02, 2021 at 07:30:23AM +, Tian, Kevin wrote:
> > > From: Jason Gunthorpe 
> > > Sent: Friday, April 2, 2021 12:04 AM
> > >
> > > On Thu, Apr 01, 2021 at 02:08:17PM +, Liu, Yi L wrote:
> > >
> > > > DMA page faults are delivered to root-complex via page request
> message
> > > and
> > > > it is per-device according to PCIe spec. Page request handling flow is:
> > > >
> > > > 1) iommu driver receives a page request from device
> > > > 2) iommu driver parses the page request message. Get the RID,PASID,
> > > faulted
> > > >page and requested permissions etc.
> > > > 3) iommu driver triggers fault handler registered by device driver with
> > > >iommu_report_device_fault()
> > >
> > > This seems confused.
> > >
> > > The PASID should define how to handle the page fault, not the driver.
> > >
> > > I don't remember any device specific actions in ATS, so what is the
> > > driver supposed to do?
> > >
> > > > 4) device driver's fault handler signals an event FD to notify userspace
> to
> > > >fetch the information about the page fault. If it's VM case, inject 
> > > > the
> > > >page fault to VM and let guest to solve it.
> > >
> > > If the PASID is set to 'report page fault to userspace' then some
> > > event should come out of /dev/ioasid, or be reported to a linked
> > > eventfd, or whatever.
> > >
> > > If the PASID is set to 'SVM' then the fault should be passed to
> > > handle_mm_fault
> > >
> > > And so on.
> > >
> > > Userspace chooses what happens based on how they configure the PASID
> > > through /dev/ioasid.
> > >
> > > Why would a device driver get involved here?
> > >
> > > > Eric has sent below series for the page fault reporting for VM with
> passthru
> > > > device.
> > > > https://lore.kernel.org/kvm/20210223210625.604517-5-
> > > eric.au...@redhat.com/
> > >
> > > It certainly should not be in vfio pci. Everything using a PASID needs
> > > this infrastructure, VDPA, mdev, PCI, CXL, etc.
> > >
> >
> > This touches an interesting fact:
> >
> > The fault may be triggered in either 1st-level or 2nd-level page table,
> > when nested translation is enabled (in vSVA case). The 1st-level is bound
> > by the user space, which therefore needs to receive the fault event. The
> > 2nd-level is managed by VFIO (or vDPA), which needs to fix the fault in
> > kernel (e.g. find HVA per faulting GPA, call handle_mm_fault and map
> > GPA->HPA to IOMMU). Yi's current proposal lets VFIO to register the
> > device fault handler, which then forward the event through /dev/ioasid
> > to userspace only if it is a 1st-level fault. Are you suggesting a pgtable-
> > centric fault reporting mechanism to separate handlers in each level,
> > i.e. letting VFIO register handler only for 2nd-level fault and then /dev/
> > ioasid register handler for 1st-level fault?
> 
> This I'm struggling to understand. /dev/ioasid should handle all the
> faults cases, why would VFIO ever get involved in a fault? What would
> it even do?
> 
> If the fault needs to be fixed in the hypervisor then it is a kernel
> fault and it does handle_mm_fault. This absolutely should not be in
> VFIO or VDPA

With nested translation it is GVA->GPA->HPA. The kernel needs to
fix fault related to GPA->HPA (managed by VFIO/VDPA) while 
handle_mm_fault only handles HVA->HPA. In this case, the 2nd-level
page fault is expected to be delivered to VFIO/VDPA first which then
find HVA related to GPA, call handle_mm_fault to fix HVA->HPA,
and then call iommu_map to fix GPA->HPA in the IOMMU page table.
This is exactly like how CPU EPT violation is handled.

> 
> If the fault needs to be fixed in the guest, then it needs to be
> delivered over /dev/ioasid in some way and injected into the
> vIOMMU. VFIO and VDPA have nothing to do with vIOMMU driver in quemu.
> 
> You need to have an interface under /dev/ioasid to create both page
> table levels and part of that will be to tell the kernel what VA is
> mapped and how to handle faults.

VFIO/VDPA already have their own interface to manage GPA->HPA
mappings. Why do we want to duplicate it in /dev/ioasid? 

Thanks
Kevin


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-02 Thread Tian, Kevin
> From: Tian, Kevin
> Sent: Friday, April 2, 2021 3:58 PM
> 
> > From: Jason Gunthorpe 
> > Sent: Thursday, April 1, 2021 9:47 PM
> >
> > On Thu, Apr 01, 2021 at 01:43:36PM +, Liu, Yi L wrote:
> > > > From: Jason Gunthorpe 
> > > > Sent: Thursday, April 1, 2021 9:16 PM
> > > >
> > > > On Thu, Apr 01, 2021 at 01:10:48PM +, Liu, Yi L wrote:
> > > > > > From: Jason Gunthorpe 
> > > > > > Sent: Thursday, April 1, 2021 7:47 PM
> > > > > [...]
> > > > > > I'm worried Intel views the only use of PASID in a guest is with
> > > > > > ENQCMD, but that is not consistent with the industry. We need to
> see
> > > > > > normal nested PASID support with assigned PCI VFs.
> > > > >
> > > > > I'm not quire flow here. Intel also allows PASID usage in guest 
> > > > > without
> > > > > ENQCMD. e.g. Passthru a PF to guest, and use PASID on it without
> > > > ENQCMD.
> > > >
> > > > Then you need all the parts, the hypervisor calls from the vIOMMU, and
> > > > you can't really use a vPASID.
> > >
> > > This is a diagram shows the vSVA setup.
> >
> > I'm not talking only about vSVA. Generic PASID support with arbitary
> > mappings.
> >
> > And how do you deal with the vPASID vs pPASID issue if the system has
> > a mix of physical devices and mdevs?
> >
> 
> We plan to support two schemes. One is vPASID identity-mapped to
> pPASID then the mixed scenario just works, with the limitation of
> lacking of live migration support. The other is non-identity-mapped
> scheme, where live migration is supported but physical devices and
> mdevs should not be mixed in one VM if both expose SVA capability
> (requires some filtering check in Qemu). Although we have some
> idea relaxing this restriction in the non-identity scheme, it requires
> more thinking given how the vSVA uAPI is being refactored.
> 
> In both cases the virtual VT-d will report a virtual capability to the guest,
> indicating that the guest must request PASID through a vcmd register
> instead of creating its own namespace. The vIOMMU returns a vPASID
> to the guest upon request. The vPASID could be directly mapped to a
> pPASID or allocated from a new namespace based on user configuration.
> 
> We hope the /dev/ioasid can support both schemes, with the minimal
> requirement of allowing userspace to tag a vPASID to a pPASID and
> allowing mdev to translate vPASID into pPASID, i.e. not assuming that
> the guest will always use pPASID.
> 

Per your comments in other threads I suppose this requirement should
be implemented in VFIO_ALLOW_PASID command instead of going 
through /dev/ioasid which only needs to know pPASID and its pgtable
management.

Thanks
Kevin


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-02 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Tuesday, March 30, 2021 9:29 PM
> 
> >
> > First, userspace may use ioasid in a non-SVA scenario where ioasid is
> > bound to specific security context (e.g. a control vq in vDPA) instead of
> > tying to mm. In this case there is no pgtable binding initiated from user
> > space. Instead, ioasid is allocated from /dev/ioasid and then programmed
> > to the intended security context through specific passthrough framework
> > which manages that context.
> 
> This sounds like the exact opposite of what I'd like to see.
> 
> I do not want to see every subsystem gaining APIs to program a
> PASID. All of that should be consolidated in *one place*.
> 
> I do not want to see VDPA and VFIO have two nearly identical sets of
> APIs to control the PASID.
> 
> Drivers consuming a PASID, like VDPA, should consume the PASID and do
> nothing more than authorize the HW to use it.
> 
> quemu should have general code under the viommu driver that drives
> /dev/ioasid to create PASID's and manage the IO mapping according to
> the guest's needs.
> 
> Drivers like VDPA and VFIO should simply accept that PASID and
> configure/authorize their HW to do DMA's with its tag.
> 

I agree with you on consolidating things in one place (especially for the
general SVA support). But here I was referring to an usage without 
pgtable binding (Possibly Jason. W can say more here), where the 
userspace just wants to allocate PASIDs, program/accept PASIDs to 
various workqueues (device specific), and then use MAP/UNMAP 
interface to manage address spaces associated with each PASID. 
I just wanted to point out that the latter two steps are through 
VFIO/VDPA specific interfaces. 

Thanks
Kevin


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-02 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Thursday, April 1, 2021 9:47 PM
> 
> On Thu, Apr 01, 2021 at 01:43:36PM +, Liu, Yi L wrote:
> > > From: Jason Gunthorpe 
> > > Sent: Thursday, April 1, 2021 9:16 PM
> > >
> > > On Thu, Apr 01, 2021 at 01:10:48PM +, Liu, Yi L wrote:
> > > > > From: Jason Gunthorpe 
> > > > > Sent: Thursday, April 1, 2021 7:47 PM
> > > > [...]
> > > > > I'm worried Intel views the only use of PASID in a guest is with
> > > > > ENQCMD, but that is not consistent with the industry. We need to see
> > > > > normal nested PASID support with assigned PCI VFs.
> > > >
> > > > I'm not quire flow here. Intel also allows PASID usage in guest without
> > > > ENQCMD. e.g. Passthru a PF to guest, and use PASID on it without
> > > ENQCMD.
> > >
> > > Then you need all the parts, the hypervisor calls from the vIOMMU, and
> > > you can't really use a vPASID.
> >
> > This is a diagram shows the vSVA setup.
> 
> I'm not talking only about vSVA. Generic PASID support with arbitary
> mappings.
> 
> And how do you deal with the vPASID vs pPASID issue if the system has
> a mix of physical devices and mdevs?
> 

We plan to support two schemes. One is vPASID identity-mapped to
pPASID then the mixed scenario just works, with the limitation of 
lacking of live migration support. The other is non-identity-mapped 
scheme, where live migration is supported but physical devices and 
mdevs should not be mixed in one VM if both expose SVA capability 
(requires some filtering check in Qemu). Although we have some 
idea relaxing this restriction in the non-identity scheme, it requires 
more thinking given how the vSVA uAPI is being refactored.

In both cases the virtual VT-d will report a virtual capability to the guest,
indicating that the guest must request PASID through a vcmd register
instead of creating its own namespace. The vIOMMU returns a vPASID 
to the guest upon request. The vPASID could be directly mapped to a 
pPASID or allocated from a new namespace based on user configuration.

We hope the /dev/ioasid can support both schemes, with the minimal
requirement of allowing userspace to tag a vPASID to a pPASID and
allowing mdev to translate vPASID into pPASID, i.e. not assuming that
the guest will always use pPASID.

Thanks
Kevin


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-02 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Friday, April 2, 2021 12:04 AM
> 
> On Thu, Apr 01, 2021 at 02:08:17PM +, Liu, Yi L wrote:
> 
> > DMA page faults are delivered to root-complex via page request message
> and
> > it is per-device according to PCIe spec. Page request handling flow is:
> >
> > 1) iommu driver receives a page request from device
> > 2) iommu driver parses the page request message. Get the RID,PASID,
> faulted
> >page and requested permissions etc.
> > 3) iommu driver triggers fault handler registered by device driver with
> >iommu_report_device_fault()
> 
> This seems confused.
> 
> The PASID should define how to handle the page fault, not the driver.
> 
> I don't remember any device specific actions in ATS, so what is the
> driver supposed to do?
> 
> > 4) device driver's fault handler signals an event FD to notify userspace to
> >fetch the information about the page fault. If it's VM case, inject the
> >page fault to VM and let guest to solve it.
> 
> If the PASID is set to 'report page fault to userspace' then some
> event should come out of /dev/ioasid, or be reported to a linked
> eventfd, or whatever.
> 
> If the PASID is set to 'SVM' then the fault should be passed to
> handle_mm_fault
> 
> And so on.
> 
> Userspace chooses what happens based on how they configure the PASID
> through /dev/ioasid.
> 
> Why would a device driver get involved here?
> 
> > Eric has sent below series for the page fault reporting for VM with passthru
> > device.
> > https://lore.kernel.org/kvm/20210223210625.604517-5-
> eric.au...@redhat.com/
> 
> It certainly should not be in vfio pci. Everything using a PASID needs
> this infrastructure, VDPA, mdev, PCI, CXL, etc.
> 

This touches an interesting fact:

The fault may be triggered in either 1st-level or 2nd-level page table, 
when nested translation is enabled (in vSVA case). The 1st-level is bound 
by the user space, which therefore needs to receive the fault event. The 
2nd-level is managed by VFIO (or vDPA), which needs to fix the fault in 
kernel (e.g. find HVA per faulting GPA, call handle_mm_fault and map 
GPA->HPA to IOMMU). Yi's current proposal lets VFIO to register the 
device fault handler, which then forward the event through /dev/ioasid 
to userspace only if it is a 1st-level fault. Are you suggesting a pgtable-
centric fault reporting mechanism to separate handlers in each level, 
i.e. letting VFIO register handler only for 2nd-level fault and then /dev/
ioasid register handler for 1st-level fault?

Thanks
Kevin



RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-29 Thread Tian, Kevin
> From: Tian, Kevin
> Sent: Tuesday, March 30, 2021 10:24 AM
> 
> > From: Jason Gunthorpe 
> > Sent: Tuesday, March 30, 2021 12:32 AM
> > > In terms of usage for guest SVA, an ioasid_set is mostly tied to a host 
> > > mm,
> > > the use case is as the following:
> >
> > From that doc:
> >
> >   It is imperative to enforce
> >   VM-IOASID ownership such that a malicious guest cannot target DMA
> >   traffic outside its own IOASIDs, or free an active IOASID that belongs
> >   to another VM.
> >
> > Huh?
> >
> > Security in a PASID world comes from the IOMMU blocking access to the
> > PASID except from approved PCI-ID's. If a VF/PF is assigned to a guest
> > then that guest can cause the device to issue any PASID by having
> > complete control and the vIOMMU is supposed to tell the real IOMMU
> > what PASID's the device is alowed to access.
> >
> > If a device is sharing a single PCI function with different security
> > contexts (eg vfio mdev) then the device itself is responsible to
> > ensure that only the secure interface can program a PASID and a less
> > secure context can never self-enroll.
> >
> > Here the mdev driver would have to consule with the vIOMMU to ensure
> > the mdev device is allowed to access the PASID - is that what this
> > set stuff is about?
> >
> > If yes, it is backwards. The MDEV is the thing doing the security, the
> > MDEV should have the list of allowed PASID's and a single PASID
> > created under /dev/ioasid should be loaded into MDEV with some 'Ok you
> > can use PASID xyz from FD abc' command.
> >
> 
> The 'set' is per-VM. Once the mdev is assigned to a VM, all valid PASID's
> in the set of that VM are considered legitimate on this mdev. The mdev
> driver will mediate guest operations which program PASID to the backend
> context and load the PASID only if it is within the 'set' (i.e. already
> allocated through /dev/ioasid). This prevents a malicious VM from attacking
> others. Though it's not mdev which directly maintaining the list of allowed
> PASID's, the effect is the same in concept.
> 

One correction. The mdev should still construct the list of allowed PASID's as
you said (by listening to IOASID_BIND/UNBIND event), in addition to the ioasid 
set maintained per VM (updated when a PASID is allocated/freed). The per-VM
set is required for inter-VM isolation (verified when a pgtable is bound to the 
mdev/PASID), while the mdev's own list is necessary for intra-VM isolation when 
multiple mdevs are assigned to the same VM (verified before loading a PASID 
to the mdev). This series just handles the general part i.e. per-VM ioasid set 
and 
leaves the mdev's own list to be managed by specific mdev driver which listens
to various IOASID events).

Thanks
Kevin


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-29 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Tuesday, March 30, 2021 12:32 AM
> > In terms of usage for guest SVA, an ioasid_set is mostly tied to a host mm,
> > the use case is as the following:
> 
> From that doc:
> 
>   It is imperative to enforce
>   VM-IOASID ownership such that a malicious guest cannot target DMA
>   traffic outside its own IOASIDs, or free an active IOASID that belongs
>   to another VM.
> 
> Huh?
> 
> Security in a PASID world comes from the IOMMU blocking access to the
> PASID except from approved PCI-ID's. If a VF/PF is assigned to a guest
> then that guest can cause the device to issue any PASID by having
> complete control and the vIOMMU is supposed to tell the real IOMMU
> what PASID's the device is alowed to access.
> 
> If a device is sharing a single PCI function with different security
> contexts (eg vfio mdev) then the device itself is responsible to
> ensure that only the secure interface can program a PASID and a less
> secure context can never self-enroll.
> 
> Here the mdev driver would have to consule with the vIOMMU to ensure
> the mdev device is allowed to access the PASID - is that what this
> set stuff is about?
> 
> If yes, it is backwards. The MDEV is the thing doing the security, the
> MDEV should have the list of allowed PASID's and a single PASID
> created under /dev/ioasid should be loaded into MDEV with some 'Ok you
> can use PASID xyz from FD abc' command.
> 

The 'set' is per-VM. Once the mdev is assigned to a VM, all valid PASID's
in the set of that VM are considered legitimate on this mdev. The mdev
driver will mediate guest operations which program PASID to the backend
context and load the PASID only if it is within the 'set' (i.e. already 
allocated through /dev/ioasid). This prevents a malicious VM from attacking
others. Though it's not mdev which directly maintaining the list of allowed 
PASID's, the effect is the same in concept.

Thanks
Kevin


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-29 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Tuesday, March 30, 2021 12:32 AM
> 
> On Wed, Mar 24, 2021 at 12:05:28PM -0700, Jacob Pan wrote:
> 
> > > IMHO a use created PASID is either bound to a mm (current) at creation
> > > time, or it will never be bound to a mm and its page table is under
> > > user control via /dev/ioasid.
> > >
> > True for PASID used in native SVA bind. But for binding with a guest mm,
> > PASID is allocated first (VT-d virtual cmd interface Spec 10.4.44), the
> > bind with the host IOMMU when vIOMMU PASID cache is invalidated.
> >
> > Our intention is to have two separate interfaces:
> > 1. /dev/ioasid (allocation/free only)
> > 2. /dev/sva (handles all SVA related activities including page tables)
> 
> I'm not sure I understand why you'd want to have two things. Doesn't
> that just complicate everything?
> 
> Manipulating the ioasid, including filling it with page tables, seems
> an integral inseperable part of the whole interface. Why have two ?

Hi, Jason,

Actually above is a major open while we are refactoring vSVA uAPI toward
this direction. We have two concerns about merging /dev/ioasid with
/dev/sva, and would like to hear your thought whether they are valid.

First, userspace may use ioasid in a non-SVA scenario where ioasid is 
bound to specific security context (e.g. a control vq in vDPA) instead of 
tying to mm. In this case there is no pgtable binding initiated from user
space. Instead, ioasid is allocated from /dev/ioasid and then programmed
to the intended security context through specific passthrough framework
which manages that context.

Second, ioasid is managed per process/VM while pgtable binding is a
device-wise operation.  The userspace flow looks like below for an integral
/dev/ioasid interface:

---initialization--
- ioctl(container->fd, VFIO_SET_IOMMU, VFIO_TYPE1_NESTING_IOMMU)
- ioasid_fd = open(/dev/ioasid)
- ioctl(ioasid_fd, IOASID_GET_USVA_FD, _fd) //an empty context
- ioctl(device->fd, VFIO_DEVICE_SET_SVA, _fd); //sva_fd ties to device
- ioctl(sva_fd, USVA_GET_INFO, _info);
---runtime
- ioctl(ioasid_fd, IOMMU_ALLOC_IOASID, );
- ioctl(sva_fd, USVA_BIND_PGTBL, _data);
- ioctl(sva_fd, USVA_FLUSH_CACHE, _info);
- ioctl(sva_fd, USVA_UNBIND_PGTBL, _data);
---destroy
- ioctl(device->fd, VFIO_DEVICE_UNSET_SVA, _fd);
- close(sva_fd)
- close(ioasid_fd)

Our hesitation here is based on one of your earlier comments that
you are not a fan of constructing fd's through ioctl. Are you OK with
above flow or have a better idea of handling it?

With separate interfaces then userspace just opens /dev/sva instead 
of getting it through ioasid_fd:

- ioasid_fd = open(/dev/ioasid)
- sva_fd = open(/dev/sva)

Thanks
Kevin


RE: [RFC PATCH v1 0/4] vfio: Add IOPF support for VFIO passthrough

2021-03-18 Thread Tian, Kevin
> From: Shenming Lu 
> Sent: Thursday, March 18, 2021 7:54 PM
> 
> On 2021/3/18 17:07, Tian, Kevin wrote:
> >> From: Shenming Lu 
> >> Sent: Thursday, March 18, 2021 3:53 PM
> >>
> >> On 2021/2/4 14:52, Tian, Kevin wrote:>>> In reality, many
> >>>>> devices allow I/O faulting only in selective contexts. However, there
> >>>>> is no standard way (e.g. PCISIG) for the device to report whether
> >>>>> arbitrary I/O fault is allowed. Then we may have to maintain device
> >>>>> specific knowledge in software, e.g. in an opt-in table to list devices
> >>>>> which allows arbitrary faults. For devices which only support selective
> >>>>> faulting, a mediator (either through vendor extensions on vfio-pci-core
> >>>>> or a mdev wrapper) might be necessary to help lock down non-
> faultable
> >>>>> mappings and then enable faulting on the rest mappings.
> >>>>
> >>>> For devices which only support selective faulting, they could tell it to 
> >>>> the
> >>>> IOMMU driver and let it filter out non-faultable faults? Do I get it 
> >>>> wrong?
> >>>
> >>> Not exactly to IOMMU driver. There is already a vfio_pin_pages() for
> >>> selectively page-pinning. The matter is that 'they' imply some device
> >>> specific logic to decide which pages must be pinned and such knowledge
> >>> is outside of VFIO.
> >>>
> >>> From enabling p.o.v we could possibly do it in phased approach. First
> >>> handles devices which tolerate arbitrary DMA faults, and then extends
> >>> to devices with selective-faulting. The former is simpler, but with one
> >>> main open whether we want to maintain such device IDs in a static
> >>> table in VFIO or rely on some hints from other components (e.g. PF
> >>> driver in VF assignment case). Let's see how Alex thinks about it.
> >>
> >> Hi Kevin,
> >>
> >> You mentioned selective-faulting some time ago. I still have some doubt
> >> about it:
> >> There is already a vfio_pin_pages() which is used for limiting the IOMMU
> >> group dirty scope to pinned pages, could it also be used for indicating
> >> the faultable scope is limited to the pinned pages and the rest mappings
> >> is non-faultable that should be pinned and mapped immediately? But it
> >> seems to be a little weird and not exactly to what you meant... I will
> >> be grateful if you can help to explain further. :-)
> >>
> >
> > The opposite, i.e. the vendor driver uses vfio_pin_pages to lock down
> > pages that are not faultable (based on its specific knowledge) and then
> > the rest memory becomes faultable.
> 
> Ahh...
> Thus, from the perspective of VFIO IOMMU, if IOPF enabled for such device,
> only the page faults within the pinned range are valid in the registered
> iommu fault handler...
> I have another question here, for the IOMMU backed devices, they are
> already
> all pinned and mapped when attaching, is there a need to call
> vfio_pin_pages()
> to lock down pages for them? Did I miss something?...
> 

If a device is marked as supporting I/O page fault (fully or selectively), 
there should be no pinning at attach or DMA_MAP time (suppose as 
this series does). Then for devices with selective-faulting its vendor 
driver will lock down the pages which are not faultable at run-time, 
e.g. when intercepting guest registration of a ring buffer...

Thanks
Kevin


RE: [RFC PATCH v1 0/4] vfio: Add IOPF support for VFIO passthrough

2021-03-18 Thread Tian, Kevin
> From: Shenming Lu 
> Sent: Thursday, March 18, 2021 3:53 PM
> 
> On 2021/2/4 14:52, Tian, Kevin wrote:>>> In reality, many
> >>> devices allow I/O faulting only in selective contexts. However, there
> >>> is no standard way (e.g. PCISIG) for the device to report whether
> >>> arbitrary I/O fault is allowed. Then we may have to maintain device
> >>> specific knowledge in software, e.g. in an opt-in table to list devices
> >>> which allows arbitrary faults. For devices which only support selective
> >>> faulting, a mediator (either through vendor extensions on vfio-pci-core
> >>> or a mdev wrapper) might be necessary to help lock down non-faultable
> >>> mappings and then enable faulting on the rest mappings.
> >>
> >> For devices which only support selective faulting, they could tell it to 
> >> the
> >> IOMMU driver and let it filter out non-faultable faults? Do I get it wrong?
> >
> > Not exactly to IOMMU driver. There is already a vfio_pin_pages() for
> > selectively page-pinning. The matter is that 'they' imply some device
> > specific logic to decide which pages must be pinned and such knowledge
> > is outside of VFIO.
> >
> > From enabling p.o.v we could possibly do it in phased approach. First
> > handles devices which tolerate arbitrary DMA faults, and then extends
> > to devices with selective-faulting. The former is simpler, but with one
> > main open whether we want to maintain such device IDs in a static
> > table in VFIO or rely on some hints from other components (e.g. PF
> > driver in VF assignment case). Let's see how Alex thinks about it.
> 
> Hi Kevin,
> 
> You mentioned selective-faulting some time ago. I still have some doubt
> about it:
> There is already a vfio_pin_pages() which is used for limiting the IOMMU
> group dirty scope to pinned pages, could it also be used for indicating
> the faultable scope is limited to the pinned pages and the rest mappings
> is non-faultable that should be pinned and mapped immediately? But it
> seems to be a little weird and not exactly to what you meant... I will
> be grateful if you can help to explain further. :-)
> 

The opposite, i.e. the vendor driver uses vfio_pin_pages to lock down
pages that are not faultable (based on its specific knowledge) and then
the rest memory becomes faultable.

Thanks
Kevin


RE: A problem of Intel IOMMU hardware ?

2021-03-18 Thread Tian, Kevin
> From: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> 
> 
> > -Original Message-
> > From: Tian, Kevin [mailto:kevin.t...@intel.com]
> > Sent: Thursday, March 18, 2021 4:27 PM
> > To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> > ; Nadav Amit 
> > Cc: chenjiashang ; David Woodhouse
> > ; io...@lists.linux-foundation.org; LKML
> > ; alex.william...@redhat.com; Gonglei
> (Arei)
> > ; w...@kernel.org
> > Subject: RE: A problem of Intel IOMMU hardware ?
> >
> > > From: iommu  On Behalf Of
> > > Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> > >
> > > > 2. Consider ensuring that the problem is not somehow related to
> > > > queued invalidations. Try to use __iommu_flush_iotlb() instead of
> > qi_flush_iotlb().
> > > >
> > >
> > > I tried to force to use __iommu_flush_iotlb(), but maybe something
> > > wrong, the system crashed, so I prefer to lower the priority of this
> operation.
> > >
> >
> > The VT-d spec clearly says that register-based invalidation can be used only
> when
> > queued-invalidations are not enabled. Intel-IOMMU driver doesn't provide
> an
> > option to disable queued-invalidation though, when the hardware is
> capable. If you
> > really want to try, tweak the code in intel_iommu_init_qi.
> >
> 
> Hi Kevin,
> 
> Thanks to point out this. Do you have any ideas about this problem ? I tried
> to descript the problem much clear in my reply to Alex, hope you could have
> a look if you're interested.
> 

btw I saw you used 4.18 kernel in this test. What about latest kernel?

Also one way to separate sw/hw bug is to trace the low level interface (e.g.,
qi_flush_iotlb) which actually sends invalidation descriptors to the IOMMU
hardware. Check the window between b) and c) and see whether the
software does the right thing as expected there. 

Thanks
Kevin


RE: A problem of Intel IOMMU hardware ?

2021-03-18 Thread Tian, Kevin
> From: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> 
> 
> 
> > -Original Message-----
> > From: Tian, Kevin [mailto:kevin.t...@intel.com]
> > Sent: Thursday, March 18, 2021 4:27 PM
> > To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> > ; Nadav Amit 
> > Cc: chenjiashang ; David Woodhouse
> > ; io...@lists.linux-foundation.org; LKML
> > ; alex.william...@redhat.com; Gonglei
> (Arei)
> > ; w...@kernel.org
> > Subject: RE: A problem of Intel IOMMU hardware ?
> >
> > > From: iommu  On Behalf Of
> > > Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> > >
> > > > 2. Consider ensuring that the problem is not somehow related to
> > > > queued invalidations. Try to use __iommu_flush_iotlb() instead of
> > qi_flush_iotlb().
> > > >
> > >
> > > I tried to force to use __iommu_flush_iotlb(), but maybe something
> > > wrong, the system crashed, so I prefer to lower the priority of this
> operation.
> > >
> >
> > The VT-d spec clearly says that register-based invalidation can be used only
> when
> > queued-invalidations are not enabled. Intel-IOMMU driver doesn't provide
> an
> > option to disable queued-invalidation though, when the hardware is
> capable. If you
> > really want to try, tweak the code in intel_iommu_init_qi.
> >
> 
> Hi Kevin,
> 
> Thanks to point out this. Do you have any ideas about this problem ? I tried
> to descript the problem much clear in my reply to Alex, hope you could have
> a look if you're interested.
> 

I agree with Nadav. Looks this implies some stale paging structure cache entry
(e.g. PMD) is not invalidated properly. It's better if Baolu can reproduce this
problem in his local environment and then do more debug to identify whether
it's a software or hardware defect.

btw what is the device under test? Does it support ATS?

Thanks
Kevin


RE: A problem of Intel IOMMU hardware ?

2021-03-18 Thread Tian, Kevin
> From: iommu  On Behalf Of
> Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> 
> > 2. Consider ensuring that the problem is not somehow related to queued
> > invalidations. Try to use __iommu_flush_iotlb() instead of qi_flush_iotlb().
> >
> 
> I tried to force to use __iommu_flush_iotlb(), but maybe something wrong,
> the system crashed, so I prefer to lower the priority of this operation.
> 

The VT-d spec clearly says that register-based invalidation can be used
only when queued-invalidations are not enabled. Intel-IOMMU driver
doesn't provide an option to disable queued-invalidation though, when
the hardware is capable. If you really want to try, tweak the code in
intel_iommu_init_qi.

Thanks
Kevin


RE: [PATCH RFC v1 12/15] iommu/virtio: Add support for INVALIDATE request

2021-03-03 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Thursday, March 4, 2021 2:29 AM
> 
> Hi Vivek,
> 
> On Fri, 15 Jan 2021 17:43:39 +0530, Vivek Gautam 
> wrote:
> 
> > From: Jean-Philippe Brucker 
> >
> > Add support for tlb invalidation ops that can send invalidation
> > requests to back-end virtio-iommu when stage-1 page tables are
> > supported.
> >
> Just curious if it possible to reuse the iommu uapi for invalidation and 
> others.
> When we started out designing the iommu uapi, the intention was to support
> both emulated and virtio iommu.

IIUC this patch is about the protocol between virtio-iommu frontend and backend.
After the virtio-iommu backend receives invalidation ops, it then needs to
forward the request to the host IOMMU driver through the existing iommu
uapi that you referred to, as a emulated VT-d or SMMU would do.

Thanks
Kevin

> 
> > Signed-off-by: Jean-Philippe Brucker 
> > [Vivek: Refactoring the iommu_flush_ops, and adding only one pasid sync
> > op that's needed with current iommu-pasid-table infrastructure.
> > Also updating uapi defines as required by latest changes]
> > Signed-off-by: Vivek Gautam 
> > Cc: Joerg Roedel 
> > Cc: Will Deacon 
> > Cc: Michael S. Tsirkin 
> > Cc: Robin Murphy 
> > Cc: Jean-Philippe Brucker 
> > Cc: Eric Auger 
> > Cc: Alex Williamson 
> > Cc: Kevin Tian 
> > Cc: Jacob Pan 
> > Cc: Liu Yi L 
> > Cc: Lorenzo Pieralisi 
> > Cc: Shameerali Kolothum Thodi 
> > ---
> >  drivers/iommu/virtio-iommu.c | 95
> 
> >  1 file changed, 95 insertions(+)
> >
> > diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
> > index ae5dfd3f8269..004ea94e3731 100644
> > --- a/drivers/iommu/virtio-iommu.c
> > +++ b/drivers/iommu/virtio-iommu.c
> > @@ -13,6 +13,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> > @@ -63,6 +64,8 @@ struct viommu_mapping {
> >  };
> >
> >  struct viommu_mm {
> > +   int pasid;
> > +   u64 archid;
> > struct io_pgtable_ops   *ops;
> > struct viommu_domain*domain;
> >  };
> > @@ -692,6 +695,98 @@ static void viommu_event_handler(struct
> virtqueue
> > *vq) virtqueue_kick(vq);
> >  }
> >
> > +/* PASID and pgtable APIs */
> > +
> > +static void __viommu_flush_pasid_tlb_all(struct viommu_domain
> *vdomain,
> > +int pasid, u64 arch_id, int
> > type) +{
> > +   struct virtio_iommu_req_invalidate req = {
> > +   .head.type  = VIRTIO_IOMMU_T_INVALIDATE,
> > +   .inv_gran   =
> > cpu_to_le32(VIRTIO_IOMMU_INVAL_G_PASID),
> > +   .flags  =
> > cpu_to_le32(VIRTIO_IOMMU_INVAL_F_PASID),
> > +   .inv_type   = cpu_to_le32(type),
> > +
> > +   .domain = cpu_to_le32(vdomain->id),
> > +   .pasid  = cpu_to_le32(pasid),
> > +   .archid = cpu_to_le64(arch_id),
> > +   };
> > +
> > +   if (viommu_send_req_sync(vdomain->viommu, , sizeof(req)))
> > +   pr_debug("could not send invalidate request\n");
> > +}
> > +
> > +static void viommu_flush_tlb_add(struct iommu_iotlb_gather *gather,
> > +unsigned long iova, size_t granule,
> > +void *cookie)
> > +{
> > +   struct viommu_mm *viommu_mm = cookie;
> > +   struct viommu_domain *vdomain = viommu_mm->domain;
> > +   struct iommu_domain *domain = >domain;
> > +
> > +   iommu_iotlb_gather_add_page(domain, gather, iova, granule);
> > +}
> > +
> > +static void viommu_flush_tlb_walk(unsigned long iova, size_t size,
> > + size_t granule, void *cookie)
> > +{
> > +   struct viommu_mm *viommu_mm = cookie;
> > +   struct viommu_domain *vdomain = viommu_mm->domain;
> > +   struct virtio_iommu_req_invalidate req = {
> > +   .head.type  = VIRTIO_IOMMU_T_INVALIDATE,
> > +   .inv_gran   = cpu_to_le32(VIRTIO_IOMMU_INVAL_G_VA),
> > +   .inv_type   =
> cpu_to_le32(VIRTIO_IOMMU_INV_T_IOTLB),
> > +   .flags  =
> > cpu_to_le32(VIRTIO_IOMMU_INVAL_F_ARCHID), +
> > +   .domain = cpu_to_le32(vdomain->id),
> > +   .pasid  = cpu_to_le32(viommu_mm->pasid),
> > +   .archid = cpu_to_le64(viommu_mm->archid),
> > +   .virt_start = cpu_to_le64(iova),
> > +   .nr_pages   = cpu_to_le64(size / granule),
> > +   .granule= ilog2(granule),
> > +   };
> > +
> > +   if (viommu_add_req(vdomain->viommu, , sizeof(req)))
> > +   pr_debug("could not add invalidate request\n");
> > +}
> > +
> > +static void viommu_flush_tlb_all(void *cookie)
> > +{
> > +   struct viommu_mm *viommu_mm = cookie;
> > +
> > +   if (!viommu_mm->archid)
> > +   return;
> > +
> > +   __viommu_flush_pasid_tlb_all(viommu_mm->domain,
> viommu_mm->pasid,
> > +viommu_mm->archid,
> > + 

RE: [PATCH 2/4] iommu/vt-d: Enable write protect propagation from guest

2021-02-19 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Saturday, February 20, 2021 1:09 AM
> 
> Hi Kevin,
> 
> On Fri, 19 Feb 2021 06:19:04 +, "Tian, Kevin" 
> wrote:
> 
> > > From: Jacob Pan 
> > > Sent: Friday, February 19, 2021 5:31 AM
> > >
> > > Write protect bit, when set, inhibits supervisor writes to the read-only
> > > pages. In guest supervisor shared virtual addressing (SVA),
> > > write-protect should be honored upon guest bind supervisor PASID
> > > request.
> > >
> > > This patch extends the VT-d portion of the IOMMU UAPI to include WP bit.
> > > WPE bit of the  supervisor PASID entry will be set to match CPU CR0.WP
> > > bit.
> > >
> > > Signed-off-by: Sanjay Kumar 
> > > Signed-off-by: Jacob Pan 
> > > ---
> > >  drivers/iommu/intel/pasid.c | 5 +
> > >  include/uapi/linux/iommu.h  | 3 ++-
> > >  2 files changed, 7 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
> > > index 0b7e0e726ade..c7a2ec930af4 100644
> > > --- a/drivers/iommu/intel/pasid.c
> > > +++ b/drivers/iommu/intel/pasid.c
> > > @@ -763,6 +763,11 @@ intel_pasid_setup_bind_data(struct
> intel_iommu
> > > *iommu, struct pasid_entry *pte,
> > >   return -EINVAL;
> > >   }
> > >   pasid_set_sre(pte);
> > > + /* Enable write protect WP if guest requested */
> > > + if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_WPE) {
> > > + if (pasid_enable_wpe(pte))
> > > + return -EINVAL;
> >
> > We should call pasid_set_wpe directly, as this binding is about guest
> > page table and suppose the guest has done whatever check required
> > (e.g. gcr0.wp) before setting this bit. pasid_enable_wpe has an
> > additional check on host cr0.wp thus is logically incorrect here.
> >
> If the host CPU does not support WP, can guest VCPU still support WP? If
> so, I agree.
> 

If you change 'support' to 'enable', then the answer is yes.


RE: [PATCH 2/4] iommu/vt-d: Enable write protect propagation from guest

2021-02-18 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Friday, February 19, 2021 5:31 AM
> 
> Write protect bit, when set, inhibits supervisor writes to the read-only
> pages. In guest supervisor shared virtual addressing (SVA), write-protect
> should be honored upon guest bind supervisor PASID request.
> 
> This patch extends the VT-d portion of the IOMMU UAPI to include WP bit.
> WPE bit of the  supervisor PASID entry will be set to match CPU CR0.WP bit.
> 
> Signed-off-by: Sanjay Kumar 
> Signed-off-by: Jacob Pan 
> ---
>  drivers/iommu/intel/pasid.c | 5 +
>  include/uapi/linux/iommu.h  | 3 ++-
>  2 files changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
> index 0b7e0e726ade..c7a2ec930af4 100644
> --- a/drivers/iommu/intel/pasid.c
> +++ b/drivers/iommu/intel/pasid.c
> @@ -763,6 +763,11 @@ intel_pasid_setup_bind_data(struct intel_iommu
> *iommu, struct pasid_entry *pte,
>   return -EINVAL;
>   }
>   pasid_set_sre(pte);
> + /* Enable write protect WP if guest requested */
> + if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_WPE) {
> + if (pasid_enable_wpe(pte))
> + return -EINVAL;

We should call pasid_set_wpe directly, as this binding is about guest
page table and suppose the guest has done whatever check required
(e.g. gcr0.wp) before setting this bit. pasid_enable_wpe has an additional 
check on host cr0.wp thus is logically incorrect here.

Thanks
Kevin

> + }
>   }
> 
>   if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_EAFE) {
> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> index 68cb558fe8db..33f3dc7a91de 100644
> --- a/include/uapi/linux/iommu.h
> +++ b/include/uapi/linux/iommu.h
> @@ -288,7 +288,8 @@ struct iommu_gpasid_bind_data_vtd {
>  #define IOMMU_SVA_VTD_GPASID_PWT (1 << 3) /* page-level write
> through */
>  #define IOMMU_SVA_VTD_GPASID_EMTE(1 << 4) /* extended mem
> type enable */
>  #define IOMMU_SVA_VTD_GPASID_CD  (1 << 5) /* PASID-level
> cache disable */
> -#define IOMMU_SVA_VTD_GPASID_LAST(1 << 6)
> +#define IOMMU_SVA_VTD_GPASID_WPE (1 << 6) /* Write protect
> enable */
> +#define IOMMU_SVA_VTD_GPASID_LAST(1 << 7)
>   __u64 flags;
>   __u32 pat;
>   __u32 emt;
> --
> 2.25.1



RE: [PATCH v5 05/14] vfio/mdev: idxd: add basic mdev registration and helper functions

2021-02-18 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Wednesday, February 17, 2021 5:33 AM
> 
> On Tue, Feb 16, 2021 at 12:04:55PM -0700, Dave Jiang wrote:
> 
> > > > +   return remap_pfn_range(vma, vma->vm_start, pgoff, req_size,
> pg_prot);
> > > Nothing validated req_size - did you copy this from the Intel RDMA
> > > driver? It had a huge security bug just like this.
> 
> > Thanks. Will add. Some of the code came from the Intel i915 mdev
> > driver.
> 
> Please make sure it is fixed as well, the security bug is huge.
> 

It's already been fixed 2yrs ago:

commit 51b00d8509dc69c98740da2ad07308b630d3eb7d
Author: Zhenyu Wang 
Date:   Fri Jan 11 13:58:53 2019 +0800

drm/i915/gvt: Fix mmap range check

This is to fix missed mmap range check on vGPU bar2 region
and only allow to map vGPU allocated GMADDR range, which means
user space should support sparse mmap to get proper offset for
mmap vGPU aperture. And this takes care of actual pgoff in mmap
request as original code always does from beginning of vGPU
aperture.

Fixes: 659643f7d814 ("drm/i915/gvt/kvmgt: add vfio/mdev support to KVMGT")
Cc: "Monroy, Rodrigo Axel" 
Cc: "Orrala Contreras, Alfredo" 
Cc: sta...@vger.kernel.org # v4.10+
Reviewed-by: Hang Yuan 
Signed-off-by: Zhenyu Wang 

Thanks
Kevin



RE: [PATCH v2 0/9] Introduce vfio-pci-core subsystem

2021-02-09 Thread Tian, Kevin
Hi, Max,

> From: Max Gurtovoy 
> Sent: Tuesday, February 2, 2021 12:28 AM
> 
> Hi Alex and Cornelia,
> 
> This series split the vfio_pci driver into 2 parts: pci driver and a
> subsystem driver that will also be library of code. The pci driver,
> vfio_pci.ko will be used as before and it will bind to the subsystem
> driver vfio_pci_core.ko to register to the VFIO subsystem. This patchset
> if fully backward compatible. This is a typical Linux subsystem
> framework behaviour. This framework can be also adopted by vfio_mdev
> devices as we'll see in the below sketch.
> 
> This series is coming to solve the issues that were raised in the
> previous attempt for extending vfio-pci for vendor specific
> functionality: https://lkml.org/lkml/2020/5/17/376 by Yan Zhao.
> 
> This solution is also deterministic in a sense that when a user will
> bind to a vendor specific vfio-pci driver, it will get all the special
> goodies of the HW.
> 
> This subsystem framework will also ease on adding vendor specific
> functionality to VFIO devices in the future by allowing another module
> to provide the pci_driver that can setup number of details before
> registering to VFIO subsystem (such as inject its own operations).

I'm a bit confused about the change from v1 to v2, especially about
how to inject module specific operations. From live migration p.o.v
it may requires two hook points at least for some devices (e.g. i40e 
in original Yan's example): register a migration region and intercept 
guest writes to specific registers. [PATCH 4/9] demonstrates the 
former but not the latter (which is allowed in v1). I saw some concerns 
about reporting struct vfio_pci_device outside of vfio-pci-core in v1
which should be the main reason driving this change. But I'm still 
curious to know how we plan to address this requirement (allowing 
vendor driver to tweak specific vfio_device_ops) moving forward.

Then another question. Once we have this framework in place, do we 
mandate this approach for any vendor specific tweak or still allow
doing it as vfio_pci_core extensions (such as igd and zdev in this series)? 
If the latter, what is the criteria to judge which way is desired? Also what 
about the scenarios where we just want one-time vendor information, 
e.g. to tell whether a device can tolerate arbitrary I/O page faults [1] or
the offset in VF PCI config space to put PASID/ATS/PRI capabilities [2]?
Do we expect to create a module for each device to provide such info?
Having those questions answered is helpful for better understanding of
this proposal IMO. 

[1] 
https://lore.kernel.org/kvm/d4c51504-24ed-2592-37b4-f390b97fd...@huawei.com/T/
[2] https://lore.kernel.org/kvm/20200407095801.648b1...@w520.home/

> 
> Below we can see the proposed changes (this patchset only deals with
> VFIO_PCI subsystem but it can easily be extended to VFIO_MDEV subsystem
> as well):
> 
> +--+
> |  |
> |   VFIO   |
> |  |
> +--+
> 
> +--++--+
> |  ||  |
> |VFIO_PCI_CORE || VFIO_MDEV_CORE   |
> |  ||  |
> +--++--+
> 
> +--+ +-++-+ +--+
> |  | | || | |  |
> |  | | || | |  |
> |   VFIO_PCI   | |MLX5_VFIO_PCI||  VFIO_MDEV  | |MLX5_VFIO_MDEV|
> |  | | || | |  |
> |  | | || | |  |
> +--+ +-++-+ +--+
> 

The VFIO_PCI part is clear but I didn't get how this concept is applied
to VFIO_MDEV. the mdev sub-system looks like below in my mind:

+--+
|  |
|   VFIO   |
|  |
+--+

+--++--+
|  ||  |
|VFIO_PCI_CORE || VFIO_MDEV|
|  ||  |
+--++--+

+--+ +-+

RE: [RFC PATCH v1 0/4] vfio: Add IOPF support for VFIO passthrough

2021-02-07 Thread Tian, Kevin
> From: Jean-Philippe Brucker 
> Sent: Friday, February 5, 2021 6:37 PM
> 
> Hi,
> 
> On Thu, Feb 04, 2021 at 06:52:10AM +, Tian, Kevin wrote:
> > > >>> The static pinning and mapping problem in VFIO and possible
> solutions
> > > >>> have been discussed a lot [1, 2]. One of the solutions is to add I/O
> > > >>> page fault support for VFIO devices. Different from those relatively
> > > >>> complicated software approaches such as presenting a vIOMMU that
> > > >> provides
> > > >>> the DMA buffer information (might include para-virtualized
> > > optimizations),
> 
> I'm curious about the performance difference between this and the
> map/unmap vIOMMU, as well as the coIOMMU. This is probably a lot faster
> but those don't depend on IOPF which is a pretty rare feature at the
> moment.
> 
> [...]
> > > > In reality, many
> > > > devices allow I/O faulting only in selective contexts. However, there
> > > > is no standard way (e.g. PCISIG) for the device to report whether
> > > > arbitrary I/O fault is allowed. Then we may have to maintain device
> > > > specific knowledge in software, e.g. in an opt-in table to list devices
> > > > which allows arbitrary faults. For devices which only support selective
> > > > faulting, a mediator (either through vendor extensions on vfio-pci-core
> > > > or a mdev wrapper) might be necessary to help lock down non-
> faultable
> > > > mappings and then enable faulting on the rest mappings.
> > >
> > > For devices which only support selective faulting, they could tell it to 
> > > the
> > > IOMMU driver and let it filter out non-faultable faults? Do I get it 
> > > wrong?
> >
> > Not exactly to IOMMU driver. There is already a vfio_pin_pages() for
> > selectively page-pinning. The matter is that 'they' imply some device
> > specific logic to decide which pages must be pinned and such knowledge
> > is outside of VFIO.
> >
> > From enabling p.o.v we could possibly do it in phased approach. First
> > handles devices which tolerate arbitrary DMA faults, and then extends
> > to devices with selective-faulting. The former is simpler, but with one
> > main open whether we want to maintain such device IDs in a static
> > table in VFIO or rely on some hints from other components (e.g. PF
> > driver in VF assignment case). Let's see how Alex thinks about it.
> 
> Do you think selective-faulting will be the norm, or only a problem for
> initial IOPF implementations?  To me it's the selective-faulting kind of
> device that will be the odd one out, but that's pure speculation. Either
> way maintaining a device list seems like a pain.

I would think it's norm for quite some time (e.g. multiple years), as from
what I learned turning a complex accelerator to an implementation 
tolerating arbitrary DMA fault is way complex (in every critical path) and
not cost effective (tracking in-fly requests). It might be OK for some 
purposely-built devices in specific usage but for most it has to be an 
evolving path toward the 100%-faultable goal...

> 
> [...]
> > Yes, it's in plan but just not happened yet. We are still focusing on guest
> > SVA part thus only the 1st-level page fault (+Yi/Jacob). It's always
> welcomed
> > to collaborate/help if you have time. 
> 
> By the way the current fault report API is missing a way to invalidate
> partial faults: when the IOMMU device's PRI queue overflows, it may
> auto-respond to page request groups that were already partially reported
> by the IOMMU driver. Upon detecting an overflow, the IOMMU driver needs
> to
> tell all fault consumers to discard their partial groups.
> iopf_queue_discard_partial() [1] does this for the internal IOPF handler
> but we have nothing for the lower-level fault handler at the moment. And
> it gets more complicated when injecting IOPFs to guests, we'd need a
> mechanism to recall partial groups all the way through kernel->userspace
> and userspace->guest.

I didn't know how to recall partial groups through emulated vIOMMUs
(at least for virtual VT-d). Possibly it could be supported by virtio-iommu.
But in any case I consider it more like an optimization instead of a functional
requirement (and could be avoided in below Shenming's suggestion).

> 
> Shenming suggests [2] to also use the IOPF handler for IOPFs managed by
> device drivers. It's worth considering in my opinion because we could hold
> partial groups within the kernel and only report full groups to device
> drivers (and guests). In addition we'd consolidate tracking of IOPFs,
> sin

RE: [RFC PATCH v1 0/4] vfio: Add IOPF support for VFIO passthrough

2021-02-03 Thread Tian, Kevin
> From: Shenming Lu 
> Sent: Tuesday, February 2, 2021 2:42 PM
> 
> On 2021/2/1 15:56, Tian, Kevin wrote:
> >> From: Alex Williamson 
> >> Sent: Saturday, January 30, 2021 6:58 AM
> >>
> >> On Mon, 25 Jan 2021 17:03:58 +0800
> >> Shenming Lu  wrote:
> >>
> >>> Hi,
> >>>
> >>> The static pinning and mapping problem in VFIO and possible solutions
> >>> have been discussed a lot [1, 2]. One of the solutions is to add I/O
> >>> page fault support for VFIO devices. Different from those relatively
> >>> complicated software approaches such as presenting a vIOMMU that
> >> provides
> >>> the DMA buffer information (might include para-virtualized
> optimizations),
> >>> IOPF mainly depends on the hardware faulting capability, such as the
> PCIe
> >>> PRI extension or Arm SMMU stall model. What's more, the IOPF support
> in
> >>> the IOMMU driver is being implemented in SVA [3]. So do we consider to
> >>> add IOPF support for VFIO passthrough based on the IOPF part of SVA at
> >>> present?
> >>>
> >>> We have implemented a basic demo only for one stage of translation
> (GPA
> >>> -> HPA in virtualization, note that it can be configured at either stage),
> >>> and tested on Hisilicon Kunpeng920 board. The nested mode is more
> >> complicated
> >>> since VFIO only handles the second stage page faults (same as the non-
> >> nested
> >>> case), while the first stage page faults need to be further delivered to
> >>> the guest, which is being implemented in [4] on ARM. My thought on this
> >>> is to report the page faults to VFIO regardless of the occured stage (try
> >>> to carry the stage information), and handle respectively according to the
> >>> configured mode in VFIO. Or the IOMMU driver might evolve to support
> >> more...
> >>>
> >>> Might TODO:
> >>>  - Optimize the faulting path, and measure the performance (it might still
> >>>be a big issue).
> >>>  - Add support for PRI.
> >>>  - Add a MMU notifier to avoid pinning.
> >>>  - Add support for the nested mode.
> >>> ...
> >>>
> >>> Any comments and suggestions are very welcome. :-)
> >>
> >> I expect performance to be pretty bad here, the lookup involved per
> >> fault is excessive.  There are cases where a user is not going to be
> >> willing to have a slow ramp up of performance for their devices as they
> >> fault in pages, so we might need to considering making this
> >> configurable through the vfio interface.  Our page mapping also only
> >
> > There is another factor to be considered. The presence of IOMMU_
> > DEV_FEAT_IOPF just indicates the device capability of triggering I/O
> > page fault through the IOMMU, but not exactly means that the device
> > can tolerate I/O page fault for arbitrary DMA requests.
> 
> Yes, so I add a iopf_enabled field in VFIO to indicate the whole path faulting
> capability and set it to true after registering a VFIO page fault handler.
> 
> > In reality, many
> > devices allow I/O faulting only in selective contexts. However, there
> > is no standard way (e.g. PCISIG) for the device to report whether
> > arbitrary I/O fault is allowed. Then we may have to maintain device
> > specific knowledge in software, e.g. in an opt-in table to list devices
> > which allows arbitrary faults. For devices which only support selective
> > faulting, a mediator (either through vendor extensions on vfio-pci-core
> > or a mdev wrapper) might be necessary to help lock down non-faultable
> > mappings and then enable faulting on the rest mappings.
> 
> For devices which only support selective faulting, they could tell it to the
> IOMMU driver and let it filter out non-faultable faults? Do I get it wrong?

Not exactly to IOMMU driver. There is already a vfio_pin_pages() for
selectively page-pinning. The matter is that 'they' imply some device
specific logic to decide which pages must be pinned and such knowledge
is outside of VFIO.

From enabling p.o.v we could possibly do it in phased approach. First 
handles devices which tolerate arbitrary DMA faults, and then extends
to devices with selective-faulting. The former is simpler, but with one
main open whether we want to maintain such device IDs in a static
table in VFIO or rely on some hints from other components (e.g. PF
driver in VF assignment case). Let's see how Alex thinks about it.

> 
> >
> >> grows here, should map

RE: [PATCH v2 3/3] iommu/vt-d: Apply SATC policy

2021-02-03 Thread Tian, Kevin
> From: Lu Baolu
> Sent: Wednesday, February 3, 2021 5:33 PM
> 
> From: Yian Chen 
> 
> Starting from Intel VT-d v3.2, Intel platform BIOS can provide a new SATC
> table structure. SATC table lists a set of SoC integrated devices that
> require ATC to work (VT-d specification v3.2, section 8.8). Furthermore,

This statement is not accurate. The purpose of SATC is to tell whether a
SoC integrated device has been validated to meet the isolation requirements 
of using device TLB. All devices listed in SATC can have ATC safely enabled by 
OS. In addition, there is a flag for each listed device for whether ATC is a 
functional requirement. However, above description only captured the last 
point.

> the new version of IOMMU supports SoC device ATS in both its Scalable
> mode
> and legacy mode.
> 
> When IOMMU is working in scalable mode, software must enable device ATS
> support. 

"must enable" is misleading here. You need describe the policies for three
categories:

- SATC devices with ATC_REQUIRED=1
- SATC devices with ATC_REQUIRED=0
- devices not listed in SATC, or when SATC is missing

> On the other hand, when IOMMU is in legacy mode for whatever
> reason, the hardware managed ATS will automatically take effect and the
> SATC required devices can work transparently to the software. As the

No background about hardware-managed ATS. 

> result, software shouldn't enable ATS on that device, otherwise duplicate
> device TLB invalidations will occur.

This description draws a equation between legacy mode and hardware
managed ATS. Do we care about the scenario where there is no hardware
managed ATS but people also want to turn on ATC in legacy mode?

Thanks
Kevin

> 
> Signed-off-by: Yian Chen 
> Signed-off-by: Lu Baolu 
> ---
>  drivers/iommu/intel/iommu.c | 73
> +++--
>  1 file changed, 69 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index ee0932307d64..3e30c340e6a9 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -872,6 +872,60 @@ static bool iommu_is_dummy(struct intel_iommu
> *iommu, struct device *dev)
>   return false;
>  }
> 
> +static bool iommu_support_ats(struct intel_iommu *iommu)
> +{
> + return ecap_dev_iotlb_support(iommu->ecap);
> +}
> +
> +static bool device_support_ats(struct pci_dev *dev)
> +{
> + return pci_ats_supported(dev) &&
> dmar_find_matched_atsr_unit(dev);
> +}
> +
> +static int segment_atc_required(u16 segment)
> +{
> + struct acpi_dmar_satc *satc;
> + struct dmar_satc_unit *satcu;
> + int ret = 1;
> +
> + rcu_read_lock();
> + list_for_each_entry_rcu(satcu, _satc_units, list) {
> + satc = container_of(satcu->hdr, struct acpi_dmar_satc,
> header);
> + if (satcu->atc_required && satcu->devices_cnt &&
> + satc->segment == segment)
> + goto out;
> + }
> + ret = 0;
> +out:
> + rcu_read_unlock();
> + return ret;
> +}
> +
> +static int device_atc_required(struct pci_dev *dev)
> +{
> + struct dmar_satc_unit *satcu;
> + struct acpi_dmar_satc *satc;
> + struct device *tmp;
> + int i, ret = 1;
> +
> + dev = pci_physfn(dev);
> + rcu_read_lock();
> + list_for_each_entry_rcu(satcu, _satc_units, list) {
> + satc = container_of(satcu->hdr, struct acpi_dmar_satc,
> header);
> + if (!satcu->atc_required ||
> + satc->segment != pci_domain_nr(dev->bus))
> + continue;
> +
> + for_each_dev_scope(satcu->devices, satcu->devices_cnt, i,
> tmp)
> + if (to_pci_dev(tmp) == dev)
> + goto out;
> + }
> + ret = 0;
> +out:
> + rcu_read_unlock();
> + return ret;
> +}
> +
>  struct intel_iommu *device_to_iommu(struct device *dev, u8 *bus, u8
> *devfn)
>  {
>   struct dmar_drhd_unit *drhd = NULL;
> @@ -2555,10 +2609,16 @@ static struct dmar_domain
> *dmar_insert_one_dev_info(struct intel_iommu *iommu,
>   if (dev && dev_is_pci(dev)) {
>   struct pci_dev *pdev = to_pci_dev(info->dev);
> 
> - if (ecap_dev_iotlb_support(iommu->ecap) &&
> - pci_ats_supported(pdev) &&
> - dmar_find_matched_atsr_unit(pdev))
> - info->ats_supported = 1;
> + /*
> +  * Support ATS by default if it's supported by both IOMMU
> and
> +  * client sides, except that the device's ATS is required by
> +  * ACPI/SATC but the IOMMU scalable mode is disabled. In
> that
> +  * case the hardware managed ATS will be automatically used.
> +  */
> + if (iommu_support_ats(iommu) &&
> device_support_ats(pdev)) {
> + if (!device_atc_required(pdev) ||
> sm_supported(iommu))
> + info->ats_supported = 1;
> + }
> 
>   if 

RE: [RFC PATCH v2] uacce: Add uacce_ctrl misc device

2021-02-01 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Tuesday, February 2, 2021 7:44 AM
> 
> On Fri, Jan 29, 2021 at 10:09:03AM +, Tian, Kevin wrote:
> > > SVA is not doom to work with IO page fault only. If we have SVA+pin,
> > > we would get both sharing address and stable I/O latency.
> >
> > Isn't it like a traditional MAP_DMA API (imply pinning) plus specifying
> > cpu_va of the memory pool as the iova?
> 
> I think their issue is the HW can't do the cpu_va trick without also
> involving the system IOMMU in a SVA mode
> 

This is the part that I didn't understand. Using cpu_va in a MAP_DMA
interface doesn't require device support. It's just an user-specified
address to be mapped into the IOMMU page table. On the other hand,
sharing CPU page table through a SVA interface for an usage where I/O 
page faults must be completely avoided seems a misleading attempt. 
Even if people do want this model (e.g. mix pinning+fault), it should be
a mm syscall as Greg pointed out, not specific to sva.

Thanks
Kevin


RE: [RFC PATCH v1 0/4] vfio: Add IOPF support for VFIO passthrough

2021-02-01 Thread Tian, Kevin
> From: Alex Williamson 
> Sent: Saturday, January 30, 2021 6:58 AM
> 
> On Mon, 25 Jan 2021 17:03:58 +0800
> Shenming Lu  wrote:
> 
> > Hi,
> >
> > The static pinning and mapping problem in VFIO and possible solutions
> > have been discussed a lot [1, 2]. One of the solutions is to add I/O
> > page fault support for VFIO devices. Different from those relatively
> > complicated software approaches such as presenting a vIOMMU that
> provides
> > the DMA buffer information (might include para-virtualized optimizations),
> > IOPF mainly depends on the hardware faulting capability, such as the PCIe
> > PRI extension or Arm SMMU stall model. What's more, the IOPF support in
> > the IOMMU driver is being implemented in SVA [3]. So do we consider to
> > add IOPF support for VFIO passthrough based on the IOPF part of SVA at
> > present?
> >
> > We have implemented a basic demo only for one stage of translation (GPA
> > -> HPA in virtualization, note that it can be configured at either stage),
> > and tested on Hisilicon Kunpeng920 board. The nested mode is more
> complicated
> > since VFIO only handles the second stage page faults (same as the non-
> nested
> > case), while the first stage page faults need to be further delivered to
> > the guest, which is being implemented in [4] on ARM. My thought on this
> > is to report the page faults to VFIO regardless of the occured stage (try
> > to carry the stage information), and handle respectively according to the
> > configured mode in VFIO. Or the IOMMU driver might evolve to support
> more...
> >
> > Might TODO:
> >  - Optimize the faulting path, and measure the performance (it might still
> >be a big issue).
> >  - Add support for PRI.
> >  - Add a MMU notifier to avoid pinning.
> >  - Add support for the nested mode.
> > ...
> >
> > Any comments and suggestions are very welcome. :-)
> 
> I expect performance to be pretty bad here, the lookup involved per
> fault is excessive.  There are cases where a user is not going to be
> willing to have a slow ramp up of performance for their devices as they
> fault in pages, so we might need to considering making this
> configurable through the vfio interface.  Our page mapping also only

There is another factor to be considered. The presence of IOMMU_
DEV_FEAT_IOPF just indicates the device capability of triggering I/O 
page fault through the IOMMU, but not exactly means that the device 
can tolerate I/O page fault for arbitrary DMA requests. In reality, many 
devices allow I/O faulting only in selective contexts. However, there
is no standard way (e.g. PCISIG) for the device to report whether 
arbitrary I/O fault is allowed. Then we may have to maintain device
specific knowledge in software, e.g. in an opt-in table to list devices
which allows arbitrary faults. For devices which only support selective 
faulting, a mediator (either through vendor extensions on vfio-pci-core
or a mdev wrapper) might be necessary to help lock down non-faultable 
mappings and then enable faulting on the rest mappings.

> grows here, should mappings expire or do we need a least recently
> mapped tracker to avoid exceeding the user's locked memory limit?  How
> does a user know what to set for a locked memory limit?  The behavior
> here would lead to cases where an idle system might be ok, but as soon
> as load increases with more inflight DMA, we start seeing
> "unpredictable" I/O faults from the user perspective.  Seems like there
> are lots of outstanding considerations and I'd also like to hear from
> the SVA folks about how this meshes with their work.  Thanks,
> 

The main overlap between this feature and SVA is the IOPF reporting
framework, which currently still has gap to support both in nested
mode, as discussed here:

https://lore.kernel.org/linux-acpi/YAaxjmJW+ZMvrhac@myrica/

Once that gap is resolved in the future, the VFIO fault handler just 
adopts different actions according to the fault-level: 1st level faults
are forwarded to userspace thru the vSVA path while 2nd-level faults
are fixed (or warned if not intended) by VFIO itself thru the IOMMU
mapping interface.

Thanks
Kevin


RE: [PATCH 1/3] iommu/vt-d: Add rate limited information when PRQ overflows

2021-01-25 Thread Tian, Kevin
> From: Lu Baolu 
> Sent: Monday, January 25, 2021 2:29 PM
> 
> Hi Kevin,
> 
> On 2021/1/22 14:38, Tian, Kevin wrote:
> >> From: Lu Baolu 
> >> Sent: Thursday, January 21, 2021 9:45 AM
> >>
> >> So that the uses could get chances to know what happened.
> >>
> >> Suggested-by: Ashok Raj 
> >> Signed-off-by: Lu Baolu 
> >> ---
> >>   drivers/iommu/intel/svm.c | 10 --
> >>   1 file changed, 8 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
> >> index 033b25886e57..f49fe715477b 100644
> >> --- a/drivers/iommu/intel/svm.c
> >> +++ b/drivers/iommu/intel/svm.c
> >> @@ -895,6 +895,7 @@ static irqreturn_t prq_event_thread(int irq, void
> *d)
> >>struct intel_iommu *iommu = d;
> >>struct intel_svm *svm = NULL;
> >>int head, tail, handled = 0;
> >> +  struct page_req_dsc *req;
> >>
> >>/* Clear PPR bit before reading head/tail registers, to
> >> * ensure that we get a new interrupt if needed. */
> >> @@ -904,7 +905,6 @@ static irqreturn_t prq_event_thread(int irq, void
> *d)
> >>head = dmar_readq(iommu->reg + DMAR_PQH_REG) &
> >> PRQ_RING_MASK;
> >>while (head != tail) {
> >>struct vm_area_struct *vma;
> >> -  struct page_req_dsc *req;
> >>struct qi_desc resp;
> >>int result;
> >>vm_fault_t ret;
> >> @@ -1042,8 +1042,14 @@ static irqreturn_t prq_event_thread(int irq,
> void
> >> *d)
> >> * Clear the page request overflow bit and wake up all threads that
> >> * are waiting for the completion of this handling.
> >> */
> >> -  if (readl(iommu->reg + DMAR_PRS_REG) & DMA_PRS_PRO)
> >> +  if (readl(iommu->reg + DMAR_PRS_REG) & DMA_PRS_PRO) {
> >> +  head = dmar_readq(iommu->reg + DMAR_PQH_REG) &
> >> PRQ_RING_MASK;
> >> +  req = >prq[head / sizeof(*req)];
> >> +  pr_warn_ratelimited("IOMMU: %s: Page request overflow:
> >> HEAD: %08llx %08llx",
> >> +  iommu->name, ((unsigned long long
> >> *)req)[0],
> >> +  ((unsigned long long *)req)[1]);
> >>writel(DMA_PRS_PRO, iommu->reg + DMAR_PRS_REG);
> >> +  }
> >>
> >
> > Not about rate limiting but I think we may have a problem in above
> > logic. It is incorrect to always clear PRO when it's set w/o first checking
> > whether the overflow condition has been cleared. This code assumes
> > that if an overflow condition occurs it must have been cleared by earlier
> > loop when hitting this check. However since this code runs in a threaded
> > context, the overflow condition could occur even after you reset the head
> > to the tail (under some extreme condition). To be sane I think we'd better
> > read both head/tail again after seeing a pending PRO here and only clear
> > PRO when it becomes a false indicator based on latest head/tail.
> >
> 
> Yes, agreed. We can check the head and tail and clear the overflow bit
> until the queue is empty. The finial code looks like:
> 
>  /*
>   * Clear the page request overflow bit and wake up all threads that
>   * are waiting for the completion of this handling.
>   */
>  if (readl(iommu->reg + DMAR_PRS_REG) & DMA_PRS_PRO) {
>  head = dmar_readq(iommu->reg + DMAR_PQH_REG) &
> PRQ_RING_MASK;
>  tail = dmar_readq(iommu->reg + DMAR_PQT_REG) &
> PRQ_RING_MASK;
>  if (head == tail) {
>  req = >prq[head / sizeof(*req)];
>  pr_warn_ratelimited("IOMMU: %s: Page request
> overflow cleared: HEAD: %08llx %08llx",
>  iommu->name, ((unsigned
> long long *)req)[0],
>  ((unsigned long long
> *)req)[1]);
>  writel(DMA_PRS_PRO, iommu->reg + DMAR_PRS_REG);
>  }
>  }
> 
> Thought?
> 

Just a small comment. Is it useful to also print a warning in the true
overflow condition which has to wait for next interrupt to be cleared?

Thanks
Kevin


RE: [PATCH 1/3] iommu/vt-d: Add rate limited information when PRQ overflows

2021-01-21 Thread Tian, Kevin
> From: Lu Baolu 
> Sent: Thursday, January 21, 2021 9:45 AM
> 
> So that the uses could get chances to know what happened.
> 
> Suggested-by: Ashok Raj 
> Signed-off-by: Lu Baolu 
> ---
>  drivers/iommu/intel/svm.c | 10 --
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
> index 033b25886e57..f49fe715477b 100644
> --- a/drivers/iommu/intel/svm.c
> +++ b/drivers/iommu/intel/svm.c
> @@ -895,6 +895,7 @@ static irqreturn_t prq_event_thread(int irq, void *d)
>   struct intel_iommu *iommu = d;
>   struct intel_svm *svm = NULL;
>   int head, tail, handled = 0;
> + struct page_req_dsc *req;
> 
>   /* Clear PPR bit before reading head/tail registers, to
>* ensure that we get a new interrupt if needed. */
> @@ -904,7 +905,6 @@ static irqreturn_t prq_event_thread(int irq, void *d)
>   head = dmar_readq(iommu->reg + DMAR_PQH_REG) &
> PRQ_RING_MASK;
>   while (head != tail) {
>   struct vm_area_struct *vma;
> - struct page_req_dsc *req;
>   struct qi_desc resp;
>   int result;
>   vm_fault_t ret;
> @@ -1042,8 +1042,14 @@ static irqreturn_t prq_event_thread(int irq, void
> *d)
>* Clear the page request overflow bit and wake up all threads that
>* are waiting for the completion of this handling.
>*/
> - if (readl(iommu->reg + DMAR_PRS_REG) & DMA_PRS_PRO)
> + if (readl(iommu->reg + DMAR_PRS_REG) & DMA_PRS_PRO) {
> + head = dmar_readq(iommu->reg + DMAR_PQH_REG) &
> PRQ_RING_MASK;
> + req = >prq[head / sizeof(*req)];
> + pr_warn_ratelimited("IOMMU: %s: Page request overflow:
> HEAD: %08llx %08llx",
> + iommu->name, ((unsigned long long
> *)req)[0],
> + ((unsigned long long *)req)[1]);
>   writel(DMA_PRS_PRO, iommu->reg + DMAR_PRS_REG);
> + }
> 

Not about rate limiting but I think we may have a problem in above
logic. It is incorrect to always clear PRO when it's set w/o first checking
whether the overflow condition has been cleared. This code assumes
that if an overflow condition occurs it must have been cleared by earlier
loop when hitting this check. However since this code runs in a threaded 
context, the overflow condition could occur even after you reset the head 
to the tail (under some extreme condition). To be sane I think we'd better
read both head/tail again after seeing a pending PRO here and only clear 
PRO when it becomes a false indicator based on latest head/tail.

Thanks
Kevin


RE: [RFC PATCH v3 2/2] platform-msi: Add platform check for subdevice irq domain

2021-01-13 Thread Tian, Kevin
> From: Lu Baolu
> Sent: Thursday, January 14, 2021 9:30 AM
> 
> The pci_subdevice_msi_create_irq_domain() should fail if the underlying
> platform is not able to support IMS (Interrupt Message Storage). Otherwise,
> the isolation of interrupt is not guaranteed.
> 
> For x86, IMS is only supported on bare metal for now. We could enable it
> in the virtualization environments in the future if interrupt HYPERCALL
> domain is supported or the hardware has the capability of interrupt
> isolation for subdevices.
> 
> Cc: David Woodhouse 
> Cc: Leon Romanovsky 
> Cc: Kevin Tian 
> Suggested-by: Thomas Gleixner 
> Link: https://lore.kernel.org/linux-
> pci/87pn4nk7nn@nanos.tec.linutronix.de/
> Link: https://lore.kernel.org/linux-
> pci/877dqrnzr3@nanos.tec.linutronix.de/
> Link: https://lore.kernel.org/linux-
> pci/877dqqmc2h@nanos.tec.linutronix.de/
> Signed-off-by: Lu Baolu 
> ---
>  arch/x86/pci/common.c   | 71
> +
>  drivers/base/platform-msi.c |  8 +
>  include/linux/msi.h |  1 +
>  3 files changed, 80 insertions(+)
> 
> diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
> index 3507f456fcd0..9deb826fb242 100644
> --- a/arch/x86/pci/common.c
> +++ b/arch/x86/pci/common.c
> @@ -12,6 +12,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
> 
>  #include 
>  #include 
> @@ -724,3 +725,73 @@ struct pci_dev *pci_real_dma_dev(struct pci_dev
> *dev)
>   return dev;
>  }
>  #endif
> +
> +/*
> + * We want to figure out which context we are running in. But the hardware
> + * does not introduce a reliable way (instruction, CPUID leaf, MSR, whatever)
> + * which can be manipulated by the VMM to let the OS figure out where it
> runs.
> + * So we go with the below probably on_bare_metal() function as a
> replacement
> + * for definitely on_bare_metal() to go forward only for the very simple
> reason
> + * that this is the only option we have.
> + */
> +static const char * const vmm_vendor_name[] = {
> + "QEMU", "Bochs", "KVM", "Xen", "VMware", "VMW", "VMware Inc.",
> + "innotek GmbH", "Oracle Corporation", "Parallels", "BHYVE"
> +};
> +
> +static void read_type0_virtual_machine(const struct dmi_header *dm, void
> *p)
> +{
> + u8 *data = (u8 *)dm + 0x13;
> +
> + /* BIOS Information (Type 0) */
> + if (dm->type != 0 || dm->length < 0x14)
> + return;
> +
> + /* Bit 4 of BIOS Characteristics Extension Byte 2*/
> + if (*data & BIT(4))
> + *((bool *)p) = true;
> +}
> +
> +static bool smbios_virtual_machine(void)
> +{
> + bool bit_present = false;
> +
> + dmi_walk(read_type0_virtual_machine, _present);
> +
> + return bit_present;
> +}
> +
> +static bool on_bare_metal(struct device *dev)
> +{
> + int i;
> +
> + if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
> + return false;
> +
> + if (smbios_virtual_machine())
> + return false;
> +
> + if (iommu_capable(dev->bus, IOMMU_CAP_VIOMMU))
> + return false;
> +
> + for (i = 0; i < ARRAY_SIZE(vmm_vendor_name); i++)
> + if (dmi_match(DMI_SYS_VENDOR, vmm_vendor_name[i]))
> + return false;

Thinking more I wonder whether this check is actually useful here. As Leon
and David commented, the same vendor name can be used both for VM
and bare metal instances. It implies that both bare metal and VM might be
misinterpreted with this check. This might not be what we want originally -
find heuristics to indicate a VM environment and tolerate misinterpreting 
VM as bare metal in corner cases (but not vice versa).

Thomas?

Thanks
Kevin



RE: [RFC PATCH v2 1/1] platform-msi: Add platform check for subdevice irq domain

2021-01-11 Thread Tian, Kevin
> From: Leon Romanovsky 
> Sent: Tuesday, January 12, 2021 1:53 PM
> 
> On Tue, Jan 12, 2021 at 01:17:11PM +0800, Lu Baolu wrote:
> > Hi,
> >
> > On 1/7/21 3:16 PM, Leon Romanovsky wrote:
> > > On Thu, Jan 07, 2021 at 06:55:16AM +, Tian, Kevin wrote:
> > > > > From: Leon Romanovsky 
> > > > > Sent: Thursday, January 7, 2021 2:09 PM
> > > > >
> > > > > On Thu, Jan 07, 2021 at 02:04:29AM +, Tian, Kevin wrote:
> > > > > > > From: Leon Romanovsky 
> > > > > > > Sent: Thursday, January 7, 2021 12:02 AM
> > > > > > >
> > > > > > > On Wed, Jan 06, 2021 at 11:23:39AM -0400, Jason Gunthorpe
> wrote:
> > > > > > > > On Wed, Jan 06, 2021 at 12:40:17PM +0200, Leon Romanovsky
> wrote:
> > > > > > > >
> > > > > > > > > I asked what will you do when QEMU will gain needed
> functionality?
> > > > > > > > > Will you remove QEMU from this list? If yes, how such "new"
> kernel
> > > > > will
> > > > > > > > > work on old QEMU versions?
> > > > > > > >
> > > > > > > > The needed functionality is some VMM hypercall, so presumably
> new
> > > > > > > > kernels that support calling this hypercall will be able to 
> > > > > > > > discover
> > > > > > > > if the VMM hypercall exists and if so superceed this entire 
> > > > > > > > check.
> > > > > > >
> > > > > > > Let's not speculate, do we have well-known path?
> > > > > > > Will such patch be taken to stable@/distros?
> > > > > > >
> > > > > >
> > > > > > There are two functions introduced in this patch. One is to detect
> whether
> > > > > > running on bare metal or in a virtual machine. The other is for
> deciding
> > > > > > whether the platform supports ims. Currently the two are identical
> because
> > > > > > ims is supported only on bare metal at current stage. In the future 
> > > > > > it
> will
> > > > > look
> > > > > > like below when ims can be enabled in a VM:
> > > > > >
> > > > > > bool arch_support_pci_device_ims(struct pci_dev *pdev)
> > > > > > {
> > > > > > return on_bare_metal() ||
> hypercall_irq_domain_supported();
> > > > > > }
> > > > > >
> > > > > > The VMM vendor list is for on_bare_metal, and suppose a vendor
> will
> > > > > > never be removed once being added to the list since the fact of
> running
> > > > > > in a VM never changes, regardless of whether this hypervisor
> supports
> > > > > > extra VMM hypercalls.
> > > > >
> > > > > This is what I imagined, this list will be forever, and this worries 
> > > > > me.
> > > > >
> > > > > I don't know if it is true or not, but guess that at least Oracle and
> > > > > Microsoft bare metal devices and VMs will have same
> DMI_SYS_VENDOR.
> > > >
> > > > It's true. David Woodhouse also said it's the case for Amazon EC2
> instances.
> > > >
> > > > >
> > > > > It means that this on_bare_metal() function won't work reliably in
> many
> > > > > cases. Also being part of include/linux/msi.h, at some point of time,
> > > > > this function will be picked by the users outside for the non-IMS 
> > > > > cases.
> > > > >
> > > > > I didn't even mention custom forks of QEMU which are prohibited to
> change
> > > > > DMI_SYS_VENDOR and private clouds with custom solutions.
> > > >
> > > > In this case the private QEMU forks are encouraged to set CPUID (X86_
> > > > FEATURE_HYPERVISOR) if they do plan to adopt a different vendor
> name.
> > >
> > > Does QEMU set this bit when it runs in host-passthrough CPU model?
> > >
> > > >
> > > > >
> > > > > The current array makes DMI_SYS_VENDOR interface as some sort of
> ABI. If
> > > > > in the future,
> > > > > the QEMU will decide to use more hipster name, for example "qEmU",
> this
> > > > > function
> > > > > won

RE: [RFC PATCH 1/1] platform-msi: Add platform check for subdevice irq domain

2021-01-06 Thread Tian, Kevin
> From: David Woodhouse 
> Sent: Thursday, December 10, 2020 4:23 PM
> 
> On Thu, 2020-12-10 at 08:46 +0800, Lu Baolu wrote:
> > +/*
> > + * We want to figure out which context we are running in. But the
> hardware
> > + * does not introduce a reliable way (instruction, CPUID leaf, MSR,
> whatever)
> > + * which can be manipulated by the VMM to let the OS figure out where it
> runs.
> > + * So we go with the below probably_on_bare_metal() function as a
> replacement
> > + * for definitely_on_bare_metal() to go forward only for the very simple
> reason
> > + * that this is the only option we have.
> > + */
> > +static const char * const possible_vmm_vendor_name[] = {
> > +   "QEMU", "Bochs", "KVM", "Xen", "VMware", "VMW", "VMware Inc.",
> > +   "innotek GmbH", "Oracle Corporation", "Parallels", "BHYVE",
> > +   "Microsoft Corporation"
> > +};
> 
> People do use SeaBIOS ("Bochs") on bare metal.
> 
> You'll also see "Amazon EC2" on virt instances as well as bare metal
> instances. Although in that case I believe the virt instances do have
> the 'virtual machine' flag set in bit 4 of the BIOS Characteristics
> Extension Byte 2, and the bare metal obviously don't.
> 

Are those virtual instances having CPUID hypervisor bit set? If yes,
they can be differentiated from bare metal instances w/o checking
the vendor list.

btw do you know whether this 'virtual machine' flag is widely used
in virtualization environments? If yes, we probably should add check
on this flag even before checking DMI_SYS_VENDOR. It sounds more
general...

Thanks
Kevin



RE: [RFC PATCH v2 1/1] platform-msi: Add platform check for subdevice irq domain

2021-01-06 Thread Tian, Kevin
> From: Leon Romanovsky 
> Sent: Thursday, January 7, 2021 2:09 PM
> 
> On Thu, Jan 07, 2021 at 02:04:29AM +, Tian, Kevin wrote:
> > > From: Leon Romanovsky 
> > > Sent: Thursday, January 7, 2021 12:02 AM
> > >
> > > On Wed, Jan 06, 2021 at 11:23:39AM -0400, Jason Gunthorpe wrote:
> > > > On Wed, Jan 06, 2021 at 12:40:17PM +0200, Leon Romanovsky wrote:
> > > >
> > > > > I asked what will you do when QEMU will gain needed functionality?
> > > > > Will you remove QEMU from this list? If yes, how such "new" kernel
> will
> > > > > work on old QEMU versions?
> > > >
> > > > The needed functionality is some VMM hypercall, so presumably new
> > > > kernels that support calling this hypercall will be able to discover
> > > > if the VMM hypercall exists and if so superceed this entire check.
> > >
> > > Let's not speculate, do we have well-known path?
> > > Will such patch be taken to stable@/distros?
> > >
> >
> > There are two functions introduced in this patch. One is to detect whether
> > running on bare metal or in a virtual machine. The other is for deciding
> > whether the platform supports ims. Currently the two are identical because
> > ims is supported only on bare metal at current stage. In the future it will
> look
> > like below when ims can be enabled in a VM:
> >
> > bool arch_support_pci_device_ims(struct pci_dev *pdev)
> > {
> > return on_bare_metal() || hypercall_irq_domain_supported();
> > }
> >
> > The VMM vendor list is for on_bare_metal, and suppose a vendor will
> > never be removed once being added to the list since the fact of running
> > in a VM never changes, regardless of whether this hypervisor supports
> > extra VMM hypercalls.
> 
> This is what I imagined, this list will be forever, and this worries me.
> 
> I don't know if it is true or not, but guess that at least Oracle and
> Microsoft bare metal devices and VMs will have same DMI_SYS_VENDOR.

It's true. David Woodhouse also said it's the case for Amazon EC2 instances.

> 
> It means that this on_bare_metal() function won't work reliably in many
> cases. Also being part of include/linux/msi.h, at some point of time,
> this function will be picked by the users outside for the non-IMS cases.
> 
> I didn't even mention custom forks of QEMU which are prohibited to change
> DMI_SYS_VENDOR and private clouds with custom solutions.

In this case the private QEMU forks are encouraged to set CPUID (X86_
FEATURE_HYPERVISOR) if they do plan to adopt a different vendor name.

> 
> The current array makes DMI_SYS_VENDOR interface as some sort of ABI. If
> in the future,
> the QEMU will decide to use more hipster name, for example "qEmU", this
> function
> won't work.
> 
> I'm aware that DMI_SYS_VENDOR is used heavily in the kernel code and
> various names for the same company are good example how not reliable it.
> 
> The most hilarious example is "Dell/Dell Inc./Dell Inc/Dell Computer
> Corporation/Dell Computer",
> but other companies are not far from them.
> 
> Luckily enough, this identification is used for hardware product that
> was released to the market and their name will be stable for that
> specific model. It is not the case here where we need to ensure future
> compatibility too (old kernel on new VM emulator).
> 
> I'm not in position to say yes or no to this patch and don't have plans to do 
> it.
> Just expressing my feeling that this solution is too hacky for my taste.
> 

I agree with your worries and solely relying on DMI_SYS_VENDOR is 
definitely too hacky. In previous discussions with Thomas there is no 
elegant way to handle this situation. It has to be a heuristic approach. 
First we hope the CPUID bit is set properly in most cases thus is checked 
first. Then other heuristics can be made for the remaining cases. DMI_
SYS_VENDOR is the first hint and more can be added later. For example,
when IOMMU is present there is vendor specific way to detect whether 
it's real or virtual. Dave also mentioned some BIOS flag to indicate a
virtual machine. Now probably the real question here is whether people 
are OK with CPUID+DMI_SYS_VENDOR combo check for now (and grow 
it later) or prefer to having all identified heuristics so far in-place 
together...

Thanks
Kevin


RE: [RFC PATCH v2 1/1] platform-msi: Add platform check for subdevice irq domain

2021-01-06 Thread Tian, Kevin
> From: Leon Romanovsky 
> Sent: Thursday, January 7, 2021 12:02 AM
> 
> On Wed, Jan 06, 2021 at 11:23:39AM -0400, Jason Gunthorpe wrote:
> > On Wed, Jan 06, 2021 at 12:40:17PM +0200, Leon Romanovsky wrote:
> >
> > > I asked what will you do when QEMU will gain needed functionality?
> > > Will you remove QEMU from this list? If yes, how such "new" kernel will
> > > work on old QEMU versions?
> >
> > The needed functionality is some VMM hypercall, so presumably new
> > kernels that support calling this hypercall will be able to discover
> > if the VMM hypercall exists and if so superceed this entire check.
> 
> Let's not speculate, do we have well-known path?
> Will such patch be taken to stable@/distros?
> 

There are two functions introduced in this patch. One is to detect whether
running on bare metal or in a virtual machine. The other is for deciding 
whether the platform supports ims. Currently the two are identical because
ims is supported only on bare metal at current stage. In the future it will 
look 
like below when ims can be enabled in a VM:

bool arch_support_pci_device_ims(struct pci_dev *pdev)
{
return on_bare_metal() || hypercall_irq_domain_supported();
}

The VMM vendor list is for on_bare_metal, and suppose a vendor will
never be removed once being added to the list since the fact of running
in a VM never changes, regardless of whether this hypervisor supports
extra VMM hypercalls. hypercall_irq_domain_supported will actually 
detect in hypervisor-specific way whether ims can be enabled in a VM 
(return true only when a 'new' kernel runs on a 'new' hypervisor). In 
this way no backporting is required when running a 'new' kernel on an
'old' hypervisor.

Thanks
Kevin


RE: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection

2020-11-16 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Tuesday, November 17, 2020 2:03 AM
> 
> On Mon, Nov 16, 2020 at 06:56:33PM +0100, Thomas Gleixner wrote:
> > On Mon, Nov 16 2020 at 11:46, Jason Gunthorpe wrote:
> >
> > > On Mon, Nov 16, 2020 at 07:31:49AM +, Tian, Kevin wrote:
> > >
> > >> > The subdevices require PASID & IOMMU in native, but inside the guest
> there
> > >> > is no
> > >> > need for IOMMU unless you want to build SVM on top. subdevices
> work
> > >> > without
> > >> > any vIOMMU or hypercall in the guest. Only because they look like
> normal
> > >> > PCI devices we could map interrupts to legacy MSIx.
> > >>
> > >> Guest managed subdevices on PF/VF requires vIOMMU.
> > >
> > > Why? I've never heard we need vIOMMU for our existing SRIOV flows in
> > > VMs??
> >
> > Handing PF/VF into the guest does not require it.
> >
> > But if the PF/VF driver in the guest wants to create and manage the
> > magic mdev subdevices which require PASID support then you surely need
> > it.
> 
> 'magic mdevs' are only one reason to use IMS in a guest. On mlx5 we
> might want to use IMS for VPDA devices. mlx5 can spawn a VDPA device
> in a guest, against a 'ADI', without ever requiring an IOMMU to do it.
> 
> We don't even need IOMMU in the hypervisor to create the ADI, mlx5 has
> an internal secure IOMMU that can be used instead of the platform
> IOMMU.
> 
> Not saying this is a major use case, or a reason not to link things to
> IOMMU detection, but lets be clear that a hard need for IOMMU is a
> another IDXD thing, not general.
> 

I should use "may require" in original post. and one thing that I obviously
mixed is the requirement of PASID-granular interrupt isolation in the
physical IOMMU instead of virtual IOMMU. But anyway, I didn't attempt
to use above to build hard need for IOMMU, just the opposite when looking
at all three cases together.

btw Jason/Thomas, how do you think about the proposal down in this
thread (ims=[auto|on|off])? Does it sound a good tradeoff to move forward?

Thanks
Kevin


RE: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection

2020-11-15 Thread Tian, Kevin
> From: Raj, Ashok 
> Sent: Monday, November 16, 2020 8:23 AM
> 
> On Sun, Nov 15, 2020 at 11:11:27PM +0100, Thomas Gleixner wrote:
> > On Sun, Nov 15 2020 at 11:31, Ashok Raj wrote:
> > > On Sun, Nov 15, 2020 at 12:26:22PM +0100, Thomas Gleixner wrote:
> > >> > opt-in by device or kernel? The way we are planning to support this is:
> > >> >
> > >> > Device support for IMS - Can discover in device specific means
> > >> > Kernel support for IMS. - Supported by IOMMU driver.
> > >>
> > >> And why exactly do we have to enforce IOMMU support? Please stop
> looking
> > >> at IMS purely from the IDXD perspective. We are talking about the
> > >> general concept here and not about the restricted Intel universe.
> > >
> > > I think you have mentioned it almost every reply :-)..Got that! Point 
> > > taken
> > > several emails ago!! :-)
> >
> > You sure? I _try_ to not mention it again then. No promise though. :)
> 
> Hey.. anything that's entertaining go for it :-)
> 
> >
> > > I didn't mean just for idxd, I said for *ANY* device driver that wants to
> > > use IMS.
> >
> > Which is wrong. Again:
> >
> > A) For PF/VF on bare metal there is absolutely no IOMMU dependency
> >because it does not have a PASID requirement. It's just an
> >alternative solution to MSI[X], which allows optimizations like
> >storing the message in driver manages queue memory or lifting the
> >restriction of 2048 interrupts per device. Nothing else.
> 
> You are right.. my eyes were clouded by virtualization.. no dependency for
> native absolutely.
> 
> >
> > B) For PF/VF in a guest the IOMMU dependency of IMS is a red herring.
> >There is no direct dependency on the IOMMU.
> >
> >The problem is the inability of the VMM to trap the message write to
> >the IMS storage if the storage is in guest driver managed memory.
> >This can be solved with either
> >
> >- a hypercall which translates the guest MSI message
> >or
> >- a vIOMMU which uses a hypercall or whatever to translate the guest
> >  MSI message
> >
> > C) Subdevices ala mdev are a different story. They require PASID which
> >enforces IOMMU and the IMS part is not managed by the users anyway.
> 
> You are right again :)
> 
> The subdevices require PASID & IOMMU in native, but inside the guest there
> is no
> need for IOMMU unless you want to build SVM on top. subdevices work
> without
> any vIOMMU or hypercall in the guest. Only because they look like normal
> PCI devices we could map interrupts to legacy MSIx.

Guest managed subdevices on PF/VF requires vIOMMU. Anyway I think
Thomas was just pointing out that subdevices are the only category out
of above three which may have business tied to IOMMU. 

> 
> >
> > So we have a couple of problems to solve:
> >
> >   1) Figure out whether the OS runs on bare metal
> >
> >  There is no reliable answer to that, so we either:
> >
> >   - Use heuristics and assume that failure is unlikely and in case
> > of failure blame the incompetence of VMM authors and/or
> > sysadmins
> >
> >  or
> >
> >   - Default to IMS disabled and let the sysadmin enable it via
> > command line option.
> >
> > If the kernel detects to run in a VM it yells and disables it
> > unless the OS and the hypervisor agree to provide support for
> > that scenario (see #2).
> >
> > That's fails as well if the sysadmin does so when the OS runs on
> > a VMM which is not identifiable, but at least we can rightfully
> > blame the sysadmin in that case.
> 
> cmdline isn't nice, best to have this functional out of box.
> 
> >
> >  or
> >
> >   - Declare that IMS always depends on IOMMU
> 
> As you had mentioned IMS has no real dependency on IOMMU in native.
> 
> we just need to make sure if running in guest we have support for it
> plumbed.
> 
> >
> > I personaly don't care, but people working on these kind of
> > device already said, that they want to avoid it when possible.
> >
> > If you want to go that route, then please talk to those folks
> > and ask them to agree in public.
> >
> >  You also need to take into account that this must work on all
> >  architectures which support virtualization because IMS is
> >  architecture independent.
> 
> What you suggest makes perfect sense. We can certainly get buy in from
> iommu list and have this co-ordinated between all existing iommu varients.

Does a hybrid scheme sound good here?

- Say a cmdline parameter: ims=[auto|on|off], with 'auto' as default;

- if ims=auto:

* If arch doesn't implement probably_on_bare_metal, disallow ims;

* If probably_on_bare_metal returns false, disallow ims;
# (future) if hypercall is supported, allow ims;

* If probably_on_bare_metal returns true, allow ims with caveat on
possible mis-interception of running on an old hypervisor. Sysadmin
may need to double-confirm in other means 
# 

RE: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection

2020-11-12 Thread Tian, Kevin
> From: Thomas Gleixner 
> Sent: Friday, November 13, 2020 6:43 AM
> 
> On Thu, Nov 12 2020 at 14:32, Konrad Rzeszutek Wilk wrote:
> >> 4. Using CPUID to detect running as guest. But as Thomas pointed out, this
> >> approach is less reliable as not all hypervisors do this way.
> >
> > Is that truly true? It is the first time I see the argument that extra
> > steps are needed and that checking for X86_FEATURE_HYPERVISOR is not
> enough.
> >
> > Or is it more "Some hypervisor probably forgot about it, so lets make sure
> we patch
> > over that possible hole?"
> 
> Nothing enforces that bit to be set. The bit is a pure software
> convention and was proposed by VMWare in 2008 with the following
> changelog:
> 
>  "This patch proposes to use a cpuid interface to detect if we are
>   running on an hypervisor.
> 
>   The discovery of a hypervisor is determined by bit 31 of CPUID#1_ECX,
>   which is defined to be "hypervisor present bit". For a VM, the bit is
>   1, otherwise it is set to 0. This bit is not officially documented by
>   either Intel/AMD yet, but they plan to do so some time soon, in the
>   meanwhile they have promised to keep it reserved for virtualization."
> 
> The reserved promise seems to hold. AMDs APM has it documented. The
> Intel SDM not so.
> 
> Also the kernel side of KVM does not enforce that bit, it's up to the user
> space management to set it.
> 
> And yes, I've tripped over this with some hypervisors and even qemu KVM
> failed to set it in the early days because it was masked with host CPUID
> trimming as there the bit is obviously 0.
> 
> DMI vendor name is pretty good final check when the bit is 0. The
> strings I'm aware of are:
> 
> QEMU, Bochs, KVM, Xen, VMware, VMW, VMware Inc., innotek GmbH,
> Oracle
> Corporation, Parallels, BHYVE, Microsoft Corporation
> 
> which is not complete but better than nothing ;)
> 
> Thanks,
> 
> tglx

Hi, Thomas,

CPUID#1_ECX is a x86 thing. Do we need to figure out probably_on_
bare_metal for every architecture altogether, or is it OK to just
handle it for x86 arch at this stage? Based on previous discussions 
ims is just one piece of multiple technologies to enable SIOV-like
scalability. Ideally arch-specific enablement beyond ims (e.g. the 
IOMMU part) will be required for such scaled usage thus we 
may just leave ims disabled for non-x86 and wait until that time to 
figure out arch specific probably_on_bare_metal?

Thanks
Kevin


RE: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection

2020-11-10 Thread Tian, Kevin
> From: Raj, Ashok 
> Sent: Tuesday, November 10, 2020 10:13 PM
> 
> Thomas,
> 
> With all these interrupt message storms ;-), I'm missing how to move
> towards
> an end goal.
> 
> On Tue, Nov 10, 2020 at 11:27:29AM +0100, Thomas Gleixner wrote:
> > Ashok,
> >
> > On Mon, Nov 09 2020 at 21:14, Ashok Raj wrote:
> > > On Mon, Nov 09, 2020 at 11:42:29PM +0100, Thomas Gleixner wrote:
> > >> On Mon, Nov 09 2020 at 13:30, Jason Gunthorpe wrote:
> > > Approach to IMS is more of a phased approach.
> > >
> > > #1 Allow physical device to scale beyond limits of PCIe MSIx
> > >Follows current methodology for guest interrupt programming and
> > >evolutionary changes rather than drastic.
> >
> > Trapping MSI[X] writes is there because it allows to hand a device to an
> > unmodified guest OS and to handle the case where the MSI[X] entries
> > storage cannot be mapped exclusively to the guest.
> >
> > But aside of this, it's not required if the storage can be mapped
> > exclusively, the guest is hypervisor aware and can get a host composed
> > message via a hypercall. That works for physical functions and SRIOV,
> > but not for SIOV.
> 
> It would greatly help if you can put down what you see is blocking
> to move forward in the following areas.
> 

Agree. We really need some guidance on how to move forward. I think all
people in this thread are aligned now that it's not Intel or IDXD specific 
thing,
e.g. need architectural solution, enabling IMS on PF/VF is important, etc. But
what we are not sure is whether we need complete all requirements in one
batch, or could evolve step-by-step as long as the growing path is clearly
defined. 

IMHO finding a way to disable IMS in guest is more important than supporting
IMS on PF/VF, since the latter requires hypercall which is not always available
in all scenarios. Even if Linux includes hypercall support for all existing 
archs
and hypervisors, it could run as an unmodified guest on a new hypervisor 
before this hypervisor gets its enlightenments into the Linux. So it is more
prominent to find a way to force using MSI/MSI-x inside guest, as it allows
such PFs/VFs still functional though not benefiting all scalability merits of 
IMS.

If such two-step plans can be agreed, then the next open is about how to
disable IMS in guest. We need a sane solution when checking in the initial 
host-only-IMS support. There are several options discussed in this thread:

1. Industry standard (e.g. a vendor-agnostic ACPI flag) followed by all 
platforms, hypervisors and OSes. It will require collaboration beyond 
Linux community;

2. IOMMU-vendor specific standards (DMAR, IORT, etc.) to report whether
IMS is allowed, implying that IMS is tied to the IOMMU. This tradeoff is 
acceptable since IMS alone cannot make SIOV working which relies on the 
IOMMU anyway. and this might be an easier path to move forward and
even not require to wait for all vendors to extend their tables together.
On physical platform the FW always reports IMS as 'allowed' and there is
time to change it. On virtual platform the hypervisor can choose to hide 
IMS in three ways:
a) do not expose IOMMU
b) expose IOMMU, but using the old format
c) expose IOMMU, using the new format with IMS reported 'disallowed'

a/b can well support legacy software stack.

However, there is one potential issue with option 1/2. The construction
of the virtual ACPI table is at VM creation time, likely based on whether a 
PV interrupt controller is exposed to this guest. However, in most cases the
hypervisor doesn't know which guest OS is running and whether it will
use the PV controller when the VM is being created. If IMS is marked as
'allowed' in the virtual DMAR table, an unmodified guest might just go to 
enable it as if it's on the native platform. Maybe what we really required is 
a flag to tell the guest that although IMS is available you cannot use it with 
traditional interrupt controllers?

3. Use IOMMU 'caching mode' as the hint of running as guest and disable
IMS by default as long as 'caching mode' is detected. iirc all IOMMU vendors 
provide such capability for constructing shadow IOMMU page table. Later
when hypercall support is detected for a specific hypervisor/arch, that path 
can override the IOMMU hint to enable IMS.

Unlike the first two options, this will be a Linux-specific policy but self
contained. Other guest OSes may not follow this way though.

4. Using CPUID to detect running as guest. But as Thomas pointed out, this
approach is less reliable as not all hypervisors do this way.

Thoughts?

Thanks
Kevin


RE: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection

2020-11-10 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Tuesday, November 10, 2020 10:19 PM
> On Mon, Nov 09, 2020 at 09:14:12PM -0800, Raj, Ashok wrote:
> 
> > was used for interrupt message storage (on the wire they follow the
> > same format), and also to ensure interoperability of devices
> > supporting IMS across CPU vendors (who may not support PASID TLP
> > prefix).  This is one reason that led to interrupts from IMS to not
> > use PASID (and match the wire format of MSI/MSI-X generated
> > interrupts).  The other problem was disambiguation between DMA to
> > SVM v/s interrupts.
> 
> This is a defect in the IOMMU, not something fundamental.
> 
> The IOMMU needs to know if the interrupt range is active or not for
> each PASID. Process based SVA will, of course, not enable interrupts
> on the PASID, VM Guest based PASID will.
> 

Unfortunately it's more than that. The interrupt message is firstly recognized
at root complex today and then routed to the IOMMU, unlike other DMA
requests. I'm not saying it's an unsolvable limitation, but just wants to point
out that to achieve such goal there are more things to be considered beyond 
the IOMMU.

Thanks
Kevin


RE: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection

2020-11-10 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Tuesday, November 10, 2020 10:24 PM
> 
> On Tue, Nov 10, 2020 at 06:13:23AM -0800, Raj, Ashok wrote:
> 
> > This isn't just for idxd, as I mentioned earlier, there are vendors other
> > than Intel already working on this. In all cases the need for guest direct
> > manipulation of interrupt store hasn't come up. From the discussion, it
> > seems like there are devices today or in future that will require direct
> > manipulation of interrupt store in the guest. This needs additional work
> > in both the device hardware providing the right plumbing and OS work to
> > comprehend those.
> 
> We'd want to see SRIOV's assigned to guests to be able to use
> IMS. This allows a SRIOV instance in a guest to spawn SIOV's which is
> useful.

Does your VF support both MSI/IMS or IMS only? If it is the former can't
we adopt a phased approach or parallel effort between forcing guest
to use MSI and adding hypercall to enable IMS on VF? Finding a way
to disable IMS is anyway required per earlier discussion when hypercall
is not available, and it could still provide a functional though suboptimal
model for such VFs.

> 
> SIOV's assigned to guests could use IMS, but the use cases we see in
> the short term can be handled by using SRIOV instead.
> 
> I would expect in general for SIOV to use MSI-X emulation to expose
> interrupts - it would be really weird for a SIOV emulator to do
> something else and we should probably discourage that.
> 

I agree with this point. This leaves hardware gaps in IOMMU and root
complex less an immediate blocker and to be addressed in the long term.

Thanks
Kevin


RE: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection

2020-11-08 Thread Tian, Kevin
> From: Raj, Ashok 
> Sent: Monday, November 9, 2020 7:59 AM
> 
> Hi Thomas,
> 
> [-] Jing, She isn't working at Intel anymore.
> 
> Now this is getting compiled as a book :-).. Thanks a ton!
> 
> One question on the hypercall case that isn't immediately
> clear to me.
> 
> On Sun, Nov 08, 2020 at 07:47:24PM +0100, Thomas Gleixner wrote:
> >
> >
> > Now if we look at the virtualization scenario and device hand through
> > then the structure in the guest view is not any different from the basic
> > case. This works with PCI-MSI[X] and the IDXD IMS variant because the
> > hypervisor can trap the access to the storage and translate the message:
> >
> >|
> >|
> >   [CPU]-- [Bri | dge] -- Bus -- [Device]
> >|
> >   Alloc +
> >   Compose   Store Use
> >  |
> >  | Trap
> >  v
> >  Hypervisor translates and stores
> >
> 
> The above case, VMM is responsible for writing to the message
> store. In both cases if its IMS or Legacy MSI/MSIx. VMM handles
> the writes to the device interrupt region and to the IRTE tables.
> 
> > But obviously with an IMS storage location which is software controlled
> > by the guest side driver (the case Jason is interested in) the above
> > cannot work for obvious reasons.
> >
> > That means the guest needs a way to ask the hypervisor for a proper
> > translation, i.e. a hypercall. Now where to do that? Looking at the
> > above remapping case it's pretty obvious:
> >
> >
> >  |
> >  |
> >   [CPU]   -- [VI | RT]  -- [Bridge] --Bus-- [Device]
> >  |
> >   Alloc  "Compose"   Store Use
> >
> >   Vectordomain   HCALLdomainBusdomain
> >  |^
> >  ||
> >  v|
> > Hypervisor
> >Alloc + Compose
> >
> > Why? Because it reflects the boundaries and leaves the busdomain part
> > agnostic as it should be. And it works for _all_ variants of Busdomains.
> >
> > Now the question which I can't answer is whether this can work correctly
> > in terms of isolation. If the IMS storage is in guest memory (queue
> > storage) then the guest driver can obviously write random crap into it
> > which the device will happily send. (For MSI and IDXD style IMS it
> > still can trap the store).
> 
> The isolation problem is not just the guest memory being used as interrrupt
> store right? If the Store to device region is not trapped and controlled by
> VMM, there is no gaurantee the guest OS has done the right thing?
> 
> 
> Thinking about it, guest memory might be more problematic since its not
> trappable and VMM can't enforce what is written. This is something that
> needs more attension. But for now the devices supporting memory on device
> the trap and store by VMM seems to satisfy the security properties you
> highlight here.
> 

Just want to clarify the trap part.

Guest memory is not trappable in Jason's example, which has queue/IMS
storage swapped between device/memory and requires special command 
to sync the state.

But there is also other forms of in-memory IMS implementation. e.g. Some
devices serve work requests based on command buffers instead of HW work
queues. The command buffers are linked in per-process contexts (both in 
memory) thus similarly IMS could be stored in each context too. There is no
swap per se. The context is allocated by the driver and then registered to 
the device through a mgmt. interface. When the mgmt. interface is mediated, 
the hypervisor knows the IMS location and could mark it as read-only in 
EPT page table to enable trapping of guest writes. Of course this approach
is awkward if the complexity is paid just for virtualizing IMS.

Thanks
Kevin


RE: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection

2020-11-08 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Monday, November 9, 2020 7:24 AM
> 
> On Sun, Nov 08, 2020 at 07:47:24PM +0100, Thomas Gleixner wrote:
> >
> > That means the guest needs a way to ask the hypervisor for a proper
> > translation, i.e. a hypercall. Now where to do that? Looking at the
> > above remapping case it's pretty obvious:
> >
> >
> >  |
> >  |
> >   [CPU]   -- [VI | RT]  -- [Bridge] --Bus-- [Device]
> >  |
> >   Alloc  "Compose"   Store Use
> >
> >   Vectordomain   HCALLdomainBusdomain
> >  |^
> >  ||
> >  v|
> > Hypervisor
> >Alloc + Compose
> 
> Yes, this will describes what I have been thinking

Agree

> 
> > Now the question which I can't answer is whether this can work correctly
> > in terms of isolation. If the IMS storage is in guest memory (queue
> > storage) then the guest driver can obviously write random crap into it
> > which the device will happily send. (For MSI and IDXD style IMS it
> > still can trap the store).
> 
> There are four cases of interest here:
> 
>  1) Bare metal, PF and VF devices just deliver whatever addr/data pairs
> to the APIC. IMS works perfectly with
> pci_subdevice_msi_create_irq_domain()
> 
>  2) SRIOV VF assigned to the guest.
> 
> The guest can cause any MemWr TLP to any addr/data pair
> and the iommu/platform/vmm is supposed to use the
> Bus/device/function to isolate & secure the interrupt address
> range.
> 
> IMS can work in the guest if the guest knows the details of the
> address range and can make hypercalls to setup routing. So
> pci_subdevice_msi_create_irq_domain() works if the hypercalls
> exist and fails if they don't.
> 
>  3) SIOV sub device assigned to the guest.
> 
> The difference between SIOV and SRIOV is the device must attach a
> PASID to every TLP triggered by the guest. Logically we'd expect
> when IMS is used in this situation the interrupt MemWr is tagged
> with bus/device/function/PASID to uniquly ID the guest and the same
> security protection scheme from #2 applies.

Unfortunately no. Intel VT-d only treats MemWr w/o PASID to 0xFEEx
as interrupt request. MemWr w/ PASID, even to 0xFEE, is translated
normally through DMA remapping page table. I don't know other IOMMU
vendors. But at least on Intel platform such device would not get the 
desired effect, since the IOMMU only guarantees interrupt isolation in 
BDF-level.

Does your device already implement such capability? We can bring this 
request back to the hardware team. 

> 
>  4) SIOV sub device assigned to the guest, but with emulation.
> 
> This SIOV device cannot tag interrupts with PASID so cannot do #2
> (or the platform cannot recieve a PASID tagged interrupt message).
> 
> Since the interrupts are being delivered with TLPs pointing at the
> hypervisor the only solution is for the hypervisor to exclusively
> control the interrupt table. MSI table like emulation for IMS is
> needed and the hypervisor will use
> pci_subdevice_msi_create_irq_domain()
> to get the real interrupts.
> 
> pci_subdevice_msi_create_irq_domain() needs to return the 'fake'
> addr/data pairs which are actually an ABI between the guest and
> hypervisor carried in the hidden hypercall of the emulation.
> (ie it works like MSI works today)
> 
> IDXD is worring about case #4, I think, but I didn't follow in that
> whole discussion about the IMS table layout if they PASID tag the IMS
> MemWr or not?? Ashok can you clarify?
> 
> > Is the IOMMU/Interrupt remapping unit able to catch such messages which
> > go outside the space to which the guest is allowed to signal to? If yes,
> > problem solved. If no, then IMS storage in guest memory can't ever work.
> 
> Right. Only PASID on the interrupt messages can resolve this securely.
> 
> > So in case that the HCALL domain is missing, the Vector domain needs
> > return an error code on domain creation. If the HCALL domain is there
> > then the domain creation works and in case of actual interrupt
> > allocation the hypercall either returns a valid composed message or an
> > appropriate error code.
> 
> Yes
> 
> > But there's a catch:
> >
> > This only works when the guest OS actually knows that it runs in a
> > VM. If the guest can't figure that out, i.e. via CPUID, this cannot be
> > solved because from the guest OS view that's the same as running on bare
> > metal. Obviously on bare metal the Vector domain can and must handle
> > this.
> 
> Yes
> 
> The flip side is today, the way pci_subdevice_msi_create_irq_domain()
> works a VF using it on baremetal will succeed and if that same VF is
> assigned to a guest then pci_subdevice_msi_create_irq_domain()
> succeeds but the interrupt never comes - so the driver is broken.

Yes, this is the main worry 

RE: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection

2020-11-08 Thread Tian, Kevin



> -Original Message-
> From: Thomas Gleixner 
> Sent: Saturday, November 7, 2020 8:32 AM
> To: Tian, Kevin ; Jason Gunthorpe 
> Cc: Jiang, Dave ; Bjorn Helgaas ;
> vk...@kernel.org; Dey, Megha ; m...@kernel.org;
> bhelg...@google.com; alex.william...@redhat.com; Pan, Jacob jun
> ; Raj, Ashok ; Liu, Yi L
> ; Lu, Baolu ; Kumar, Sanjay K
> ; Luck, Tony ;
> jing@intel.com; Williams, Dan J ;
> kwankh...@nvidia.com; eric.au...@redhat.com; pa...@mellanox.com;
> raf...@kernel.org; netan...@mellanox.com; shah...@mellanox.com;
> yan.y.z...@linux.intel.com; pbonz...@redhat.com; Ortiz, Samuel
> ; Hossain, Mona ;
> dmaeng...@vger.kernel.org; linux-kernel@vger.kernel.org; linux-
> p...@vger.kernel.org; k...@vger.kernel.org
> Subject: RE: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
> 
> On Fri, Nov 06 2020 at 09:48, Kevin Tian wrote:
> >> From: Jason Gunthorpe 
> >> On Wed, Nov 04, 2020 at 01:34:08PM +, Tian, Kevin wrote:
> >> The interrupt controller is responsible to create an addr/data pair
> >> for an interrupt message. It sets the message format and ensures it
> >> routes to the proper CPU interrupt handler. Everything about the
> >> addr/data pair is owned by the platform interrupt controller.
> >>
> >> Devices do not create interrupts. They only trigger the addr/data pair
> >> the platform gives them.
> >
> > I guess that we may just view it from different angles. On x86 platform,
> > a MSI/IMS capable device directly composes interrupt messages, with
> > addr/data pair filled by OS. If there is no IOMMU remapping enabled in
> > the middle, the message just hits the CPU. Your description possibly
> > is from software side, e.g. describing the hierarchical IRQ domain
> > concept?
> 
> No. The device composes nothing. If the interrupt is raised in the
> device then the MSI block sends the message which was composed by the OS
> and stored in the device's message store. For PCI/MSI that's the MSI or
> MSIX table and for IMS that's either on device memory (as IDXD uses) or
> some completely different location which Jason described.

Sorry being inaccurate here. I actually meant the same thing as
you described since I did mention addr/data pair filled by OS. 
Unfortunately I mistakenly thought that 'compose' has similar
meaning to 'send' in English but clearly it's not and instead it's
just about the message content. and for sure I also agree with your
other clarifications regarding to architecture independent  manner.

Thanks
Kevin

> 
> This has absolutely nothing to do with the X86 platform. MSI is a
> architecture independent mechanism: Send whatever the OS put into the
> storage to raise an interrupt in the CPU. The device does neither know
> whether that message is going to be intercepted by an interrupt
> remapping unit or not.
> 
> Stop claiming that any of this has anything to do with x86. It has
> absolutely nothing to do with x86 and looking at MSI from an x86
> perspective instead of looking at it from the architecture agnostic
> technical reality of MSI is the reason why we have this discussion at
> all.
> 
> We had a similar discussion vs. the way how IMS interrupts have to be
> dealt with in terms of irq domains. Can you finally stop looking at
> everything as a big x86/intel/platform lump and understand that things
> are very well structured and seperated both at the hardware and at the
> software level?
> 
> > Do you mind providing the link? There were lots of discussions between
> > you and Thomas. I failed to locate the exact mail when searching above
> > keywords.
> 
> In this thread: 20200821002424.119492...@linutronix.de and you were on
> Cc
> 
> Thanks,
> 
> tglx
> 



RE: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection

2020-11-06 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Wednesday, November 4, 2020 9:54 PM
> 
> On Wed, Nov 04, 2020 at 01:34:08PM +, Tian, Kevin wrote:
> > > From: Jason Gunthorpe 
> > > Sent: Wednesday, November 4, 2020 8:40 PM
> > >
> > > On Wed, Nov 04, 2020 at 03:41:33AM +, Tian, Kevin wrote:
> > > > > From: Jason Gunthorpe 
> > > > > Sent: Tuesday, November 3, 2020 8:44 PM
> > > > >
> > > > > On Tue, Nov 03, 2020 at 02:49:27AM +, Tian, Kevin wrote:
> > > > >
> > > > > > > There is a missing hypercall to allow the guest to do this on its 
> > > > > > > own,
> > > > > > > presumably it will someday be fixed so IMS can work in guests.
> > > > > >
> > > > > > Hypercall is VMM specific, while IMS cap provides a VMM-agnostic
> > > > > > interface so any guest driver (if following the spec) can seamlessly
> > > > > > work on all hypervisors.
> > > > >
> > > > > It is a *VMM* issue, not PCI. Adding a PCI cap to describe a VMM
> issue
> > > > > is architecturally wrong.
> > > > >
> > > > > IMS *can not work* in any hypervsior without some special
> > > > > hypercall. Just block it in the platform code and forget about the PCI
> > > > > cap.
> > > > >
> > > >
> > > > It's per-device thing instead of platform thing. If the VMM understands
> > > > the IMS format of a specific device and virtualize it to the guest,
> > >
> > > Please no! Adding device specific emulation is just going down deeper
> > > into this bad architecture.
> > >
> > > Interrupts is a platform issue. Using emulation of MSI to dynamically
> >
> > Interrupt controller is a platform issue. Interrupt source is about device.
> 
> The interrupt controller is responsible to create an addr/data pair
> for an interrupt message. It sets the message format and ensures it
> routes to the proper CPU interrupt handler. Everything about the
> addr/data pair is owned by the platform interrupt controller.
> 
> Devices do not create interrupts. They only trigger the addr/data pair
> the platform gives them.

I guess that we may just view it from different angles. On x86 platform,
a MSI/IMS capable device directly composes interrupt messages, with 
addr/data pair filled by OS. If there is no IOMMU remapping enabled in 
the middle, the message just hits the CPU. Your description possibly
is from software side, e.g. describing the hierarchical IRQ domain
concept?

> 
> > > insert vectors to a VM was a reasonable, but hacky thing. Now it needs
> > > proper platform support.
> >
> > why is MSI emulation a hacky thing? isn't it defined by PCISIG? I guess
> > that I must misunderstand your real point here...
> 
> It means the interrupt controller in the VM's platform is a fiction,
> the addr/data pairs it creates are not real.
> 
> A PCI device assigned to a VM is supposed to be fully contained by the
> IOMMU, interrupts included, so there is no reason to do MSI emulation
> if the VM's interrupt controller is aware of what addr/data pairs it
> can use with the device - eg by getting them through a hypercall. This
> is much cleaner and supports things like IMS

I agree with this point, just as how pci-hyperv.c works. In concept Linux
guest driver should be able to use IMS when running on Hyper-v. There
is no such thing for KVM, but possibly one day we will need similar stuff.
Before that happens the guest could choose to simply disallow devmsi 
by default in the platform code (inventing a hypercall just for 'disable' 
doesn't make sense) and ignore the IMS cap. One small open is whether
this can be done in one central-place. The detection of running as guest
is done in arch-specific code. Do we need disabling devmsi for every arch?

But when talking about virtualization it's not good to assume the guest
behavior. It's perfectly sane to run a guest OS which doesn't implement 
any PV stuff (thus don't know running in a VM) but do support IMS. In 
such scenario the IMS cap allows the hypervisor to educate the guest 
driver to use MSI instead of IMS, as long as the driver follows the device 
spec. In this regard I don't think that the IMS cap will be a short-term 
thing, although Linux may choose to not use it.

> 
> Trying to do IMS emulation is nutz, the entire point of IMS is the
> device can do what it likes, and emulating that is not going to
> feasible. For instance go read the discussion I had with Thomas how a
> object-centric device would manage interrupts.
> 

Do you mind providing the link? There were lots of discussions between
you and Thomas. I failed to locate the exact mail when searching above
keywords. 

Thanks
Kevin


RE: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection

2020-11-04 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Wednesday, November 4, 2020 8:40 PM
> 
> On Wed, Nov 04, 2020 at 03:41:33AM +, Tian, Kevin wrote:
> > > From: Jason Gunthorpe 
> > > Sent: Tuesday, November 3, 2020 8:44 PM
> > >
> > > On Tue, Nov 03, 2020 at 02:49:27AM +, Tian, Kevin wrote:
> > >
> > > > > There is a missing hypercall to allow the guest to do this on its own,
> > > > > presumably it will someday be fixed so IMS can work in guests.
> > > >
> > > > Hypercall is VMM specific, while IMS cap provides a VMM-agnostic
> > > > interface so any guest driver (if following the spec) can seamlessly
> > > > work on all hypervisors.
> > >
> > > It is a *VMM* issue, not PCI. Adding a PCI cap to describe a VMM issue
> > > is architecturally wrong.
> > >
> > > IMS *can not work* in any hypervsior without some special
> > > hypercall. Just block it in the platform code and forget about the PCI
> > > cap.
> > >
> >
> > It's per-device thing instead of platform thing. If the VMM understands
> > the IMS format of a specific device and virtualize it to the guest,
> 
> Please no! Adding device specific emulation is just going down deeper
> into this bad architecture.
> 
> Interrupts is a platform issue. Using emulation of MSI to dynamically

Interrupt controller is a platform issue. Interrupt source is about device.

> insert vectors to a VM was a reasonable, but hacky thing. Now it needs
> proper platform support.
> 

why is MSI emulation a hacky thing? isn't it defined by PCISIG? I guess
that I must misunderstand your real point here...

Thanks
Kevin


RE: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection

2020-11-03 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Tuesday, November 3, 2020 8:44 PM
> 
> On Tue, Nov 03, 2020 at 02:49:27AM +, Tian, Kevin wrote:
> 
> > > There is a missing hypercall to allow the guest to do this on its own,
> > > presumably it will someday be fixed so IMS can work in guests.
> >
> > Hypercall is VMM specific, while IMS cap provides a VMM-agnostic
> > interface so any guest driver (if following the spec) can seamlessly
> > work on all hypervisors.
> 
> It is a *VMM* issue, not PCI. Adding a PCI cap to describe a VMM issue
> is architecturally wrong.
> 
> IMS *can not work* in any hypervsior without some special
> hypercall. Just block it in the platform code and forget about the PCI
> cap.
> 

It's per-device thing instead of platform thing. If the VMM understands
the IMS format of a specific device and virtualize it to the guest, the
guest can use IMS w/o any hypercall. If the VMM doesn't understand, it
simply clears the IMS cap bit for this device which forces the guest to
use the standard PCI MSI/MSI-X interface. In VMM side the decision is
based on device virtualization knowledge, e.g. in VFIO, instead of in 
platform virtualization logic. Your platform argument is based on the 
hypercall assumption, which is what we want to avoid instead.

Thanks
Kevin


RE: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection

2020-11-02 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Monday, November 2, 2020 9:22 PM
> 
> On Fri, Oct 30, 2020 at 03:49:22PM -0700, Dave Jiang wrote:
> >
> >
> > On 10/30/2020 3:45 PM, Jason Gunthorpe wrote:
> > > On Fri, Oct 30, 2020 at 02:20:03PM -0700, Dave Jiang wrote:
> > > > So the intel-iommu driver checks for the SIOV cap. And the idxd driver
> > > > checks for SIOV and IMS cap. There will be other upcoming drivers that
> will
> > > > check for such cap too. It is Intel vendor specific right now, but SIOV 
> > > > is
> > > > public and other vendors may implement to the spec. Is there a good
> place to
> > > > put the common capability check for that?
> > >
> > > I'm still really unhappy with these SIOV caps. It was explained this
> > > is just a hack to make up for pci_ims_array_create_msi_irq_domain()
> > > succeeding in VM cases when it doesn't actually work.
> > >
> > > Someday this is likely to get fixed, so tying platform behavior to PCI
> > > caps is completely wrong.
> > >
> > > This needs to be solved in the platform code,
> > > pci_ims_array_create_msi_irq_domain() should not succeed in these
> > > cases.
> >
> > That sounds reasonable. Are you asking that the IMS cap check should gate
> > the success/failure of pci_ims_array_create_msi_irq_domain() rather than
> the
> > driver?
> 
> There shouldn't be an IMS cap at all
> 
> As I understand, the problem here is the only way to establish new
> VT-d IRQ routing is by trapping and emulating MSI/MSI-X related
> activities and triggering routing of the vectors into the guest.
> 
> There is a missing hypercall to allow the guest to do this on its own,
> presumably it will someday be fixed so IMS can work in guests.

Hypercall is VMM specific, while IMS cap provides a VMM-agnostic
interface so any guest driver (if following the spec) can seamlessly
work on all hypervisors.

Thanks
Kevin




RE: [PATCH v6 5/5] vfio/type1: Use mdev bus iommu_ops for IOMMU callbacks

2020-10-30 Thread Tian, Kevin
> From: Lu Baolu 
> Sent: Friday, October 30, 2020 12:58 PM
> 
> With the IOMMU driver registering iommu_ops for the mdev_bus, the
> IOMMU
> operations on an mdev could be done in the same way as any normal device
> (for example, PCI/PCIe). There's no need to distinguish an mdev from
> others for iommu operations. Remove the unnecessary code.

This is really a nice cleanup as the output of this change! :)

Thanks
Kevin

> 
> Signed-off-by: Lu Baolu 
> ---
>  drivers/vfio/mdev/mdev_core.c|  18 -
>  drivers/vfio/mdev/mdev_driver.c  |   6 ++
>  drivers/vfio/mdev/mdev_private.h |   1 -
>  drivers/vfio/vfio_iommu_type1.c  | 128 +++
>  include/linux/mdev.h |  14 
>  5 files changed, 18 insertions(+), 149 deletions(-)
> 
> diff --git a/drivers/vfio/mdev/mdev_core.c
> b/drivers/vfio/mdev/mdev_core.c
> index 6b9ab71f89e7..f4fd5f237c49 100644
> --- a/drivers/vfio/mdev/mdev_core.c
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -386,24 +386,6 @@ int mdev_device_remove(struct device *dev)
>   return 0;
>  }
> 
> -int mdev_set_iommu_device(struct device *dev, struct device
> *iommu_device)
> -{
> - struct mdev_device *mdev = to_mdev_device(dev);
> -
> - mdev->iommu_device = iommu_device;
> -
> - return 0;
> -}
> -EXPORT_SYMBOL(mdev_set_iommu_device);
> -
> -struct device *mdev_get_iommu_device(struct device *dev)
> -{
> - struct mdev_device *mdev = to_mdev_device(dev);
> -
> - return mdev->iommu_device;
> -}
> -EXPORT_SYMBOL(mdev_get_iommu_device);
> -
>  static int __init mdev_init(void)
>  {
>   return mdev_bus_register();
> diff --git a/drivers/vfio/mdev/mdev_driver.c
> b/drivers/vfio/mdev/mdev_driver.c
> index 0d3223aee20b..487402f16355 100644
> --- a/drivers/vfio/mdev/mdev_driver.c
> +++ b/drivers/vfio/mdev/mdev_driver.c
> @@ -18,6 +18,9 @@ static int mdev_attach_iommu(struct mdev_device
> *mdev)
>   int ret;
>   struct iommu_group *group;
> 
> + if (iommu_present(_bus_type))
> + return 0;
> +
>   group = iommu_group_alloc();
>   if (IS_ERR(group))
>   return PTR_ERR(group);
> @@ -33,6 +36,9 @@ static int mdev_attach_iommu(struct mdev_device
> *mdev)
> 
>  static void mdev_detach_iommu(struct mdev_device *mdev)
>  {
> + if (iommu_present(_bus_type))
> + return;
> +
>   iommu_group_remove_device(>dev);
>   dev_info(>dev, "MDEV: detaching iommu\n");
>  }
> diff --git a/drivers/vfio/mdev/mdev_private.h
> b/drivers/vfio/mdev/mdev_private.h
> index 7d922950caaf..efe0aefdb52f 100644
> --- a/drivers/vfio/mdev/mdev_private.h
> +++ b/drivers/vfio/mdev/mdev_private.h
> @@ -31,7 +31,6 @@ struct mdev_device {
>   void *driver_data;
>   struct list_head next;
>   struct kobject *type_kobj;
> - struct device *iommu_device;
>   bool active;
>  };
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c
> b/drivers/vfio/vfio_iommu_type1.c
> index bb2684cc245e..e231b7070ca5 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -100,7 +100,6 @@ struct vfio_dma {
>  struct vfio_group {
>   struct iommu_group  *iommu_group;
>   struct list_headnext;
> - boolmdev_group; /* An mdev group */
>   boolpinned_page_dirty_scope;
>  };
> 
> @@ -1675,102 +1674,6 @@ static bool vfio_iommu_has_sw_msi(struct
> list_head *group_resv_regions,
>   return ret;
>  }
> 
> -static struct device *vfio_mdev_get_iommu_device(struct device *dev)
> -{
> - struct device *(*fn)(struct device *dev);
> - struct device *iommu_device;
> -
> - fn = symbol_get(mdev_get_iommu_device);
> - if (fn) {
> - iommu_device = fn(dev);
> - symbol_put(mdev_get_iommu_device);
> -
> - return iommu_device;
> - }
> -
> - return NULL;
> -}
> -
> -static int vfio_mdev_attach_domain(struct device *dev, void *data)
> -{
> - struct iommu_domain *domain = data;
> - struct device *iommu_device;
> -
> - iommu_device = vfio_mdev_get_iommu_device(dev);
> - if (iommu_device) {
> - if (iommu_dev_feature_enabled(iommu_device,
> IOMMU_DEV_FEAT_AUX))
> - return iommu_aux_attach_device(domain,
> iommu_device);
> - else
> - return iommu_attach_device(domain,
> iommu_device);
> - }
> -
> - return -EINVAL;
> -}
> -
> -static int vfio_mdev_detach_domain(struct device *dev, void *data)
> -{
> - struct iommu_domain *domain = data;
> - struct device *iommu_device;
> -
> - iommu_device = vfio_mdev_get_iommu_device(dev);
> - if (iommu_device) {
> - if (iommu_dev_feature_enabled(iommu_device,
> IOMMU_DEV_FEAT_AUX))
> - iommu_aux_detach_device(domain, iommu_device);
> - else
> - iommu_detach_device(domain, iommu_device);
> - }
> -
> - return 0;
> -}
> -
> -static int vfio_iommu_attach_group(struct vfio_domain 

RE: [PATCH v6 4/5] iommu/vt-d: Add iommu_ops support for subdevice bus

2020-10-30 Thread Tian, Kevin
> From: Lu Baolu 
> Sent: Friday, October 30, 2020 12:58 PM
> 
> The iommu_ops will only take effect when INTEL_IOMMU_SCALABLE_IOV
> kernel
> option is selected. It applies to any device passthrough framework which
> implements an underlying bus for the subdevices.
> 
> - Subdevice probe:
>   When a subdevice is created and added to the bus, iommu_probe_device()
>   will be called, where the device will be probed by the iommu core. An
>   iommu group will be allocated and the device will be added to it. The
>   default domain won't be allocated since there's no use case of using a
>   subdevice in the host kernel at this time being. However, it's pretty
>   easy to add this support later.
> 
> - Domain alloc/free/map/unmap/iova_to_phys operations:
>   For such ops, we just reuse those for PCI/PCIe devices.

One question. Just be curious whether every IOMMU vendor supports
only one iommu_ops for all bus types. For Intel obviously the answer is
yes. But if a vendor supports bus-type specific iommu_ops for physical
devices, this may impose a restriction to VFIO or other passthrough 
frameworks, because VFIO today maintains only one mdev bus while the 
parent devices could come from different bus types. In the end the ops
for subdevice should be same as the one used for the parent. Then it may 
require VFIO to organize subdevices based on parent bus types. 

> 
> - Domain attach/detach operations:
>   It depends on whether the parent device supports IOMMU_DEV_FEAT_AUX
>   feature. If so, the domain will be attached to the parent device as an
>   aux-domain; Otherwise, it will be attached to the parent as a primary
>   domain.
> 
> Signed-off-by: Lu Baolu 
> ---
>  drivers/iommu/intel/Kconfig  |  13 
>  drivers/iommu/intel/Makefile |   1 +
>  drivers/iommu/intel/iommu.c  |   5 ++
>  drivers/iommu/intel/siov.c   | 119 +++
>  include/linux/intel-iommu.h  |   4 ++
>  5 files changed, 142 insertions(+)
>  create mode 100644 drivers/iommu/intel/siov.c
> 
> diff --git a/drivers/iommu/intel/Kconfig b/drivers/iommu/intel/Kconfig
> index 28a3d1596c76..94edc332f558 100644
> --- a/drivers/iommu/intel/Kconfig
> +++ b/drivers/iommu/intel/Kconfig
> @@ -86,3 +86,16 @@ config INTEL_IOMMU_SCALABLE_MODE_DEFAULT_ON
> is not selected, scalable mode support could also be enabled by
> passing intel_iommu=sm_on to the kernel. If not sure, please use
> the default value.
> +
> +config INTEL_IOMMU_SCALABLE_IOV

INTEL_IOMMU_SUBDEVICE? if just talking from IOMMU p.o.v...

> + bool "Support for Intel Scalable I/O Virtualization"
> + depends on INTEL_IOMMU
> + select VFIO
> + select VFIO_MDEV
> + select VFIO_MDEV_DEVICE
> + help
> +   Intel Scalable I/O virtualization (SIOV) is a hardware-assisted
> +   PCIe subdevices virtualization. With each subdevice tagged with
> +   an unique ID(PCI/PASID) the VT-d hardware could identify, hence
> +   isolate DMA transactions from different subdevices on a same PCIe
> +   device. Selecting this option will enable the support.
> diff --git a/drivers/iommu/intel/Makefile b/drivers/iommu/intel/Makefile
> index fb8e1e8c8029..f216385d5d59 100644
> --- a/drivers/iommu/intel/Makefile
> +++ b/drivers/iommu/intel/Makefile
> @@ -4,4 +4,5 @@ obj-$(CONFIG_INTEL_IOMMU) += iommu.o pasid.o
>  obj-$(CONFIG_INTEL_IOMMU) += trace.o
>  obj-$(CONFIG_INTEL_IOMMU_DEBUGFS) += debugfs.o
>  obj-$(CONFIG_INTEL_IOMMU_SVM) += svm.o
> +obj-$(CONFIG_INTEL_IOMMU_SCALABLE_IOV) += siov.o
>  obj-$(CONFIG_IRQ_REMAP) += irq_remapping.o
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index 1454fe74f3ba..dafd8069c2af 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -4298,6 +4298,11 @@ int __init intel_iommu_init(void)
>   up_read(_global_lock);
> 
>   bus_set_iommu(_bus_type, _iommu_ops);
> +
> +#ifdef CONFIG_INTEL_IOMMU_SCALABLE_IOV
> + intel_siov_init();
> +#endif /* CONFIG_INTEL_IOMMU_SCALABLE_IOV */
> +
>   if (si_domain && !hw_pass_through)
>   register_memory_notifier(_iommu_memory_nb);
>   cpuhp_setup_state(CPUHP_IOMMU_INTEL_DEAD,
> "iommu/intel:dead", NULL,
> diff --git a/drivers/iommu/intel/siov.c b/drivers/iommu/intel/siov.c
> new file mode 100644
> index ..b9470e7ab3d6
> --- /dev/null
> +++ b/drivers/iommu/intel/siov.c
> @@ -0,0 +1,119 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/**
> + * siov.c - Intel Scalable I/O virtualization support
> + *
> + * Copyright (C) 2020 Intel Corporation
> + *
> + * Author: Lu Baolu 
> + */
> +
> +#define pr_fmt(fmt)  "DMAR: " fmt
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +static struct device *subdev_lookup_parent(struct device *dev)
> +{
> + if (dev->bus == _bus_type)
> + return mdev_parent_dev(mdev_from_dev(dev));

What about finding the parent through device core? Then the logic
should work for all subdevice frameworks.

> +
> + return NULL;
> 

RE: [PATCH v6 2/5] iommu: Use bus iommu ops for aux related callback

2020-10-29 Thread Tian, Kevin
> From: Lu Baolu 
> Sent: Friday, October 30, 2020 12:58 PM
> 
> The aux-domain apis were designed for macro driver where the subdevices
> are created and used inside a device driver. Use the device's bus iommu
> ops instead of that in iommu domain for various callbacks.

IIRC there are only two users on these apis. One is VFIO, and the other
is on the ARM side (not checked in yet). Jean, can you help confirm 
whether ARM-side usage still relies on aux apis even with this change?
If no, possibly they can be removed completely?

Thanks
Kevin

> 
> Signed-off-by: Lu Baolu 
> ---
>  drivers/iommu/iommu.c | 16 ++--
>  1 file changed, 10 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 6bbdd959f9f3..17f2686664db 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -2913,10 +2913,11 @@
> EXPORT_SYMBOL_GPL(iommu_dev_feature_enabled);
>   */
>  int iommu_aux_attach_device(struct iommu_domain *domain, struct device
> *dev)
>  {
> + const struct iommu_ops *ops = dev->bus->iommu_ops;
>   int ret = -ENODEV;
> 
> - if (domain->ops->aux_attach_dev)
> - ret = domain->ops->aux_attach_dev(domain, dev);
> + if (ops && ops->aux_attach_dev)
> + ret = ops->aux_attach_dev(domain, dev);
> 
>   if (!ret)
>   trace_attach_device_to_domain(dev);
> @@ -2927,8 +2928,10 @@
> EXPORT_SYMBOL_GPL(iommu_aux_attach_device);
> 
>  void iommu_aux_detach_device(struct iommu_domain *domain, struct
> device *dev)
>  {
> - if (domain->ops->aux_detach_dev) {
> - domain->ops->aux_detach_dev(domain, dev);
> + const struct iommu_ops *ops = dev->bus->iommu_ops;
> +
> + if (ops && ops->aux_detach_dev) {
> + ops->aux_detach_dev(domain, dev);
>   trace_detach_device_from_domain(dev);
>   }
>  }
> @@ -2936,10 +2939,11 @@
> EXPORT_SYMBOL_GPL(iommu_aux_detach_device);
> 
>  int iommu_aux_get_pasid(struct iommu_domain *domain, struct device
> *dev)
>  {
> + const struct iommu_ops *ops = dev->bus->iommu_ops;
>   int ret = -ENODEV;
> 
> - if (domain->ops->aux_get_pasid)
> - ret = domain->ops->aux_get_pasid(domain, dev);
> + if (ops && ops->aux_get_pasid)
> + ret = ops->aux_get_pasid(domain, dev);
> 
>   return ret;
>  }
> --
> 2.25.1



RE: [PATCH v2 1/1] iommu/vt-d: Use device numa domain if RHSA is missing

2020-09-03 Thread Tian, Kevin
> From: Lu Baolu
> Sent: Friday, September 4, 2020 9:03 AM
> 
> If there are multiple NUMA domains but the RHSA is missing in ACPI/DMAR
> table, we could default to the device NUMA domain as fall back. This could
> also benefit a vIOMMU use case where only single vIOMMU is exposed,
> hence
> no RHSA will be present but device numa domain can be correct.

My comment on this is not fixed. It is not restricted to single-vIOMMU 
situation.
and actually this may also happen on physical platform if some FW doesn't
provide RHSA information.

with that being fixed:

Reviewed-by: Kevin Tian 

> 
> Cc: Jacob Pan 
> Cc: Kevin Tian 
> Cc: Ashok Raj 
> Signed-off-by: Lu Baolu 
> ---
>  drivers/iommu/intel/iommu.c | 37
> +++--
>  1 file changed, 35 insertions(+), 2 deletions(-)
> 
> Change log:
> v1->v2:
>   - Add a comment as suggested by Kevin.
> https://lore.kernel.org/linux-
> iommu/MWHPR11MB1645E6D6BD1EFDFA139AA37C8C520@MWHPR11MB1
> 645.namprd11.prod.outlook.com/
> 
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index 7f844d1c8cd9..69d5a87188f4 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -698,12 +698,47 @@ static int
> domain_update_iommu_superpage(struct dmar_domain *domain,
>   return fls(mask);
>  }
> 
> +static int domain_update_device_node(struct dmar_domain *domain)
> +{
> + struct device_domain_info *info;
> + int nid = NUMA_NO_NODE;
> +
> + assert_spin_locked(_domain_lock);
> +
> + if (list_empty(>devices))
> + return NUMA_NO_NODE;
> +
> + list_for_each_entry(info, >devices, link) {
> + if (!info->dev)
> + continue;
> +
> + /*
> +  * There could possibly be multiple device numa nodes as
> devices
> +  * within the same domain may sit behind different IOMMUs.
> There
> +  * isn't perfect answer in such situation, so we select first
> +  * come first served policy.
> +  */
> + nid = dev_to_node(info->dev);
> + if (nid != NUMA_NO_NODE)
> + break;
> + }
> +
> + return nid;
> +}
> +
>  /* Some capabilities may be different across iommus */
>  static void domain_update_iommu_cap(struct dmar_domain *domain)
>  {
>   domain_update_iommu_coherency(domain);
>   domain->iommu_snooping =
> domain_update_iommu_snooping(NULL);
>   domain->iommu_superpage =
> domain_update_iommu_superpage(domain, NULL);
> +
> + /*
> +  * If RHSA is missing, we should default to the device numa domain
> +  * as fall back.
> +  */
> + if (domain->nid == NUMA_NO_NODE)
> + domain->nid = domain_update_device_node(domain);
>  }
> 
>  struct context_entry *iommu_context_addr(struct intel_iommu *iommu, u8
> bus,
> @@ -5096,8 +5131,6 @@ static struct iommu_domain
> *intel_iommu_domain_alloc(unsigned type)
>   if (type == IOMMU_DOMAIN_DMA)
>   intel_init_iova_domain(dmar_domain);
> 
> - domain_update_iommu_cap(dmar_domain);
> -
>   domain = _domain->domain;
>   domain->geometry.aperture_start = 0;
>   domain->geometry.aperture_end   =
> --
> 2.17.1
> 
> ___
> iommu mailing list
> io...@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v3] PCI: Introduce flag for detached virtual functions

2020-09-01 Thread Tian, Kevin
> From: kvm-ow...@vger.kernel.org  On Behalf
> Of Niklas Schnelle
> Sent: Friday, August 28, 2020 5:10 PM
> To: Bjorn Helgaas ; Alex Williamson
> 
> 
[...]
> >>
> >> FWIW, pci_physfn() never returns NULL, it returns the provided pdev if
> >> is_virtfn is not set.  This proposal wouldn't change that return value.
> >> AIUI pci_physfn(), the caller needs to test that the returned device is
> >> different from the provided device if there's really code that wants to
> >> traverse to the PF.
> >
> > Oh, so this VF has is_virtfn==0.  That seems weird.  There are lots of
> > other ways that a VF is different: Vendor/Device IDs are 0x, BARs
> > are zeroes, etc.
> >
> > It sounds like you're sweeping those under the rug by avoiding the
> > normal enumeration path (e.g., you don't have to size the BARs), but
> > if it actually is a VF, it seems like there might be fewer surprises
> > if we treat it as one.
> >
> > Why don't you just set is_virtfn=1 since it *is* a VF, and then deal
> > with the special cases where you want to touch the PF?
> >
> > Bjorn
> >
> 
> As we are always running under at least a machine level hypervisor
> we're somewhat in the same situation as e.g. a KVM guest in
> that the VFs we see have some emulation that makes them act more like
> normal PCI functions. It just so happens that the machine level hypervisor
> does not emulate the PCI_COMMAND_MEMORY, it does emulate BARs and
> Vendor/Device IDs
> though.
> So is_virtfn is 0 for some VF for the same reason it is 0 on
> KVM/ESXi/HyperV/Jailhouse…
> guests on other architectures.

I wonder whether it's a good idea to also find a way to set is_virtfn
for normal KVM guest which get a vf assigned. There are other cases 
where faithful emulation of certain PCI capabilities is difficult, e.g. 
when enabling guest SVA related features (PASID/ATS/PRS). Per PCIe 
spec, some or all fields of those capabilities are shared between PF 
and VF. Among them:

1) Some could be emulated properly and indirectly reflected in hardware, 
e.g. Intel VT-d allows additional control per VF about whether to accept 
page request, execute/privileged permission, etc. thus allowing VF-specific 
control even when device-side setting is shared;

2) Some could be purely emulated in software and it's harmless to leave
the hardware following PF setting, e.g. ATS enable, STU(?), outstanding
page request allocation, etc.;

3) However, I didn’t see a clean way of emulating page_request_ctrl.reset
and page_request_status.stopped. Those two have clear definition about
outstanding page requests. They are shared thus we cannot issue physical
action just due to request on one VF, while pure software emulation 
cannot guarantee the desired expectation. Of course this issue also exists
even on bare metal - pci_enable/disable/reset_pri just do nothing for
vf. But there is chance to mitigate (e.g. timeout), but not possible in guest
if the guest doesn't know it's actually a VF.

Setting is_virtfn=1 allows guest to be cooperative like running together
with PF driver. But there is an ordering issue. The guest knows whether
a device is VF only when the VF driver is loaded (based on PCI_ID), but
related capabilities might be already enabled when attaching the device
to IOMMU (at least for intel_iommu). But suppose it's not a hard fix.
Last, detached vf is not a PCISIG definition. So the host still needs to
do proper emulation (even not faithful) of those capabilities for guests 
who don't recognize detached vf.

Thoughts?

Thanks
Kevin


RE: [PATCH 1/1] iommu/vt-d: Use device numa domain if RHSA is missing

2020-08-27 Thread Tian, Kevin
> From: Lu Baolu 
> Sent: Thursday, August 27, 2020 1:57 PM
> 
> If there are multiple NUMA domains but the RHSA is missing in ACPI/DMAR
> table, we could default to the device NUMA domain as fall back. This also
> benefits the vIOMMU use case where only a single vIOMMU is exposed,
> hence
> no RHSA will be present but device numa domain can be correct.

this benefits vIOMMU but not necessarily only applied to single-vIOMMU
case. The logic still holds in multiple vIOMMU cases as long as RHSA is
not provided.

> 
> Cc: Jacob Pan 
> Cc: Kevin Tian 
> Cc: Ashok Raj 
> Signed-off-by: Lu Baolu 
> ---
>  drivers/iommu/intel/iommu.c | 31 +--
>  1 file changed, 29 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index e0516d64d7a3..bce158468abf 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -700,12 +700,41 @@ static int
> domain_update_iommu_superpage(struct dmar_domain *domain,
>   return fls(mask);
>  }
> 
> +static int domain_update_device_node(struct dmar_domain *domain)
> +{
> + struct device_domain_info *info;
> + int nid = NUMA_NO_NODE;
> +
> + assert_spin_locked(_domain_lock);
> +
> + if (list_empty(>devices))
> + return NUMA_NO_NODE;
> +
> + list_for_each_entry(info, >devices, link) {
> + if (!info->dev)
> + continue;
> +
> + nid = dev_to_node(info->dev);
> + if (nid != NUMA_NO_NODE)
> + break;
> + }

There could be multiple device numa nodes as devices within the
same domain may sit behind different IOMMUs. Of course there
is no perfect answer in such situation, and this patch is still an
obvious improvement on current always-on-node0 policy. But 
some comment about such implication is welcomed.

> +
> + return nid;
> +}
> +
>  /* Some capabilities may be different across iommus */
>  static void domain_update_iommu_cap(struct dmar_domain *domain)
>  {
>   domain_update_iommu_coherency(domain);
>   domain->iommu_snooping =
> domain_update_iommu_snooping(NULL);
>   domain->iommu_superpage =
> domain_update_iommu_superpage(domain, NULL);
> +
> + /*
> +  * If RHSA is missing, we should default to the device numa domain
> +  * as fall back.
> +  */
> + if (domain->nid == NUMA_NO_NODE)
> + domain->nid = domain_update_device_node(domain);
>  }
> 
>  struct context_entry *iommu_context_addr(struct intel_iommu *iommu, u8
> bus,
> @@ -5086,8 +5115,6 @@ static struct iommu_domain
> *intel_iommu_domain_alloc(unsigned type)
>   if (type == IOMMU_DOMAIN_DMA)
>   intel_init_iova_domain(dmar_domain);
> 
> - domain_update_iommu_cap(dmar_domain);
> -

Is it intended or by mistake? If the former, looks it is a separate fix...

>   domain = _domain->domain;
>   domain->geometry.aperture_start = 0;
>   domain->geometry.aperture_end   =
> --
> 2.17.1



RE: [PATCH v3 1/1] iommu/vt-d: Serialize IOMMU GCMD register modifications

2020-08-27 Thread Tian, Kevin
> From: Lu Baolu 
> Sent: Friday, August 28, 2020 8:06 AM
> 
> The VT-d spec requires (10.4.4 Global Command Register, GCMD_REG
> General
> Description) that:
> 
> If multiple control fields in this register need to be modified, software
> must serialize the modifications through multiple writes to this register.
> 
> However, in irq_remapping.c, modifications of IRE and CFI are done in one
> write. We need to do two separate writes with STS checking after each. It
> also checks the status register before writing command register to avoid
> unnecessary register write.
> 
> Fixes: af8d102f999a4 ("x86/intel/irq_remapping: Clean up x2apic opt-out
> security warning mess")
> Cc: Andy Lutomirski 
> Cc: Jacob Pan 
> Cc: Kevin Tian 
> Cc: Ashok Raj 
> Signed-off-by: Lu Baolu 
> ---
>  drivers/iommu/intel/irq_remapping.c | 10 --
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> Change log:
> v1->v2:
>   - v1 posted here
> https://lore.kernel.org/linux-iommu/20200826025825.2322-1-
> baolu...@linux.intel.com/
>   - Add status check before disabling CFI (Kevin)
> v2->v3:
>   - v2 posted here
> https://lore.kernel.org/linux-iommu/20200827042513.30292-1-
> baolu...@linux.intel.com/
>   - Remove unnecessary register read (Kevin)
> 
> diff --git a/drivers/iommu/intel/irq_remapping.c
> b/drivers/iommu/intel/irq_remapping.c
> index 9564d23d094f..a91dd997d268 100644
> --- a/drivers/iommu/intel/irq_remapping.c
> +++ b/drivers/iommu/intel/irq_remapping.c
> @@ -507,12 +507,18 @@ static void iommu_enable_irq_remapping(struct
> intel_iommu *iommu)
> 
>   /* Enable interrupt-remapping */
>   iommu->gcmd |= DMA_GCMD_IRE;
> - iommu->gcmd &= ~DMA_GCMD_CFI;  /* Block compatibility-format
> MSIs */
>   writel(iommu->gcmd, iommu->reg + DMAR_GCMD_REG);
> -
>   IOMMU_WAIT_OP(iommu, DMAR_GSTS_REG,
> readl, (sts & DMA_GSTS_IRES), sts);
> 
> + /* Block compatibility-format MSIs */
> + if (sts & DMA_GSTS_CFIS) {
> + iommu->gcmd &= ~DMA_GCMD_CFI;
> + writel(iommu->gcmd, iommu->reg + DMAR_GCMD_REG);
> + IOMMU_WAIT_OP(iommu, DMAR_GSTS_REG,
> +   readl, !(sts & DMA_GSTS_CFIS), sts);
> + }
> +
>   /*
>* With CFI clear in the Global Command register, we should be
>* protected from dangerous (i.e. compatibility) interrupts
> --
> 2.17.1

Reviewed-by: Kevin Tian 


RE: [PATCH v2 1/1] iommu/vt-d: Serialize IOMMU GCMD register modifications

2020-08-26 Thread Tian, Kevin
> From: Lu Baolu 
> Sent: Thursday, August 27, 2020 12:25 PM
> 
> The VT-d spec requires (10.4.4 Global Command Register, GCMD_REG
> General
> Description) that:
> 
> If multiple control fields in this register need to be modified, software
> must serialize the modifications through multiple writes to this register.
> 
> However, in irq_remapping.c, modifications of IRE and CFI are done in one
> write. We need to do two separate writes with STS checking after each.
> 
> Fixes: af8d102f999a4 ("x86/intel/irq_remapping: Clean up x2apic opt-out
> security warning mess")
> Cc: Andy Lutomirski 
> Cc: Jacob Pan 
> Cc: Kevin Tian 
> Cc: Ashok Raj 
> Signed-off-by: Lu Baolu 
> ---
>  drivers/iommu/intel/irq_remapping.c | 11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> Change log:
> v1->v2:
>   - v1 posted here
> https://lore.kernel.org/linux-iommu/20200826025825.2322-1-
> baolu...@linux.intel.com/;
>   - Add status check before disabling CFI. (Kevin)
> 
> diff --git a/drivers/iommu/intel/irq_remapping.c
> b/drivers/iommu/intel/irq_remapping.c
> index 9564d23d094f..7552bb7e92c8 100644
> --- a/drivers/iommu/intel/irq_remapping.c
> +++ b/drivers/iommu/intel/irq_remapping.c
> @@ -507,12 +507,19 @@ static void iommu_enable_irq_remapping(struct
> intel_iommu *iommu)
> 
>   /* Enable interrupt-remapping */
>   iommu->gcmd |= DMA_GCMD_IRE;
> - iommu->gcmd &= ~DMA_GCMD_CFI;  /* Block compatibility-format
> MSIs */
>   writel(iommu->gcmd, iommu->reg + DMAR_GCMD_REG);
> -
>   IOMMU_WAIT_OP(iommu, DMAR_GSTS_REG,
> readl, (sts & DMA_GSTS_IRES), sts);
> 
> + /* Block compatibility-format MSIs */
> + sts = readl(iommu->reg + DMAR_GSTS_REG);

no need of this readl as the status is already three in IOMMU_WAIT_OP.

> + if (sts & DMA_GSTS_CFIS) {
> + iommu->gcmd &= ~DMA_GCMD_CFI;
> + writel(iommu->gcmd, iommu->reg + DMAR_GCMD_REG);
> + IOMMU_WAIT_OP(iommu, DMAR_GSTS_REG,
> +   readl, !(sts & DMA_GSTS_CFIS), sts);
> + }
> +
>   /*
>* With CFI clear in the Global Command register, we should be
>* protected from dangerous (i.e. compatibility) interrupts
> --
> 2.17.1



RE: [PATCH 1/1] iommu/vt-d: Serialize IOMMU GCMD register modifications

2020-08-25 Thread Tian, Kevin
> From: Lu Baolu
> Sent: Wednesday, August 26, 2020 10:58 AM
> 
> The VT-d spec requires (10.4.4 Global Command Register, GCMD_REG
> General
> Description) that:
> 
> If multiple control fields in this register need to be modified, software
> must serialize the modifications through multiple writes to this register.
> 
> However, in irq_remapping.c, modifications of IRE and CFI are done in one
> write. We need to do two separate writes with STS checking after each.
> 
> Fixes: af8d102f999a4 ("x86/intel/irq_remapping: Clean up x2apic opt-out
> security warning mess")
> Cc: Andy Lutomirski 
> Cc: Jacob Pan 
> Signed-off-by: Lu Baolu 
> ---
>  drivers/iommu/intel/irq_remapping.c | 8 ++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/iommu/intel/irq_remapping.c
> b/drivers/iommu/intel/irq_remapping.c
> index 9564d23d094f..19d7e18876fe 100644
> --- a/drivers/iommu/intel/irq_remapping.c
> +++ b/drivers/iommu/intel/irq_remapping.c
> @@ -507,12 +507,16 @@ static void iommu_enable_irq_remapping(struct
> intel_iommu *iommu)
> 
>   /* Enable interrupt-remapping */
>   iommu->gcmd |= DMA_GCMD_IRE;
> - iommu->gcmd &= ~DMA_GCMD_CFI;  /* Block compatibility-format
> MSIs */
>   writel(iommu->gcmd, iommu->reg + DMAR_GCMD_REG);
> -
>   IOMMU_WAIT_OP(iommu, DMAR_GSTS_REG,
> readl, (sts & DMA_GSTS_IRES), sts);
> 
> + /* Block compatibility-format MSIs */
> + iommu->gcmd &= ~DMA_GCMD_CFI;
> + writel(iommu->gcmd, iommu->reg + DMAR_GCMD_REG);
> + IOMMU_WAIT_OP(iommu, DMAR_GSTS_REG,
> +   readl, !(sts & DMA_GSTS_CFIS), sts);
> +

Better do it only when CFI is actually enabled (by checking sts).

>   /*
>* With CFI clear in the Global Command register, we should be
>* protected from dangerous (i.e. compatibility) interrupts
> --
> 2.17.1
> 
> ___
> iommu mailing list
> io...@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH RFC v2 00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver

2020-08-19 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Tuesday, August 18, 2020 7:50 PM
> 
> On Tue, Aug 18, 2020 at 01:09:01AM +, Tian, Kevin wrote:
> > The difference in my reply is not just about the implementation gap
> > of growing a userspace DMA framework to a passthrough framework.
> > My real point is about the different goals that each wants to achieve.
> > Userspace DMA is purely about allowing userspace to directly access
> > the portal and do DMA, but the wq configuration is always under kernel
> > driver's control. On the other hand, passthrough means delegating full
> > control of the wq to the guest and then associated mock-up (live migration,
> > vSVA, posted interrupt, etc.) for that to work. I really didn't see the
> > value of mixing them together when there is already a good candidate
> > to handle passthrough...
> 
> In Linux a 'VM' and virtualization has always been a normal system
> process that uses a few extra kernel features. This has been more or
> less the cornerstone of that design since the start.
> 
> In that view it doesn't make any sense to say that uAPI from idxd that
> is useful for virtualization somehow doesn't belong as part of the
> standard uAPI.

The point is that we already have a more standard uAPI (VFIO) which
is unified and vendor-agnostic to userspace. Creating a idxd specific
uAPI to absorb similar requirements that VFIO already does is not 
compelling and instead causes more trouble to Qemu or other VMMs 
as they need to deal with every such driver uAPI even when Qemu 
itself has no interest in the device detail (since the real user is inside 
guest). 

> 
> Especially when it is such a small detail like what APIs are used to
> configure the wq.
> 
> For instance, what about suspend/resume of containers using idxd?
> Wouldn't you want to have the same basic approach of controlling the
> wq from userspace that virtualization uses?
> 

I'm not familiar with how container suspend/resume is done today.
But my gut-feeling is that it's different from virtualization. For 
virtualization, the whole wq is assigned to the guest thus the uAPI
must provide a way to save the wq state including its configuration 
at suspsend, and then restore the state to what guest expects when
resume. However in container case which does userspace DMA, the
wq is managed by host kernel and could be shared between multiple
containers. So the wq state is irrelevant to container. The only relevant
state is the in-fly workloads which needs a draining interface. In this
view I think the two have a major difference.

Thanks
Kevin


RE: [PATCH RFC v2 00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver

2020-08-17 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Tuesday, August 18, 2020 8:44 AM
> 
> On Mon, Aug 17, 2020 at 02:12:44AM +, Tian, Kevin wrote:
> > > From: Jason Gunthorpe
> > > Sent: Friday, August 14, 2020 9:35 PM
> > >
> > > On Mon, Aug 10, 2020 at 07:32:24AM +, Tian, Kevin wrote:
> > >
> > > > > I would prefer to see that the existing userspace interface have the
> > > > > extra needed bits for virtualization (eg by having appropriate
> > > > > internal kernel APIs to make this easy) and all the emulation to build
> > > > > the synthetic PCI device be done in userspace.
> > > >
> > > > In the end what decides the direction is the amount of changes that
> > > > we have to put in kernel, not whether we call it 'emulation'.
> > >
> > > No, this is not right. The decision should be based on what will end
> > > up more maintable in the long run.
> > >
> > > Yes it would be more code to dis-aggregate some of the things
> > > currently only bundled as uAPI inside VFIO (eg your vSVA argument
> > > above) but once it is disaggregated the maintability of the whole
> > > solution will be better overall, and more drivers will be able to use
> > > this functionality.
> > >
> >
> > Disaggregation is an orthogonal topic to the main divergence in
> > this thread, which is passthrough vs. userspace DMA. I gave detail
> > explanation about the difference between the two in last reply.
> 
> You said the first issue was related to SVA, which is understandable
> because we have no SVA uAPIs outside VFIO.
> 
> Imagine if we had some /dev/sva that provided this API and user space
> DMA drivers could simply accept an FD and work properly. It is not
> such a big leap anymore, nor is it customized code in idxd.
> 
> The other pass through issue was IRQ, which last time I looked, was
> fairly trivial to connect via interrupt remapping in the kernel, or
> could be made extremely trivial.
> 
> The last, seemed to be a concern that the current uapi for idxd was
> lacking seems idxd specific features, which seems like an quite weak
> reason to use VFIO.
> 

The difference in my reply is not just about the implementation gap
of growing a userspace DMA framework to a passthrough framework.
My real point is about the different goals that each wants to achieve.
Userspace DMA is purely about allowing userspace to directly access
the portal and do DMA, but the wq configuration is always under kernel 
driver's control. On the other hand, passthrough means delegating full 
control of the wq to the guest and then associated mock-up (live migration,
vSVA, posted interrupt, etc.) for that to work. I really didn't see the
value of mixing them together when there is already a good candidate
to handle passthrough...

Thanks
Kevin


RE: [PATCH RFC v2 00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver

2020-08-16 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Friday, August 14, 2020 9:24 PM
> 
> The same basic argument goes for all the points - the issue is really
> the only uAPI we have for this stuff is under VFIO, and the better
> solution is to disagregate that uAPI, not to try and make everything
> pretend to be a VFIO device.
> 

Nobody is proposing to make everything VFIO. there must be some
criteria which can be brainstormed in LPC. But the opposite also holds - 
the fact that we should not make everything VFIO doesn't imply
prohibition on anyone from using it. There is a clear difference between 
passthrough and userspace DMA requirements in idxd context, and we
see good reasons to use VFIO for our passthrough requirements.


Thanks
Kevin


RE: [PATCH RFC v2 00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver

2020-08-16 Thread Tian, Kevin
> From: Jason Gunthorpe
> Sent: Friday, August 14, 2020 9:35 PM
> 
> On Mon, Aug 10, 2020 at 07:32:24AM +, Tian, Kevin wrote:
> 
> > > I would prefer to see that the existing userspace interface have the
> > > extra needed bits for virtualization (eg by having appropriate
> > > internal kernel APIs to make this easy) and all the emulation to build
> > > the synthetic PCI device be done in userspace.
> >
> > In the end what decides the direction is the amount of changes that
> > we have to put in kernel, not whether we call it 'emulation'.
> 
> No, this is not right. The decision should be based on what will end
> up more maintable in the long run.
> 
> Yes it would be more code to dis-aggregate some of the things
> currently only bundled as uAPI inside VFIO (eg your vSVA argument
> above) but once it is disaggregated the maintability of the whole
> solution will be better overall, and more drivers will be able to use
> this functionality.
> 

Disaggregation is an orthogonal topic to the main divergence in 
this thread, which is passthrough vs. userspace DMA. I gave detail
explanation about the difference between the two in last reply.
the possibility of dis-aggregating something between passthrough
frameworks (e.g. VFIO and vDPA) is not the reason for growing 
every userspace DMA framework to be a passthrough framework.
Doing that is instead hurting maintainability in general...

Thanks
Kevin


RE: [PATCH RFC v2 00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver

2020-08-12 Thread Tian, Kevin
> From: Jason Wang 
> Sent: Thursday, August 13, 2020 12:34 PM
> 
> 
> On 2020/8/12 下午12:05, Tian, Kevin wrote:
> >> The problem is that if we tie all controls via VFIO uAPI, the other
> >> subsystem like vDPA is likely to duplicate them. I wonder if there is a
> >> way to decouple the vSVA out of VFIO uAPI?
> > vSVA is a per-device (either pdev or mdev) feature thus naturally should
> > be managed by its device driver (VFIO or vDPA). From this angle some
> > duplication is inevitable given VFIO and vDPA are orthogonal passthrough
> > frameworks. Within the kernel the majority of vSVA handling is done by
> > IOMMU and IOASID modules thus most logic are shared.
> 
> 
> So why not introduce vSVA uAPI at IOMMU or IOASID layer?

One may ask a similar question why IOMMU doesn't expose map/unmap
as uAPI...

> 
> 
> >
> >>>If an userspace DMA interface can be easily
> >>> adapted to be a passthrough one, it might be the choice.
> >> It's not that easy even for VFIO which requires a lot of new uAPIs and
> >> infrastructures(e.g mdev) to be invented.
> >>
> >>
> >>> But for idxd,
> >>> we see mdev a much better fit here, given the big difference between
> >>> what userspace DMA requires and what guest driver requires in this hw.
> >> A weak point for mdev is that it can't serve kernel subsystem other than
> >> VFIO. In this case, you need some other infrastructures (like [1]) to do
> >> this.
> > mdev is not exclusive from kernel usages. It's perfectly fine for a driver
> > to reserve some work queues for host usages, while wrapping others
> > into mdevs.
> 
> 
> I meant you may want slices to be an independent device from the kernel
> point of view:
> 
> E.g for ethernet devices, you may want 10K mdevs to be passed to guest.
> 
> Similarly, you may want 10K net devices which is connected to the kernel
> networking subsystems.
> 
> In this case it's not simply reserving queues but you need some other
> type of device abstraction. There could be some kind of duplication
> between this and mdev.
> 

yes, some abstraction required but isn't it what the driver should
care about instead of mdev framework itself? If the driver reports
the same set of resource to both mdev and networking, it needs to
make sure when the resource is claimed in one interface then it
should be marked in-use in another. e.g. each mdev includes a
available_intances attribute. the driver could report 10k available
instances initially and then update it to 5K when another 5K is used
for net devices later.

Mdev definitely has its usage limitations. Some may be improved 
in the future, some may not. But those are distracting from the
original purpose of this thread (mdev vs. userspace DMA) and better
be discussed in other places e.g. LPC... 

Thanks
Kevin


RE: [PATCH RFC v2 00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver

2020-08-11 Thread Tian, Kevin
> From: Jason Wang 
> Sent: Wednesday, August 12, 2020 11:28 AM
> 
> 
> On 2020/8/10 下午3:32, Tian, Kevin wrote:
> >> From: Jason Gunthorpe 
> >> Sent: Friday, August 7, 2020 8:20 PM
> >>
> >> On Wed, Aug 05, 2020 at 07:22:58PM -0600, Alex Williamson wrote:
> >>
> >>> If you see this as an abuse of the framework, then let's identify those
> >>> specific issues and come up with a better approach.  As we've discussed
> >>> before, things like basic PCI config space emulation are acceptable
> >>> overhead and low risk (imo) and some degree of register emulation is
> >>> well within the territory of an mdev driver.
> >> What troubles me is that idxd already has a direct userspace interface
> >> to its HW, and does userspace DMA. The purpose of this mdev is to
> >> provide a second direct userspace interface that is a little different
> >> and trivially plugs into the virtualization stack.
> > No. Userspace DMA and subdevice passthrough (what mdev provides)
> > are two distinct usages IMO (at least in idxd context). and this might
> > be the main divergence between us, thus let me put more words here.
> > If we could reach consensus in this matter, which direction to go
> > would be clearer.
> >
> > First, a passthrough interface requires some unique requirements
> > which are not commonly observed in an userspace DMA interface, e.g.:
> >
> > - Tracking DMA dirty pages for live migration;
> > - A set of interfaces for using SVA inside guest;
> > * PASID allocation/free (on some platforms);
> > * bind/unbind guest mm/page table (nested translation);
> > * invalidate IOMMU cache/iotlb for guest page table changes;
> > * report page request from device to guest;
> > * forward page response from guest to device;
> > - Configuring irqbypass for posted interrupt;
> > - ...
> >
> > Second, a passthrough interface requires delegating raw controllability
> > of subdevice to guest driver, while the same delegation might not be
> > required for implementing an userspace DMA interface (especially for
> > modern devices which support SVA). For example, idxd allows following
> > setting per wq (guest driver may configure them in any combination):
> > - put in dedicated or shared mode;
> > - enable/disable SVA;
> > - Associate guest-provided PASID to MSI/IMS entry;
> > - set threshold;
> > - allow/deny privileged access;
> > - allocate/free interrupt handle (enlightened for guest);
> > - collect error status;
> > - ...
> >
> > We plan to support idxd userspace DMA with SVA. The driver just needs
> > to prepare a wq with a predefined configuration (e.g. shared, SVA,
> > etc.), bind the process mm to IOMMU (non-nested) and then map
> > the portal to userspace. The goal that userspace can do DMA to
> > associated wq doesn't change the fact that the wq is still *owned*
> > and *controlled* by kernel driver. However as far as passthrough
> > is concerned, the wq is considered 'owned' by the guest driver thus
> > we need an interface which can support low-level *controllability*
> > from guest driver. It is sort of a mess in uAPI when mixing the
> > two together.
> 
> 
> So for userspace drivers like DPDK, it can use both of the two uAPIs?

yes.

> 
> 
> >
> > Based on above two reasons, we see distinct requirements between
> > userspace DMA and passthrough interfaces, at least in idxd context
> > (though other devices may have less distinction in-between). Therefore,
> > we didn't see the value/necessity of reinventing the wheel that mdev
> > already handles well to evolve an simple application-oriented usespace
> > DMA interface to a complex guest-driver-oriented passthrough interface.
> > The complexity of doing so would incur far more kernel-side changes
> > than the portion of emulation code that you've been concerned about...
> >
> >> I don't think VFIO should be the only entry point to
> >> virtualization. If we say the universe of devices doing user space DMA
> >> must also implement a VFIO mdev to plug into virtualization then it
> >> will be alot of mdevs.
> > Certainly VFIO will not be the only entry point. and This has to be a
> > case-by-case decision.
> 
> 
> The problem is that if we tie all controls via VFIO uAPI, the other
> subsystem like vDPA is likely to duplicate them. I wonder if there is a
> way to decouple the vSVA out of VFIO uAPI?

vSVA is a per-device (either pdev or mdev) feature thus naturally should 
be manag

RE: [PATCH RFC v2 00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver

2020-08-11 Thread Tian, Kevin
> From: Alex Williamson 
> Sent: Wednesday, August 12, 2020 10:36 AM
> On Wed, 12 Aug 2020 01:58:00 +0000
> "Tian, Kevin"  wrote:
> 
> > >
> > > I'll also remind folks that LPC is coming up in just a couple short
> > > weeks and this might be something we should discuss (virtually)
> > > in-person.  uconf CfPs are currently open.Thanks,
> > >
> >
> > Yes, LPC is a good place to reach consensus. btw I saw there is
> > already one VFIO topic called "device assignment/sub-assignment".
> > Do you think whether this can be covered under that topic, or
> > makes more sense to be a new one?
> 
> All the things listed in the CFP are only potential topics to get ideas
> flowing, there is currently no proposal to talk about sub-assignment.
> I'd suggest submitting separate topics for each and if we run into time
> constraints we can ask that they might be combined together.  Thanks,
> 

Done.
--
title: Criteria of using VFIO mdev (vs. userspace DMA)

Content:
VFIO mdev provides a framework for subdevice assignment and reuses 
existing VFIO uAPI  to handle common passthrough-related requirements. 
However, subdevice (e.g. ADI defined in Intel Scalable IOV) might not be 
a PCI endpoint (e.g. just a work queue), thus requires some degree of 
emulation/mediation in kernel to fit into VFIO device API. Then there is 
a concern on putting emulation in kernel and how to judge abuse of 
mdev framework by simply using it as an easy path to hook into 
virtualization stack. An associated open is about differentiating mdev 
from userspace DMA framework (such as uacce), and whether building 
passthrough features on top of userspace DMA framework is a better 
choice than using mdev. 

Thanks
Kevin


RE: [PATCH RFC v2 00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver

2020-08-11 Thread Tian, Kevin
> From: Alex Williamson 
> Sent: Wednesday, August 12, 2020 1:01 AM
> 
> On Mon, 10 Aug 2020 07:32:24 +0000
> "Tian, Kevin"  wrote:
> 
> > > From: Jason Gunthorpe 
> > > Sent: Friday, August 7, 2020 8:20 PM
> > >
> > > On Wed, Aug 05, 2020 at 07:22:58PM -0600, Alex Williamson wrote:
> > >
> > > > If you see this as an abuse of the framework, then let's identify those
> > > > specific issues and come up with a better approach.  As we've discussed
> > > > before, things like basic PCI config space emulation are acceptable
> > > > overhead and low risk (imo) and some degree of register emulation is
> > > > well within the territory of an mdev driver.
> > >
> > > What troubles me is that idxd already has a direct userspace interface
> > > to its HW, and does userspace DMA. The purpose of this mdev is to
> > > provide a second direct userspace interface that is a little different
> > > and trivially plugs into the virtualization stack.
> >
> > No. Userspace DMA and subdevice passthrough (what mdev provides)
> > are two distinct usages IMO (at least in idxd context). and this might
> > be the main divergence between us, thus let me put more words here.
> > If we could reach consensus in this matter, which direction to go
> > would be clearer.
> >
> > First, a passthrough interface requires some unique requirements
> > which are not commonly observed in an userspace DMA interface, e.g.:
> >
> > - Tracking DMA dirty pages for live migration;
> > - A set of interfaces for using SVA inside guest;
> > * PASID allocation/free (on some platforms);
> > * bind/unbind guest mm/page table (nested translation);
> > * invalidate IOMMU cache/iotlb for guest page table changes;
> > * report page request from device to guest;
> > * forward page response from guest to device;
> > - Configuring irqbypass for posted interrupt;
> > - ...
> >
> > Second, a passthrough interface requires delegating raw controllability
> > of subdevice to guest driver, while the same delegation might not be
> > required for implementing an userspace DMA interface (especially for
> > modern devices which support SVA). For example, idxd allows following
> > setting per wq (guest driver may configure them in any combination):
> > - put in dedicated or shared mode;
> > - enable/disable SVA;
> > - Associate guest-provided PASID to MSI/IMS entry;
> > - set threshold;
> > - allow/deny privileged access;
> > - allocate/free interrupt handle (enlightened for guest);
> > - collect error status;
> > - ...
> >
> > We plan to support idxd userspace DMA with SVA. The driver just needs
> > to prepare a wq with a predefined configuration (e.g. shared, SVA,
> > etc.), bind the process mm to IOMMU (non-nested) and then map
> > the portal to userspace. The goal that userspace can do DMA to
> > associated wq doesn't change the fact that the wq is still *owned*
> > and *controlled* by kernel driver. However as far as passthrough
> > is concerned, the wq is considered 'owned' by the guest driver thus
> > we need an interface which can support low-level *controllability*
> > from guest driver. It is sort of a mess in uAPI when mixing the
> > two together.
> >
> > Based on above two reasons, we see distinct requirements between
> > userspace DMA and passthrough interfaces, at least in idxd context
> > (though other devices may have less distinction in-between). Therefore,
> > we didn't see the value/necessity of reinventing the wheel that mdev
> > already handles well to evolve an simple application-oriented usespace
> > DMA interface to a complex guest-driver-oriented passthrough interface.
> > The complexity of doing so would incur far more kernel-side changes
> > than the portion of emulation code that you've been concerned about...
> >
> > >
> > > I don't think VFIO should be the only entry point to
> > > virtualization. If we say the universe of devices doing user space DMA
> > > must also implement a VFIO mdev to plug into virtualization then it
> > > will be alot of mdevs.
> >
> > Certainly VFIO will not be the only entry point. and This has to be a
> > case-by-case decision.  If an userspace DMA interface can be easily
> > adapted to be a passthrough one, it might be the choice. But for idxd,
> > we see mdev a much better fit here, given the big difference between
> > what userspace DMA requires and what guest driver requires in this 

RE: [PATCH RFC v2 00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver

2020-08-10 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Friday, August 7, 2020 8:20 PM
> 
> On Wed, Aug 05, 2020 at 07:22:58PM -0600, Alex Williamson wrote:
> 
> > If you see this as an abuse of the framework, then let's identify those
> > specific issues and come up with a better approach.  As we've discussed
> > before, things like basic PCI config space emulation are acceptable
> > overhead and low risk (imo) and some degree of register emulation is
> > well within the territory of an mdev driver.
> 
> What troubles me is that idxd already has a direct userspace interface
> to its HW, and does userspace DMA. The purpose of this mdev is to
> provide a second direct userspace interface that is a little different
> and trivially plugs into the virtualization stack.

No. Userspace DMA and subdevice passthrough (what mdev provides)
are two distinct usages IMO (at least in idxd context). and this might 
be the main divergence between us, thus let me put more words here. 
If we could reach consensus in this matter, which direction to go 
would be clearer.

First, a passthrough interface requires some unique requirements 
which are not commonly observed in an userspace DMA interface, e.g.:

- Tracking DMA dirty pages for live migration;
- A set of interfaces for using SVA inside guest;
* PASID allocation/free (on some platforms);
* bind/unbind guest mm/page table (nested translation);
* invalidate IOMMU cache/iotlb for guest page table changes;
* report page request from device to guest;
* forward page response from guest to device;
- Configuring irqbypass for posted interrupt;
- ...

Second, a passthrough interface requires delegating raw controllability
of subdevice to guest driver, while the same delegation might not be
required for implementing an userspace DMA interface (especially for
modern devices which support SVA). For example, idxd allows following
setting per wq (guest driver may configure them in any combination):
- put in dedicated or shared mode;
- enable/disable SVA;
- Associate guest-provided PASID to MSI/IMS entry;
- set threshold;
- allow/deny privileged access;
- allocate/free interrupt handle (enlightened for guest);
- collect error status;
- ...

We plan to support idxd userspace DMA with SVA. The driver just needs 
to prepare a wq with a predefined configuration (e.g. shared, SVA, 
etc.), bind the process mm to IOMMU (non-nested) and then map 
the portal to userspace. The goal that userspace can do DMA to 
associated wq doesn't change the fact that the wq is still *owned* 
and *controlled* by kernel driver. However as far as passthrough 
is concerned, the wq is considered 'owned' by the guest driver thus 
we need an interface which can support low-level *controllability* 
from guest driver. It is sort of a mess in uAPI when mixing the
two together.

Based on above two reasons, we see distinct requirements between 
userspace DMA and passthrough interfaces, at least in idxd context 
(though other devices may have less distinction in-between). Therefore,
we didn't see the value/necessity of reinventing the wheel that mdev 
already handles well to evolve an simple application-oriented usespace 
DMA interface to a complex guest-driver-oriented passthrough interface. 
The complexity of doing so would incur far more kernel-side changes 
than the portion of emulation code that you've been concerned about...
 
> 
> I don't think VFIO should be the only entry point to
> virtualization. If we say the universe of devices doing user space DMA
> must also implement a VFIO mdev to plug into virtualization then it
> will be alot of mdevs.

Certainly VFIO will not be the only entry point. and This has to be a 
case-by-case decision.  If an userspace DMA interface can be easily 
adapted to be a passthrough one, it might be the choice. But for idxd, 
we see mdev a much better fit here, given the big difference between 
what userspace DMA requires and what guest driver requires in this hw.

> 
> I would prefer to see that the existing userspace interface have the
> extra needed bits for virtualization (eg by having appropriate
> internal kernel APIs to make this easy) and all the emulation to build
> the synthetic PCI device be done in userspace.

In the end what decides the direction is the amount of changes that
we have to put in kernel, not whether we call it 'emulation'. For idxd,
adding special passthrough requirements (guest SVA, dirty tracking,
etc.) and raw controllability to the simple userspace DMA interface 
is for sure making kernel more complex than reusing the mdev
framework (plus some degree of emulation mockup behind). Not to
mention the merit of uAPI compatibility with mdev...

Thanks
Kevin


RE: [PATCH v3 3/4] iommu: Add iommu_aux_get_domain_for_dev()

2020-07-30 Thread Tian, Kevin
> From: Tian, Kevin
> Sent: Friday, July 31, 2020 8:26 AM
> 
> > From: Alex Williamson 
> > Sent: Friday, July 31, 2020 4:17 AM
> >
> > On Wed, 29 Jul 2020 23:49:20 +
> > "Tian, Kevin"  wrote:
> >
> > > > From: Alex Williamson 
> > > > Sent: Thursday, July 30, 2020 4:25 AM
> > > >
> > > > On Tue, 14 Jul 2020 13:57:02 +0800
> > > > Lu Baolu  wrote:
> > > >
> > > > > The device driver needs an API to get its aux-domain. A typical usage
> > > > > scenario is:
> > > > >
> > > > > unsigned long pasid;
> > > > > struct iommu_domain *domain;
> > > > > struct device *dev = mdev_dev(mdev);
> > > > > struct device *iommu_device =
> vfio_mdev_get_iommu_device(dev);
> > > > >
> > > > > domain = iommu_aux_get_domain_for_dev(dev);
> > > > > if (!domain)
> > > > > return -ENODEV;
> > > > >
> > > > > pasid = iommu_aux_get_pasid(domain, iommu_device);
> > > > > if (pasid <= 0)
> > > > > return -EINVAL;
> > > > >
> > > > >  /* Program the device context */
> > > > >  
> > > > >
> > > > > This adds an API for such use case.
> > > > >
> > > > > Suggested-by: Alex Williamson 
> > > > > Signed-off-by: Lu Baolu 
> > > > > ---
> > > > >  drivers/iommu/iommu.c | 18 ++
> > > > >  include/linux/iommu.h |  7 +++
> > > > >  2 files changed, 25 insertions(+)
> > > > >
> > > > > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > > > > index cad5a19ebf22..434bf42b6b9b 100644
> > > > > --- a/drivers/iommu/iommu.c
> > > > > +++ b/drivers/iommu/iommu.c
> > > > > @@ -2817,6 +2817,24 @@ void iommu_aux_detach_group(struct
> > > > iommu_domain *domain,
> > > > >  }
> > > > >  EXPORT_SYMBOL_GPL(iommu_aux_detach_group);
> > > > >
> > > > > +struct iommu_domain *iommu_aux_get_domain_for_dev(struct
> > device
> > > > *dev)
> > > > > +{
> > > > > + struct iommu_domain *domain = NULL;
> > > > > + struct iommu_group *group;
> > > > > +
> > > > > + group = iommu_group_get(dev);
> > > > > + if (!group)
> > > > > + return NULL;
> > > > > +
> > > > > + if (group->aux_domain_attached)
> > > > > + domain = group->domain;
> > > >
> > > > Why wouldn't the aux domain flag be on the domain itself rather than
> > > > the group?  Then if we wanted sanity checking in patch 1/ we'd only
> > > > need to test the flag on the object we're provided.
> > > >
> > > > If we had such a flag, we could create an iommu_domain_is_aux()
> > > > function and then simply use iommu_get_domain_for_dev() and test
> that
> > > > it's an aux domain in the example use case.  It seems like that would
> > >
> > > IOMMU layer manages domains per parent device. Here given a
> >
> > Is this the IOMMU layer or the VT-d driver?  I don't see any notion of
> > managing domains relative to a parent in the IOMMU layer.  Please point
> > to something more specific if I'm wrong here.
> 
> it's maintained in VT-d driver (include/linux/intel-iommu.h)
> 
> struct device_domain_info {
> struct list_head link;  /* link to domain siblings */
> struct list_head global; /* link to global list */
> struct list_head table; /* link to pasid table */
> struct list_head auxiliary_domains; /* auxiliary domains
>  * attached to this device
>  */
>   ...
> 
> >
> > > dev (of mdev), we need a way to find its associated domain under its
> > > parent device. And we cannot simply use iommu_get_domain_for_dev
> > > on the parent device of the mdev, as it will give us the primary domain
> > > of parent device.
> >
> > Not the parent device of the mdev, but the mdev_dev(mdev) device.
> > Isn't that what this series is enabling, being able to return the
> > domain from the group that contains the mdev_dev?  We shouldn't need
> to
> > leave breadcrumbs on the group to know about the domain, the domain
> > itself should be the source of knowledge, or provide a mechanism/ops to
> > learn that knowledge.  Thanks,
> >
> > Alex
> 
> It's the tradeoff between leaving breadcrumb in domain or in group.
> Today the domain has no knowledge of mdev. It just includes a list
> of physical devices which are attached to the domain (either due to
> the device is assigned in a whole or as the parent device of an assigned
> mdev). Then we have two choices. One is to save the mdev_dev info
> in device_domain_info and maintain a mapping between mdev_dev
> and related aux domain. The other is to record the domain info directly
> in group. Earlier we choose the latter one as it looks simpler. If you
> prefer to the former one, we can think more and have a try.
> 

Please skip this comment. I have a wrong understanding of the problem
and is discussing with Baolu. He will reply with our conclusion later.

Thanks
Kevin


RE: [PATCH v3 3/4] iommu: Add iommu_aux_get_domain_for_dev()

2020-07-30 Thread Tian, Kevin
> From: Alex Williamson 
> Sent: Friday, July 31, 2020 4:17 AM
> 
> On Wed, 29 Jul 2020 23:49:20 +0000
> "Tian, Kevin"  wrote:
> 
> > > From: Alex Williamson 
> > > Sent: Thursday, July 30, 2020 4:25 AM
> > >
> > > On Tue, 14 Jul 2020 13:57:02 +0800
> > > Lu Baolu  wrote:
> > >
> > > > The device driver needs an API to get its aux-domain. A typical usage
> > > > scenario is:
> > > >
> > > > unsigned long pasid;
> > > > struct iommu_domain *domain;
> > > > struct device *dev = mdev_dev(mdev);
> > > > struct device *iommu_device = vfio_mdev_get_iommu_device(dev);
> > > >
> > > > domain = iommu_aux_get_domain_for_dev(dev);
> > > > if (!domain)
> > > > return -ENODEV;
> > > >
> > > > pasid = iommu_aux_get_pasid(domain, iommu_device);
> > > > if (pasid <= 0)
> > > > return -EINVAL;
> > > >
> > > >  /* Program the device context */
> > > >  
> > > >
> > > > This adds an API for such use case.
> > > >
> > > > Suggested-by: Alex Williamson 
> > > > Signed-off-by: Lu Baolu 
> > > > ---
> > > >  drivers/iommu/iommu.c | 18 ++
> > > >  include/linux/iommu.h |  7 +++
> > > >  2 files changed, 25 insertions(+)
> > > >
> > > > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > > > index cad5a19ebf22..434bf42b6b9b 100644
> > > > --- a/drivers/iommu/iommu.c
> > > > +++ b/drivers/iommu/iommu.c
> > > > @@ -2817,6 +2817,24 @@ void iommu_aux_detach_group(struct
> > > iommu_domain *domain,
> > > >  }
> > > >  EXPORT_SYMBOL_GPL(iommu_aux_detach_group);
> > > >
> > > > +struct iommu_domain *iommu_aux_get_domain_for_dev(struct
> device
> > > *dev)
> > > > +{
> > > > +   struct iommu_domain *domain = NULL;
> > > > +   struct iommu_group *group;
> > > > +
> > > > +   group = iommu_group_get(dev);
> > > > +   if (!group)
> > > > +   return NULL;
> > > > +
> > > > +   if (group->aux_domain_attached)
> > > > +   domain = group->domain;
> > >
> > > Why wouldn't the aux domain flag be on the domain itself rather than
> > > the group?  Then if we wanted sanity checking in patch 1/ we'd only
> > > need to test the flag on the object we're provided.
> > >
> > > If we had such a flag, we could create an iommu_domain_is_aux()
> > > function and then simply use iommu_get_domain_for_dev() and test that
> > > it's an aux domain in the example use case.  It seems like that would
> >
> > IOMMU layer manages domains per parent device. Here given a
> 
> Is this the IOMMU layer or the VT-d driver?  I don't see any notion of
> managing domains relative to a parent in the IOMMU layer.  Please point
> to something more specific if I'm wrong here.

it's maintained in VT-d driver (include/linux/intel-iommu.h)

struct device_domain_info {
struct list_head link;  /* link to domain siblings */
struct list_head global; /* link to global list */
struct list_head table; /* link to pasid table */
struct list_head auxiliary_domains; /* auxiliary domains
 * attached to this device
 */
...

> 
> > dev (of mdev), we need a way to find its associated domain under its
> > parent device. And we cannot simply use iommu_get_domain_for_dev
> > on the parent device of the mdev, as it will give us the primary domain
> > of parent device.
> 
> Not the parent device of the mdev, but the mdev_dev(mdev) device.
> Isn't that what this series is enabling, being able to return the
> domain from the group that contains the mdev_dev?  We shouldn't need to
> leave breadcrumbs on the group to know about the domain, the domain
> itself should be the source of knowledge, or provide a mechanism/ops to
> learn that knowledge.  Thanks,
> 
> Alex

It's the tradeoff between leaving breadcrumb in domain or in group. 
Today the domain has no knowledge of mdev. It just includes a list
of physical devices which are attached to the domain (either due to
the device is assigned in a whole or as the parent device of an assigned
mdev). Then we have two choices. One is to save the mdev_dev info
in device_domain_info and maintain a mapping between mdev_dev
and related aux domain. The other is to record the domain info directly
in group. Earlier we choose the latter one as it looks simpler. If you
prefer to the former one, we can think more and have a try.

Thanks
Kevin


RE: [PATCH v3 3/4] iommu: Add iommu_aux_get_domain_for_dev()

2020-07-29 Thread Tian, Kevin
> From: Alex Williamson 
> Sent: Thursday, July 30, 2020 4:25 AM
> 
> On Tue, 14 Jul 2020 13:57:02 +0800
> Lu Baolu  wrote:
> 
> > The device driver needs an API to get its aux-domain. A typical usage
> > scenario is:
> >
> > unsigned long pasid;
> > struct iommu_domain *domain;
> > struct device *dev = mdev_dev(mdev);
> > struct device *iommu_device = vfio_mdev_get_iommu_device(dev);
> >
> > domain = iommu_aux_get_domain_for_dev(dev);
> > if (!domain)
> > return -ENODEV;
> >
> > pasid = iommu_aux_get_pasid(domain, iommu_device);
> > if (pasid <= 0)
> > return -EINVAL;
> >
> >  /* Program the device context */
> >  
> >
> > This adds an API for such use case.
> >
> > Suggested-by: Alex Williamson 
> > Signed-off-by: Lu Baolu 
> > ---
> >  drivers/iommu/iommu.c | 18 ++
> >  include/linux/iommu.h |  7 +++
> >  2 files changed, 25 insertions(+)
> >
> > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > index cad5a19ebf22..434bf42b6b9b 100644
> > --- a/drivers/iommu/iommu.c
> > +++ b/drivers/iommu/iommu.c
> > @@ -2817,6 +2817,24 @@ void iommu_aux_detach_group(struct
> iommu_domain *domain,
> >  }
> >  EXPORT_SYMBOL_GPL(iommu_aux_detach_group);
> >
> > +struct iommu_domain *iommu_aux_get_domain_for_dev(struct device
> *dev)
> > +{
> > +   struct iommu_domain *domain = NULL;
> > +   struct iommu_group *group;
> > +
> > +   group = iommu_group_get(dev);
> > +   if (!group)
> > +   return NULL;
> > +
> > +   if (group->aux_domain_attached)
> > +   domain = group->domain;
> 
> Why wouldn't the aux domain flag be on the domain itself rather than
> the group?  Then if we wanted sanity checking in patch 1/ we'd only
> need to test the flag on the object we're provided.
> 
> If we had such a flag, we could create an iommu_domain_is_aux()
> function and then simply use iommu_get_domain_for_dev() and test that
> it's an aux domain in the example use case.  It seems like that would

IOMMU layer manages domains per parent device. Here given a
dev (of mdev), we need a way to find its associated domain under its
parent device. And we cannot simply use iommu_get_domain_for_dev
on the parent device of the mdev, as it will give us the primary domain
of parent device. 

Thanks
Kevin

> resolve the jump from a domain to an aux-domain just as well as adding
> this separate iommu_aux_get_domain_for_dev() interface.  The is_aux
> test might also be useful in other cases too.  Thanks,
> 
> Alex
> 
> > +
> > +   iommu_group_put(group);
> > +
> > +   return domain;
> > +}
> > +EXPORT_SYMBOL_GPL(iommu_aux_get_domain_for_dev);
> > +
> >  /**
> >   * iommu_sva_bind_device() - Bind a process address space to a device
> >   * @dev: the device
> > diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> > index 9506551139ab..cda6cef7579e 100644
> > --- a/include/linux/iommu.h
> > +++ b/include/linux/iommu.h
> > @@ -639,6 +639,7 @@ int iommu_aux_attach_group(struct
> iommu_domain *domain,
> >struct iommu_group *group, struct device *dev);
> >  void iommu_aux_detach_group(struct iommu_domain *domain,
> >struct iommu_group *group, struct device *dev);
> > +struct iommu_domain *iommu_aux_get_domain_for_dev(struct device
> *dev);
> >
> >  struct iommu_sva *iommu_sva_bind_device(struct device *dev,
> > struct mm_struct *mm,
> > @@ -1040,6 +1041,12 @@ iommu_aux_detach_group(struct
> iommu_domain *domain,
> >  {
> >  }
> >
> > +static inline struct iommu_domain *
> > +iommu_aux_get_domain_for_dev(struct device *dev)
> > +{
> > +   return NULL;
> > +}
> > +
> >  static inline struct iommu_sva *
> >  iommu_sva_bind_device(struct device *dev, struct mm_struct *mm, void
> *drvdata)
> >  {



RE: [PATCH v3 2/4] iommu: Add iommu_aux_at(de)tach_group()

2020-07-29 Thread Tian, Kevin
> From: Alex Williamson 
> Sent: Thursday, July 30, 2020 4:04 AM
> 
> On Thu, 16 Jul 2020 09:07:46 +0800
> Lu Baolu  wrote:
> 
> > Hi Jacob,
> >
> > On 7/16/20 12:01 AM, Jacob Pan wrote:
> > > On Wed, 15 Jul 2020 08:47:36 +0800
> > > Lu Baolu  wrote:
> > >
> > >> Hi Jacob,
> > >>
> > >> On 7/15/20 12:39 AM, Jacob Pan wrote:
> > >>> On Tue, 14 Jul 2020 13:57:01 +0800
> > >>> Lu Baolu  wrote:
> > >>>
> >  This adds two new aux-domain APIs for a use case like vfio/mdev
> >  where sub-devices derived from an aux-domain capable device are
> >  created and put in an iommu_group.
> > 
> >  /**
> > * iommu_aux_attach_group - attach an aux-domain to an
> iommu_group
> >  which
> > *  contains sub-devices (for example
> >  mdevs) derived
> > *  from @dev.
> > * @domain: an aux-domain;
> > * @group:  an iommu_group which contains sub-devices derived
> from
> >  @dev;
> > * @dev:the physical device which supports
> IOMMU_DEV_FEAT_AUX.
> > *
> > * Returns 0 on success, or an error value.
> > */
> >  int iommu_aux_attach_group(struct iommu_domain *domain,
> >   struct iommu_group *group,
> >   struct device *dev)
> > 
> >  /**
> > * iommu_aux_detach_group - detach an aux-domain from an
> >  iommu_group *
> > * @domain: an aux-domain;
> > * @group:  an iommu_group which contains sub-devices derived
> from
> >  @dev;
> > * @dev:the physical device which supports
> IOMMU_DEV_FEAT_AUX.
> > *
> > * @domain must have been attached to @group via
> >  iommu_aux_attach_group(). */
> >  void iommu_aux_detach_group(struct iommu_domain *domain,
> >    struct iommu_group *group,
> >    struct device *dev)
> > 
> >  It also adds a flag in the iommu_group data structure to identify
> >  an iommu_group with aux-domain attached from those normal ones.
> > 
> >  Signed-off-by: Lu Baolu
> >  ---
> > drivers/iommu/iommu.c | 58
> >  +++
> include/linux/iommu.h |
> >  17 + 2 files changed, 75 insertions(+)
> > 
> >  diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> >  index e1fdd3531d65..cad5a19ebf22 100644
> >  --- a/drivers/iommu/iommu.c
> >  +++ b/drivers/iommu/iommu.c
> >  @@ -45,6 +45,7 @@ struct iommu_group {
> > struct iommu_domain *default_domain;
> > struct iommu_domain *domain;
> > struct list_head entry;
> >  +  unsigned int aux_domain_attached:1;
> > };
> > 
> > struct group_device {
> >  @@ -2759,6 +2760,63 @@ int iommu_aux_get_pasid(struct
> iommu_domain
> >  *domain, struct device *dev) }
> > EXPORT_SYMBOL_GPL(iommu_aux_get_pasid);
> > 
> >  +/**
> >  + * iommu_aux_attach_group - attach an aux-domain to an
> iommu_group
> >  which
> >  + *  contains sub-devices (for example
> >  mdevs) derived
> >  + *  from @dev.
> >  + * @domain: an aux-domain;
> >  + * @group:  an iommu_group which contains sub-devices derived
> from
> >  @dev;
> >  + * @dev:the physical device which supports
> IOMMU_DEV_FEAT_AUX.
> >  + *
> >  + * Returns 0 on success, or an error value.
> >  + */
> >  +int iommu_aux_attach_group(struct iommu_domain *domain,
> >  + struct iommu_group *group, struct
> >  device *dev) +{
> >  +  int ret = -EBUSY;
> >  +
> >  +  mutex_lock(>mutex);
> >  +  if (group->domain)
> >  +  goto out_unlock;
> >  +
> > >>> Perhaps I missed something but are we assuming only one mdev per
> > >>> mdev group? That seems to change the logic where vfio does:
> > >>> iommu_group_for_each_dev()
> > >>> iommu_aux_attach_device()
> > >>>
> > >>
> > >> It has been changed in PATCH 4/4:
> > >>
> > >> static int vfio_iommu_attach_group(struct vfio_domain *domain,
> > >>  struct vfio_group *group)
> > >> {
> > >>   if (group->mdev_group)
> > >>   return iommu_aux_attach_group(domain->domain,
> > >> group->iommu_group,
> > >> group->iommu_device);
> > >>   else
> > >>   return iommu_attach_group(domain->domain,
> > >> group->iommu_group);
> > >> }
> > >>
> > >> So, for both normal domain and aux-domain, we use the same concept:
> > >> attach a domain to a group.
> > >>
> > > I get that, but don't you have to attach all the devices within the
> >
> > This iommu_group includes only mediated devices derived from an
> 

RE: [PATCH v6 1/6] docs: IOMMU user API

2020-07-28 Thread Tian, Kevin
> From: Alex Williamson 
> Sent: Wednesday, July 29, 2020 3:20 AM
> 
[...]
> > +
> > +For example, IOTLB invalidations should always succeed. There is no
> > +architectural way to report back to the vIOMMU if the UAPI data is
> > +incompatible. If that happens, in order to guarantee IOMMU iosolation,
> 
> isolation
> 
> > +we have to resort to not giving completion status in vIOMMU. This may
> > +result in VM hang.
> > +
> > +For this reason the following IOMMU UAPIs cannot fail without
> > +catastrophic effect:
> > +
> > +1. Free PASID
> > +2. Unbind guest PASID
> > +3. Unbind guest PASID table (SMMU)
> > +4. Cache invalidate
> 
> I'm not entirely sure what we're trying to assert here.  Clearly cache
> invalidation can fail and patch 5/6 goes on to add over a dozen checks
> of the user provided data that return an -errno.  Any user ioctl can
> fail if the user botches the parameters.  So are we just trying to
> explain the feature checking that should allow the user to know
> supported combinations and if they adhere to them, these should not
> fail?  It's not quite worded to that effect.  Thanks,
> 

I guess the above wording is messed by what a UAPI should
behave and whether the vIOMMU reports associated errors.
UAPI can always fail, as you pointed out. vIOMMU may not
have a matching error code though, e.g. on Intel VT-d there is no
error reporting mechanism for cache invalidation. However,
it is not wise to assert UAPI behavior according to vIOMMU
definition. An error is an error. vIOMMU should just react to
UAPI errors according to its architecture definition (e.g. ignore,
forward to guest, hang, etc.). From this matter I feel above
section could better be removed.

Thanks
Kevin


RE: [PATCH RFC v2 00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver

2020-07-21 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Wednesday, July 22, 2020 12:45 AM
> 
> > Link to previous discussions with Jason:
> > https://lore.kernel.org/lkml/57296ad1-20fe-caf2-b83f-
> 46d823ca0...@intel.com/
> > The emulation part that can be moved to user space is very small due to
> the majority of the
> > emulations being control bits and need to reside in the kernel. We can
> revisit the necessity of
> > moving the small emulation part to userspace and required architectural
> changes at a later time.
> 
> The point here is that you already have a user space interface for
> these queues that already has kernel support to twiddle the control
> bits. Generally I'd expect extending that existing kernel code to do
> the small bit more needed for mapping the queue through to PCI
> emulation to be smaller than the 2kloc of new code here to put all the
> emulation and support framework in the kernel, and exposes a lower
> attack surface of kernel code to the guest.
> 

We discussed in v1 about why extending user space interface is not a
strong motivation at current stage:

https://lore.kernel.org/lkml/20200513124016.gg19...@mellanox.com/

In a nutshell, applications don't require raw WQ controllability as guest
kernel drivers may expect. Extending DSA user space interface to be another
passthrough interface just for virtualization needs is less compelling than
leveraging established VFIO/mdev framework (with the major merit that
existing user space VMMs just work w/o any change as long as they already
support VFIO uAPI).

And in this version we split previous 2kloc mdev patch into three parts:
[09] mdev framework and callbacks; [10] mmio/pci_cfg emulation; and
[11] handling of control commands. Only patch[10] is purely about
emulation (~500LOC), while the other two parts are tightly coupled to
physical resource management.

In last review you said that you didn't hard nak this approach and would
like to hear opinion from virtualization guys. In this version we CCed KVM
mailing list, Paolo (VFIO/Qemu), Alex (VFIO), Samuel (Rust-VMM/Cloud
hypervisor), etc. Let's see how they feel about this approach.

Thanks
Kevin


RE: [PATCH v3 4/4] vfio/type1: Use iommu_aux_at(de)tach_group() APIs

2020-07-14 Thread Tian, Kevin
> From: Lu Baolu 
> Sent: Wednesday, July 15, 2020 9:00 AM
> 
> Hi Christoph and Jacob,
> 
> On 7/15/20 12:29 AM, Jacob Pan wrote:
> > On Tue, 14 Jul 2020 09:25:14 +0100
> > Christoph Hellwig  wrote:
> >
> >> On Tue, Jul 14, 2020 at 01:57:03PM +0800, Lu Baolu wrote:
> >>> Replace iommu_aux_at(de)tach_device() with
> >>> iommu_aux_at(de)tach_group(). It also saves the
> >>> IOMMU_DEV_FEAT_AUX-capable physcail device in the vfio_group data
> >>> structure so that it could be reused in other places.
> >> This removes the last user of iommu_aux_attach_device and
> >> iommu_aux_detach_device, which can be removed now.
> > it is still used in patch 2/4 inside iommu_aux_attach_group(), right?
> >
> 
> There is a need to use this interface. For example, an aux-domain is
> attached to a subset of a physical device and used in the kernel. In
> this usage scenario, there's no need to use vfio/mdev. The device driver
> could just allocate an aux-domain and call iommu_aux_attach_device() to
> setup the iommu.
> 

and here is one example usage for adding per-instance pagetables for drm/msm:
https://lore.kernel.org/lkml/20200626200414.14382-5-jcro...@codeaurora.org/

Thanks
Kevin


RE: [PATCH v3 4/4] iommu/vt-d: Add page response ops support

2020-07-09 Thread Tian, Kevin
> From: Lu Baolu 
> Sent: Friday, July 10, 2020 1:37 PM
> 
> Hi Kevin,
> 
> On 2020/7/10 10:42, Tian, Kevin wrote:
> >> From: Lu Baolu 
> >> Sent: Thursday, July 9, 2020 3:06 PM
> >>
> >> After page requests are handled, software must respond to the device
> >> which raised the page request with the result. This is done through
> >> the iommu ops.page_response if the request was reported to outside of
> >> vendor iommu driver through iommu_report_device_fault(). This adds
> the
> >> VT-d implementation of page_response ops.
> >>
> >> Co-developed-by: Jacob Pan 
> >> Signed-off-by: Jacob Pan 
> >> Co-developed-by: Liu Yi L 
> >> Signed-off-by: Liu Yi L 
> >> Signed-off-by: Lu Baolu 
> >> ---
> >>   drivers/iommu/intel/iommu.c |   1 +
> >>   drivers/iommu/intel/svm.c   | 100
> >> 
> >>   include/linux/intel-iommu.h |   3 ++
> >>   3 files changed, 104 insertions(+)
> >>
> >> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> >> index 4a6b6960fc32..98390a6d8113 100644
> >> --- a/drivers/iommu/intel/iommu.c
> >> +++ b/drivers/iommu/intel/iommu.c
> >> @@ -6057,6 +6057,7 @@ const struct iommu_ops intel_iommu_ops = {
> >>.sva_bind   = intel_svm_bind,
> >>.sva_unbind = intel_svm_unbind,
> >>.sva_get_pasid  = intel_svm_get_pasid,
> >> +  .page_response  = intel_svm_page_response,
> >>   #endif
> >>   };
> >>
> >> diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
> >> index d24e71bac8db..839d2af377b6 100644
> >> --- a/drivers/iommu/intel/svm.c
> >> +++ b/drivers/iommu/intel/svm.c
> >> @@ -1082,3 +1082,103 @@ int intel_svm_get_pasid(struct iommu_sva
> *sva)
> >>
> >>return pasid;
> >>   }
> >> +
> >> +int intel_svm_page_response(struct device *dev,
> >> +  struct iommu_fault_event *evt,
> >> +  struct iommu_page_response *msg)
> >> +{
> >> +  struct iommu_fault_page_request *prm;
> >> +  struct intel_svm_dev *sdev = NULL;
> >> +  struct intel_svm *svm = NULL;
> >> +  struct intel_iommu *iommu;
> >> +  bool private_present;
> >> +  bool pasid_present;
> >> +  bool last_page;
> >> +  u8 bus, devfn;
> >> +  int ret = 0;
> >> +  u16 sid;
> >> +
> >> +  if (!dev || !dev_is_pci(dev))
> >> +  return -ENODEV;
> >> +
> >> +  iommu = device_to_iommu(dev, , );
> >> +  if (!iommu)
> >> +  return -ENODEV;
> >> +
> >> +  if (!msg || !evt)
> >> +  return -EINVAL;
> >> +
> >> +  mutex_lock(_mutex);
> >> +
> >> +  prm = >fault.prm;
> >> +  sid = PCI_DEVID(bus, devfn);
> >> +  pasid_present = prm->flags &
> >> IOMMU_FAULT_PAGE_REQUEST_PASID_VALID;
> >> +  private_present = prm->flags &
> >> IOMMU_FAULT_PAGE_REQUEST_PRIV_DATA;
> >> +  last_page = prm->flags &
> >> IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE;
> >> +
> >> +  if (pasid_present) {
> >> +  if (prm->pasid == 0 || prm->pasid >= PASID_MAX) {
> >> +  ret = -EINVAL;
> >> +  goto out;
> >> +  }
> >> +
> >> +  ret = pasid_to_svm_sdev(dev, prm->pasid, , );
> >> +  if (ret || !sdev) {
> >> +  ret = -ENODEV;
> >> +  goto out;
> >> +  }
> >> +
> >> +  /*
> >> +   * For responses from userspace, need to make sure that the
> >> +   * pasid has been bound to its mm.
> >> +  */
> >> +  if (svm->flags & SVM_FLAG_GUEST_MODE) {
> >> +  struct mm_struct *mm;
> >> +
> >> +  mm = get_task_mm(current);
> >> +  if (!mm) {
> >> +  ret = -EINVAL;
> >> +  goto out;
> >> +  }
> >> +
> >> +  if (mm != svm->mm) {
> >> +  ret = -ENODEV;
> >> +  mmput(mm);
> >> +  goto out;
> >> +  }
> >> +
> >> +  mmput(mm);
&g

RE: [PATCH v3 4/4] iommu/vt-d: Add page response ops support

2020-07-09 Thread Tian, Kevin
> From: Lu Baolu 
> Sent: Thursday, July 9, 2020 3:06 PM
> 
> After page requests are handled, software must respond to the device
> which raised the page request with the result. This is done through
> the iommu ops.page_response if the request was reported to outside of
> vendor iommu driver through iommu_report_device_fault(). This adds the
> VT-d implementation of page_response ops.
> 
> Co-developed-by: Jacob Pan 
> Signed-off-by: Jacob Pan 
> Co-developed-by: Liu Yi L 
> Signed-off-by: Liu Yi L 
> Signed-off-by: Lu Baolu 
> ---
>  drivers/iommu/intel/iommu.c |   1 +
>  drivers/iommu/intel/svm.c   | 100
> 
>  include/linux/intel-iommu.h |   3 ++
>  3 files changed, 104 insertions(+)
> 
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index 4a6b6960fc32..98390a6d8113 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -6057,6 +6057,7 @@ const struct iommu_ops intel_iommu_ops = {
>   .sva_bind   = intel_svm_bind,
>   .sva_unbind = intel_svm_unbind,
>   .sva_get_pasid  = intel_svm_get_pasid,
> + .page_response  = intel_svm_page_response,
>  #endif
>  };
> 
> diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
> index d24e71bac8db..839d2af377b6 100644
> --- a/drivers/iommu/intel/svm.c
> +++ b/drivers/iommu/intel/svm.c
> @@ -1082,3 +1082,103 @@ int intel_svm_get_pasid(struct iommu_sva *sva)
> 
>   return pasid;
>  }
> +
> +int intel_svm_page_response(struct device *dev,
> + struct iommu_fault_event *evt,
> + struct iommu_page_response *msg)
> +{
> + struct iommu_fault_page_request *prm;
> + struct intel_svm_dev *sdev = NULL;
> + struct intel_svm *svm = NULL;
> + struct intel_iommu *iommu;
> + bool private_present;
> + bool pasid_present;
> + bool last_page;
> + u8 bus, devfn;
> + int ret = 0;
> + u16 sid;
> +
> + if (!dev || !dev_is_pci(dev))
> + return -ENODEV;
> +
> + iommu = device_to_iommu(dev, , );
> + if (!iommu)
> + return -ENODEV;
> +
> + if (!msg || !evt)
> + return -EINVAL;
> +
> + mutex_lock(_mutex);
> +
> + prm = >fault.prm;
> + sid = PCI_DEVID(bus, devfn);
> + pasid_present = prm->flags &
> IOMMU_FAULT_PAGE_REQUEST_PASID_VALID;
> + private_present = prm->flags &
> IOMMU_FAULT_PAGE_REQUEST_PRIV_DATA;
> + last_page = prm->flags &
> IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE;
> +
> + if (pasid_present) {
> + if (prm->pasid == 0 || prm->pasid >= PASID_MAX) {
> + ret = -EINVAL;
> + goto out;
> + }
> +
> + ret = pasid_to_svm_sdev(dev, prm->pasid, , );
> + if (ret || !sdev) {
> + ret = -ENODEV;
> + goto out;
> + }
> +
> + /*
> +  * For responses from userspace, need to make sure that the
> +  * pasid has been bound to its mm.
> + */
> + if (svm->flags & SVM_FLAG_GUEST_MODE) {
> + struct mm_struct *mm;
> +
> + mm = get_task_mm(current);
> + if (!mm) {
> + ret = -EINVAL;
> + goto out;
> + }
> +
> + if (mm != svm->mm) {
> + ret = -ENODEV;
> + mmput(mm);
> + goto out;
> + }
> +
> + mmput(mm);
> + }
> + } else {
> + pr_err_ratelimited("Invalid page response: no pasid\n");
> + ret = -EINVAL;
> + goto out;

check pasid=0 first, then no need to indent so many lines above. 

> + }
> +
> + /*
> +  * Per VT-d spec. v3.0 ch7.7, system software must respond
> +  * with page group response if private data is present (PDP)
> +  * or last page in group (LPIG) bit is set. This is an
> +  * additional VT-d requirement beyond PCI ATS spec.
> +  */

What is the behavior if system software doesn't follow the requirement?
en... maybe the question is really about whether the information in prm
comes from userspace or from internally-recorded info in iommu core.
The former cannot be trusted. The latter one is OK.

Thanks
Kevin

> + if (last_page || private_present) {
> + struct qi_desc desc;
> +
> + desc.qw0 = QI_PGRP_PASID(prm->pasid) | QI_PGRP_DID(sid)
> |
> + QI_PGRP_PASID_P(pasid_present) |
> + QI_PGRP_PDP(private_present) |
> + QI_PGRP_RESP_CODE(msg->code) |
> + QI_PGRP_RESP_TYPE;
> + desc.qw1 = QI_PGRP_IDX(prm->grpid) |
> QI_PGRP_LPIG(last_page);
> + desc.qw2 = 0;
> + 

RE: [PATCH v3 3/4] iommu/vt-d: Report page request faults for guest SVA

2020-07-09 Thread Tian, Kevin
> From: Lu Baolu 
> Sent: Thursday, July 9, 2020 3:06 PM
> 
> A pasid might be bound to a page table from a VM guest via the iommu
> ops.sva_bind_gpasid. In this case, when a DMA page fault is detected
> on the physical IOMMU, we need to inject the page fault request into
> the guest. After the guest completes handling the page fault, a page
> response need to be sent back via the iommu ops.page_response().
> 
> This adds support to report a page request fault. Any external module
> which is interested in handling this fault should regiester a notifier
> with iommu_register_device_fault_handler().
> 
> Co-developed-by: Jacob Pan 
> Signed-off-by: Jacob Pan 
> Co-developed-by: Liu Yi L 
> Signed-off-by: Liu Yi L 
> Signed-off-by: Lu Baolu 
> ---
>  drivers/iommu/intel/svm.c | 103 +++---
>  1 file changed, 85 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
> index c23167877b2b..d24e71bac8db 100644
> --- a/drivers/iommu/intel/svm.c
> +++ b/drivers/iommu/intel/svm.c
> @@ -815,8 +815,63 @@ static void intel_svm_drain_prq(struct device *dev,
> int pasid)
>   }
>  }
> 
> +static int prq_to_iommu_prot(struct page_req_dsc *req)
> +{
> + int prot = 0;
> +
> + if (req->rd_req)
> + prot |= IOMMU_FAULT_PERM_READ;
> + if (req->wr_req)
> + prot |= IOMMU_FAULT_PERM_WRITE;
> + if (req->exe_req)
> + prot |= IOMMU_FAULT_PERM_EXEC;
> + if (req->pm_req)
> + prot |= IOMMU_FAULT_PERM_PRIV;
> +
> + return prot;
> +}
> +
> +static int
> +intel_svm_prq_report(struct device *dev, struct page_req_dsc *desc)
> +{
> + struct iommu_fault_event event;
> +
> + /* Fill in event data for device specific processing */
> + memset(, 0, sizeof(struct iommu_fault_event));
> + event.fault.type = IOMMU_FAULT_PAGE_REQ;
> + event.fault.prm.addr = desc->addr;
> + event.fault.prm.pasid = desc->pasid;
> + event.fault.prm.grpid = desc->prg_index;
> + event.fault.prm.perm = prq_to_iommu_prot(desc);
> +
> + if (!dev || !dev_is_pci(dev))
> + return -ENODEV;

move the check before memset.

> +
> + if (desc->lpig)
> + event.fault.prm.flags |=
> IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE;
> + if (desc->pasid_present) {
> + event.fault.prm.flags |=
> IOMMU_FAULT_PAGE_REQUEST_PASID_VALID;
> + event.fault.prm.flags |=
> IOMMU_FAULT_PAGE_RESPONSE_NEEDS_PASID;
> + }

if pasid is not present, should we return error directly instead of
submitting the req and let iommu core to figure out? I don't have
a strong preference, thus:

Reviewed-by: Kevin Tian 

> + if (desc->priv_data_present) {
> + /*
> +  * Set last page in group bit if private data is present,
> +  * page response is required as it does for LPIG.
> +  * iommu_report_device_fault() doesn't understand this
> vendor
> +  * specific requirement thus we set last_page as a
> workaround.
> +  */
> + event.fault.prm.flags |=
> IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE;
> + event.fault.prm.flags |=
> IOMMU_FAULT_PAGE_REQUEST_PRIV_DATA;
> + memcpy(event.fault.prm.private_data, desc->priv_data,
> +sizeof(desc->priv_data));
> + }
> +
> + return iommu_report_device_fault(dev, );
> +}
> +
>  static irqreturn_t prq_event_thread(int irq, void *d)
>  {
> + struct intel_svm_dev *sdev = NULL;
>   struct intel_iommu *iommu = d;
>   struct intel_svm *svm = NULL;
>   int head, tail, handled = 0;
> @@ -828,7 +883,6 @@ static irqreturn_t prq_event_thread(int irq, void *d)
>   tail = dmar_readq(iommu->reg + DMAR_PQT_REG) &
> PRQ_RING_MASK;
>   head = dmar_readq(iommu->reg + DMAR_PQH_REG) &
> PRQ_RING_MASK;
>   while (head != tail) {
> - struct intel_svm_dev *sdev;
>   struct vm_area_struct *vma;
>   struct page_req_dsc *req;
>   struct qi_desc resp;
> @@ -864,6 +918,20 @@ static irqreturn_t prq_event_thread(int irq, void *d)
>   }
>   }
> 
> + if (!sdev || sdev->sid != req->rid) {
> + struct intel_svm_dev *t;
> +
> + sdev = NULL;
> + rcu_read_lock();
> + list_for_each_entry_rcu(t, >devs, list) {
> + if (t->sid == req->rid) {
> + sdev = t;
> + break;
> + }
> + }
> + rcu_read_unlock();
> + }
> +
>   result = QI_RESP_INVALID;
>   /* Since we're using init_mm.pgd directly, we should never
> take
>* any faults on kernel addresses. */
> @@ -874,6 +942,17 @@ static irqreturn_t prq_event_thread(int irq, void *d)
>   if 

RE: [PATCH v3 06/14] vfio/type1: Add VFIO_IOMMU_PASID_REQUEST (alloc/free)

2020-07-08 Thread Tian, Kevin
> From: Liu, Yi L 
> Sent: Thursday, July 9, 2020 10:08 AM
> 
> Hi Kevin,
> 
> > From: Tian, Kevin 
> > Sent: Thursday, July 9, 2020 9:57 AM
> >
> > > From: Liu, Yi L 
> > > Sent: Thursday, July 9, 2020 8:32 AM
> > >
> > > Hi Alex,
> > >
> > > > Alex Williamson 
> > > > Sent: Thursday, July 9, 2020 3:55 AM
> > > >
> > > > On Wed, 8 Jul 2020 08:16:16 +
> > > > "Liu, Yi L"  wrote:
> > > >
> > > > > Hi Alex,
> > > > >
> > > > > > From: Liu, Yi L < yi.l@intel.com>
> > > > > > Sent: Friday, July 3, 2020 2:28 PM
> > > > > >
> > > > > > Hi Alex,
> > > > > >
> > > > > > > From: Alex Williamson 
> > > > > > > Sent: Friday, July 3, 2020 5:19 AM
> > > > > > >
> > > > > > > On Wed, 24 Jun 2020 01:55:19 -0700 Liu Yi L
> > > > > > >  wrote:
> > > > > > >
> > > > > > > > This patch allows user space to request PASID allocation/free,
> e.g.
> > > > > > > > when serving the request from the guest.
> > > > > > > >
> > > > > > > > PASIDs that are not freed by userspace are automatically
> > > > > > > > freed
> > > when
> > > > > > > > the IOASID set is destroyed when process exits.
> > > > > [...]
> > > > > > > > +static int vfio_iommu_type1_pasid_request(struct vfio_iommu
> > > *iommu,
> > > > > > > > + unsigned long arg)
> > > > > > > > +{
> > > > > > > > +   struct vfio_iommu_type1_pasid_request req;
> > > > > > > > +   unsigned long minsz;
> > > > > > > > +
> > > > > > > > +   minsz = offsetofend(struct 
> > > > > > > > vfio_iommu_type1_pasid_request,
> > > > range);
> > > > > > > > +
> > > > > > > > +   if (copy_from_user(, (void __user *)arg, minsz))
> > > > > > > > +   return -EFAULT;
> > > > > > > > +
> > > > > > > > +   if (req.argsz < minsz || (req.flags &
> > > > ~VFIO_PASID_REQUEST_MASK))
> > > > > > > > +   return -EINVAL;
> > > > > > > > +
> > > > > > > > +   if (req.range.min > req.range.max)
> > > > > > >
> > > > > > > Is it exploitable that a user can spin the kernel for a long
> > > > > > > time in the case of a free by calling this with [0, MAX_UINT]
> > > > > > > regardless of their
> > > > actual
> > > > > > allocations?
> > > > > >
> > > > > > IOASID can ensure that user can only free the PASIDs allocated
> > > > > > to the
> > > user.
> > > > but
> > > > > > it's true, kernel needs to loop all the PASIDs within the range
> > > > > > provided by user.
> > > > it
> > > > > > may take a long time. is there anything we can do? one thing may
> > > > > > limit
> > > the
> > > > range
> > > > > > provided by user?
> > > > >
> > > > > thought about it more, we have per-VM pasid quota (say 1000), so
> > > > > even if user passed down [0, MAX_UNIT], kernel will only loop the
> > > > > 1000 pasids at most. do you think we still need to do something on it?
> > > >
> > > > How do you figure that?  vfio_iommu_type1_pasid_request() accepts
> > > > the user's min/max so long as (max > min) and passes that to
> > > > vfio_iommu_type1_pasid_free(), then to vfio_pasid_free_range()
> > > > which loops as:
> > > >
> > > > ioasid_t pasid = min;
> > > > for (; pasid <= max; pasid++)
> > > > ioasid_free(pasid);
> > > >
> > > > A user might only be able to allocate 1000 pasids, but apparently
> > > > they can ask to free all they want.
> > > >
> > > > It's also not obvious to me that calling ioasid_free() is only
> > > > allowing the user to free their own pa

RE: [PATCH v3 06/14] vfio/type1: Add VFIO_IOMMU_PASID_REQUEST (alloc/free)

2020-07-08 Thread Tian, Kevin
> From: Liu, Yi L 
> Sent: Thursday, July 9, 2020 8:32 AM
> 
> Hi Alex,
> 
> > Alex Williamson 
> > Sent: Thursday, July 9, 2020 3:55 AM
> >
> > On Wed, 8 Jul 2020 08:16:16 +
> > "Liu, Yi L"  wrote:
> >
> > > Hi Alex,
> > >
> > > > From: Liu, Yi L < yi.l@intel.com>
> > > > Sent: Friday, July 3, 2020 2:28 PM
> > > >
> > > > Hi Alex,
> > > >
> > > > > From: Alex Williamson 
> > > > > Sent: Friday, July 3, 2020 5:19 AM
> > > > >
> > > > > On Wed, 24 Jun 2020 01:55:19 -0700
> > > > > Liu Yi L  wrote:
> > > > >
> > > > > > This patch allows user space to request PASID allocation/free, e.g.
> > > > > > when serving the request from the guest.
> > > > > >
> > > > > > PASIDs that are not freed by userspace are automatically freed
> when
> > > > > > the IOASID set is destroyed when process exits.
> > > [...]
> > > > > > +static int vfio_iommu_type1_pasid_request(struct vfio_iommu
> *iommu,
> > > > > > + unsigned long arg)
> > > > > > +{
> > > > > > +   struct vfio_iommu_type1_pasid_request req;
> > > > > > +   unsigned long minsz;
> > > > > > +
> > > > > > +   minsz = offsetofend(struct vfio_iommu_type1_pasid_request,
> > range);
> > > > > > +
> > > > > > +   if (copy_from_user(, (void __user *)arg, minsz))
> > > > > > +   return -EFAULT;
> > > > > > +
> > > > > > +   if (req.argsz < minsz || (req.flags &
> > ~VFIO_PASID_REQUEST_MASK))
> > > > > > +   return -EINVAL;
> > > > > > +
> > > > > > +   if (req.range.min > req.range.max)
> > > > >
> > > > > Is it exploitable that a user can spin the kernel for a long time in
> > > > > the case of a free by calling this with [0, MAX_UINT] regardless of 
> > > > > their
> > actual
> > > > allocations?
> > > >
> > > > IOASID can ensure that user can only free the PASIDs allocated to the
> user.
> > but
> > > > it's true, kernel needs to loop all the PASIDs within the range provided
> > > > by user.
> > it
> > > > may take a long time. is there anything we can do? one thing may limit
> the
> > range
> > > > provided by user?
> > >
> > > thought about it more, we have per-VM pasid quota (say 1000), so even if
> > > user passed down [0, MAX_UNIT], kernel will only loop the 1000 pasids at
> > > most. do you think we still need to do something on it?
> >
> > How do you figure that?  vfio_iommu_type1_pasid_request() accepts the
> > user's min/max so long as (max > min) and passes that to
> > vfio_iommu_type1_pasid_free(), then to vfio_pasid_free_range()  which
> > loops as:
> >
> > ioasid_t pasid = min;
> > for (; pasid <= max; pasid++)
> > ioasid_free(pasid);
> >
> > A user might only be able to allocate 1000 pasids, but apparently they
> > can ask to free all they want.
> >
> > It's also not obvious to me that calling ioasid_free() is only allowing
> > the user to free their own passid.  Does it?  It would be a pretty

Agree. I thought ioasid_free should at least carry a token since the
user space is only allowed to manage PASIDs in its own set...

> > gaping hole if a user could free arbitrary pasids.  A r-b tree of
> > passids might help both for security and to bound spinning in a loop.
> 
> oh, yes. BTW. instead of r-b tree in VFIO, maybe we can add an ioasid_set
> parameter for ioasid_free(), thus to prevent the user from freeing PASIDs
> that doesn't belong to it. I remember Jacob mentioned it before.
> 

check current ioasid_free:

spin_lock(_allocator_lock);
ioasid_data = xa_load(_allocator->xa, ioasid);
if (!ioasid_data) {
pr_err("Trying to free unknown IOASID %u\n", ioasid);
goto exit_unlock;
}

Allow an user to trigger above lock paths with MAX_UINT times might still
be bad. 

Thanks
Kevin


RE: [PATCH v2 4/4] iommu/vt-d: Add page response ops support

2020-07-05 Thread Tian, Kevin
> From: Lu Baolu
> Sent: Monday, July 6, 2020 8:26 AM
> 
> After a page request is handled, software must response the device which
> raised the page request with the handling result. This is done through

'response' is a noun. 

> the iommu ops.page_response if the request was reported to outside of
> vendor iommu driver through iommu_report_device_fault(). This adds the
> VT-d implementation of page_response ops.
> 
> Co-developed-by: Jacob Pan 
> Signed-off-by: Jacob Pan 
> Co-developed-by: Liu Yi L 
> Signed-off-by: Liu Yi L 
> Signed-off-by: Lu Baolu 
> ---
>  drivers/iommu/intel/iommu.c |  1 +
>  drivers/iommu/intel/svm.c   | 74
> +
>  include/linux/intel-iommu.h |  3 ++
>  3 files changed, 78 insertions(+)
> 
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index de17952ed133..7eb29167e8f9 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -6057,6 +6057,7 @@ const struct iommu_ops intel_iommu_ops = {
>   .sva_bind   = intel_svm_bind,
>   .sva_unbind = intel_svm_unbind,
>   .sva_get_pasid  = intel_svm_get_pasid,
> + .page_response  = intel_svm_page_response,
>  #endif
>  };
> 
> diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
> index 08c58c2b1a06..1c7d8a9ea124 100644
> --- a/drivers/iommu/intel/svm.c
> +++ b/drivers/iommu/intel/svm.c
> @@ -1078,3 +1078,77 @@ int intel_svm_get_pasid(struct iommu_sva *sva)
> 
>   return pasid;
>  }
> +
> +int intel_svm_page_response(struct device *dev,
> + struct iommu_fault_event *evt,
> + struct iommu_page_response *msg)
> +{
> + struct iommu_fault_page_request *prm;
> + struct intel_svm_dev *sdev;
> + struct intel_iommu *iommu;
> + struct intel_svm *svm;
> + bool private_present;
> + bool pasid_present;
> + bool last_page;
> + u8 bus, devfn;
> + int ret = 0;
> + u16 sid;
> +
> + if (!dev || !dev_is_pci(dev))
> + return -ENODEV;

but we didn't do same check when reporting fault?

> +
> + iommu = device_to_iommu(dev, , );
> + if (!iommu)
> + return -ENODEV;
> +
> + if (!msg || !evt)
> + return -EINVAL;
> +
> + mutex_lock(_mutex);
> +
> + prm = >fault.prm;
> + sid = PCI_DEVID(bus, devfn);
> + pasid_present = prm->flags &
> IOMMU_FAULT_PAGE_REQUEST_PASID_VALID;
> + private_present = prm->flags &
> IOMMU_FAULT_PAGE_REQUEST_PRIV_DATA;
> + last_page = prm->flags &
> IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE;
> +
> + if (pasid_present) {
> + if (prm->pasid == 0 || prm->pasid >= PASID_MAX) {
> + ret = -EINVAL;
> + goto out;
> + }
> +
> + ret = pasid_to_svm_sdev(dev, prm->pasid, , );
> + if (ret || !sdev) {
> + ret = -ENODEV;
> + goto out;
> + }
> + }

what about pasid_present==0? Do we support recoverable fault now
with this patch?

and who guarantees that the external fault handler (e.g. guest)
cannot do bad thing with this interface, e.g. by specifying a PASID
belonging to other guests (when Scalable IOV is enabled)?

Thanks
Kevin

> +
> + /*
> +  * Per VT-d spec. v3.0 ch7.7, system software must respond
> +  * with page group response if private data is present (PDP)
> +  * or last page in group (LPIG) bit is set. This is an
> +  * additional VT-d requirement beyond PCI ATS spec.
> +  */
> + if (last_page || private_present) {
> + struct qi_desc desc;
> +
> + desc.qw0 = QI_PGRP_PASID(prm->pasid) | QI_PGRP_DID(sid)
> |
> + QI_PGRP_PASID_P(pasid_present) |
> + QI_PGRP_PDP(private_present) |
> + QI_PGRP_RESP_CODE(msg->code) |
> + QI_PGRP_RESP_TYPE;
> + desc.qw1 = QI_PGRP_IDX(prm->grpid) |
> QI_PGRP_LPIG(last_page);
> + desc.qw2 = 0;
> + desc.qw3 = 0;
> + if (private_present)
> + memcpy(, prm->private_data,
> +sizeof(prm->private_data));
> +
> + qi_submit_sync(iommu, , 1, 0);
> + }
> +out:
> + mutex_unlock(_mutex);
> + return ret;
> +}
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index fc2cfc3db6e1..bf6009a344f5 100644
> --- a/include/linux/intel-iommu.h
> +++ b/include/linux/intel-iommu.h
> @@ -741,6 +741,9 @@ struct iommu_sva *intel_svm_bind(struct device
> *dev, struct mm_struct *mm,
>void *drvdata);
>  void intel_svm_unbind(struct iommu_sva *handle);
>  int intel_svm_get_pasid(struct iommu_sva *handle);
> +int intel_svm_page_response(struct device *dev, struct iommu_fault_event
> *evt,
> + struct iommu_page_response *msg);
> +
>  struct 

RE: [PATCH v2 3/4] iommu/vt-d: Report page request faults for guest SVA

2020-07-05 Thread Tian, Kevin
> From: Tian, Kevin
> Sent: Monday, July 6, 2020 9:30 AM
> 
> > From: Lu Baolu 
> > Sent: Monday, July 6, 2020 8:26 AM
> >
> > A pasid might be bound to a page table from a VM guest via the iommu
> > ops.sva_bind_gpasid. In this case, when a DMA page fault is detected
> > on the physical IOMMU, we need to inject the page fault request into
> > the guest. After the guest completes handling the page fault, a page
> > response need to be sent back via the iommu ops.page_response().
> >
> > This adds support to report a page request fault. Any external module
> > which is interested in handling this fault should regiester a notifier
> > callback.
> 
> be specific on which notifier to be registered...
> 
> >
> > Co-developed-by: Jacob Pan 
> > Signed-off-by: Jacob Pan 
> > Co-developed-by: Liu Yi L 
> > Signed-off-by: Liu Yi L 
> > Signed-off-by: Lu Baolu 
> > ---
> >  drivers/iommu/intel/svm.c | 99 -
> --
> >  1 file changed, 81 insertions(+), 18 deletions(-)
> >
> > diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
> > index c23167877b2b..08c58c2b1a06 100644
> > --- a/drivers/iommu/intel/svm.c
> > +++ b/drivers/iommu/intel/svm.c
> > @@ -815,6 +815,57 @@ static void intel_svm_drain_prq(struct device *dev,
> > int pasid)
> > }
> >  }
> >
> > +static int prq_to_iommu_prot(struct page_req_dsc *req)
> > +{
> > +   int prot = 0;
> > +
> > +   if (req->rd_req)
> > +   prot |= IOMMU_FAULT_PERM_READ;
> > +   if (req->wr_req)
> > +   prot |= IOMMU_FAULT_PERM_WRITE;
> > +   if (req->exe_req)
> > +   prot |= IOMMU_FAULT_PERM_EXEC;
> > +   if (req->pm_req)
> > +   prot |= IOMMU_FAULT_PERM_PRIV;
> > +
> > +   return prot;
> > +}
> > +
> > +static int
> > +intel_svm_prq_report(struct device *dev, struct page_req_dsc *desc)
> > +{
> > +   struct iommu_fault_event event;
> > +   u8 bus, devfn;
> > +
> > +   memset(, 0, sizeof(struct iommu_fault_event));
> > +   bus = PCI_BUS_NUM(desc->rid);
> > +   devfn = desc->rid & 0xff;
> 
> not required.
> 
> > +
> > +   /* Fill in event data for device specific processing */
> > +   event.fault.type = IOMMU_FAULT_PAGE_REQ;
> > +   event.fault.prm.addr = desc->addr;
> > +   event.fault.prm.pasid = desc->pasid;
> > +   event.fault.prm.grpid = desc->prg_index;
> > +   event.fault.prm.perm = prq_to_iommu_prot(desc);
> > +
> > +   /*
> > +* Set last page in group bit if private data is present,
> > +* page response is required as it does for LPIG.
> > +*/
> 
> move to priv_data_present check?
> 
> > +   if (desc->lpig)
> > +   event.fault.prm.flags |=
> > IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE;
> > +   if (desc->pasid_present)
> > +   event.fault.prm.flags |=
> > IOMMU_FAULT_PAGE_REQUEST_PASID_VALID;
> > +   if (desc->priv_data_present) {
> > +   event.fault.prm.flags |=
> > IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE;

btw earlier comment is more about the behavior of the fault
handler (e.g. the guest), but not about why we need convert
to last_page prm flag. Let's make it clear that doing so is 
because iommu_report_device_fault doesn't understand this
vt-d specific requirement thus we set last_page as a workaround.

Thanks
Kevin

> > +   event.fault.prm.flags |=
> > IOMMU_FAULT_PAGE_REQUEST_PRIV_DATA;
> > +   memcpy(event.fault.prm.private_data, desc->priv_data,
> > +  sizeof(desc->priv_data));
> > +   }
> > +
> > +   return iommu_report_device_fault(dev, );
> > +}
> > +
> >  static irqreturn_t prq_event_thread(int irq, void *d)
> >  {
> > struct intel_iommu *iommu = d;
> > @@ -828,7 +879,7 @@ static irqreturn_t prq_event_thread(int irq, void *d)
> > tail = dmar_readq(iommu->reg + DMAR_PQT_REG) &
> > PRQ_RING_MASK;
> > head = dmar_readq(iommu->reg + DMAR_PQH_REG) &
> > PRQ_RING_MASK;
> > while (head != tail) {
> > -   struct intel_svm_dev *sdev;
> > +   struct intel_svm_dev *sdev = NULL;
> 
> move to outside of the loop, otherwise later check always hit "if (!sdev)"
> 
> > struct vm_area_struct *vma;
> > struct page_req_dsc *req;
> > struct qi_desc resp;
> > @@ -864,6 +915,20 @@ static irqreturn_t prq_event_thread(int irq, void
> *d)
> > }

RE: [PATCH v2 3/4] iommu/vt-d: Report page request faults for guest SVA

2020-07-05 Thread Tian, Kevin
> From: Lu Baolu 
> Sent: Monday, July 6, 2020 8:26 AM
> 
> A pasid might be bound to a page table from a VM guest via the iommu
> ops.sva_bind_gpasid. In this case, when a DMA page fault is detected
> on the physical IOMMU, we need to inject the page fault request into
> the guest. After the guest completes handling the page fault, a page
> response need to be sent back via the iommu ops.page_response().
> 
> This adds support to report a page request fault. Any external module
> which is interested in handling this fault should regiester a notifier
> callback.

be specific on which notifier to be registered...

> 
> Co-developed-by: Jacob Pan 
> Signed-off-by: Jacob Pan 
> Co-developed-by: Liu Yi L 
> Signed-off-by: Liu Yi L 
> Signed-off-by: Lu Baolu 
> ---
>  drivers/iommu/intel/svm.c | 99 ---
>  1 file changed, 81 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
> index c23167877b2b..08c58c2b1a06 100644
> --- a/drivers/iommu/intel/svm.c
> +++ b/drivers/iommu/intel/svm.c
> @@ -815,6 +815,57 @@ static void intel_svm_drain_prq(struct device *dev,
> int pasid)
>   }
>  }
> 
> +static int prq_to_iommu_prot(struct page_req_dsc *req)
> +{
> + int prot = 0;
> +
> + if (req->rd_req)
> + prot |= IOMMU_FAULT_PERM_READ;
> + if (req->wr_req)
> + prot |= IOMMU_FAULT_PERM_WRITE;
> + if (req->exe_req)
> + prot |= IOMMU_FAULT_PERM_EXEC;
> + if (req->pm_req)
> + prot |= IOMMU_FAULT_PERM_PRIV;
> +
> + return prot;
> +}
> +
> +static int
> +intel_svm_prq_report(struct device *dev, struct page_req_dsc *desc)
> +{
> + struct iommu_fault_event event;
> + u8 bus, devfn;
> +
> + memset(, 0, sizeof(struct iommu_fault_event));
> + bus = PCI_BUS_NUM(desc->rid);
> + devfn = desc->rid & 0xff;

not required.

> +
> + /* Fill in event data for device specific processing */
> + event.fault.type = IOMMU_FAULT_PAGE_REQ;
> + event.fault.prm.addr = desc->addr;
> + event.fault.prm.pasid = desc->pasid;
> + event.fault.prm.grpid = desc->prg_index;
> + event.fault.prm.perm = prq_to_iommu_prot(desc);
> +
> + /*
> +  * Set last page in group bit if private data is present,
> +  * page response is required as it does for LPIG.
> +  */

move to priv_data_present check?

> + if (desc->lpig)
> + event.fault.prm.flags |=
> IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE;
> + if (desc->pasid_present)
> + event.fault.prm.flags |=
> IOMMU_FAULT_PAGE_REQUEST_PASID_VALID;
> + if (desc->priv_data_present) {
> + event.fault.prm.flags |=
> IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE;
> + event.fault.prm.flags |=
> IOMMU_FAULT_PAGE_REQUEST_PRIV_DATA;
> + memcpy(event.fault.prm.private_data, desc->priv_data,
> +sizeof(desc->priv_data));
> + }
> +
> + return iommu_report_device_fault(dev, );
> +}
> +
>  static irqreturn_t prq_event_thread(int irq, void *d)
>  {
>   struct intel_iommu *iommu = d;
> @@ -828,7 +879,7 @@ static irqreturn_t prq_event_thread(int irq, void *d)
>   tail = dmar_readq(iommu->reg + DMAR_PQT_REG) &
> PRQ_RING_MASK;
>   head = dmar_readq(iommu->reg + DMAR_PQH_REG) &
> PRQ_RING_MASK;
>   while (head != tail) {
> - struct intel_svm_dev *sdev;
> + struct intel_svm_dev *sdev = NULL;

move to outside of the loop, otherwise later check always hit "if (!sdev)"

>   struct vm_area_struct *vma;
>   struct page_req_dsc *req;
>   struct qi_desc resp;
> @@ -864,6 +915,20 @@ static irqreturn_t prq_event_thread(int irq, void *d)
>   }
>   }
> 
> + if (!sdev || sdev->sid != req->rid) {
> + struct intel_svm_dev *t;
> +
> + sdev = NULL;
> + rcu_read_lock();
> + list_for_each_entry_rcu(t, >devs, list) {
> + if (t->sid == req->rid) {
> + sdev = t;
> + break;
> + }
> + }
> + rcu_read_unlock();
> + }
> +
>   result = QI_RESP_INVALID;
>   /* Since we're using init_mm.pgd directly, we should never
> take
>* any faults on kernel addresses. */
> @@ -874,6 +939,17 @@ static irqreturn_t prq_event_thread(int irq, void *d)
>   if (!is_canonical_address(address))
>   goto bad_req;
> 
> + /*
> +  * If prq is to be handled outside iommu driver via receiver of
> +  * the fault notifiers, we skip the page response here.
> +  */
> + if (svm->flags & SVM_FLAG_GUEST_MODE) {
> + if (sdev && !intel_svm_prq_report(sdev->dev, req))
> +

RE: [PATCH v2 2/4] iommu/vt-d: Add a helper to get svm and sdev for pasid

2020-07-05 Thread Tian, Kevin
> From: Lu Baolu 
> Sent: Monday, July 6, 2020 8:26 AM
> 
> There are several places in the code that need to get the pointers of
> svm and sdev according to a pasid and device. Add a helper to achieve
> this for code consolidation and readability.
> 
> Signed-off-by: Lu Baolu 
> ---
>  drivers/iommu/intel/svm.c | 121 +-
>  1 file changed, 68 insertions(+), 53 deletions(-)
> 
> diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
> index 25dd74f27252..c23167877b2b 100644
> --- a/drivers/iommu/intel/svm.c
> +++ b/drivers/iommu/intel/svm.c
> @@ -228,6 +228,50 @@ static LIST_HEAD(global_svm_list);
>   list_for_each_entry((sdev), &(svm)->devs, list) \
>   if ((d) != (sdev)->dev) {} else
> 
> +static int pasid_to_svm_sdev(struct device *dev, unsigned int pasid,
> +  struct intel_svm **rsvm,
> +  struct intel_svm_dev **rsdev)
> +{
> + struct intel_svm_dev *d, *sdev = NULL;
> + struct intel_svm *svm;
> +
> + /* The caller should hold the pasid_mutex lock */
> + if (WARN_ON(!mutex_is_locked(_mutex)))
> + return -EINVAL;
> +
> + if (pasid == INVALID_IOASID || pasid >= PASID_MAX)
> + return -EINVAL;
> +
> + svm = ioasid_find(NULL, pasid, NULL);
> + if (IS_ERR(svm))
> + return PTR_ERR(svm);
> +
> + if (!svm)
> + goto out;
> +
> + /*
> +  * If we found svm for the PASID, there must be at least one device
> +  * bond.
> +  */
> + if (WARN_ON(list_empty(>devs)))
> + return -EINVAL;
> +
> + rcu_read_lock();
> + list_for_each_entry_rcu(d, >devs, list) {
> + if (d->dev == dev) {
> + sdev = d;
> + break;
> + }
> + }
> + rcu_read_unlock();
> +
> +out:
> + *rsvm = svm;
> + *rsdev = sdev;
> +
> + return 0;
> +}
> +
>  int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device
> *dev,
> struct iommu_gpasid_bind_data *data)
>  {
> @@ -261,39 +305,27 @@ int intel_svm_bind_gpasid(struct iommu_domain
> *domain, struct device *dev,
>   dmar_domain = to_dmar_domain(domain);
> 
>   mutex_lock(_mutex);
> - svm = ioasid_find(NULL, data->hpasid, NULL);
> - if (IS_ERR(svm)) {
> - ret = PTR_ERR(svm);
> + ret = pasid_to_svm_sdev(dev, data->hpasid, , );
> + if (ret)
>   goto out;
> - }
> 
> - if (svm) {
> + if (sdev) {
>   /*
> -  * If we found svm for the PASID, there must be at
> -  * least one device bond, otherwise svm should be freed.
> +  * For devices with aux domains, we should allow
> +  * multiple bind calls with the same PASID and pdev.
>*/
> - if (WARN_ON(list_empty(>devs))) {
> - ret = -EINVAL;
> - goto out;
> + if (iommu_dev_feature_enabled(dev,
> IOMMU_DEV_FEAT_AUX)) {
> + sdev->users++;
> + } else {
> + dev_warn_ratelimited(dev,
> +  "Already bound with PASID %u\n",
> +  svm->pasid);
> + ret = -EBUSY;
>   }
> + goto out;
> + }
> 
> - for_each_svm_dev(sdev, svm, dev) {
> - /*
> -  * For devices with aux domains, we should allow
> -  * multiple bind calls with the same PASID and pdev.
> -  */
> - if (iommu_dev_feature_enabled(dev,
> -   IOMMU_DEV_FEAT_AUX))
> {
> - sdev->users++;
> - } else {
> - dev_warn_ratelimited(dev,
> -  "Already bound with
> PASID %u\n",
> -  svm->pasid);
> - ret = -EBUSY;
> - }
> - goto out;
> - }
> - } else {
> + if (!svm) {
>   /* We come here when PASID has never been bond to a
> device. */
>   svm = kzalloc(sizeof(*svm), GFP_KERNEL);
>   if (!svm) {
> @@ -376,25 +408,17 @@ int intel_svm_unbind_gpasid(struct device *dev,
> int pasid)
>   struct intel_iommu *iommu = device_to_iommu(dev, NULL, NULL);
>   struct intel_svm_dev *sdev;
>   struct intel_svm *svm;
> - int ret = -EINVAL;
> + int ret;
> 
>   if (WARN_ON(!iommu))
>   return -EINVAL;
> 
>   mutex_lock(_mutex);
> - svm = ioasid_find(NULL, pasid, NULL);
> - if (!svm) {
> - ret = -EINVAL;
> - goto out;
> - }
> -
> - if (IS_ERR(svm)) {
> - ret = PTR_ERR(svm);
> + ret = 

RE: [PATCH v2 1/4] iommu/vt-d: Refactor device_to_iommu() helper

2020-07-05 Thread Tian, Kevin
> From: Lu Baolu 
> Sent: Monday, July 6, 2020 8:26 AM
> 
> It is refactored in two ways:
> 
> - Make it global so that it could be used in other files.
> 
> - Make bus/devfn optional so that callers could ignore these two returned
> values when they only want to get the coresponding iommu pointer.
> 
> Signed-off-by: Lu Baolu 
> ---
>  drivers/iommu/intel/iommu.c | 55 +++--
>  drivers/iommu/intel/svm.c   |  8 +++---
>  include/linux/intel-iommu.h |  3 +-
>  3 files changed, 21 insertions(+), 45 deletions(-)
> 
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index d759e7234e98..de17952ed133 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -778,16 +778,16 @@ is_downstream_to_pci_bridge(struct device *dev,
> struct device *bridge)
>   return false;
>  }
> 
> -static struct intel_iommu *device_to_iommu(struct device *dev, u8 *bus, u8
> *devfn)
> +struct intel_iommu *device_to_iommu(struct device *dev, u8 *bus, u8
> *devfn)
>  {
>   struct dmar_drhd_unit *drhd = NULL;
> + struct pci_dev *pdev = NULL;
>   struct intel_iommu *iommu;
>   struct device *tmp;
> - struct pci_dev *pdev = NULL;
>   u16 segment = 0;
>   int i;
> 
> - if (iommu_dummy(dev))
> + if (!dev || iommu_dummy(dev))
>   return NULL;
> 
>   if (dev_is_pci(dev)) {
> @@ -818,8 +818,10 @@ static struct intel_iommu *device_to_iommu(struct
> device *dev, u8 *bus, u8 *devf
>   if (pdev && pdev->is_virtfn)
>   goto got_pdev;
> 
> - *bus = drhd->devices[i].bus;
> - *devfn = drhd->devices[i].devfn;
> + if (bus && devfn) {
> + *bus = drhd->devices[i].bus;
> + *devfn = drhd->devices[i].devfn;
> + }
>   goto out;
>   }
> 
> @@ -829,8 +831,10 @@ static struct intel_iommu *device_to_iommu(struct
> device *dev, u8 *bus, u8 *devf
> 
>   if (pdev && drhd->include_all) {
>   got_pdev:
> - *bus = pdev->bus->number;
> - *devfn = pdev->devfn;
> + if (bus && devfn) {
> + *bus = pdev->bus->number;
> + *devfn = pdev->devfn;
> + }
>   goto out;
>   }
>   }
> @@ -5146,11 +5150,10 @@ static int aux_domain_add_dev(struct
> dmar_domain *domain,
> struct device *dev)
>  {
>   int ret;
> - u8 bus, devfn;
>   unsigned long flags;
>   struct intel_iommu *iommu;
> 
> - iommu = device_to_iommu(dev, , );
> + iommu = device_to_iommu(dev, NULL, NULL);
>   if (!iommu)
>   return -ENODEV;
> 
> @@ -5236,9 +5239,8 @@ static int prepare_domain_attach_device(struct
> iommu_domain *domain,
>   struct dmar_domain *dmar_domain = to_dmar_domain(domain);
>   struct intel_iommu *iommu;
>   int addr_width;
> - u8 bus, devfn;
> 
> - iommu = device_to_iommu(dev, , );
> + iommu = device_to_iommu(dev, NULL, NULL);
>   if (!iommu)
>   return -ENODEV;
> 
> @@ -5658,9 +5660,8 @@ static bool intel_iommu_capable(enum
> iommu_cap cap)
>  static struct iommu_device *intel_iommu_probe_device(struct device *dev)
>  {
>   struct intel_iommu *iommu;
> - u8 bus, devfn;
> 
> - iommu = device_to_iommu(dev, , );
> + iommu = device_to_iommu(dev, NULL, NULL);
>   if (!iommu)
>   return ERR_PTR(-ENODEV);
> 
> @@ -5673,9 +5674,8 @@ static struct iommu_device
> *intel_iommu_probe_device(struct device *dev)
>  static void intel_iommu_release_device(struct device *dev)
>  {
>   struct intel_iommu *iommu;
> - u8 bus, devfn;
> 
> - iommu = device_to_iommu(dev, , );
> + iommu = device_to_iommu(dev, NULL, NULL);
>   if (!iommu)
>   return;
> 
> @@ -5825,37 +5825,14 @@ static struct iommu_group
> *intel_iommu_device_group(struct device *dev)
>   return generic_device_group(dev);
>  }
> 
> -#ifdef CONFIG_INTEL_IOMMU_SVM
> -struct intel_iommu *intel_svm_device_to_iommu(struct device *dev)
> -{
> - struct intel_iommu *iommu;
> - u8 bus, devfn;
> -
> - if (iommu_dummy(dev)) {
> - dev_warn(dev,
> -  "No IOMMU translation for device; cannot enable
> SVM\n");
> - return NULL;
> - }
> -
> - iommu = device_to_iommu(dev, , );
> - if ((!iommu)) {
> - dev_err(dev, "No IOMMU for device; cannot enable SVM\n");
> - return NULL;
> - }
> -
> - return iommu;
> -}
> -#endif /* CONFIG_INTEL_IOMMU_SVM */
> -
>  static int intel_iommu_enable_auxd(struct device *dev)
>  {
>   struct device_domain_info *info;
>   struct intel_iommu *iommu;
>  

RE: [PATCH 4/4] iommu/vt-d: Add page response ops support

2020-06-30 Thread Tian, Kevin
> From: Lu Baolu 
> Sent: Sunday, June 28, 2020 8:34 AM
> 
> After a page request is handled, software must response the device which
> raised the page request with the handling result. This is done through
> the iommu ops.page_response if the request was reported to outside of
> vendor iommu driver through iommu_report_device_fault(). This adds the
> VT-d implementation of page_response ops.
> 
> Co-developed-by: Jacob Pan 
> Signed-off-by: Jacob Pan 
> Co-developed-by: Liu Yi L 
> Signed-off-by: Liu Yi L 
> Signed-off-by: Lu Baolu 
> ---
>  drivers/iommu/intel/iommu.c |  1 +
>  drivers/iommu/intel/svm.c   | 73
> +
>  include/linux/intel-iommu.h |  3 ++
>  3 files changed, 77 insertions(+)
> 
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index de17952ed133..7eb29167e8f9 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -6057,6 +6057,7 @@ const struct iommu_ops intel_iommu_ops = {
>   .sva_bind   = intel_svm_bind,
>   .sva_unbind = intel_svm_unbind,
>   .sva_get_pasid  = intel_svm_get_pasid,
> + .page_response  = intel_svm_page_response,
>  #endif
>  };
> 
> diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
> index 4800bb6f8794..003ea9579632 100644
> --- a/drivers/iommu/intel/svm.c
> +++ b/drivers/iommu/intel/svm.c
> @@ -1092,3 +1092,76 @@ int intel_svm_get_pasid(struct iommu_sva *sva)
> 
>   return pasid;
>  }
> +
> +int intel_svm_page_response(struct device *dev,
> + struct iommu_fault_event *evt,
> + struct iommu_page_response *msg)
> +{
> + struct iommu_fault_page_request *prm;
> + struct intel_svm_dev *sdev;
> + struct intel_iommu *iommu;
> + struct intel_svm *svm;
> + bool private_present;
> + bool pasid_present;
> + bool last_page;
> + u8 bus, devfn;
> + int ret = 0;
> + u16 sid;
> +
> + if (!dev || !dev_is_pci(dev))
> + return -ENODEV;
> +
> + iommu = device_to_iommu(dev, , );
> + if (!iommu)
> + return -ENODEV;

move to the place when iommu is referenced. This place is too early.

> +
> + if (!msg || !evt)
> + return -EINVAL;
> +
> + mutex_lock(_mutex);
> +
> + prm = >fault.prm;
> + sid = PCI_DEVID(bus, devfn);
> + pasid_present = prm->flags &
> IOMMU_FAULT_PAGE_REQUEST_PASID_VALID;
> + private_present = prm->flags &
> IOMMU_FAULT_PAGE_REQUEST_PRIV_DATA;
> + last_page = prm->flags &
> IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE;
> +
> + if (pasid_present) {
> + /* VT-d supports devices with full 20 bit PASIDs only */
> + if (pci_max_pasids(to_pci_dev(dev)) != PASID_MAX) {
> + ret = -EINVAL;
> + goto out;
> + }

shouldn't we check prm->pasid here? Above is more reasonable to be
checked when page request is reported.

> +
> + ret = pasid_to_svm_sdev(dev, prm->pasid, , );
> + if (ret || !sdev)

if sdev==NULL, suppose an error (-ENODEV) should be returned here?

> + goto out;
> + }
> +
> + /*
> +  * Per VT-d spec. v3.0 ch7.7, system software must respond
> +  * with page group response if private data is present (PDP)
> +  * or last page in group (LPIG) bit is set. This is an
> +  * additional VT-d feature beyond PCI ATS spec.

feature->requirement

Thanks
Kevin

> +  */
> + if (last_page || private_present) {
> + struct qi_desc desc;
> +
> + desc.qw0 = QI_PGRP_PASID(prm->pasid) | QI_PGRP_DID(sid)
> |
> + QI_PGRP_PASID_P(pasid_present) |
> + QI_PGRP_PDP(private_present) |
> + QI_PGRP_RESP_CODE(msg->code) |
> + QI_PGRP_RESP_TYPE;
> + desc.qw1 = QI_PGRP_IDX(prm->grpid) |
> QI_PGRP_LPIG(last_page);
> + desc.qw2 = 0;
> + desc.qw3 = 0;
> + if (private_present)
> + memcpy(, prm->private_data,
> +sizeof(prm->private_data));
> +
> + qi_submit_sync(iommu, , 1, 0);
> + }
> +out:
> + mutex_unlock(_mutex);
> + return ret;
> +}
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index fc2cfc3db6e1..bf6009a344f5 100644
> --- a/include/linux/intel-iommu.h
> +++ b/include/linux/intel-iommu.h
> @@ -741,6 +741,9 @@ struct iommu_sva *intel_svm_bind(struct device
> *dev, struct mm_struct *mm,
>void *drvdata);
>  void intel_svm_unbind(struct iommu_sva *handle);
>  int intel_svm_get_pasid(struct iommu_sva *handle);
> +int intel_svm_page_response(struct device *dev, struct iommu_fault_event
> *evt,
> + struct iommu_page_response *msg);
> +
>  struct svm_dev_ops;
> 
>  struct intel_svm_dev {
> --
> 2.17.1



RE: [PATCH 3/4] iommu/vt-d: Report page request faults for guest SVA

2020-06-30 Thread Tian, Kevin
> From: Lu Baolu 
> Sent: Sunday, June 28, 2020 8:34 AM
> 
> A pasid might be bound to a page table from a VM guest via the iommu
> ops.sva_bind_gpasid. In this case, when a DMA page fault is detected
> on the physical IOMMU, we need to inject the page fault request into
> the guest. After the guest completes handling the page fault, a page
> response need to be sent back via the iommu ops.page_response().
> 
> This adds support to report a page request fault. Any external module
> which is interested in handling this fault should regiester a notifier
> callback.
> 
> Co-developed-by: Jacob Pan 
> Signed-off-by: Jacob Pan 
> Co-developed-by: Liu Yi L 
> Signed-off-by: Liu Yi L 
> Signed-off-by: Lu Baolu 
> ---
>  drivers/iommu/intel/svm.c | 83
> +--
>  1 file changed, 80 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
> index c23167877b2b..4800bb6f8794 100644
> --- a/drivers/iommu/intel/svm.c
> +++ b/drivers/iommu/intel/svm.c
> @@ -815,6 +815,69 @@ static void intel_svm_drain_prq(struct device *dev,
> int pasid)
>   }
>  }
> 
> +static int prq_to_iommu_prot(struct page_req_dsc *req)
> +{
> + int prot = 0;
> +
> + if (req->rd_req)
> + prot |= IOMMU_FAULT_PERM_READ;
> + if (req->wr_req)
> + prot |= IOMMU_FAULT_PERM_WRITE;
> + if (req->exe_req)
> + prot |= IOMMU_FAULT_PERM_EXEC;
> + if (req->pm_req)
> + prot |= IOMMU_FAULT_PERM_PRIV;
> +
> + return prot;
> +}
> +
> +static int
> +intel_svm_prq_report(struct intel_iommu *iommu, struct page_req_dsc
> *desc)
> +{
> + struct iommu_fault_event event;
> + struct pci_dev *pdev;
> + u8 bus, devfn;
> + int ret = 0;
> +
> + memset(, 0, sizeof(struct iommu_fault_event));
> + bus = PCI_BUS_NUM(desc->rid);
> + devfn = desc->rid & 0xff;
> + pdev = pci_get_domain_bus_and_slot(iommu->segment, bus, devfn);

Is this step necessary? dev can be passed in (based on sdev), and more
importantly iommu_report_device_fault already handles the ref counting
e.g. get_device(dev) when fault handler is valid...

> +
> + if (!pdev) {
> + pr_err("No PCI device found for PRQ [%02x:%02x.%d]\n",
> +bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
> + return -ENODEV;
> + }
> +
> + /* Fill in event data for device specific processing */
> + event.fault.type = IOMMU_FAULT_PAGE_REQ;
> + event.fault.prm.addr = desc->addr;
> + event.fault.prm.pasid = desc->pasid;
> + event.fault.prm.grpid = desc->prg_index;
> + event.fault.prm.perm = prq_to_iommu_prot(desc);
> +
> + /*
> +  * Set last page in group bit if private data is present,
> +  * page response is required as it does for LPIG.
> +  */
> + if (desc->lpig)
> + event.fault.prm.flags |=
> IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE;
> + if (desc->pasid_present)
> + event.fault.prm.flags |=
> IOMMU_FAULT_PAGE_REQUEST_PASID_VALID;
> + if (desc->priv_data_present) {
> + event.fault.prm.flags |=
> IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE;

why setting lpig under this condition? 

> + event.fault.prm.flags |=
> IOMMU_FAULT_PAGE_REQUEST_PRIV_DATA;
> + memcpy(event.fault.prm.private_data, desc->priv_data,
> +sizeof(desc->priv_data));
> + }
> +
> + ret = iommu_report_device_fault(>dev, );
> + pci_dev_put(pdev);
> +
> + return ret;
> +}
> +
>  static irqreturn_t prq_event_thread(int irq, void *d)
>  {
>   struct intel_iommu *iommu = d;
> @@ -874,6 +937,19 @@ static irqreturn_t prq_event_thread(int irq, void *d)
>   if (!is_canonical_address(address))
>   goto bad_req;
> 
> + /*
> +  * If prq is to be handled outside iommu driver via receiver of
> +  * the fault notifiers, we skip the page response here.
> +  */
> + if (svm->flags & SVM_FLAG_GUEST_MODE) {
> + int res = intel_svm_prq_report(iommu, req);
> +
> + if (!res)
> + goto prq_advance;
> + else
> + goto bad_req;
> + }
> +

I noted in bad_req there is another reporting logic:

if (sdev && sdev->ops && sdev->ops->fault_cb) {
int rwxp = (req->rd_req << 3) | (req->wr_req << 2) |
(req->exe_req << 1) | (req->pm_req);
sdev->ops->fault_cb(sdev->dev, req->pasid, req->addr,
req->priv_data, rwxp, result);
}

It was introduced in the 1st version of svm.c. It might be unrelated to
this patch, but I wonder whether that one should be replaced with 
iommu_report_device_fault too?

Thanks
Kevin

>   /* If the mm is already defunct, don't handle faults. */
>   

RE: [PATCH 3/7] iommu/vt-d: Fix PASID devTLB invalidation

2020-06-29 Thread Tian, Kevin
> From: Lu Baolu 
> Sent: Thursday, June 25, 2020 3:26 PM
> 
> On 2020/6/23 23:43, Jacob Pan wrote:
> > DevTLB flush can be used for both DMA request with and without PASIDs.
> > The former uses PASID#0 (RID2PASID), latter uses non-zero PASID for SVA
> > usage.
> >
> > This patch adds a check for PASID value such that devTLB flush with
> > PASID is used for SVA case. This is more efficient in that multiple
> > PASIDs can be used by a single device, when tearing down a PASID entry
> > we shall flush only the devTLB specific to a PASID.
> >
> > Fixes: 6f7db75e1c46 ("iommu/vt-d: Add second level page table")

btw is it really a fix? From the description it's more like an optimization...

> > Signed-off-by: Jacob Pan 
> > ---
> >   drivers/iommu/intel/pasid.c | 11 ++-
> >   1 file changed, 10 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
> > index c81f0f17c6ba..3991a24539a1 100644
> > --- a/drivers/iommu/intel/pasid.c
> > +++ b/drivers/iommu/intel/pasid.c
> > @@ -486,7 +486,16 @@ devtlb_invalidation_with_pasid(struct
> intel_iommu *iommu,
> > qdep = info->ats_qdep;
> > pfsid = info->pfsid;
> >
> > -   qi_flush_dev_iotlb(iommu, sid, pfsid, qdep, 0, 64 - VTD_PAGE_SHIFT);
> > +   /*
> > +* When PASID 0 is used, it indicates RID2PASID(DMA request w/o
> PASID),
> > +* devTLB flush w/o PASID should be used. For non-zero PASID under
> > +* SVA usage, device could do DMA with multiple PASIDs. It is more
> > +* efficient to flush devTLB specific to the PASID.
> > +*/
> > +   if (pasid)
> 
> How about
> 
>   if (pasid == PASID_RID2PASID)
>   qi_flush_dev_iotlb(iommu, sid, pfsid, qdep, 0, 64 -
> VTD_PAGE_SHIFT);
>   else
>   qi_flush_dev_iotlb_pasid(iommu, sid, pfsid, pasid, qdep, 0,
> 64 -
> VTD_PAGE_SHIFT);
> 
> ?
> 
> It makes the code more readable and still works even we reassign another
> pasid for RID2PASID.
> 
> Best regards,
> baolu
> 
> > +   qi_flush_dev_iotlb_pasid(iommu, sid, pfsid, pasid, qdep, 0,
> 64 - VTD_PAGE_SHIFT);
> > +   else
> > +   qi_flush_dev_iotlb(iommu, sid, pfsid, qdep, 0, 64 -
> VTD_PAGE_SHIFT);
> >   }
> >
> >   void intel_pasid_tear_down_entry(struct intel_iommu *iommu, struct
> device *dev,
> >


RE: [PATCH v3 1/5] docs: IOMMU user API

2020-06-29 Thread Tian, Kevin
> From: Jacob Pan
> Sent: Tuesday, June 30, 2020 7:05 AM
> 
> On Fri, 26 Jun 2020 16:19:23 -0600
> Alex Williamson  wrote:
> 
> > On Tue, 23 Jun 2020 10:03:53 -0700
> > Jacob Pan  wrote:
> >
> > > IOMMU UAPI is newly introduced to support communications between
> > > guest virtual IOMMU and host IOMMU. There has been lots of
> > > discussions on how it should work with VFIO UAPI and userspace in
> > > general.
> > >
> > > This document is indended to clarify the UAPI design and usage. The
> > > mechenics of how future extensions should be achieved are also
> > > covered in this documentation.
> > >
> > > Signed-off-by: Liu Yi L 
> > > Signed-off-by: Jacob Pan 
> > > ---
> > >  Documentation/userspace-api/iommu.rst | 244
> > > ++ 1 file changed, 244 insertions(+)
> > >  create mode 100644 Documentation/userspace-api/iommu.rst
> > >
> > > diff --git a/Documentation/userspace-api/iommu.rst
> > > b/Documentation/userspace-api/iommu.rst new file mode 100644
> > > index ..f9e4ed90a413
> > > --- /dev/null
> > > +++ b/Documentation/userspace-api/iommu.rst
> > > @@ -0,0 +1,244 @@
> > > +.. SPDX-License-Identifier: GPL-2.0
> > > +.. iommu:
> > > +
> > > +=
> > > +IOMMU Userspace API
> > > +=
> > > +
> > > +IOMMU UAPI is used for virtualization cases where communications
> > > are +needed between physical and virtual IOMMU drivers. For native
> > > +usage, IOMMU is a system device which does not need to communicate
> > > +with user space directly.
> > > +
> > > +The primary use cases are guest Shared Virtual Address (SVA) and
> > > +guest IO virtual address (IOVA), wherein a virtual IOMMU (vIOMMU)
> > > is +required to communicate with the physical IOMMU in the host.
> > > +
> > > +.. contents:: :local:
> > > +
> > > +Functionalities
> > > +===
> > > +Communications of user and kernel involve both directions. The
> > > +supported user-kernel APIs are as follows:
> > > +
> > > +1. Alloc/Free PASID
> > > +2. Bind/unbind guest PASID (e.g. Intel VT-d)
> > > +3. Bind/unbind guest PASID table (e.g. ARM sMMU)
> > > +4. Invalidate IOMMU caches
> > > +5. Service page requests
> > > +
> > > +Requirements
> > > +
> > > +The IOMMU UAPIs are generic and extensible to meet the following
> > > +requirements:
> > > +
> > > +1. Emulated and para-virtualised vIOMMUs
> > > +2. Multiple vendors (Intel VT-d, ARM sMMU, etc.)
> > > +3. Extensions to the UAPI shall not break existing user space
> > > +
> > > +Interfaces
> > > +==
> > > +Although the data structures defined in IOMMU UAPI are
> > > self-contained, +there is no user API functions introduced.
> > > Instead, IOMMU UAPI is +designed to work with existing user driver
> > > frameworks such as VFIO. +
> > > +Extension Rules & Precautions
> > > +-
> > > +When IOMMU UAPI gets extended, the data structures can *only* be
> > > +modified in two ways:
> > > +
> > > +1. Adding new fields by re-purposing the padding[] field. No size
> > > change. +2. Adding new union members at the end. May increase in
> > > size. +
> > > +No new fields can be added *after* the variable sized union in
> > > that it +will break backward compatibility when offset moves. In
> > > both cases, a +new flag must be accompanied with a new field such
> > > that the IOMMU +driver can process the data based on the new flag.
> > > Version field is +only reserved for the unlikely event of UAPI
> > > upgrade at its entirety. +
> > > +It's *always* the caller's responsibility to indicate the size of
> > > the +structure passed by setting argsz appropriately.
> > > +Though at the same time, argsz is user provided data which is not
> > > +trusted. The argsz field allows the user to indicate how much data
> > > +they're providing, it's still the kernel's responsibility to
> > > validate +whether it's correct and sufficient for the requested
> > > operation. +
> > > +Compatibility Checking
> > > +--
> > > +When IOMMU UAPI extension results in size increase, user such as
> > > VFIO +has to handle the following cases:
> > > +
> > > +1. User and kernel has exact size match
> > > +2. An older user with older kernel header (smaller UAPI size)
> > > running on a
> > > +   newer kernel (larger UAPI size)
> > > +3. A newer user with newer kernel header (larger UAPI size) running
> > > +   on an older kernel.
> > > +4. A malicious/misbehaving user pass illegal/invalid size but
> > > within
> > > +   range. The data may contain garbage.
> >
> > What exactly does vfio need to do to handle these?
> >
> VFIO does nothing other than returning the status from IOMMU driver.
> Based on the return status, users such as QEMU can cause fault
> conditions within the vIOMMU.

But from above description, "user such as VFIO has to handle the
following cases"...

Thanks
Kevin

> 
> > > +
> > > +Feature Checking
> > > +
> > > +While launching 

RE: [PATCH v3 02/14] iommu: Report domain nesting info

2020-06-29 Thread Tian, Kevin
> From: Liu, Yi L 
> Sent: Monday, June 29, 2020 8:23 PM
> 
> Hi Stefan,
> 
> > From: Stefan Hajnoczi 
> > Sent: Monday, June 29, 2020 5:25 PM
> >
> > On Wed, Jun 24, 2020 at 01:55:15AM -0700, Liu Yi L wrote:
> > > +/*
> > > + * struct iommu_nesting_info - Information for nesting-capable IOMMU.
> > > + *   user space should check it before using
> > > + *   nesting capability.
> > > + *
> > > + * @size:size of the whole structure
> > > + * @format:  PASID table entry format, the same definition with
> > > + *   @format of struct iommu_gpasid_bind_data.
> > > + * @features:supported nesting features.
> > > + * @flags:   currently reserved for future extension.
> > > + * @data:vendor specific cap info.
> > > + *
> > > + * +---++
> > > + * | feature   |  Notes |
> > > + *
> >
> +===+===
> 
> > =+
> > > + * | SYSWIDE_PASID |  Kernel manages PASID in system wide, PASIDs
> used  |
> > > + * |   |  in the system should be allocated by host kernel  |
> > > + * +---++
> > > + * | BIND_PGTBL|  bind page tables to host PASID, the PASID could   |
> > > + * |   |  either be a host PASID passed in bind request or  |
> > > + * |   |  default PASIDs (e.g. default PASID of aux-domain) |
> > > + * +---++
> > > + * | CACHE_INVLD   |  mandatory feature for nesting capable IOMMU
> |
> > > + * +---++
> >
> > This feature description is vague about what CACHE_INVLD does and how
> to
> > use it. If I understand correctly, the presence of this feature means
> > that VFIO_IOMMU_NESTING_OP_CACHE_INVLD must be used?
> >
> > The same kind of clarification could be done for SYSWIDE_PASID and
> > BIND_PGTBL too.
> 
> For SYSWIDE_PASID and BIND_PGTBL, yes, presence of the feature bit
> means must use. So the two are requirements to user space if it wants
> to setup nesting. While for CACHE_INVLD, it's kind of availability
> here. How about removing CACHE_INVLD as presence of BIND_PGTBL should
> indicates support of CACHE_INVLD?
> 

So far this assumption is correct but it may not be true when thinking forward.
For example, a vendor might find a way to allow the owner of 1st-level page
table to directly invalidate cache w/o going through host IOMMU driver. From
this angle I feel explicitly reporting this capability is more robust.

Regarding to the description, what about below?

--
SYSWIDE_PASID: PASIDs are managed in system-wide, instead of per device.
When a device is assigned to userspace or VM, proper uAPI (provided by 
userspace driver framework, e.g. VFIO) must be used to allocate/free PASIDs
for the assigned device.

BIND_PGTBL: The owner of the first-level/stage-1 page table must explicitly 
bind the page table to associated PASID (either the one specified in bind 
request or the default PASID of the iommu domain), through VFIO_IOMMU
_NESTING_OP

CACHE_INVLD: The owner of the first-level/stage-1 page table must
explicitly invalidate the IOMMU cache through VFIO_IOMMU_NESTING_OP,
according to vendor-specific requirement when changing the page table.
--

Thanks
Kevin





RE: [PATCH v3 02/14] iommu: Report domain nesting info

2020-06-29 Thread Tian, Kevin
> From: Liu, Yi L 
> Sent: Saturday, June 27, 2020 2:53 PM
> 
> Hi Robin,
> 
> > From: Robin Murphy 
> > Sent: Saturday, June 27, 2020 12:05 AM
> >
> > On 2020-06-26 08:47, Jean-Philippe Brucker wrote:
> > > On Wed, Jun 24, 2020 at 01:55:15AM -0700, Liu Yi L wrote:
> > >> IOMMUs that support nesting translation needs report the capability
> > >> info to userspace, e.g. the format of first level/stage paging 
> > >> structures.
> > >>
> > >> This patch reports nesting info by DOMAIN_ATTR_NESTING. Caller can
> > >> get nesting info after setting DOMAIN_ATTR_NESTING.
> > >>
> > >> v2 -> v3:
> > >> *) remvoe cap/ecap_mask in iommu_nesting_info.
> > >> *) reuse DOMAIN_ATTR_NESTING to get nesting info.
> > >> *) return an empty iommu_nesting_info for SMMU drivers per Jean'
> > >> suggestion.
> > >>
> > >> Cc: Kevin Tian 
> > >> CC: Jacob Pan 
> > >> Cc: Alex Williamson 
> > >> Cc: Eric Auger 
> > >> Cc: Jean-Philippe Brucker 
> > >> Cc: Joerg Roedel 
> > >> Cc: Lu Baolu 
> > >> Signed-off-by: Liu Yi L 
> > >> Signed-off-by: Jacob Pan 
> > >> ---
> > >>   drivers/iommu/arm-smmu-v3.c | 29 --
> > >>   drivers/iommu/arm-smmu.c| 29 --
> > >
> > > Looks reasonable to me. Please move the SMMU changes to a separate
> > > patch and Cc the SMMU maintainers:
> >
> > Cheers Jean, I'll admit I've been skipping over a lot of these patches 
> > lately :)
> >
> > A couple of comments below...
> >
> > >
> > > Cc: Will Deacon 
> > > Cc: Robin Murphy 
> > >
> > > Thanks,
> > > Jean
> > >
> > >>   include/uapi/linux/iommu.h  | 59
> > +
> > >>   3 files changed, 113 insertions(+), 4 deletions(-)
> > >>
> > >> diff --git a/drivers/iommu/arm-smmu-v3.c
> > >> b/drivers/iommu/arm-smmu-v3.c index f578677..0c45d4d 100644
> > >> --- a/drivers/iommu/arm-smmu-v3.c
> > >> +++ b/drivers/iommu/arm-smmu-v3.c
> > >> @@ -3019,6 +3019,32 @@ static struct iommu_group
> > *arm_smmu_device_group(struct device *dev)
> > >>  return group;
> > >>   }
> > >>
> > >> +static int arm_smmu_domain_nesting_info(struct arm_smmu_domain
> > *smmu_domain,
> > >> +void *data)
> > >> +{
> > >> +struct iommu_nesting_info *info = (struct iommu_nesting_info *)
> data;
> > >> +u32 size;
> > >> +
> > >> +if (!info || smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED)
> > >> +return -ENODEV;
> > >> +
> > >> +size = sizeof(struct iommu_nesting_info);
> > >> +
> > >> +/*
> > >> + * if provided buffer size is not equal to the size, should
> > >> + * return 0 and also the expected buffer size to caller.
> > >> + */
> > >> +if (info->size != size) {
> > >> +info->size = size;
> > >> +return 0;
> > >> +}
> > >> +
> > >> +/* report an empty iommu_nesting_info for now */
> > >> +memset(info, 0x0, size);
> > >> +info->size = size;
> > >> +return 0;
> > >> +}
> > >> +
> > >>   static int arm_smmu_domain_get_attr(struct iommu_domain *domain,
> > >>  enum iommu_attr attr, void *data)
> > >>   {
> > >> @@ -3028,8 +3054,7 @@ static int arm_smmu_domain_get_attr(struct
> > iommu_domain *domain,
> > >>  case IOMMU_DOMAIN_UNMANAGED:
> > >>  switch (attr) {
> > >>  case DOMAIN_ATTR_NESTING:
> > >> -*(int *)data = (smmu_domain->stage ==
> > ARM_SMMU_DOMAIN_NESTED);
> > >> -return 0;
> > >> +return
> arm_smmu_domain_nesting_info(smmu_domain,
> > data);
> > >>  default:
> > >>  return -ENODEV;
> > >>  }
> > >> diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
> > >> index 243bc4c..908607d 100644
> > >> --- a/drivers/iommu/arm-smmu.c
> > >> +++ b/drivers/iommu/arm-smmu.c
> > >> @@ -1506,6 +1506,32 @@ static struct iommu_group
> > *arm_smmu_device_group(struct device *dev)
> > >>  return group;
> > >>   }
> > >>
> > >> +static int arm_smmu_domain_nesting_info(struct arm_smmu_domain
> > *smmu_domain,
> > >> +void *data)
> > >> +{
> > >> +struct iommu_nesting_info *info = (struct iommu_nesting_info *)
> data;
> > >> +u32 size;
> > >> +
> > >> +if (!info || smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED)
> > >> +return -ENODEV;
> > >> +
> > >> +size = sizeof(struct iommu_nesting_info);
> > >> +
> > >> +/*
> > >> + * if provided buffer size is not equal to the size, should
> > >> + * return 0 and also the expected buffer size to caller.
> > >> + */
> > >> +if (info->size != size) {
> > >> +info->size = size;
> > >> +return 0;
> > >> +}
> > >> +
> > >> +/* report an empty iommu_nesting_info for now */
> > >> +

RE: [PATCH v2 1/3] docs: IOMMU user API

2020-06-17 Thread Tian, Kevin
> From: Liu, Yi L 
> Sent: Wednesday, June 17, 2020 2:20 PM
> 
> > From: Jacob Pan 
> > Sent: Tuesday, June 16, 2020 11:22 PM
> >
> > On Thu, 11 Jun 2020 17:27:27 -0700
> > Jacob Pan  wrote:
> >
> > > >
> > > > But then I thought it even better if VFIO leaves the entire
> > > > copy_from_user() to the layer consuming it.
> > > >
> > > OK. Sounds good, that was what Kevin suggested also. I just wasn't
> > > sure how much VFIO wants to inspect, I thought VFIO layer wanted to do
> > > a sanity check.
> > >
> > > Anyway, I will move copy_from_user to iommu uapi layer.
> >
> > Just one more point brought up by Yi when we discuss this offline.
> >
> > If we move copy_from_user to iommu uapi layer, then there will be
> multiple
> > copy_from_user calls for the same data when a VFIO container has
> multiple domains,
> > devices. For bind, it might be OK. But might be additional overhead for TLB
> flush
> > request from the guest.
> 
> I think it is the same with bind and TLB flush path. will be multiple
> copy_from_user.

multiple copies is possibly fine. In reality we allow only one group per
nesting container (as described in patch [03/15]), and usually there
is just one SVA-capable device per group.

> 
> BTW. for moving data copy to iommy layer, there is another point which
> need to consider. VFIO needs to do unbind in bind path if bind failed,
> so it will assemble unbind_data and pass to iommu layer. If iommu layer
> do the copy_from_user, I think it will be failed. any idea?
> 

This might be mitigated if we go back to use the same bind_data for both
bind/unbind. Then you can reuse the user object for unwinding.

However there is another case where VFIO may need to assemble the
bind_data itself. When a VM is killed, VFIO needs to walk allocated PASIDs
and unbind them one-by-one. In such case copy_from_user doesn't work
since the data is created by kernel. Alex, do you have a suggestion how this
usage can be supported? e.g. asking IOMMU driver to provide two sets of
APIs to handle user/kernel generated requests?

Thanks
Kevin


RE: [PATCH v2 00/15] vfio: expose virtual Shared Virtual Addressing to VMs

2020-06-15 Thread Tian, Kevin
> From: Stefan Hajnoczi 
> Sent: Monday, June 15, 2020 6:02 PM
> 
> On Thu, Jun 11, 2020 at 05:15:19AM -0700, Liu Yi L wrote:
> > Shared Virtual Addressing (SVA), a.k.a, Shared Virtual Memory (SVM) on
> > Intel platforms allows address space sharing between device DMA and
> > applications. SVA can reduce programming complexity and enhance
> security.
> >
> > This VFIO series is intended to expose SVA usage to VMs. i.e. Sharing
> > guest application address space with passthru devices. This is called
> > vSVA in this series. The whole vSVA enabling requires QEMU/VFIO/IOMMU
> > changes. For IOMMU and QEMU changes, they are in separate series (listed
> > in the "Related series").
> >
> > The high-level architecture for SVA virtualization is as below, the key
> > design of vSVA support is to utilize the dual-stage IOMMU translation (
> > also known as IOMMU nesting translation) capability in host IOMMU.
> >
> >
> > .-.  .---.
> > |   vIOMMU|  | Guest process CR3, FL only|
> > | |  '---'
> > ./
> > | PASID Entry |--- PASID cache flush -
> > '-'   |
> > | |   V
> > | |CR3 in GPA
> > '-'
> > Guest
> > --| Shadow |--|
> >   vv  v
> > Host
> > .-.  .--.
> > |   pIOMMU|  | Bind FL for GVA-GPA  |
> > | |  '--'
> > ./  |
> > | PASID Entry | V (Nested xlate)
> > '\.--.
> > | |   |SL for GPA-HPA, default domain|
> > | |   '--'
> > '-'
> > Where:
> >  - FL = First level/stage one page tables
> >  - SL = Second level/stage two page tables
> 
> Hi,
> Looks like an interesting feature!
> 
> To check I understand this feature: can applications now pass virtual
> addresses to devices instead of translating to IOVAs?
> 
> If yes, can guest applications restrict the vSVA address space so the
> device only has access to certain regions?
> 
> On one hand replacing IOVA translation with virtual addresses simplifies
> the application programming model, but does it give up isolation if the
> device can now access all application memory?
> 

with SVA each application is allocated with a unique PASID to tag its
virtual address space. The device that claims SVA support must guarantee 
that one application can only program the device to access its own virtual
address space (i.e. all DMAs triggered by this application are tagged with
the application's PASID, and are translated by IOMMU's PASID-granular
page table). So, isolation is not sacrificed in SVA.

Thanks
Kevin


RE: [PATCH v2 02/15] iommu: Report domain nesting info

2020-06-15 Thread Tian, Kevin
> From: Liu, Yi L 
> Sent: Monday, June 15, 2020 2:05 PM
> 
> Hi Kevin,
> 
> > From: Tian, Kevin 
> > Sent: Monday, June 15, 2020 9:23 AM
> >
> > > From: Liu, Yi L 
> > > Sent: Friday, June 12, 2020 5:05 PM
> > >
> > > Hi Alex,
> > >
> > > > From: Alex Williamson 
> > > > Sent: Friday, June 12, 2020 3:30 AM
> > > >
> > > > On Thu, 11 Jun 2020 05:15:21 -0700
> > > > Liu Yi L  wrote:
> > > >
> > > > > IOMMUs that support nesting translation needs report the
> > > > > capability info to userspace, e.g. the format of first level/stage 
> > > > > paging
> > structures.
> > > > >
> > > > > Cc: Kevin Tian 
> > > > > CC: Jacob Pan 
> > > > > Cc: Alex Williamson 
> > > > > Cc: Eric Auger 
> > > > > Cc: Jean-Philippe Brucker 
> > > > > Cc: Joerg Roedel 
> > > > > Cc: Lu Baolu 
> > > > > Signed-off-by: Liu Yi L 
> > > > > Signed-off-by: Jacob Pan 
> > > > > ---
> > > > > @Jean, Eric: as nesting was introduced for ARM, but looks like no
> > > > > actual user of it. right? So I'm wondering if we can reuse
> > > > > DOMAIN_ATTR_NESTING to retrieve nesting info? how about your
> > > opinions?
> > > > >
> > > > >  include/linux/iommu.h  |  1 +
> > > > >  include/uapi/linux/iommu.h | 34
> > > ++
> > > > >  2 files changed, 35 insertions(+)
> > > > >
> > > > > diff --git a/include/linux/iommu.h b/include/linux/iommu.h index
> > > > > 78a26ae..f6e4b49 100644
> > > > > --- a/include/linux/iommu.h
> > > > > +++ b/include/linux/iommu.h
> > > > > @@ -126,6 +126,7 @@ enum iommu_attr {
> > > > >   DOMAIN_ATTR_FSL_PAMUV1,
> > > > >   DOMAIN_ATTR_NESTING,/* two stages of translation */
> > > > >   DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE,
> > > > > + DOMAIN_ATTR_NESTING_INFO,
> > > > >   DOMAIN_ATTR_MAX,
> > > > >  };
> > > > >
> > > > > diff --git a/include/uapi/linux/iommu.h
> > > > > b/include/uapi/linux/iommu.h index 303f148..02eac73 100644
> > > > > --- a/include/uapi/linux/iommu.h
> > > > > +++ b/include/uapi/linux/iommu.h
> > > > > @@ -332,4 +332,38 @@ struct iommu_gpasid_bind_data {
> > > > >   };
> > > > >  };
> > > > >
> > > > > +struct iommu_nesting_info {
> > > > > + __u32   size;
> > > > > + __u32   format;
> > > > > + __u32   features;
> > > > > +#define IOMMU_NESTING_FEAT_SYSWIDE_PASID (1 << 0)
> > > > > +#define IOMMU_NESTING_FEAT_BIND_PGTBL(1 << 1)
> > > > > +#define IOMMU_NESTING_FEAT_CACHE_INVLD   (1 <<
> 2)
> > > > > + __u32   flags;
> > > > > + __u8data[];
> > > > > +};
> > > > > +
> > > > > +/*
> > > > > + * @flags:   VT-d specific flags. Currently reserved for future
> > > > > + *   extension.
> > > > > + * @addr_width:  The output addr width of first level/stage
> > translation
> > > > > + * @pasid_bits:  Maximum supported PASID bits, 0 represents
> no
> > > PASID
> > > > > + *   support.
> > > > > + * @cap_reg: Describe basic capabilities as defined in VT-d
> > > capability
> > > > > + *   register.
> > > > > + * @cap_mask:Mark valid capability bits in @cap_reg.
> > > > > + * @ecap_reg:Describe the extended capabilities as defined 
> > > > > in VT-d
> > > > > + *   extended capability register.
> > > > > + * @ecap_mask:   Mark the valid capability bits in @ecap_reg.
> > > >
> > > > Please explain this a little further, why do we need to tell
> > > > userspace about cap/ecap register bits that aren't valid through this
> interface?
> > > > Thanks,
> > >
> > > we only want to tell userspace about the bits marked in the
> cap/ecap_mask.
> > > cap/ecap_mask is kind of white-list of the cap/ecap register.
> > > userspace should only care about the bits in t

  1   2   3   >