from:"Alex Williamson"

Re: [PATCH v3 06/14] vfio/type1: Add VFIO_IOMMU_PASID_REQUEST (alloc/free)

2020-07-09 Thread Alex Williamson

On Thu, 9 Jul 2020 07:16:31 +
"Liu, Yi L"  wrote:

> Hi Alex,
> 
> After more thinking, looks like adding a r-b tree is still not enough to
> solve the potential problem for free a range of PASID in one ioctl. If
> caller gives [0, MAX_UNIT] in the free request, kernel anyhow should
> loop all the PASIDs and search in the r-b tree. Even VFIO can track the
> smallest/largest allocated PASID, and limit the free range to an accurate
> range, it is still no efficient. For example, user has allocated two PASIDs
> ( 1 and 999), and user gives the [0, MAX_UNIT] range in free request. VFIO
> will limit the free range to be [1, 999], but still needs to loop PASID 1 -
> 999, and search in r-b tree.

That sounds like a poor tree implementation.  Look at vfio_find_dma()
for instance, it returns a node within the specified range.  If the
tree has two nodes within the specified range we should never need to
call a search function like vfio_find_dma() more than three times.  We
call it once, get the first node, remove it.  Call it again, get the
other node, remove it.  Call a third time, find no matches, we're done.
So such an implementation limits searches to N+1 where N is the number
of nodes within the range.

> So I'm wondering can we fall back to prior proposal which only free one
> PASID for a free request. how about your opinion?

Doesn't it still seem like it would be a useful user interface to have
a mechanism to free all pasids, by calling with exactly [0, MAX_UINT]?
I'm not sure if there's another use case for this given than the user
doesn't have strict control of the pasid values they get.  Thanks,

Alex

> > From: Liu, Yi L 
> > Sent: Thursday, July 9, 2020 10:26 AM
> > 
> > Hi Kevin,
> >   
> > > From: Tian, Kevin 
> > > Sent: Thursday, July 9, 2020 10:18 AM
> > >  
> > > > From: Liu, Yi L 
> > > > Sent: Thursday, July 9, 2020 10:08 AM
> > > >
> > > > Hi Kevin,
> > > >  
> > > > > From: Tian, Kevin 
> > > > > Sent: Thursday, July 9, 2020 9:57 AM
> > > > >  
> > > > > > From: Liu, Yi L 
> > > > > > Sent: Thursday, July 9, 2020 8:32 AM
> > > > > >
> > > > > > Hi Alex,
> > > > > >  
> > > > > > > Alex Williamson 
> > > > > > > Sent: Thursday, July 9, 2020 3:55 AM
> > > > > > >
> > > > > > > On Wed, 8 Jul 2020 08:16:16 + "Liu, Yi L"
> > > > > > >  wrote:
> > > > > > >  
> > > > > > > > Hi Alex,
> > > > > > > >  
> > > > > > > > > From: Liu, Yi L < yi.l@intel.com>
> > > > > > > > > Sent: Friday, July 3, 2020 2:28 PM
> > > > > > > > >
> > > > > > > > > Hi Alex,
> > > > > > > > >  
> > > > > > > > > > From: Alex Williamson 
> > > > > > > > > > Sent: Friday, July 3, 2020 5:19 AM
> > > > > > > > > >
> > > > > > > > > > On Wed, 24 Jun 2020 01:55:19 -0700 Liu Yi L
> > > > > > > > > >  wrote:
> > > > > > > > > >  
> > > > > > > > > > > This patch allows user space to request PASID
> > > > > > > > > > > allocation/free,  
> > > > e.g.  
> > > > > > > > > > > when serving the request from the guest.
> > > > > > > > > > >
> > > > > > > > > > > PASIDs that are not freed by userspace are
> > > > > > > > > > > automatically freed  
> > > > > > when  
> > > > > > > > > > > the IOASID set is destroyed when process exits.  
> > > > > > > > [...]  
> > > > > > > > > > > +static int vfio_iommu_type1_pasid_request(struct
> > > > > > > > > > > +vfio_iommu  
> > > > > > *iommu,  
> > > > > > > > > > > +   unsigned long arg) {
> > > > > > > > > > > + struct vfio_iommu_type1_pasid_request req;
> > > > > > > > > > > + unsigned long minsz;
> > > > > > > > > > > +
> > > > > > > > > > > + minsz = offsetofend(struct  
> > > vfio_iommu_type1_pasid_request,  
> > >

Re: [PATCH v3 06/14] vfio/type1: Add VFIO_IOMMU_PASID_REQUEST (alloc/free)

2020-07-08 Thread Alex Williamson

On Wed, 8 Jul 2020 08:16:16 +
"Liu, Yi L"  wrote:

> Hi Alex,
> 
> > From: Liu, Yi L < yi.l@intel.com>
> > Sent: Friday, July 3, 2020 2:28 PM
> > 
> > Hi Alex,
> >   
> > > From: Alex Williamson 
> > > Sent: Friday, July 3, 2020 5:19 AM
> > >
> > > On Wed, 24 Jun 2020 01:55:19 -0700
> > > Liu Yi L  wrote:
> > >  
> > > > This patch allows user space to request PASID allocation/free, e.g.
> > > > when serving the request from the guest.
> > > >
> > > > PASIDs that are not freed by userspace are automatically freed when
> > > > the IOASID set is destroyed when process exits.  
> [...]
> > > > +static int vfio_iommu_type1_pasid_request(struct vfio_iommu *iommu,
> > > > + unsigned long arg)
> > > > +{
> > > > +   struct vfio_iommu_type1_pasid_request req;
> > > > +   unsigned long minsz;
> > > > +
> > > > +   minsz = offsetofend(struct vfio_iommu_type1_pasid_request, 
> > > > range);
> > > > +
> > > > +   if (copy_from_user(, (void __user *)arg, minsz))
> > > > +   return -EFAULT;
> > > > +
> > > > +   if (req.argsz < minsz || (req.flags & ~VFIO_PASID_REQUEST_MASK))
> > > > +   return -EINVAL;
> > > > +
> > > > +   if (req.range.min > req.range.max)  
> > >
> > > Is it exploitable that a user can spin the kernel for a long time in
> > > the case of a free by calling this with [0, MAX_UINT] regardless of their 
> > > actual  
> > allocations?
> > 
> > IOASID can ensure that user can only free the PASIDs allocated to the user. 
> > but
> > it's true, kernel needs to loop all the PASIDs within the range provided by 
> > user. it
> > may take a long time. is there anything we can do? one thing may limit the 
> > range
> > provided by user?  
> 
> thought about it more, we have per-VM pasid quota (say 1000), so even if
> user passed down [0, MAX_UNIT], kernel will only loop the 1000 pasids at
> most. do you think we still need to do something on it?

How do you figure that?  vfio_iommu_type1_pasid_request() accepts the
user's min/max so long as (max > min) and passes that to
vfio_iommu_type1_pasid_free(), then to vfio_pasid_free_range()  which
loops as:

ioasid_t pasid = min;
for (; pasid <= max; pasid++)
ioasid_free(pasid);

A user might only be able to allocate 1000 pasids, but apparently they
can ask to free all they want.

It's also not obvious to me that calling ioasid_free() is only allowing
the user to free their own passid.  Does it?  It would be a pretty
gaping hole if a user could free arbitrary pasids.  A r-b tree of
passids might help both for security and to bound spinning in a loop.
Thanks,

Alex

Re: [PATCH v4 04/15] vfio/type1: Report iommu nesting info to userspace

2020-07-08 Thread Alex Williamson

On Wed, 8 Jul 2020 08:08:40 +
"Liu, Yi L"  wrote:

> Hi Alex,
> 
> Eric asked if we will to have data strcut other than struct iommu_nesting_info
> type in the struct vfio_iommu_type1_info_cap_nesting @info[] field. I'm not
> quit sure on it. I guess the answer may be not as VFIO's nesting support 
> should
> based on IOMMU UAPI. how about your opinion?
> 
> +#define VFIO_IOMMU_TYPE1_INFO_CAP_NESTING  3
> +
> +/*
> + * Reporting nesting info to user space.
> + *
> + * @info:the nesting info provided by IOMMU driver. Today
> + *   it is expected to be a struct iommu_nesting_info
> + *   data.
> + */
> +struct vfio_iommu_type1_info_cap_nesting {
> + struct  vfio_info_cap_header header;
> + __u32   flags;
> + __u32   padding;
> + __u8info[];
> +};

It's not a very useful uAPI if the user can't be sure what they're
getting out of it.  Info capabilities are "cheap", they don't need to
be as extensible as an ioctl.  It's not clear that we really even need
the flags (and therefore the padding), just define it to return the
IOMMU uAPI structure with no extensibility.  If we need to expose
something else, create a new capability.  Thanks,

Alex

> 
> https://lore.kernel.org/linux-iommu/dm5pr11mb1435290b6cd561ec61027892c3...@dm5pr11mb1435.namprd11.prod.outlook.com/
> 
> Regards,
> Yi Liu
> 
> > From: Liu, Yi L
> > Sent: Tuesday, July 7, 2020 5:32 PM
> >   
> [...]
> > > >  
> > > >>> +
> > > >>> +/*
> > > >>> + * Reporting nesting info to user space.
> > > >>> + *
> > > >>> + * @info:the nesting info provided by IOMMU driver. Today
> > > >>> + *   it is expected to be a struct iommu_nesting_info
> > > >>> + *   data.  
> > > >> Is it expected to change?  
> > > >
> > > > honestly, I'm not quite sure on it. I did considered to embed struct
> > > > iommu_nesting_info here instead of using info[]. but I hesitated as
> > > > using info[] may leave more flexibility on this struct. how about
> > > > your opinion? perhaps it's fine to embed the struct
> > > > iommu_nesting_info here as long as VFIO is setup nesting based on
> > > > IOMMU UAPI.
> > > >  
> > > >>> + */
> > > >>> +struct vfio_iommu_type1_info_cap_nesting {
> > > >>> + struct  vfio_info_cap_header header;
> > > >>> + __u32   flags;  
> > > >> You may document flags.  
> > > >
> > > > sure. it's reserved for future.
> > > >
> > > > Regards,
> > > > Yi Liu
> > > >  
> > > >>> + __u32   padding;
> > > >>> + __u8info[];
> > > >>> +};
> > > >>> +
> > > >>>  #define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
> > > >>>
> > > >>>  /**
> > > >>>  
> > > >> Thanks
> > > >>
> > > >> Eric  
> > > >  
>

Re: [PATCH v2 1/2] iommu: iommu_aux_at(de)tach_device() extension

2020-07-08 Thread Alex Williamson

On Wed, 8 Jul 2020 10:53:12 +0800
Lu Baolu  wrote:

> Hi Alex,
> 
> Thanks a lot for your comments. Please check my reply inline.
> 
> On 7/8/20 5:04 AM, Alex Williamson wrote:
> > On Tue,  7 Jul 2020 09:39:56 +0800
> > Lu Baolu  wrote:
> >   
> >> The hardware assistant vfio mediated device is a use case of iommu
> >> aux-domain. The interactions between vfio/mdev and iommu during mdev
> >> creation and passthr are:
> >>
> >> - Create a group for mdev with iommu_group_alloc();
> >> - Add the device to the group with
> >>  group = iommu_group_alloc();
> >>  if (IS_ERR(group))
> >>  return PTR_ERR(group);
> >>
> >>  ret = iommu_group_add_device(group, >dev);
> >>  if (!ret)
> >>  dev_info(>dev, "MDEV: group_id = %d\n",
> >>   iommu_group_id(group));
> >> - Allocate an aux-domain
> >>  iommu_domain_alloc()
> >> - Attach the aux-domain to the physical device from which the mdev is
> >>created.
> >>  iommu_aux_attach_device()
> >>
> >> In the whole process, an iommu group was allocated for the mdev and an
> >> iommu domain was attached to the group, but the group->domain leaves
> >> NULL. As the result, iommu_get_domain_for_dev() doesn't work anymore.
> >>
> >> The iommu_get_domain_for_dev() is a necessary interface for device
> >> drivers that want to support aux-domain. For example,
> >>
> >>  struct iommu_domain *domain;
> >>  struct device *dev = mdev_dev(mdev);
> >>  unsigned long pasid;
> >>
> >>  domain = iommu_get_domain_for_dev(dev);
> >>  if (!domain)
> >>  return -ENODEV;
> >>
> >>  pasid = iommu_aux_get_pasid(domain, dev->parent);  
> > How did we know this was an aux domain? ie. How did we know we could
> > use it with iommu_aux_get_pasid()?  
> 
> Yes. It's a bit confusing if iommu_get_domain_for_dev() is reused here
> for aux-domain.
> 
> > 
> > Why did we assume the parent device is the iommu device for the aux
> > domain?  Should that level of detail be already known by the aux domain?
> > 
> > Nits - The iomu device of an mdev device is found via
> > mdev_get_iommu_device(dev), it should not be assumed to be the parent.
> > The parent of an mdev device is found via mdev_parent_dev(mdev).  
> 
> My bad. The driver should use mdev_get_iommu_device() instead.
> 
> > 
> > The leaps in logic here make me wonder if we should instead be exposing
> > more of an aux domain API rather than blurring the differences between
> > these domains.  Thanks,  
> 
> How about add below API?
> 
> /**
>   * iommu_aux_get_domain_for_dev - get aux domain for a device
>   * @dev: the accessory device
>   *
>   * The caller should pass a valid @dev to iommu_aux_attach_device() before
>   * calling this api. Return an attached aux-domain, or NULL otherwise.

That's not necessarily the caller's responsibility, that might happen
elsewhere, this function simply returns an aux domain for the device if
it's attached to one.

>   */
> struct iommu_domain *iommu_aux_get_domain_for_dev(struct device *dev)
> {
>  struct iommu_domain *domain = NULL;
>  struct iommu_group *group;
> 
>  group = iommu_group_get(dev);
>  if (!group)
>  return NULL;
> 
>  if (group->aux_domain_attached)
>  domain = group->domain;
> 
>  iommu_group_put(group);
> 
>  return domain;
> }
> EXPORT_SYMBOL_GPL(iommu_aux_get_domain_for_dev);

For your example use case, this seems more clear to me.  Thanks,

Alex

Re: [PATCH v3 1/5] docs: IOMMU user API

2020-07-07 Thread Alex Williamson

On Mon, 29 Jun 2020 16:05:18 -0700
Jacob Pan  wrote:

> On Fri, 26 Jun 2020 16:19:23 -0600
> Alex Williamson  wrote:
> 
> > On Tue, 23 Jun 2020 10:03:53 -0700
> > Jacob Pan  wrote:
> >   
> > > IOMMU UAPI is newly introduced to support communications between
> > > guest virtual IOMMU and host IOMMU. There has been lots of
> > > discussions on how it should work with VFIO UAPI and userspace in
> > > general.
> > > 
> > > This document is indended to clarify the UAPI design and usage. The
> > > mechenics of how future extensions should be achieved are also
> > > covered in this documentation.
> > > 
> > > Signed-off-by: Liu Yi L 
> > > Signed-off-by: Jacob Pan 
> > > ---
> > >  Documentation/userspace-api/iommu.rst | 244
> > > ++ 1 file changed, 244 insertions(+)
> > >  create mode 100644 Documentation/userspace-api/iommu.rst
> > > 
> > > diff --git a/Documentation/userspace-api/iommu.rst
> > > b/Documentation/userspace-api/iommu.rst new file mode 100644
> > > index ..f9e4ed90a413
> > > --- /dev/null
> > > +++ b/Documentation/userspace-api/iommu.rst
> > > @@ -0,0 +1,244 @@
> > > +.. SPDX-License-Identifier: GPL-2.0
> > > +.. iommu:
> > > +
> > > +=
> > > +IOMMU Userspace API
> > > +=
> > > +
> > > +IOMMU UAPI is used for virtualization cases where communications
> > > are +needed between physical and virtual IOMMU drivers. For native
> > > +usage, IOMMU is a system device which does not need to communicate
> > > +with user space directly.
> > > +
> > > +The primary use cases are guest Shared Virtual Address (SVA) and
> > > +guest IO virtual address (IOVA), wherein a virtual IOMMU (vIOMMU)
> > > is +required to communicate with the physical IOMMU in the host.
> > > +
> > > +.. contents:: :local:
> > > +
> > > +Functionalities
> > > +===
> > > +Communications of user and kernel involve both directions. The
> > > +supported user-kernel APIs are as follows:
> > > +
> > > +1. Alloc/Free PASID
> > > +2. Bind/unbind guest PASID (e.g. Intel VT-d)
> > > +3. Bind/unbind guest PASID table (e.g. ARM sMMU)
> > > +4. Invalidate IOMMU caches
> > > +5. Service page requests
> > > +
> > > +Requirements
> > > +
> > > +The IOMMU UAPIs are generic and extensible to meet the following
> > > +requirements:
> > > +
> > > +1. Emulated and para-virtualised vIOMMUs
> > > +2. Multiple vendors (Intel VT-d, ARM sMMU, etc.)
> > > +3. Extensions to the UAPI shall not break existing user space
> > > +
> > > +Interfaces
> > > +==
> > > +Although the data structures defined in IOMMU UAPI are
> > > self-contained, +there is no user API functions introduced.
> > > Instead, IOMMU UAPI is +designed to work with existing user driver
> > > frameworks such as VFIO. +
> > > +Extension Rules & Precautions
> > > +-
> > > +When IOMMU UAPI gets extended, the data structures can *only* be
> > > +modified in two ways:
> > > +
> > > +1. Adding new fields by re-purposing the padding[] field. No size
> > > change. +2. Adding new union members at the end. May increase in
> > > size. +
> > > +No new fields can be added *after* the variable sized union in
> > > that it +will break backward compatibility when offset moves. In
> > > both cases, a +new flag must be accompanied with a new field such
> > > that the IOMMU +driver can process the data based on the new flag.
> > > Version field is +only reserved for the unlikely event of UAPI
> > > upgrade at its entirety. +
> > > +It's *always* the caller's responsibility to indicate the size of
> > > the +structure passed by setting argsz appropriately.
> > > +Though at the same time, argsz is user provided data which is not
> > > +trusted. The argsz field allows the user to indicate how much data
> > > +they're providing, it's still the kernel's responsibility to
> > > validate +whether it's correct and sufficient for the requested
> > > operation. +
> > > +Compatibility Checking
> > > +--
> > > +When IOMMU UAPI extension results in size increase, user such as
> > > VFIO +has to handle

Re: [PATCH v2 1/2] iommu: iommu_aux_at(de)tach_device() extension

2020-07-07 Thread Alex Williamson

On Tue,  7 Jul 2020 09:39:56 +0800
Lu Baolu  wrote:

> The hardware assistant vfio mediated device is a use case of iommu
> aux-domain. The interactions between vfio/mdev and iommu during mdev
> creation and passthr are:
> 
> - Create a group for mdev with iommu_group_alloc();
> - Add the device to the group with
> group = iommu_group_alloc();
> if (IS_ERR(group))
> return PTR_ERR(group);
> 
> ret = iommu_group_add_device(group, >dev);
> if (!ret)
> dev_info(>dev, "MDEV: group_id = %d\n",
>  iommu_group_id(group));
> - Allocate an aux-domain
> iommu_domain_alloc()
> - Attach the aux-domain to the physical device from which the mdev is
>   created.
> iommu_aux_attach_device()
> 
> In the whole process, an iommu group was allocated for the mdev and an
> iommu domain was attached to the group, but the group->domain leaves
> NULL. As the result, iommu_get_domain_for_dev() doesn't work anymore.
> 
> The iommu_get_domain_for_dev() is a necessary interface for device
> drivers that want to support aux-domain. For example,
> 
> struct iommu_domain *domain;
> struct device *dev = mdev_dev(mdev);
> unsigned long pasid;
> 
> domain = iommu_get_domain_for_dev(dev);
> if (!domain)
> return -ENODEV;
> 
> pasid = iommu_aux_get_pasid(domain, dev->parent);

How did we know this was an aux domain? ie. How did we know we could
use it with iommu_aux_get_pasid()?

Why did we assume the parent device is the iommu device for the aux
domain?  Should that level of detail be already known by the aux domain?

Nits - The iomu device of an mdev device is found via
mdev_get_iommu_device(dev), it should not be assumed to be the parent.
The parent of an mdev device is found via mdev_parent_dev(mdev).

The leaps in logic here make me wonder if we should instead be exposing
more of an aux domain API rather than blurring the differences between
these domains.  Thanks,

Alex

> if (pasid == IOASID_INVALID)
> return -EINVAL;
> 
>  /* Program the device context with the PASID value */
>  
> 
> This extends iommu_aux_at(de)tach_device() so that the users could pass
> in an optional device pointer (struct device for vfio/mdev for example),
> and the necessary check and data link could be done.
> 
> Fixes: a3a195929d40b ("iommu: Add APIs for multiple domains per device")
> Cc: Robin Murphy 
> Cc: Alex Williamson 
> Signed-off-by: Lu Baolu 
> ---
>  drivers/iommu/iommu.c   | 86 +
>  drivers/vfio/vfio_iommu_type1.c |  5 +-
>  include/linux/iommu.h   | 12 +++--
>  3 files changed, 87 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 1ed1e14a1f0c..435835058209 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -2723,26 +2723,92 @@ EXPORT_SYMBOL_GPL(iommu_dev_feature_enabled);
>   * This should make us safe against a device being attached to a guest as a
>   * whole while there are still pasid users on it (aux and sva).
>   */
> -int iommu_aux_attach_device(struct iommu_domain *domain, struct device *dev)
> +int iommu_aux_attach_device(struct iommu_domain *domain,
> + struct device *phys_dev, struct device *dev)
>  {
> - int ret = -ENODEV;
> + struct iommu_group *group;
> + int ret;
>  
> - if (domain->ops->aux_attach_dev)
> - ret = domain->ops->aux_attach_dev(domain, dev);
> + if (!domain->ops->aux_attach_dev ||
> + !iommu_dev_feature_enabled(phys_dev, IOMMU_DEV_FEAT_AUX))
> + return -ENODEV;
>  
> - if (!ret)
> - trace_attach_device_to_domain(dev);
> + /* Bare use only. */
> + if (!dev) {
> + ret = domain->ops->aux_attach_dev(domain, phys_dev);
> + if (!ret)
> + trace_attach_device_to_domain(phys_dev);
> +
> + return ret;
> + }
> +
> + /*
> +  * The caller has created a made-up device (for example, vfio/mdev)
> +  * and allocated an iommu_group for user level direct assignment.
> +  * Make sure that the group has only single device and hasn't been
> +  * attached by any other domain.
> +  */
> + group = iommu_group_get(dev);
> + if (!group)
> + return -ENODEV;
> +
> + /*
> +  * Lock the group to make sure the device-count doesn't change while
> +  * we are attaching.
> +  */
> + mutex_lock(>mutex);
> + ret = -EINVAL;
>

Re: [PATCH v3 01/14] vfio/type1: Refactor vfio_iommu_type1_ioctl()

2020-07-02 Thread Alex Williamson

On Wed, 24 Jun 2020 01:55:14 -0700
Liu Yi L  wrote:

> This patch refactors the vfio_iommu_type1_ioctl() to use switch instead of
> if-else, and each cmd got a helper function.
> 
> Cc: Kevin Tian 
> CC: Jacob Pan 
> Cc: Alex Williamson 
> Cc: Eric Auger 
> Cc: Jean-Philippe Brucker 
> Cc: Joerg Roedel 
> Cc: Lu Baolu 
> Suggested-by: Christoph Hellwig 
> Signed-off-by: Liu Yi L 
> ---
>  drivers/vfio/vfio_iommu_type1.c | 392 
> ++--
>  1 file changed, 213 insertions(+), 179 deletions(-)

I can go ahead and grab this one for my v5.9 next branch.  Thanks,

Alex
 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 5e556ac..7accb59 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -2453,6 +2453,23 @@ static int vfio_domains_have_iommu_cache(struct 
> vfio_iommu *iommu)
>   return ret;
>  }
>  
> +static int vfio_iommu_type1_check_extension(struct vfio_iommu *iommu,
> + unsigned long arg)
> +{
> + switch (arg) {
> + case VFIO_TYPE1_IOMMU:
> + case VFIO_TYPE1v2_IOMMU:
> + case VFIO_TYPE1_NESTING_IOMMU:
> + return 1;
> + case VFIO_DMA_CC_IOMMU:
> + if (!iommu)
> + return 0;
> + return vfio_domains_have_iommu_cache(iommu);
> + default:
> + return 0;
> + }
> +}
> +
>  static int vfio_iommu_iova_add_cap(struct vfio_info_cap *caps,
>struct vfio_iommu_type1_info_cap_iova_range *cap_iovas,
>size_t size)
> @@ -2529,238 +2546,255 @@ static int vfio_iommu_migration_build_caps(struct 
> vfio_iommu *iommu,
>   return vfio_info_add_capability(caps, _mig.header, sizeof(cap_mig));
>  }
>  
> -static long vfio_iommu_type1_ioctl(void *iommu_data,
> -unsigned int cmd, unsigned long arg)
> +static int vfio_iommu_type1_get_info(struct vfio_iommu *iommu,
> +  unsigned long arg)
>  {
> - struct vfio_iommu *iommu = iommu_data;
> + struct vfio_iommu_type1_info info;
>   unsigned long minsz;
> + struct vfio_info_cap caps = { .buf = NULL, .size = 0 };
> + unsigned long capsz;
> + int ret;
>  
> - if (cmd == VFIO_CHECK_EXTENSION) {
> - switch (arg) {
> - case VFIO_TYPE1_IOMMU:
> - case VFIO_TYPE1v2_IOMMU:
> - case VFIO_TYPE1_NESTING_IOMMU:
> - return 1;
> - case VFIO_DMA_CC_IOMMU:
> - if (!iommu)
> - return 0;
> - return vfio_domains_have_iommu_cache(iommu);
> - default:
> - return 0;
> - }
> - } else if (cmd == VFIO_IOMMU_GET_INFO) {
> - struct vfio_iommu_type1_info info;
> - struct vfio_info_cap caps = { .buf = NULL, .size = 0 };
> - unsigned long capsz;
> - int ret;
> -
> - minsz = offsetofend(struct vfio_iommu_type1_info, iova_pgsizes);
> + minsz = offsetofend(struct vfio_iommu_type1_info, iova_pgsizes);
>  
> - /* For backward compatibility, cannot require this */
> - capsz = offsetofend(struct vfio_iommu_type1_info, cap_offset);
> + /* For backward compatibility, cannot require this */
> + capsz = offsetofend(struct vfio_iommu_type1_info, cap_offset);
>  
> - if (copy_from_user(, (void __user *)arg, minsz))
> - return -EFAULT;
> + if (copy_from_user(, (void __user *)arg, minsz))
> + return -EFAULT;
>  
> - if (info.argsz < minsz)
> - return -EINVAL;
> + if (info.argsz < minsz)
> + return -EINVAL;
>  
> - if (info.argsz >= capsz) {
> - minsz = capsz;
> - info.cap_offset = 0; /* output, no-recopy necessary */
> - }
> + if (info.argsz >= capsz) {
> + minsz = capsz;
> + info.cap_offset = 0; /* output, no-recopy necessary */
> + }
>  
> - mutex_lock(>lock);
> - info.flags = VFIO_IOMMU_INFO_PGSIZES;
> + mutex_lock(>lock);
> + info.flags = VFIO_IOMMU_INFO_PGSIZES;
>  
> - info.iova_pgsizes = iommu->pgsize_bitmap;
> + info.iova_pgsizes = iommu->pgsize_bitmap;
>  
> - ret = vfio_iommu_migration_build_caps(iommu, );
> + ret = vfio_iommu_migration_build_caps(iommu, );
>  
> - if (!ret)
> - ret = vf

Re: [PATCH v3 10/14] vfio/type1: Allow invalidating first-level/stage IOMMU cache

2020-07-02 Thread Alex Williamson

On Wed, 24 Jun 2020 01:55:23 -0700
Liu Yi L  wrote:

> This patch provides an interface allowing the userspace to invalidate
> IOMMU cache for first-level page table. It is required when the first
> level IOMMU page table is not managed by the host kernel in the nested
> translation setup.
> 
> Cc: Kevin Tian 
> CC: Jacob Pan 
> Cc: Alex Williamson 
> Cc: Eric Auger 
> Cc: Jean-Philippe Brucker 
> Cc: Joerg Roedel 
> Cc: Lu Baolu 
> Signed-off-by: Liu Yi L 
> Signed-off-by: Eric Auger 
> Signed-off-by: Jacob Pan 
> ---
> v1 -> v2:
> *) rename from "vfio/type1: Flush stage-1 IOMMU cache for nesting type"
> *) rename vfio_cache_inv_fn() to vfio_dev_cache_invalidate_fn()
> *) vfio_dev_cache_inv_fn() always successful
> *) remove VFIO_IOMMU_CACHE_INVALIDATE, and reuse VFIO_IOMMU_NESTING_OP
> ---
>  drivers/vfio/vfio_iommu_type1.c | 52 
> +
>  include/uapi/linux/vfio.h   |  3 +++
>  2 files changed, 55 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 5926533..4c21300 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -3080,6 +3080,53 @@ static long vfio_iommu_handle_pgtbl_op(struct 
> vfio_iommu *iommu,
>   return ret;
>  }
>  
> +static int vfio_dev_cache_invalidate_fn(struct device *dev, void *data)
> +{
> + struct domain_capsule *dc = (struct domain_capsule *)data;
> + unsigned long arg = *(unsigned long *) dc->data;
> +
> + iommu_cache_invalidate(dc->domain, dev, (void __user *) arg);
> + return 0;
> +}
> +
> +static long vfio_iommu_invalidate_cache(struct vfio_iommu *iommu,
> + unsigned long arg)
> +{
> + struct domain_capsule dc = { .data =  };
> + struct vfio_group *group;
> + struct vfio_domain *domain;
> + int ret = 0;
> + struct iommu_nesting_info *info;
> +
> + mutex_lock(>lock);
> + /*
> +  * Cache invalidation is required for any nesting IOMMU,
> +  * so no need to check system-wide PASID support.
> +  */
> + info = iommu->nesting_info;
> + if (!info || !(info->features & IOMMU_NESTING_FEAT_CACHE_INVLD)) {
> + ret = -ENOTSUPP;
> + goto out_unlock;
> + }
> +
> + group = vfio_find_nesting_group(iommu);
> + if (!group) {
> + ret = -EINVAL;
> + goto out_unlock;
> + }
> +
> + domain = list_first_entry(>domain_list,
> +   struct vfio_domain, next);
> + dc.group = group;
> + dc.domain = domain->domain;
> + iommu_group_for_each_dev(group->iommu_group, ,
> +  vfio_dev_cache_invalidate_fn);
> +
> +out_unlock:
> + mutex_unlock(>lock);
> + return ret;
> +}
> +
>  static long vfio_iommu_type1_nesting_op(struct vfio_iommu *iommu,
>   unsigned long arg)
>  {
> @@ -3102,6 +3149,11 @@ static long vfio_iommu_type1_nesting_op(struct 
> vfio_iommu *iommu,
>   case VFIO_IOMMU_NESTING_OP_UNBIND_PGTBL:
>   ret = vfio_iommu_handle_pgtbl_op(iommu, false, arg + minsz);
>   break;
> + case VFIO_IOMMU_NESTING_OP_CACHE_INVLD:
> + {
> + ret = vfio_iommu_invalidate_cache(iommu, arg + minsz);
> + break;
> + }


Why the {} brackets?  Thanks,

Alex


>   default:
>   ret = -EINVAL;
>   }
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 2c9def8..7f8678e 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -1213,6 +1213,8 @@ struct vfio_iommu_type1_pasid_request {
>   * +-+---+
>   * | UNBIND_PGTBL|  struct iommu_gpasid_bind_data|
>   * +-+---+
> + * | CACHE_INVLD |  struct iommu_cache_invalidate_info   |
> + * +-+---+
>   *
>   * returns: 0 on success, -errno on failure.
>   */
> @@ -1225,6 +1227,7 @@ struct vfio_iommu_type1_nesting_op {
>  
>  #define VFIO_IOMMU_NESTING_OP_BIND_PGTBL (0)
>  #define VFIO_IOMMU_NESTING_OP_UNBIND_PGTBL   (1)
> +#define VFIO_IOMMU_NESTING_OP_CACHE_INVLD(2)
>  
>  #define VFIO_IOMMU_NESTING_OP_IO(VFIO_TYPE, VFIO_BASE + 19)
>

Re: [PATCH v3 09/14] vfio/type1: Support binding guest page tables to PASID

2020-07-02 Thread Alex Williamson

On Wed, 24 Jun 2020 01:55:22 -0700
Liu Yi L  wrote:

> Nesting translation allows two-levels/stages page tables, with 1st level
> for guest translations (e.g. GVA->GPA), 2nd level for host translations
> (e.g. GPA->HPA). This patch adds interface for binding guest page tables
> to a PASID. This PASID must have been allocated to user space before the
> binding request.
> 
> Cc: Kevin Tian 
> CC: Jacob Pan 
> Cc: Alex Williamson 
> Cc: Eric Auger 
> Cc: Jean-Philippe Brucker 
> Cc: Joerg Roedel 
> Cc: Lu Baolu 
> Signed-off-by: Jean-Philippe Brucker 
> Signed-off-by: Liu Yi L 
> Signed-off-by: Jacob Pan 
> ---
> v2 -> v3:
> *) use __iommu_sva_unbind_gpasid() for unbind call issued by VFIO
> https://lore.kernel.org/linux-iommu/1592931837-58223-6-git-send-email-jacob.jun@linux.intel.com/
> 
> v1 -> v2:
> *) rename subject from "vfio/type1: Bind guest page tables to host"
> *) remove VFIO_IOMMU_BIND, introduce VFIO_IOMMU_NESTING_OP to support bind/
>unbind guet page table
> *) replaced vfio_iommu_for_each_dev() with a group level loop since this
>series enforces one group per container w/ nesting type as start.
> *) rename vfio_bind/unbind_gpasid_fn() to vfio_dev_bind/unbind_gpasid_fn()
> *) vfio_dev_unbind_gpasid() always successful
> *) use vfio_mm->pasid_lock to avoid race between PASID free and page table
>bind/unbind
> ---
>  drivers/vfio/vfio_iommu_type1.c | 169 
> 
>  drivers/vfio/vfio_pasid.c   |  30 +++
>  include/linux/vfio.h|  20 +
>  include/uapi/linux/vfio.h   |  30 +++
>  4 files changed, 249 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index d0891c5..5926533 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -148,6 +148,33 @@ struct vfio_regions {
>  #define DIRTY_BITMAP_PAGES_MAX((u64)INT_MAX)
>  #define DIRTY_BITMAP_SIZE_MAX 
> DIRTY_BITMAP_BYTES(DIRTY_BITMAP_PAGES_MAX)
>  
> +struct domain_capsule {
> + struct vfio_group *group;
> + struct iommu_domain *domain;
> + void *data;
> +};
> +
> +/* iommu->lock must be held */
> +static struct vfio_group *vfio_find_nesting_group(struct vfio_iommu *iommu)
> +{
> + struct vfio_domain *d;
> + struct vfio_group *g, *group = NULL;
> +
> + if (!iommu->nesting_info)
> + return NULL;
> +
> + /* only support singleton container with nesting type */
> + list_for_each_entry(d, >domain_list, next) {
> + list_for_each_entry(g, >group_list, next) {
> + if (!group) {
> + group = g;
> + break;
> + }


We break out of the inner loop only to pointlessly continue in the
outer loop when we could simply return g and remove the second group
pointer altogether (use "group" instead of "g" if so).


> + }
> + }
> + return group;
> +}
> +
>  static int put_pfn(unsigned long pfn, int prot);
>  
>  static struct vfio_group *vfio_iommu_find_iommu_group(struct vfio_iommu 
> *iommu,
> @@ -2351,6 +2378,48 @@ static int vfio_iommu_resv_refresh(struct vfio_iommu 
> *iommu,
>   return ret;
>  }
>  
> +static int vfio_dev_bind_gpasid_fn(struct device *dev, void *data)
> +{
> + struct domain_capsule *dc = (struct domain_capsule *)data;
> + unsigned long arg = *(unsigned long *) dc->data;
> +
> + return iommu_sva_bind_gpasid(dc->domain, dev, (void __user *) arg);
> +}
> +
> +static int vfio_dev_unbind_gpasid_fn(struct device *dev, void *data)
> +{
> + struct domain_capsule *dc = (struct domain_capsule *)data;
> + unsigned long arg = *(unsigned long *) dc->data;
> +
> + iommu_sva_unbind_gpasid(dc->domain, dev, (void __user *) arg);
> + return 0;
> +}
> +
> +static int __vfio_dev_unbind_gpasid_fn(struct device *dev, void *data)
> +{
> + struct domain_capsule *dc = (struct domain_capsule *)data;
> + struct iommu_gpasid_bind_data *unbind_data =
> + (struct iommu_gpasid_bind_data *) dc->data;
> +
> + __iommu_sva_unbind_gpasid(dc->domain, dev, unbind_data);
> + return 0;
> +}
> +
> +static void vfio_group_unbind_gpasid_fn(ioasid_t pasid, void *data)
> +{
> + struct domain_capsule *dc = (struct domain_capsule *) data;
> + struct iommu_gpasid_bind_data unbind_data;
> +
> + unbind_data.argsz = offsetof(struct iommu_gpasid_bind_data, vendor);
> + unbind_data.flags = 0;
> + unbind_data.hpasid =

Re: [PATCH v3 06/14] vfio/type1: Add VFIO_IOMMU_PASID_REQUEST (alloc/free)

2020-07-02 Thread Alex Williamson

On Wed, 24 Jun 2020 01:55:19 -0700
Liu Yi L  wrote:

> This patch allows user space to request PASID allocation/free, e.g. when
> serving the request from the guest.
> 
> PASIDs that are not freed by userspace are automatically freed when the
> IOASID set is destroyed when process exits.
> 
> Cc: Kevin Tian 
> CC: Jacob Pan 
> Cc: Alex Williamson 
> Cc: Eric Auger 
> Cc: Jean-Philippe Brucker 
> Cc: Joerg Roedel 
> Cc: Lu Baolu 
> Signed-off-by: Liu Yi L 
> Signed-off-by: Yi Sun 
> Signed-off-by: Jacob Pan 
> ---
> v1 -> v2:
> *) move the vfio_mm related code to be a seprate module
> *) use a single structure for alloc/free, could support a range of PASIDs
> *) fetch vfio_mm at group_attach time instead of at iommu driver open time
> ---
>  drivers/vfio/Kconfig|  1 +
>  drivers/vfio/vfio_iommu_type1.c | 96 
> -
>  drivers/vfio/vfio_pasid.c   | 10 +
>  include/linux/vfio.h|  6 +++
>  include/uapi/linux/vfio.h   | 36 
>  5 files changed, 147 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> index 3d8a108..95d90c6 100644
> --- a/drivers/vfio/Kconfig
> +++ b/drivers/vfio/Kconfig
> @@ -2,6 +2,7 @@
>  config VFIO_IOMMU_TYPE1
>   tristate
>   depends on VFIO
> + select VFIO_PASID if (X86)
>   default n
>  
>  config VFIO_IOMMU_SPAPR_TCE
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 8c143d5..d0891c5 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -73,6 +73,7 @@ struct vfio_iommu {
>   boolv2;
>   boolnesting;
>   struct iommu_nesting_info *nesting_info;
> + struct vfio_mm  *vmm;

Structure alignment again.

>   booldirty_page_tracking;
>   boolpinned_page_dirty_scope;
>  };
> @@ -1933,6 +1934,17 @@ static void vfio_iommu_iova_insert_copy(struct 
> vfio_iommu *iommu,
>  
>   list_splice_tail(iova_copy, iova);
>  }
> +
> +static void vfio_iommu_release_nesting_info(struct vfio_iommu *iommu)
> +{
> + if (iommu->vmm) {
> + vfio_mm_put(iommu->vmm);
> + iommu->vmm = NULL;
> + }
> +
> + kfree(iommu->nesting_info);

iommu->nesting_info = NULL;

> +}
> +
>  static int vfio_iommu_type1_attach_group(void *iommu_data,
>struct iommu_group *iommu_group)
>  {
> @@ -2067,6 +2079,25 @@ static int vfio_iommu_type1_attach_group(void 
> *iommu_data,
>   goto out_detach;
>   }
>   iommu->nesting_info = info;
> +
> + if (info->features & IOMMU_NESTING_FEAT_SYSWIDE_PASID) {
> + struct vfio_mm *vmm;
> + int sid;
> +
> + vmm = vfio_mm_get_from_task(current);
> + if (IS_ERR(vmm)) {
> + ret = PTR_ERR(vmm);
> + goto out_detach;
> + }
> + iommu->vmm = vmm;
> +
> + sid = vfio_mm_ioasid_sid(vmm);
> + ret = iommu_domain_set_attr(domain->domain,
> + DOMAIN_ATTR_IOASID_SID,
> + );

This looks pretty dicey in the case of !CONFIG_VFIO_PASID, can we get
here in that case?  If so it looks like we're doing bad things with
setting the domain->ioasid_sid.

> + if (ret)
> + goto out_detach;
> + }
>   }
>  
>   /* Get aperture info */
> @@ -2178,7 +2209,8 @@ static int vfio_iommu_type1_attach_group(void 
> *iommu_data,
>   return 0;
>  
>  out_detach:
> - kfree(iommu->nesting_info);
> + if (iommu->nesting_info)
> + vfio_iommu_release_nesting_info(iommu);

Make vfio_iommu_release_nesting_info() check iommu->nesting_info, then
call it unconditionally?

>   vfio_iommu_detach_group(domain, group);
>  out_domain:
>   iommu_domain_free(domain->domain);
> @@ -2380,7 +2412,8 @@ static void vfio_iommu_type1_detach_group(void 
> *iommu_data,
>   else
>   vfio_iommu_unmap_unpin_reaccount(iommu);
>  
> - kfree(iommu->nesting_info);
> + if (iommu->nesting_info)
> + vfio_iommu_release_nesting_info(iommu);
>

Re: [PATCH v3 04/14] vfio: Add PASID allocation/free support

2020-07-02 Thread Alex Williamson

On Wed, 24 Jun 2020 01:55:17 -0700
Liu Yi L  wrote:

> Shared Virtual Addressing (a.k.a Shared Virtual Memory) allows sharing
> multiple process virtual address spaces with the device for simplified
> programming model. PASID is used to tag an virtual address space in DMA
> requests and to identify the related translation structure in IOMMU. When
> a PASID-capable device is assigned to a VM, we want the same capability
> of using PASID to tag guest process virtual address spaces to achieve
> virtual SVA (vSVA).
> 
> PASID management for guest is vendor specific. Some vendors (e.g. Intel
> VT-d) requires system-wide managed PASIDs cross all devices, regardless
> of whether a device is used by host or assigned to guest. Other vendors
> (e.g. ARM SMMU) may allow PASIDs managed per-device thus could be fully
> delegated to the guest for assigned devices.
> 
> For system-wide managed PASIDs, this patch introduces a vfio module to
> handle explicit PASID alloc/free requests from guest. Allocated PASIDs
> are associated to a process (or, mm_struct) in IOASID core. A vfio_mm
> object is introduced to track mm_struct. Multiple VFIO containers within
> a process share the same vfio_mm object.
> 
> A quota mechanism is provided to prevent malicious user from exhausting
> available PASIDs. Currently the quota is a global parameter applied to
> all VFIO devices. In the future per-device quota might be supported too.
> 
> Cc: Kevin Tian 
> CC: Jacob Pan 
> Cc: Eric Auger 
> Cc: Jean-Philippe Brucker 
> Cc: Joerg Roedel 
> Cc: Lu Baolu 
> Suggested-by: Alex Williamson 
> Signed-off-by: Liu Yi L 
> ---
> v1 -> v2:
> *) added in v2, split from the pasid alloc/free support of v1
> ---
>  drivers/vfio/Kconfig  |   5 ++
>  drivers/vfio/Makefile |   1 +
>  drivers/vfio/vfio_pasid.c | 151 
> ++
>  include/linux/vfio.h  |  28 +
>  4 files changed, 185 insertions(+)
>  create mode 100644 drivers/vfio/vfio_pasid.c
> 
> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> index fd17db9..3d8a108 100644
> --- a/drivers/vfio/Kconfig
> +++ b/drivers/vfio/Kconfig
> @@ -19,6 +19,11 @@ config VFIO_VIRQFD
>   depends on VFIO && EVENTFD
>   default n
>  
> +config VFIO_PASID
> + tristate
> + depends on IOASID && VFIO
> + default n
> +
>  menuconfig VFIO
>   tristate "VFIO Non-Privileged userspace driver framework"
>   depends on IOMMU_API
> diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
> index de67c47..bb836a3 100644
> --- a/drivers/vfio/Makefile
> +++ b/drivers/vfio/Makefile
> @@ -3,6 +3,7 @@ vfio_virqfd-y := virqfd.o
>  
>  obj-$(CONFIG_VFIO) += vfio.o
>  obj-$(CONFIG_VFIO_VIRQFD) += vfio_virqfd.o
> +obj-$(CONFIG_VFIO_PASID) += vfio_pasid.o
>  obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
>  obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
>  obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o
> diff --git a/drivers/vfio/vfio_pasid.c b/drivers/vfio/vfio_pasid.c
> new file mode 100644
> index 000..dd5b6d1
> --- /dev/null
> +++ b/drivers/vfio/vfio_pasid.c
> @@ -0,0 +1,151 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (C) 2020 Intel Corporation.
> + * Author: Liu Yi L 
> + *
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#define DRIVER_VERSION  "0.1"
> +#define DRIVER_AUTHOR   "Liu Yi L "
> +#define DRIVER_DESC "PASID management for VFIO bus drivers"
> +
> +#define VFIO_DEFAULT_PASID_QUOTA 1000
> +static int pasid_quota = VFIO_DEFAULT_PASID_QUOTA;
> +module_param_named(pasid_quota, pasid_quota, uint, 0444);
> +MODULE_PARM_DESC(pasid_quota,
> +  " Set the quota for max number of PASIDs that an application 
> is allowed to request (default 1000)");
> +
> +struct vfio_mm_token {
> + unsigned long long val;
> +};
> +
> +struct vfio_mm {
> + struct kref kref;
> + struct vfio_mm_tokentoken;
> + int ioasid_sid;
> + int pasid_quota;
> + struct list_headnext;
> +};
> +
> +static struct vfio_pasid {
> + struct mutexvfio_mm_lock;
> + struct list_headvfio_mm_list;
> +} vfio_pasid;
> +
> +/* called with vfio.vfio_mm_lock held */
> +static void vfio_mm_release(struct kref *kref)
> +{
> + struct vfio_mm *vmm = container_of(kref, struct vfio_mm, kref);
> +
> + list_del(>next);
> + mutex_unlock(_pasid.vfio_mm_lock);
> + ioasid_free_set(vmm->ioasid_

Re: [PATCH v3 03/14] vfio/type1: Report iommu nesting info to userspace

2020-07-02 Thread Alex Williamson

On Wed, 24 Jun 2020 01:55:16 -0700
Liu Yi L  wrote:

> This patch exports iommu nesting capability info to user space through
> VFIO. User space is expected to check this info for supported uAPIs (e.g.
> PASID alloc/free, bind page table, and cache invalidation) and the vendor
> specific format information for first level/stage page table that will be
> bound to.
> 
> The nesting info is available only after the nesting iommu type is set
> for a container. Current implementation imposes one limitation - one
> nesting container should include at most one group. The philosophy of
> vfio container is having all groups/devices within the container share
> the same IOMMU context. When vSVA is enabled, one IOMMU context could
> include one 2nd-level address space and multiple 1st-level address spaces.
> While the 2nd-leve address space is reasonably sharable by multiple groups
> , blindly sharing 1st-level address spaces across all groups within the
> container might instead break the guest expectation. In the future sub/
> super container concept might be introduced to allow partial address space
> sharing within an IOMMU context. But for now let's go with this restriction
> by requiring singleton container for using nesting iommu features. Below
> link has the related discussion about this decision.
> 
> https://lkml.org/lkml/2020/5/15/1028
> 
> Cc: Kevin Tian 
> CC: Jacob Pan 
> Cc: Alex Williamson 
> Cc: Eric Auger 
> Cc: Jean-Philippe Brucker 
> Cc: Joerg Roedel 
> Cc: Lu Baolu 
> Signed-off-by: Liu Yi L 
> ---
>  drivers/vfio/vfio_iommu_type1.c | 73 
> +
>  include/uapi/linux/vfio.h   |  9 +
>  2 files changed, 82 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 7accb59..8c143d5 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -72,6 +72,7 @@ struct vfio_iommu {
>   uint64_tpgsize_bitmap;
>   boolv2;
>   boolnesting;
> + struct iommu_nesting_info *nesting_info;
>   booldirty_page_tracking;
>   boolpinned_page_dirty_scope;
>  };

Mind the structure packing and alignment, placing a pointer in the
middle of a section of bools is going to create wasteful holes in the
data structure.

> @@ -130,6 +131,9 @@ struct vfio_regions {
>  #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)  \
>   (!list_empty(>domain_list))
>  
> +#define IS_DOMAIN_IN_CONTAINER(iommu)((iommu->external_domain) || \
> +  (!list_empty(>domain_list)))
> +
>  #define DIRTY_BITMAP_BYTES(n)(ALIGN(n, BITS_PER_TYPE(u64)) / 
> BITS_PER_BYTE)
>  
>  /*
> @@ -1959,6 +1963,12 @@ static int vfio_iommu_type1_attach_group(void 
> *iommu_data,
>   }
>   }
>  
> + /* Nesting type container can include only one group */
> + if (iommu->nesting && IS_DOMAIN_IN_CONTAINER(iommu)) {
> + mutex_unlock(>lock);
> + return -EINVAL;
> + }
> +
>   group = kzalloc(sizeof(*group), GFP_KERNEL);
>   domain = kzalloc(sizeof(*domain), GFP_KERNEL);
>   if (!group || !domain) {
> @@ -2029,6 +2039,36 @@ static int vfio_iommu_type1_attach_group(void 
> *iommu_data,
>   if (ret)
>   goto out_domain;
>  
> + /* Nesting cap info is available only after attaching */
> + if (iommu->nesting) {
> + struct iommu_nesting_info tmp;
> + struct iommu_nesting_info *info;
> +
> + /* First get the size of vendor specific nesting info */
> + ret = iommu_domain_get_attr(domain->domain,
> + DOMAIN_ATTR_NESTING,
> + );
> + if (ret)
> + goto out_detach;
> +
> + info = kzalloc(tmp.size, GFP_KERNEL);
> + if (!info) {
> + ret = -ENOMEM;
> + goto out_detach;
> + }
> +
> + /* Now get the nesting info */
> + info->size = tmp.size;
> + ret = iommu_domain_get_attr(domain->domain,
> + DOMAIN_ATTR_NESTING,
> + info);
> + if (ret) {
> + kfree(info);
> + goto out_detach;
> + }
> + iommu->nesting_info = info;
> + }
> +
>   /* Get aperture info */
>   iommu_domain_get_attr(do

Re: [PATCH v3 02/14] iommu: Report domain nesting info

2020-07-02 Thread Alex Williamson

On Wed, 24 Jun 2020 01:55:15 -0700
Liu Yi L  wrote:

> IOMMUs that support nesting translation needs report the capability info
> to userspace, e.g. the format of first level/stage paging structures.
> 
> This patch reports nesting info by DOMAIN_ATTR_NESTING. Caller can get
> nesting info after setting DOMAIN_ATTR_NESTING.
> 
> v2 -> v3:
> *) remvoe cap/ecap_mask in iommu_nesting_info.
> *) reuse DOMAIN_ATTR_NESTING to get nesting info.
> *) return an empty iommu_nesting_info for SMMU drivers per Jean'
>suggestion.
> 
> Cc: Kevin Tian 
> CC: Jacob Pan 
> Cc: Alex Williamson 
> Cc: Eric Auger 
> Cc: Jean-Philippe Brucker 
> Cc: Joerg Roedel 
> Cc: Lu Baolu 
> Signed-off-by: Liu Yi L 
> Signed-off-by: Jacob Pan 
> ---
>  drivers/iommu/arm-smmu-v3.c | 29 --
>  drivers/iommu/arm-smmu.c| 29 --
>  include/uapi/linux/iommu.h  | 59 
> +
>  3 files changed, 113 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> index f578677..0c45d4d 100644
> --- a/drivers/iommu/arm-smmu-v3.c
> +++ b/drivers/iommu/arm-smmu-v3.c
> @@ -3019,6 +3019,32 @@ static struct iommu_group 
> *arm_smmu_device_group(struct device *dev)
>   return group;
>  }
>  
> +static int arm_smmu_domain_nesting_info(struct arm_smmu_domain *smmu_domain,
> + void *data)
> +{
> + struct iommu_nesting_info *info = (struct iommu_nesting_info *) data;
> + u32 size;
> +
> + if (!info || smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED)
> + return -ENODEV;
> +
> + size = sizeof(struct iommu_nesting_info);
> +
> + /*
> +  * if provided buffer size is not equal to the size, should
> +  * return 0 and also the expected buffer size to caller.
> +  */
> + if (info->size != size) {
> + info->size = size;
> + return 0;
> + }
> +
> + /* report an empty iommu_nesting_info for now */
> + memset(info, 0x0, size);
> + info->size = size;
> + return 0;
> +}
> +
>  static int arm_smmu_domain_get_attr(struct iommu_domain *domain,
>   enum iommu_attr attr, void *data)
>  {
> @@ -3028,8 +3054,7 @@ static int arm_smmu_domain_get_attr(struct iommu_domain 
> *domain,
>   case IOMMU_DOMAIN_UNMANAGED:
>   switch (attr) {
>   case DOMAIN_ATTR_NESTING:
> - *(int *)data = (smmu_domain->stage == 
> ARM_SMMU_DOMAIN_NESTED);
> - return 0;
> + return arm_smmu_domain_nesting_info(smmu_domain, data);
>   default:
>   return -ENODEV;
>   }
> diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
> index 243bc4c..908607d 100644
> --- a/drivers/iommu/arm-smmu.c
> +++ b/drivers/iommu/arm-smmu.c
> @@ -1506,6 +1506,32 @@ static struct iommu_group 
> *arm_smmu_device_group(struct device *dev)
>   return group;
>  }
>  
> +static int arm_smmu_domain_nesting_info(struct arm_smmu_domain *smmu_domain,
> + void *data)
> +{
> + struct iommu_nesting_info *info = (struct iommu_nesting_info *) data;
> + u32 size;
> +
> + if (!info || smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED)
> + return -ENODEV;
> +
> + size = sizeof(struct iommu_nesting_info);
> +
> + /*
> +  * if provided buffer size is not equal to the size, should
> +  * return 0 and also the expected buffer size to caller.
> +  */
> + if (info->size != size) {
> + info->size = size;
> + return 0;
> + }
> +
> + /* report an empty iommu_nesting_info for now */
> + memset(info, 0x0, size);
> + info->size = size;
> + return 0;
> +}
> +
>  static int arm_smmu_domain_get_attr(struct iommu_domain *domain,
>   enum iommu_attr attr, void *data)
>  {
> @@ -1515,8 +1541,7 @@ static int arm_smmu_domain_get_attr(struct iommu_domain 
> *domain,
>   case IOMMU_DOMAIN_UNMANAGED:
>   switch (attr) {
>   case DOMAIN_ATTR_NESTING:
> - *(int *)data = (smmu_domain->stage == 
> ARM_SMMU_DOMAIN_NESTED);
> - return 0;
> + return arm_smmu_domain_nesting_info(smmu_domain, data);
>   default:
>   return -ENODEV;
>   }
> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> index 1afc661..898c99a 100644
> --- a

[PATCH] vfio/pci: Add Intel X550 to hidden INTx devices

2020-07-01 Thread Alex Williamson

Intel document 333717-008, "Intel® Ethernet Controller X550
Specification Update", version 2.7, dated June 2020, includes errata
#22, added in version 2.1, May 2016, indicating X550 NICs suffer from
the same implementation deficiency as the 700-series NICs:

"The Interrupt Status bit in the Status register of the PCIe
 configuration space is not implemented and is not set as described
 in the PCIe specification."

Without the interrupt status bit, vfio-pci cannot determine when
these devices signal INTx.  They are therefore added to the nointx
quirk.

Cc: Jesse Brandeburg 
Signed-off-by: Alex Williamson 
---
 drivers/vfio/pci/vfio_pci.c |2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index f634c81998bb..9968dc0f87a3 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -207,6 +207,8 @@ static bool vfio_pci_nointx(struct pci_dev *pdev)
case 0x1580 ... 0x1581:
case 0x1583 ... 0x158b:
case 0x37d0 ... 0x37d2:
+   /* X550 */
+   case 0x1563:
return true;
default:
return false;

Re: [PATCH] vfio/pci: Fix SR-IOV VF handling with MMIO blocking

2020-06-27 Thread Alex Williamson

On Sun, 28 Jun 2020 03:12:12 +
"Wang, Haiyue"  wrote:

> > -Original Message-
> > From: kvm-ow...@vger.kernel.org  On Behalf Of 
> > Alex Williamson
> > Sent: Friday, June 26, 2020 00:57
> > To: alex.william...@redhat.com
> > Cc: k...@vger.kernel.org; linux-kernel@vger.kernel.org; 
> > maxime.coque...@redhat.com
> > Subject: [PATCH] vfio/pci: Fix SR-IOV VF handling with MMIO blocking
> > 
> > SR-IOV VFs do not implement the memory enable bit of the command
> > register, therefore this bit is not set in config space after
> > pci_enable_device().  This leads to an unintended difference
> > between PF and VF in hand-off state to the user.  We can correct
> > this by setting the initial value of the memory enable bit in our
> > virtualized config space.  There's really no need however to
> > ever fault a user on a VF though as this would only indicate an
> > error in the user's management of the enable bit, versus a PF
> > where the same access could trigger hardware faults.
> > 
> > Fixes: abafbc551fdd ("vfio-pci: Invalidate mmaps and block MMIO access on 
> > disabled memory")
> > Signed-off-by: Alex Williamson 
> > ---
> >  drivers/vfio/pci/vfio_pci_config.c |   17 -
> >  1 file changed, 16 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/vfio/pci/vfio_pci_config.c 
> > b/drivers/vfio/pci/vfio_pci_config.c
> > index 8746c943247a..d98843feddce 100644
> > --- a/drivers/vfio/pci/vfio_pci_config.c
> > +++ b/drivers/vfio/pci/vfio_pci_config.c
> > @@ -398,9 +398,15 @@ static inline void p_setd(struct perm_bits *p, int 
> > off, u32 virt, u32 write)
> >  /* Caller should hold memory_lock semaphore */
> >  bool __vfio_pci_memory_enabled(struct vfio_pci_device *vdev)
> >  {
> > +   struct pci_dev *pdev = vdev->pdev;
> > u16 cmd = le16_to_cpu(*(__le16 *)>vconfig[PCI_COMMAND]);
> > 
> > -   return cmd & PCI_COMMAND_MEMORY;
> > +   /*
> > +* SR-IOV VF memory enable is handled by the MSE bit in the
> > +* PF SR-IOV capability, there's therefore no need to trigger
> > +* faults based on the virtual value.
> > +*/
> > +   return pdev->is_virtfn || (cmd & PCI_COMMAND_MEMORY);  
> 
> Hi Alex,
> 
> After set up the initial copy of config space for memory enable bit for VF, 
> is it worth
> to trigger SIGBUS into the bad user space process which intentionally try to 
> disable the
> memory access command (even it is VF) then access the memory to trigger 
> CVE-2020-12888 ?

We're essentially only trying to catch the user in mismanaging the
enable bit if we trigger a fault based on the virtualized enabled bit,
right?  There's no risk that the VF would trigger a UR based on the
state of our virtual enable bit.  So is it worth triggering a user
fault when, for instance, the user might be aware that the device is a
VF and know that the memory enable bit is not relative to the physical
device?  Thanks,

Alex

> >  }
> > 
> >  /*
> > @@ -1728,6 +1734,15 @@ int vfio_config_init(struct vfio_pci_device *vdev)
> >  vconfig[PCI_INTERRUPT_PIN]);
> > 
> > vconfig[PCI_INTERRUPT_PIN] = 0; /* Gratuitous for good VFs */
> > +
> > +   /*
> > +* VFs do no implement the memory enable bit of the COMMAND
> > +* register therefore we'll not have it set in our initial
> > +* copy of config space after pci_enable_device().  For
> > +* consistency with PFs, set the virtual enable bit here.
> > +*/
> > +   *(__le16 *)[PCI_COMMAND] |=
> > +   cpu_to_le16(PCI_COMMAND_MEMORY);
> > }
> > 
> > if (!IS_ENABLED(CONFIG_VFIO_PCI_INTX) || vdev->nointx)  
>

[GIT PULL] VFIO fixes for v5.8-rc3

2020-06-27 Thread Alex Williamson

Hi Linus,

The following changes since commit b3a9e3b9622ae10064826dccb4f7a52bd88c7407:

  Linux 5.8-rc1 (2020-06-14 12:45:04 -0700)

are available in the Git repository at:

  git://github.com/awilliam/linux-vfio.git tags/vfio-v5.8-rc3

for you to fetch changes up to ebfa440ce38b7e2e04c3124aa89c8a9f4094cf21:

  vfio/pci: Fix SR-IOV VF handling with MMIO blocking (2020-06-25 11:04:23 
-0600)


VFIO fixes for v5.8-rc3

 - Fix double free of eventfd ctx (Alex Williamson)

 - Fix duplicate use of capability ID (Alex Williamson)

 - Fix SR-IOV VF memory enable handling (Alex Williamson)


Alex Williamson (3):
  vfio/pci: Clear error and request eventfd ctx after releasing
  vfio/type1: Fix migration info capability ID
  vfio/pci: Fix SR-IOV VF handling with MMIO blocking

 drivers/vfio/pci/vfio_pci.c|  8 ++--
 drivers/vfio/pci/vfio_pci_config.c | 17 -
 include/uapi/linux/vfio.h  |  2 +-
 3 files changed, 23 insertions(+), 4 deletions(-)

Re: [PATCH v3 1/5] docs: IOMMU user API

2020-06-26 Thread Alex Williamson

On Tue, 23 Jun 2020 10:03:53 -0700
Jacob Pan  wrote:

> IOMMU UAPI is newly introduced to support communications between guest
> virtual IOMMU and host IOMMU. There has been lots of discussions on how
> it should work with VFIO UAPI and userspace in general.
> 
> This document is indended to clarify the UAPI design and usage. The
> mechenics of how future extensions should be achieved are also covered
> in this documentation.
> 
> Signed-off-by: Liu Yi L 
> Signed-off-by: Jacob Pan 
> ---
>  Documentation/userspace-api/iommu.rst | 244 
> ++
>  1 file changed, 244 insertions(+)
>  create mode 100644 Documentation/userspace-api/iommu.rst
> 
> diff --git a/Documentation/userspace-api/iommu.rst 
> b/Documentation/userspace-api/iommu.rst
> new file mode 100644
> index ..f9e4ed90a413
> --- /dev/null
> +++ b/Documentation/userspace-api/iommu.rst
> @@ -0,0 +1,244 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +.. iommu:
> +
> +=
> +IOMMU Userspace API
> +=
> +
> +IOMMU UAPI is used for virtualization cases where communications are
> +needed between physical and virtual IOMMU drivers. For native
> +usage, IOMMU is a system device which does not need to communicate
> +with user space directly.
> +
> +The primary use cases are guest Shared Virtual Address (SVA) and
> +guest IO virtual address (IOVA), wherein a virtual IOMMU (vIOMMU) is
> +required to communicate with the physical IOMMU in the host.
> +
> +.. contents:: :local:
> +
> +Functionalities
> +===
> +Communications of user and kernel involve both directions. The
> +supported user-kernel APIs are as follows:
> +
> +1. Alloc/Free PASID
> +2. Bind/unbind guest PASID (e.g. Intel VT-d)
> +3. Bind/unbind guest PASID table (e.g. ARM sMMU)
> +4. Invalidate IOMMU caches
> +5. Service page requests
> +
> +Requirements
> +
> +The IOMMU UAPIs are generic and extensible to meet the following
> +requirements:
> +
> +1. Emulated and para-virtualised vIOMMUs
> +2. Multiple vendors (Intel VT-d, ARM sMMU, etc.)
> +3. Extensions to the UAPI shall not break existing user space
> +
> +Interfaces
> +==
> +Although the data structures defined in IOMMU UAPI are self-contained,
> +there is no user API functions introduced. Instead, IOMMU UAPI is
> +designed to work with existing user driver frameworks such as VFIO.
> +
> +Extension Rules & Precautions
> +-
> +When IOMMU UAPI gets extended, the data structures can *only* be
> +modified in two ways:
> +
> +1. Adding new fields by re-purposing the padding[] field. No size change.
> +2. Adding new union members at the end. May increase in size.
> +
> +No new fields can be added *after* the variable sized union in that it
> +will break backward compatibility when offset moves. In both cases, a
> +new flag must be accompanied with a new field such that the IOMMU
> +driver can process the data based on the new flag. Version field is
> +only reserved for the unlikely event of UAPI upgrade at its entirety.
> +
> +It's *always* the caller's responsibility to indicate the size of the
> +structure passed by setting argsz appropriately.
> +Though at the same time, argsz is user provided data which is not
> +trusted. The argsz field allows the user to indicate how much data
> +they're providing, it's still the kernel's responsibility to validate
> +whether it's correct and sufficient for the requested operation.
> +
> +Compatibility Checking
> +--
> +When IOMMU UAPI extension results in size increase, user such as VFIO
> +has to handle the following cases:
> +
> +1. User and kernel has exact size match
> +2. An older user with older kernel header (smaller UAPI size) running on a
> +   newer kernel (larger UAPI size)
> +3. A newer user with newer kernel header (larger UAPI size) running
> +   on an older kernel.
> +4. A malicious/misbehaving user pass illegal/invalid size but within
> +   range. The data may contain garbage.

What exactly does vfio need to do to handle these?

> +
> +Feature Checking
> +
> +While launching a guest with vIOMMU, it is important to ensure that host
> +can support the UAPI data structures to be used for vIOMMU-pIOMMU
> +communications. Without upfront compatibility checking, future faults
> +are difficult to report even in normal conditions. For example, TLB
> +invalidations should always succeed. There is no architectural way to
> +report back to the vIOMMU if the UAPI data is incompatible. If that
> +happens, in order to protect IOMMU iosolation guarantee, we have to
> +resort to not giving completion status in vIOMMU. This may result in
> +VM hang.
> +
> +For this reason the following IOMMU UAPIs cannot fail:
> +
> +1. Free PASID
> +2. Unbind guest PASID
> +3. Unbind guest PASID table (SMMU)
> +4. Cache invalidate
> +
> +User applications such as QEMU is expected to import kernel UAPI
> +headers. Backward

[PATCH] vfio/pci: Fix SR-IOV VF handling with MMIO blocking

2020-06-25 Thread Alex Williamson

SR-IOV VFs do not implement the memory enable bit of the command
register, therefore this bit is not set in config space after
pci_enable_device().  This leads to an unintended difference
between PF and VF in hand-off state to the user.  We can correct
this by setting the initial value of the memory enable bit in our
virtualized config space.  There's really no need however to
ever fault a user on a VF though as this would only indicate an
error in the user's management of the enable bit, versus a PF
where the same access could trigger hardware faults.

Fixes: abafbc551fdd ("vfio-pci: Invalidate mmaps and block MMIO access on 
disabled memory")
Signed-off-by: Alex Williamson 
---
 drivers/vfio/pci/vfio_pci_config.c |   17 -
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/vfio_pci_config.c 
b/drivers/vfio/pci/vfio_pci_config.c
index 8746c943247a..d98843feddce 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -398,9 +398,15 @@ static inline void p_setd(struct perm_bits *p, int off, 
u32 virt, u32 write)
 /* Caller should hold memory_lock semaphore */
 bool __vfio_pci_memory_enabled(struct vfio_pci_device *vdev)
 {
+   struct pci_dev *pdev = vdev->pdev;
u16 cmd = le16_to_cpu(*(__le16 *)>vconfig[PCI_COMMAND]);
 
-   return cmd & PCI_COMMAND_MEMORY;
+   /*
+* SR-IOV VF memory enable is handled by the MSE bit in the
+* PF SR-IOV capability, there's therefore no need to trigger
+* faults based on the virtual value.
+*/
+   return pdev->is_virtfn || (cmd & PCI_COMMAND_MEMORY);
 }
 
 /*
@@ -1728,6 +1734,15 @@ int vfio_config_init(struct vfio_pci_device *vdev)
 vconfig[PCI_INTERRUPT_PIN]);
 
vconfig[PCI_INTERRUPT_PIN] = 0; /* Gratuitous for good VFs */
+
+   /*
+* VFs do no implement the memory enable bit of the COMMAND
+* register therefore we'll not have it set in our initial
+* copy of config space after pci_enable_device().  For
+* consistency with PFs, set the virtual enable bit here.
+*/
+   *(__le16 *)[PCI_COMMAND] |=
+   cpu_to_le16(PCI_COMMAND_MEMORY);
}
 
if (!IS_ENABLED(CONFIG_VFIO_PCI_INTX) || vdev->nointx)

Re: [RFC PATCH v4 07/10] vfio/pci: introduce a new irq type VFIO_IRQ_TYPE_REMAP_BAR_REGION

2020-06-19 Thread Alex Williamson

On Wed, 10 Jun 2020 01:23:14 -0400
Yan Zhao  wrote:

> On Fri, Jun 05, 2020 at 10:13:01AM -0600, Alex Williamson wrote:
> > On Thu, 4 Jun 2020 22:02:31 -0400
> > Yan Zhao  wrote:
> >   
> > > On Wed, Jun 03, 2020 at 10:10:58PM -0600, Alex Williamson wrote:  
> > > > On Wed, 3 Jun 2020 22:42:28 -0400
> > > > Yan Zhao  wrote:
> > > > 
> > > > > On Wed, Jun 03, 2020 at 05:04:52PM -0600, Alex Williamson wrote:
> > > > > > On Tue, 2 Jun 2020 21:40:58 -0400
> > > > > > Yan Zhao  wrote:
> > > > > >   
> > > > > > > On Tue, Jun 02, 2020 at 01:34:35PM -0600, Alex Williamson wrote:  
> > > > > > > 
> > > > > > > > I'm not at all happy with this.  Why do we need to hide the 
> > > > > > > > migration
> > > > > > > > sparse mmap from the user until migration time?  What if 
> > > > > > > > instead we
> > > > > > > > introduced a new VFIO_REGION_INFO_CAP_SPARSE_MMAP_SAVING 
> > > > > > > > capability
> > > > > > > > where the existing capability is the normal runtime sparse 
> > > > > > > > setup and
> > > > > > > > the user is required to use this new one prior to enabled 
> > > > > > > > device_state
> > > > > > > > with _SAVING.  The vendor driver could then simply track mmap 
> > > > > > > > vmas to
> > > > > > > > the region and refuse to change device_state if there are 
> > > > > > > > outstanding
> > > > > > > > mmaps conflicting with the _SAVING sparse mmap layout.  No new 
> > > > > > > > IRQs
> > > > > > > > required, no new irqfds, an incremental change to the protocol,
> > > > > > > > backwards compatible to the extent that a vendor driver 
> > > > > > > > requiring this
> > > > > > > > will automatically fail migration.
> > > > > > > > 
> > > > > > > right. looks we need to use this approach to solve the problem.
> > > > > > > thanks for your guide.
> > > > > > > so I'll abandon the current remap irq way for dirty tracking 
> > > > > > > during live
> > > > > > > migration.
> > > > > > > but anyway, it demos how to customize irq_types in vendor drivers.
> > > > > > > then, what do you think about patches 1-5?  
> > > > > > 
> > > > > > In broad strokes, I don't think we've found the right solution yet. 
> > > > > >  I
> > > > > > really question whether it's supportable to parcel out vfio-pci like
> > > > > > this and I don't know how I'd support unraveling whether we have a 
> > > > > > bug
> > > > > > in vfio-pci, the vendor driver, or how the vendor driver is making 
> > > > > > use
> > > > > > of vfio-pci.
> > > > > >
> > > > > > Let me also ask, why does any of this need to be in the kernel?  We
> > > > > > spend 5 patches slicing up vfio-pci so that we can register a vendor
> > > > > > driver and have that vendor driver call into vfio-pci as it sees 
> > > > > > fit.
> > > > > > We have two patches creating device specific interrupts and a BAR
> > > > > > remapping scheme that we've decided we don't need.  That brings us 
> > > > > > to
> > > > > > the actual i40e vendor driver, where the first patch is simply 
> > > > > > making
> > > > > > the vendor driver work like vfio-pci already does, the second patch 
> > > > > > is
> > > > > > handling the migration region, and the third patch is implementing 
> > > > > > the
> > > > > > BAR remapping IRQ that we decided we don't need.  It's difficult to
> > > > > > actually find the small bit of code that's required to support
> > > > > > migration outside of just dealing with the protocol we've defined to
> > > > > > expose this from the kernel.  So why are we trying to do this in the
> > > > > > kernel?  We have quirk support in QEMU, we can easily flip
> > > > > > MemoryRegions on and off, etc.  What

Re: [PATCH v5 0/4] introduction of migration_version attribute for VFIO live migration

2020-06-19 Thread Alex Williamson

On Tue, 9 Jun 2020 20:37:31 -0400
Yan Zhao  wrote:

> On Fri, Jun 05, 2020 at 03:39:50PM +0100, Dr. David Alan Gilbert wrote:
> > > > > I tried to simplify the problem a bit, but we keep going backwards.  
> > > > > If
> > > > > the requirement is that potentially any source device can migrate to 
> > > > > any
> > > > > target device and we cannot provide any means other than writing an
> > > > > opaque source string into a version attribute on the target and
> > > > > evaluating the result to determine compatibility, then we're requiring
> > > > > userspace to do an exhaustive search to find a potential match.  That
> > > > > sucks. 
> > > >  
> hi Alex and Dave,
> do you think it's good for us to put aside physical devices and mdev 
> aggregation
> for the moment, and use Alex's original idea that
> 
> +  Userspace should regard two mdev devices compatible when ALL of below
> +  conditions are met:
> +  (0) The mdev devices are of the same type
> +  (1) success when reading from migration_version attribute of one mdev 
> device.
> +  (2) success when writing migration_version string of one mdev device to
> +  migration_version attribute of the other mdev device.

I think Pandora's box is already opened, if we can't articulate how
this solution would evolve to support features that we know are coming,
why should we proceed with this approach?  We've already seen interest
in breaking rule (0) in this thread, so we can't focus the solution on
mdev devices.

Maybe the best we can do is to compare one instance of a device to
another instance of a device, without any capability to predict
compatibility prior to creating devices, in the case on mdev.  The
string would need to include not only the device and vendor driver
compatibility, but also anything that has modified the state of the
device, such as creation time or post-creation time configuration.  The
user is left on their own for creating a compatible device, or
filtering devices to determine which might be, or which might generate,
compatible devices.  It's not much of a solution, I wonder if anyone
would even use it.

> and what about adding another sysfs attribute for vendors to put
> recommended migration compatible device type. e.g.
> #cat 
> /sys/bus/pci/devices/:00:02.0/mdev_supported_types/i915-GVTg_V5_8/migration_compatible_devices
> parent id: 8086 591d
> mdev_type: i915-GVTg_V5_8
> 
> vendors are free to define the format and conent of this 
> migration_compatible_devices
> and it's even not to be a full list.
> 
> before libvirt or user to do live migration, they have to read and test
> migration_version attributes of src/target devices to check migration 
> compatibility.

AFAICT, free-form, vendor defined attributes are useless to libvirt.
Vendors could already put this information in the description attribute
and have it ignored by userspace tools due to the lack of defined
format.  It's also not clear what value this provides when it's
necessarily incomplete, a driver written today cannot know what future
drivers might be compatible with its migration data.  Thanks,

Alex

Re: [PATCH v2 1/3] docs: IOMMU user API

2020-06-19 Thread Alex Williamson

On Fri, 19 Jun 2020 03:30:24 +
"Liu, Yi L"  wrote:

> Hi Alex,
> 
> > From: Alex Williamson 
> > Sent: Friday, June 19, 2020 10:55 AM
> > 
> > On Fri, 19 Jun 2020 02:15:36 +
> > "Liu, Yi L"  wrote:
> >   
> > > Hi Alex,
> > >  
> > > > From: Alex Williamson 
> > > > Sent: Friday, June 19, 2020 5:48 AM
> > > >
> > > > On Wed, 17 Jun 2020 08:28:24 +
> > > > "Tian, Kevin"  wrote:
> > > >  
> > > > > > From: Liu, Yi L 
> > > > > > Sent: Wednesday, June 17, 2020 2:20 PM
> > > > > >  
> > > > > > > From: Jacob Pan 
> > > > > > > Sent: Tuesday, June 16, 2020 11:22 PM
> > > > > > >
> > > > > > > On Thu, 11 Jun 2020 17:27:27 -0700 Jacob Pan
> > > > > > >  wrote:
> > > > > > >  
> > > > > > > > >
> > > > > > > > > But then I thought it even better if VFIO leaves the
> > > > > > > > > entire
> > > > > > > > > copy_from_user() to the layer consuming it.
> > > > > > > > >  
> > > > > > > > OK. Sounds good, that was what Kevin suggested also. I just
> > > > > > > > wasn't sure how much VFIO wants to inspect, I thought VFIO
> > > > > > > > layer wanted to do a sanity check.
> > > > > > > >
> > > > > > > > Anyway, I will move copy_from_user to iommu uapi layer.  
> > > > > > >
> > > > > > > Just one more point brought up by Yi when we discuss this offline.
> > > > > > >
> > > > > > > If we move copy_from_user to iommu uapi layer, then there will
> > > > > > > be  
> > > > > > multiple  
> > > > > > > copy_from_user calls for the same data when a VFIO container
> > > > > > > has  
> > > > > > multiple domains,  
> > > > > > > devices. For bind, it might be OK. But might be additional
> > > > > > > overhead for TLB  
> > > > > > flush  
> > > > > > > request from the guest.  
> > > > > >
> > > > > > I think it is the same with bind and TLB flush path. will be
> > > > > > multiple copy_from_user.  
> > > > >
> > > > > multiple copies is possibly fine. In reality we allow only one
> > > > > group per nesting container (as described in patch [03/15]), and
> > > > > usually there is just one SVA-capable device per group.
> > > > >  
> > > > > >
> > > > > > BTW. for moving data copy to iommy layer, there is another point
> > > > > > which need to consider. VFIO needs to do unbind in bind path if
> > > > > > bind failed, so it will assemble unbind_data and pass to iommu
> > > > > > layer. If iommu layer do the copy_from_user, I think it will be 
> > > > > > failed. any  
> > idea?  
> > > >
> > > > If a call into a UAPI fails, there should be nothing to undo.
> > > > Creating a partial setup for a failed call that needs to be undone
> > > > by the caller is not good practice.  
> > >
> > > is it still a problem if it's the VFIO to undo the partial setup
> > > before returning to user space?  
> > 
> > Yes.  If a UAPI function fails there should be no residual effect.  
> 
> ok. the iommu_sva_bind_gpasid() is per device call. There is no residual
> effect if it failed. so no partial setup will happen per device.
> 
> but VFIO needs to use iommu_group_for_each_dev() to do bind, so
> if iommu_group_for_each_dev() failed, I guess VFIO needs to undo
> the partial setup for the group. right?

Yes, each individual call should make no changes if it fails, but the
caller would need to unwind separate calls.  If this introduces too
much knowledge to the caller for the group case, maybe there should be
a group-level function in the iommu code to handle that.  Thanks,

Alex

> > > > > This might be mitigated if we go back to use the same bind_data
> > > > > for both bind/unbind. Then you can reuse the user object for 
> > > > > unwinding.
> > > > >
> > > > > However there is another case where VFIO may need to assemble the
> > > > > bind_data itself. When a VM is killed, VFIO needs to walk
> > > > > allocated PASIDs and unbind them one-by-one. In such case
> > > > > copy_from_user doesn't work since the data is created by kernel.
> > > > > Alex, do you have a suggestion how this usage can be supported?
> > > > > e.g. asking IOMMU driver to provide two sets of APIs to handle 
> > > > > user/kernel  
> > generated requests?  
> > > >
> > > > Yes, it seems like vfio would need to make use of a driver API to do
> > > > this, we shouldn't be faking a user buffer in the kernel in order to
> > > > call through to a UAPI.  Thanks,  
> > >
> > > ok, so if VFIO wants to issue unbind by itself, it should use an API
> > > which passes kernel buffer to IOMMU layer. If the unbind request is
> > > from user space, then VFIO should use another API which passes user
> > > buffer pointer to IOMMU layer. makes sense. will align with jacob.  
> > 
> > Sounds right to me.  Different approaches might be used for the driver API 
> > versus
> > the UAPI, perhaps there is no buffer.  Thanks,  
> 
> thanks for your coaching. It may require Jacob to add APIs in iommu layer
> for the two purposes.
> 
> Regards,
> Yi Liu
> 
> > Alex  
>

[PATCH v2] vfio: Cleanup allowed driver naming

2020-06-19 Thread Alex Williamson

No functional change, avoid non-inclusive naming schemes.

Signed-off-by: Alex Williamson 
---

v2: Wrap vfio_dev_driver_allowed to 80 column for consistency,
checkpatch no longer warns about this.

 drivers/vfio/vfio.c |   13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 580099afeaff..262ab0efd06c 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -627,9 +627,10 @@ static struct vfio_device *vfio_group_get_device(struct 
vfio_group *group,
  * that error notification via MSI can be affected for platforms that handle
  * MSI within the same IOVA space as DMA.
  */
-static const char * const vfio_driver_whitelist[] = { "pci-stub" };
+static const char * const vfio_driver_allowed[] = { "pci-stub" };
 
-static bool vfio_dev_whitelisted(struct device *dev, struct device_driver *drv)
+static bool vfio_dev_driver_allowed(struct device *dev,
+   struct device_driver *drv)
 {
if (dev_is_pci(dev)) {
struct pci_dev *pdev = to_pci_dev(dev);
@@ -638,8 +639,8 @@ static bool vfio_dev_whitelisted(struct device *dev, struct 
device_driver *drv)
return true;
}
 
-   return match_string(vfio_driver_whitelist,
-   ARRAY_SIZE(vfio_driver_whitelist),
+   return match_string(vfio_driver_allowed,
+   ARRAY_SIZE(vfio_driver_allowed),
drv->name) >= 0;
 }
 
@@ -648,7 +649,7 @@ static bool vfio_dev_whitelisted(struct device *dev, struct 
device_driver *drv)
  * one of the following states:
  *  - driver-less
  *  - bound to a vfio driver
- *  - bound to a whitelisted driver
+ *  - bound to an otherwise allowed driver
  *  - a PCI interconnect device
  *
  * We use two methods to determine whether a device is bound to a vfio
@@ -674,7 +675,7 @@ static int vfio_dev_viable(struct device *dev, void *data)
}
mutex_unlock(>unbound_lock);
 
-   if (!ret || !drv || vfio_dev_whitelisted(dev, drv))
+   if (!ret || !drv || vfio_dev_driver_allowed(dev, drv))
return 0;
 
device = vfio_group_get_device(group, dev);

Re: [PATCH] vfio: Cleanup allowed driver naming

2020-06-19 Thread Alex Williamson

On Fri, 19 Jun 2020 00:18:02 -0700
Christoph Hellwig  wrote:

> On Thu, Jun 18, 2020 at 01:57:18PM -0600, Alex Williamson wrote:
> > No functional change, avoid non-inclusive naming schemes.  
> 
> Adding a bunch of overly long lines that don't change anything are
> everything but an improvement.

In fact, 3 of 5 code change lines are shorter, the other two are 3
characters longer each and arguably more descriptive.  One line does now
exceed 80 columns, though checkpatch no longer cares.  I'll send a v2
with that line wrapped.  Thanks,

Alex

Re: [PATCH v2 1/3] docs: IOMMU user API

2020-06-18 Thread Alex Williamson

On Fri, 19 Jun 2020 02:15:36 +
"Liu, Yi L"  wrote:

> Hi Alex,
> 
> > From: Alex Williamson 
> > Sent: Friday, June 19, 2020 5:48 AM
> > 
> > On Wed, 17 Jun 2020 08:28:24 +
> > "Tian, Kevin"  wrote:
> >   
> > > > From: Liu, Yi L 
> > > > Sent: Wednesday, June 17, 2020 2:20 PM
> > > >  
> > > > > From: Jacob Pan 
> > > > > Sent: Tuesday, June 16, 2020 11:22 PM
> > > > >
> > > > > On Thu, 11 Jun 2020 17:27:27 -0700
> > > > > Jacob Pan  wrote:
> > > > >  
> > > > > > >
> > > > > > > But then I thought it even better if VFIO leaves the entire
> > > > > > > copy_from_user() to the layer consuming it.
> > > > > > >  
> > > > > > OK. Sounds good, that was what Kevin suggested also. I just wasn't
> > > > > > sure how much VFIO wants to inspect, I thought VFIO layer wanted to 
> > > > > > do
> > > > > > a sanity check.
> > > > > >
> > > > > > Anyway, I will move copy_from_user to iommu uapi layer.  
> > > > >
> > > > > Just one more point brought up by Yi when we discuss this offline.
> > > > >
> > > > > If we move copy_from_user to iommu uapi layer, then there will be  
> > > > multiple  
> > > > > copy_from_user calls for the same data when a VFIO container has  
> > > > multiple domains,  
> > > > > devices. For bind, it might be OK. But might be additional overhead 
> > > > > for TLB  
> > > > flush  
> > > > > request from the guest.  
> > > >
> > > > I think it is the same with bind and TLB flush path. will be multiple
> > > > copy_from_user.  
> > >
> > > multiple copies is possibly fine. In reality we allow only one group per
> > > nesting container (as described in patch [03/15]), and usually there
> > > is just one SVA-capable device per group.
> > >  
> > > >
> > > > BTW. for moving data copy to iommy layer, there is another point which
> > > > need to consider. VFIO needs to do unbind in bind path if bind failed,
> > > > so it will assemble unbind_data and pass to iommu layer. If iommu layer
> > > > do the copy_from_user, I think it will be failed. any idea?  
> > 
> > If a call into a UAPI fails, there should be nothing to undo.  Creating
> > a partial setup for a failed call that needs to be undone by the caller
> > is not good practice.  
> 
> is it still a problem if it's the VFIO to undo the partial setup before
> returning to user space?

Yes.  If a UAPI function fails there should be no residual effect.
 
> > > This might be mitigated if we go back to use the same bind_data for both
> > > bind/unbind. Then you can reuse the user object for unwinding.
> > >
> > > However there is another case where VFIO may need to assemble the
> > > bind_data itself. When a VM is killed, VFIO needs to walk allocated PASIDs
> > > and unbind them one-by-one. In such case copy_from_user doesn't work
> > > since the data is created by kernel. Alex, do you have a suggestion how 
> > > this
> > > usage can be supported? e.g. asking IOMMU driver to provide two sets of
> > > APIs to handle user/kernel generated requests?  
> > 
> > Yes, it seems like vfio would need to make use of a driver API to do
> > this, we shouldn't be faking a user buffer in the kernel in order to
> > call through to a UAPI.  Thanks,  
> 
> ok, so if VFIO wants to issue unbind by itself, it should use an API which
> passes kernel buffer to IOMMU layer. If the unbind request is from user
> space, then VFIO should use another API which passes user buffer pointer
> to IOMMU layer. makes sense. will align with jacob.

Sounds right to me.  Different approaches might be used for the driver
API versus the UAPI, perhaps there is no buffer.  Thanks,

Alex

Re: [PATCH v2 1/3] docs: IOMMU user API

2020-06-18 Thread Alex Williamson

On Wed, 17 Jun 2020 08:28:24 +
"Tian, Kevin"  wrote:

> > From: Liu, Yi L 
> > Sent: Wednesday, June 17, 2020 2:20 PM
> >   
> > > From: Jacob Pan 
> > > Sent: Tuesday, June 16, 2020 11:22 PM
> > >
> > > On Thu, 11 Jun 2020 17:27:27 -0700
> > > Jacob Pan  wrote:
> > >  
> > > > >
> > > > > But then I thought it even better if VFIO leaves the entire
> > > > > copy_from_user() to the layer consuming it.
> > > > >  
> > > > OK. Sounds good, that was what Kevin suggested also. I just wasn't
> > > > sure how much VFIO wants to inspect, I thought VFIO layer wanted to do
> > > > a sanity check.
> > > >
> > > > Anyway, I will move copy_from_user to iommu uapi layer.  
> > >
> > > Just one more point brought up by Yi when we discuss this offline.
> > >
> > > If we move copy_from_user to iommu uapi layer, then there will be  
> > multiple  
> > > copy_from_user calls for the same data when a VFIO container has  
> > multiple domains,  
> > > devices. For bind, it might be OK. But might be additional overhead for 
> > > TLB  
> > flush  
> > > request from the guest.  
> > 
> > I think it is the same with bind and TLB flush path. will be multiple
> > copy_from_user.  
> 
> multiple copies is possibly fine. In reality we allow only one group per
> nesting container (as described in patch [03/15]), and usually there
> is just one SVA-capable device per group.
> 
> > 
> > BTW. for moving data copy to iommy layer, there is another point which
> > need to consider. VFIO needs to do unbind in bind path if bind failed,
> > so it will assemble unbind_data and pass to iommu layer. If iommu layer
> > do the copy_from_user, I think it will be failed. any idea?

If a call into a UAPI fails, there should be nothing to undo.  Creating
a partial setup for a failed call that needs to be undone by the caller
is not good practice.

> This might be mitigated if we go back to use the same bind_data for both
> bind/unbind. Then you can reuse the user object for unwinding.
> 
> However there is another case where VFIO may need to assemble the
> bind_data itself. When a VM is killed, VFIO needs to walk allocated PASIDs
> and unbind them one-by-one. In such case copy_from_user doesn't work
> since the data is created by kernel. Alex, do you have a suggestion how this
> usage can be supported? e.g. asking IOMMU driver to provide two sets of
> APIs to handle user/kernel generated requests?

Yes, it seems like vfio would need to make use of a driver API to do
this, we shouldn't be faking a user buffer in the kernel in order to
call through to a UAPI.  Thanks,

Alex

[PATCH] vfio: Cleanup allowed driver naming

2020-06-18 Thread Alex Williamson

No functional change, avoid non-inclusive naming schemes.

Signed-off-by: Alex Williamson 
---
 drivers/vfio/vfio.c |   12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 580099afeaff..833da937b7fc 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -627,9 +627,9 @@ static struct vfio_device *vfio_group_get_device(struct 
vfio_group *group,
  * that error notification via MSI can be affected for platforms that handle
  * MSI within the same IOVA space as DMA.
  */
-static const char * const vfio_driver_whitelist[] = { "pci-stub" };
+static const char * const vfio_driver_allowed[] = { "pci-stub" };
 
-static bool vfio_dev_whitelisted(struct device *dev, struct device_driver *drv)
+static bool vfio_dev_driver_allowed(struct device *dev, struct device_driver 
*drv)
 {
if (dev_is_pci(dev)) {
struct pci_dev *pdev = to_pci_dev(dev);
@@ -638,8 +638,8 @@ static bool vfio_dev_whitelisted(struct device *dev, struct 
device_driver *drv)
return true;
}
 
-   return match_string(vfio_driver_whitelist,
-   ARRAY_SIZE(vfio_driver_whitelist),
+   return match_string(vfio_driver_allowed,
+   ARRAY_SIZE(vfio_driver_allowed),
drv->name) >= 0;
 }
 
@@ -648,7 +648,7 @@ static bool vfio_dev_whitelisted(struct device *dev, struct 
device_driver *drv)
  * one of the following states:
  *  - driver-less
  *  - bound to a vfio driver
- *  - bound to a whitelisted driver
+ *  - bound to an otherwise allowed driver
  *  - a PCI interconnect device
  *
  * We use two methods to determine whether a device is bound to a vfio
@@ -674,7 +674,7 @@ static int vfio_dev_viable(struct device *dev, void *data)
}
mutex_unlock(>unbound_lock);
 
-   if (!ret || !drv || vfio_dev_whitelisted(dev, drv))
+   if (!ret || !drv || vfio_dev_driver_allowed(dev, drv))
return 0;
 
device = vfio_group_get_device(group, dev);

[PATCH] vfio/type1: Fix migration info capability ID

2020-06-18 Thread Alex Williamson

ID 1 is already used by the IOVA range capability, use ID 2.

Reported-by: Liu Yi L 
Cc: Kirti Wankhede 
Fixes: ad721705d09c ("vfio iommu: Add migration capability to report supported 
features")
Signed-off-by: Alex Williamson 
---
 include/uapi/linux/vfio.h |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index eca6692667a3..920470502329 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1030,7 +1030,7 @@ struct vfio_iommu_type1_info_cap_iova_range {
  * size in bytes that can be used by user applications when getting the dirty
  * bitmap.
  */
-#define VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION  1
+#define VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION  2
 
 struct vfio_iommu_type1_info_cap_migration {
struct  vfio_info_cap_header header;

Re: [PATCH AUTOSEL 5.7 280/388] vfio/pci: fix memory leaks of eventfd ctx

2020-06-17 Thread Alex Williamson

On Wed, 17 Jun 2020 21:06:17 -0400
Sasha Levin  wrote:

> From: Qian Cai 
> 
> [ Upstream commit 1518ac272e789cae8c555d69951b032a275b7602 ]
> 
> Finished a qemu-kvm (-device vfio-pci,host=0001:01:00.0) triggers a few
> memory leaks after a while because vfio_pci_set_ctx_trigger_single()
> calls eventfd_ctx_fdget() without the matching eventfd_ctx_put() later.
> Fix it by calling eventfd_ctx_put() for those memory in
> vfio_pci_release() before vfio_device_release().
> 
> unreferenced object 0xebff008981cc2b00 (size 128):
>   comm "qemu-kvm", pid 4043, jiffies 4294994816 (age 9796.310s)
>   hex dump (first 32 bytes):
> 01 00 00 00 6b 6b 6b 6b 00 00 00 00 ad 4e ad de  .N..
> ff ff ff ff 6b 6b 6b 6b ff ff ff ff ff ff ff ff  
>   backtrace:
> [<917e8f8d>] slab_post_alloc_hook+0x74/0x9c
> [<df0f2aa2>] kmem_cache_alloc_trace+0x2b4/0x3d4
> [<5fcec025>] do_eventfd+0x54/0x1ac
> [<82791a69>] __arm64_sys_eventfd2+0x34/0x44
> [<b819758c>] do_el0_svc+0x128/0x1dc
> [<b244e810>] el0_sync_handler+0xd0/0x268
> [<d495ef94>] el0_sync+0x164/0x180
> unreferenced object 0x29ff008981cc4180 (size 128):
>   comm "qemu-kvm", pid 4043, jiffies 4294994818 (age 9796.290s)
>   hex dump (first 32 bytes):
> 01 00 00 00 6b 6b 6b 6b 00 00 00 00 ad 4e ad de  .N..
> ff ff ff ff 6b 6b 6b 6b ff ff ff ff ff ff ff ff  
>   backtrace:
> [<917e8f8d>] slab_post_alloc_hook+0x74/0x9c
> [<df0f2aa2>] kmem_cache_alloc_trace+0x2b4/0x3d4
> [<5fcec025>] do_eventfd+0x54/0x1ac
> [<82791a69>] __arm64_sys_eventfd2+0x34/0x44
> [<0000b819758c>] do_el0_svc+0x128/0x1dc
> [<b244e810>] el0_sync_handler+0xd0/0x268
> [<d495ef94>] el0_sync+0x164/0x180
> 
> Signed-off-by: Qian Cai 
> Signed-off-by: Alex Williamson 
> Signed-off-by: Sasha Levin 
> ---
>  drivers/vfio/pci/vfio_pci.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 6c6b37b5c04e..080e6608f297 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -519,6 +519,10 @@ static void vfio_pci_release(void *device_data)
>   vfio_pci_vf_token_user_add(vdev, -1);
>   vfio_spapr_pci_eeh_release(vdev->pdev);
>   vfio_pci_disable(vdev);
> + if (vdev->err_trigger)
> + eventfd_ctx_put(vdev->err_trigger);
> + if (vdev->req_trigger)
> + eventfd_ctx_put(vdev->req_trigger);
>   }
>  
>   mutex_unlock(>reflck->lock);


This has a fix pending, I'd suggest not picking it on its own:

https://lore.kernel.org/kvm/20200616085052.sahrunsesjyje...@beryllium.lan/
https://lore.kernel.org/kvm/159234276956.31057.6902954364435481688.st...@gimli.home/

Thanks,
Alex

Re: [PATCH AUTOSEL 5.4 191/266] vfio/pci: fix memory leaks of eventfd ctx

2020-06-17 Thread Alex Williamson

On Wed, 17 Jun 2020 21:15:16 -0400
Sasha Levin  wrote:

> From: Qian Cai 
> 
> [ Upstream commit 1518ac272e789cae8c555d69951b032a275b7602 ]
> 
> Finished a qemu-kvm (-device vfio-pci,host=0001:01:00.0) triggers a few
> memory leaks after a while because vfio_pci_set_ctx_trigger_single()
> calls eventfd_ctx_fdget() without the matching eventfd_ctx_put() later.
> Fix it by calling eventfd_ctx_put() for those memory in
> vfio_pci_release() before vfio_device_release().
> 
> unreferenced object 0xebff008981cc2b00 (size 128):
>   comm "qemu-kvm", pid 4043, jiffies 4294994816 (age 9796.310s)
>   hex dump (first 32 bytes):
> 01 00 00 00 6b 6b 6b 6b 00 00 00 00 ad 4e ad de  .N..
> ff ff ff ff 6b 6b 6b 6b ff ff ff ff ff ff ff ff  
>   backtrace:
> [<917e8f8d>] slab_post_alloc_hook+0x74/0x9c
> [<df0f2aa2>] kmem_cache_alloc_trace+0x2b4/0x3d4
> [<5fcec025>] do_eventfd+0x54/0x1ac
> [<82791a69>] __arm64_sys_eventfd2+0x34/0x44
> [<b819758c>] do_el0_svc+0x128/0x1dc
> [<b244e810>] el0_sync_handler+0xd0/0x268
> [<d495ef94>] el0_sync+0x164/0x180
> unreferenced object 0x29ff008981cc4180 (size 128):
>   comm "qemu-kvm", pid 4043, jiffies 4294994818 (age 9796.290s)
>   hex dump (first 32 bytes):
> 01 00 00 00 6b 6b 6b 6b 00 00 00 00 ad 4e ad de  .N..
> ff ff ff ff 6b 6b 6b 6b ff ff ff ff ff ff ff ff  
>   backtrace:
> [<917e8f8d>] slab_post_alloc_hook+0x74/0x9c
> [<df0f2aa2>] kmem_cache_alloc_trace+0x2b4/0x3d4
> [<5fcec025>] do_eventfd+0x54/0x1ac
> [<82791a69>] __arm64_sys_eventfd2+0x34/0x44
> [<0000b819758c>] do_el0_svc+0x128/0x1dc
> [<b244e810>] el0_sync_handler+0xd0/0x268
> [<d495ef94>] el0_sync+0x164/0x180
> 
> Signed-off-by: Qian Cai 
> Signed-off-by: Alex Williamson 
> Signed-off-by: Sasha Levin 
> ---
>  drivers/vfio/pci/vfio_pci.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 02206162eaa9..d917dd2df3b3 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -472,6 +472,10 @@ static void vfio_pci_release(void *device_data)
>   if (!(--vdev->refcnt)) {
>   vfio_spapr_pci_eeh_release(vdev->pdev);
>   vfio_pci_disable(vdev);
> + if (vdev->err_trigger)
> + eventfd_ctx_put(vdev->err_trigger);
> + if (vdev->req_trigger)
> + eventfd_ctx_put(vdev->req_trigger);
>   }
>  
>   mutex_unlock(>reflck->lock);

This has a fix pending, I'd suggest not picking it on its own:

https://lore.kernel.org/kvm/20200616085052.sahrunsesjyje...@beryllium.lan/
https://lore.kernel.org/kvm/159234276956.31057.6902954364435481688.st...@gimli.home/

Thanks,
Alex

Re: [PATCH AUTOSEL 4.19 120/172] vfio/pci: fix memory leaks of eventfd ctx

2020-06-17 Thread Alex Williamson

On Wed, 17 Jun 2020 21:21:26 -0400
Sasha Levin  wrote:

> From: Qian Cai 
> 
> [ Upstream commit 1518ac272e789cae8c555d69951b032a275b7602 ]
> 
> Finished a qemu-kvm (-device vfio-pci,host=0001:01:00.0) triggers a few
> memory leaks after a while because vfio_pci_set_ctx_trigger_single()
> calls eventfd_ctx_fdget() without the matching eventfd_ctx_put() later.
> Fix it by calling eventfd_ctx_put() for those memory in
> vfio_pci_release() before vfio_device_release().
> 
> unreferenced object 0xebff008981cc2b00 (size 128):
>   comm "qemu-kvm", pid 4043, jiffies 4294994816 (age 9796.310s)
>   hex dump (first 32 bytes):
> 01 00 00 00 6b 6b 6b 6b 00 00 00 00 ad 4e ad de  .N..
> ff ff ff ff 6b 6b 6b 6b ff ff ff ff ff ff ff ff  
>   backtrace:
> [<917e8f8d>] slab_post_alloc_hook+0x74/0x9c
> [<df0f2aa2>] kmem_cache_alloc_trace+0x2b4/0x3d4
> [<5fcec025>] do_eventfd+0x54/0x1ac
> [<82791a69>] __arm64_sys_eventfd2+0x34/0x44
> [<b819758c>] do_el0_svc+0x128/0x1dc
> [<b244e810>] el0_sync_handler+0xd0/0x268
> [<d495ef94>] el0_sync+0x164/0x180
> unreferenced object 0x29ff008981cc4180 (size 128):
>   comm "qemu-kvm", pid 4043, jiffies 4294994818 (age 9796.290s)
>   hex dump (first 32 bytes):
> 01 00 00 00 6b 6b 6b 6b 00 00 00 00 ad 4e ad de  .N..
> ff ff ff ff 6b 6b 6b 6b ff ff ff ff ff ff ff ff  
>   backtrace:
> [<917e8f8d>] slab_post_alloc_hook+0x74/0x9c
> [<df0f2aa2>] kmem_cache_alloc_trace+0x2b4/0x3d4
> [<5fcec025>] do_eventfd+0x54/0x1ac
> [<82791a69>] __arm64_sys_eventfd2+0x34/0x44
> [<0000b819758c>] do_el0_svc+0x128/0x1dc
> [<b244e810>] el0_sync_handler+0xd0/0x268
> [<d495ef94>] el0_sync+0x164/0x180
> 
> Signed-off-by: Qian Cai 
> Signed-off-by: Alex Williamson 
> Signed-off-by: Sasha Levin 
> ---
>  drivers/vfio/pci/vfio_pci.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 66783a37f450..36b2ea920bc9 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -407,6 +407,10 @@ static void vfio_pci_release(void *device_data)
>   if (!(--vdev->refcnt)) {
>   vfio_spapr_pci_eeh_release(vdev->pdev);
>   vfio_pci_disable(vdev);
> + if (vdev->err_trigger)
> + eventfd_ctx_put(vdev->err_trigger);
> + if (vdev->req_trigger)
> + eventfd_ctx_put(vdev->req_trigger);
>   }
>  
>   mutex_unlock(_lock);

This has a fix pending, I'd suggest not picking it on its own:

https://lore.kernel.org/kvm/20200616085052.sahrunsesjyje...@beryllium.lan/
https://lore.kernel.org/kvm/159234276956.31057.6902954364435481688.st...@gimli.home/

Thanks,
Alex

Re: [PATCH AUTOSEL 4.4 50/60] vfio/pci: fix memory leaks of eventfd ctx

2020-06-17 Thread Alex Williamson

On Wed, 17 Jun 2020 21:29:54 -0400
Sasha Levin  wrote:

> From: Qian Cai 
> 
> [ Upstream commit 1518ac272e789cae8c555d69951b032a275b7602 ]
> 
> Finished a qemu-kvm (-device vfio-pci,host=0001:01:00.0) triggers a few
> memory leaks after a while because vfio_pci_set_ctx_trigger_single()
> calls eventfd_ctx_fdget() without the matching eventfd_ctx_put() later.
> Fix it by calling eventfd_ctx_put() for those memory in
> vfio_pci_release() before vfio_device_release().
> 
> unreferenced object 0xebff008981cc2b00 (size 128):
>   comm "qemu-kvm", pid 4043, jiffies 4294994816 (age 9796.310s)
>   hex dump (first 32 bytes):
> 01 00 00 00 6b 6b 6b 6b 00 00 00 00 ad 4e ad de  .N..
> ff ff ff ff 6b 6b 6b 6b ff ff ff ff ff ff ff ff  
>   backtrace:
> [<917e8f8d>] slab_post_alloc_hook+0x74/0x9c
> [<df0f2aa2>] kmem_cache_alloc_trace+0x2b4/0x3d4
> [<5fcec025>] do_eventfd+0x54/0x1ac
> [<82791a69>] __arm64_sys_eventfd2+0x34/0x44
> [<b819758c>] do_el0_svc+0x128/0x1dc
> [<b244e810>] el0_sync_handler+0xd0/0x268
> [<d495ef94>] el0_sync+0x164/0x180
> unreferenced object 0x29ff008981cc4180 (size 128):
>   comm "qemu-kvm", pid 4043, jiffies 4294994818 (age 9796.290s)
>   hex dump (first 32 bytes):
> 01 00 00 00 6b 6b 6b 6b 00 00 00 00 ad 4e ad de  .N..
> ff ff ff ff 6b 6b 6b 6b ff ff ff ff ff ff ff ff  
>   backtrace:
> [<917e8f8d>] slab_post_alloc_hook+0x74/0x9c
> [<df0f2aa2>] kmem_cache_alloc_trace+0x2b4/0x3d4
> [<5fcec025>] do_eventfd+0x54/0x1ac
> [<82791a69>] __arm64_sys_eventfd2+0x34/0x44
> [<0000b819758c>] do_el0_svc+0x128/0x1dc
> [<b244e810>] el0_sync_handler+0xd0/0x268
> [<d495ef94>] el0_sync+0x164/0x180
> 
> Signed-off-by: Qian Cai 
> Signed-off-by: Alex Williamson 
> Signed-off-by: Sasha Levin 
> ---
>  drivers/vfio/pci/vfio_pci.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 7a82735d5308..ab765770e8dd 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -255,6 +255,10 @@ static void vfio_pci_release(void *device_data)
>   if (!(--vdev->refcnt)) {
>   vfio_spapr_pci_eeh_release(vdev->pdev);
>   vfio_pci_disable(vdev);
> + if (vdev->err_trigger)
> + eventfd_ctx_put(vdev->err_trigger);
> + if (vdev->req_trigger)
> + eventfd_ctx_put(vdev->req_trigger);
>   }
>  
>   mutex_unlock(_lock);

This has a fix pending, I'd suggest not picking it on its own:

https://lore.kernel.org/kvm/20200616085052.sahrunsesjyje...@beryllium.lan/
https://lore.kernel.org/kvm/159234276956.31057.6902954364435481688.st...@gimli.home/

Thanks,
Alex

Re: [PATCH AUTOSEL 4.14 079/108] vfio/pci: fix memory leaks of eventfd ctx

2020-06-17 Thread Alex Williamson

On Wed, 17 Jun 2020 21:25:31 -0400
Sasha Levin  wrote:

> From: Qian Cai 
> 
> [ Upstream commit 1518ac272e789cae8c555d69951b032a275b7602 ]
> 
> Finished a qemu-kvm (-device vfio-pci,host=0001:01:00.0) triggers a few
> memory leaks after a while because vfio_pci_set_ctx_trigger_single()
> calls eventfd_ctx_fdget() without the matching eventfd_ctx_put() later.
> Fix it by calling eventfd_ctx_put() for those memory in
> vfio_pci_release() before vfio_device_release().
> 
> unreferenced object 0xebff008981cc2b00 (size 128):
>   comm "qemu-kvm", pid 4043, jiffies 4294994816 (age 9796.310s)
>   hex dump (first 32 bytes):
> 01 00 00 00 6b 6b 6b 6b 00 00 00 00 ad 4e ad de  .N..
> ff ff ff ff 6b 6b 6b 6b ff ff ff ff ff ff ff ff  
>   backtrace:
> [<917e8f8d>] slab_post_alloc_hook+0x74/0x9c
> [<df0f2aa2>] kmem_cache_alloc_trace+0x2b4/0x3d4
> [<5fcec025>] do_eventfd+0x54/0x1ac
> [<82791a69>] __arm64_sys_eventfd2+0x34/0x44
> [<b819758c>] do_el0_svc+0x128/0x1dc
> [<b244e810>] el0_sync_handler+0xd0/0x268
> [<d495ef94>] el0_sync+0x164/0x180
> unreferenced object 0x29ff008981cc4180 (size 128):
>   comm "qemu-kvm", pid 4043, jiffies 4294994818 (age 9796.290s)
>   hex dump (first 32 bytes):
> 01 00 00 00 6b 6b 6b 6b 00 00 00 00 ad 4e ad de  .N..
> ff ff ff ff 6b 6b 6b 6b ff ff ff ff ff ff ff ff  
>   backtrace:
> [<917e8f8d>] slab_post_alloc_hook+0x74/0x9c
> [<df0f2aa2>] kmem_cache_alloc_trace+0x2b4/0x3d4
> [<5fcec025>] do_eventfd+0x54/0x1ac
> [<82791a69>] __arm64_sys_eventfd2+0x34/0x44
> [<0000b819758c>] do_el0_svc+0x128/0x1dc
> [<b244e810>] el0_sync_handler+0xd0/0x268
> [<d495ef94>] el0_sync+0x164/0x180
> 
> Signed-off-by: Qian Cai 
> Signed-off-by: Alex Williamson 
> Signed-off-by: Sasha Levin 
> ---
>  drivers/vfio/pci/vfio_pci.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 550ab7707b57..b7733d3c06de 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -397,6 +397,10 @@ static void vfio_pci_release(void *device_data)
>   if (!(--vdev->refcnt)) {
>   vfio_spapr_pci_eeh_release(vdev->pdev);
>   vfio_pci_disable(vdev);
> + if (vdev->err_trigger)
> + eventfd_ctx_put(vdev->err_trigger);
> + if (vdev->req_trigger)
> + eventfd_ctx_put(vdev->req_trigger);
>   }
>  
>   mutex_unlock(_lock);

This has a fix pending, I'd suggest not picking it on its own:

https://lore.kernel.org/kvm/20200616085052.sahrunsesjyje...@beryllium.lan/
https://lore.kernel.org/kvm/159234276956.31057.6902954364435481688.st...@gimli.home/

Thanks,
Alex

Re: [PATCH AUTOSEL 4.9 64/80] vfio/pci: fix memory leaks of eventfd ctx

2020-06-17 Thread Alex Williamson

On Wed, 17 Jun 2020 21:28:03 -0400
Sasha Levin  wrote:

> From: Qian Cai 
> 
> [ Upstream commit 1518ac272e789cae8c555d69951b032a275b7602 ]
> 
> Finished a qemu-kvm (-device vfio-pci,host=0001:01:00.0) triggers a few
> memory leaks after a while because vfio_pci_set_ctx_trigger_single()
> calls eventfd_ctx_fdget() without the matching eventfd_ctx_put() later.
> Fix it by calling eventfd_ctx_put() for those memory in
> vfio_pci_release() before vfio_device_release().
> 
> unreferenced object 0xebff008981cc2b00 (size 128):
>   comm "qemu-kvm", pid 4043, jiffies 4294994816 (age 9796.310s)
>   hex dump (first 32 bytes):
> 01 00 00 00 6b 6b 6b 6b 00 00 00 00 ad 4e ad de  .N..
> ff ff ff ff 6b 6b 6b 6b ff ff ff ff ff ff ff ff  
>   backtrace:
> [<917e8f8d>] slab_post_alloc_hook+0x74/0x9c
> [<df0f2aa2>] kmem_cache_alloc_trace+0x2b4/0x3d4
> [<5fcec025>] do_eventfd+0x54/0x1ac
> [<82791a69>] __arm64_sys_eventfd2+0x34/0x44
> [<b819758c>] do_el0_svc+0x128/0x1dc
> [<b244e810>] el0_sync_handler+0xd0/0x268
> [<d495ef94>] el0_sync+0x164/0x180
> unreferenced object 0x29ff008981cc4180 (size 128):
>   comm "qemu-kvm", pid 4043, jiffies 4294994818 (age 9796.290s)
>   hex dump (first 32 bytes):
> 01 00 00 00 6b 6b 6b 6b 00 00 00 00 ad 4e ad de  .N..
> ff ff ff ff 6b 6b 6b 6b ff ff ff ff ff ff ff ff  
>   backtrace:
> [<917e8f8d>] slab_post_alloc_hook+0x74/0x9c
> [<df0f2aa2>] kmem_cache_alloc_trace+0x2b4/0x3d4
> [<5fcec025>] do_eventfd+0x54/0x1ac
> [<82791a69>] __arm64_sys_eventfd2+0x34/0x44
> [<0000b819758c>] do_el0_svc+0x128/0x1dc
> [<b244e810>] el0_sync_handler+0xd0/0x268
> [<d495ef94>] el0_sync+0x164/0x180
> 
> Signed-off-by: Qian Cai 
> Signed-off-by: Alex Williamson 
> Signed-off-by: Sasha Levin 
> ---
>  drivers/vfio/pci/vfio_pci.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index c94167d87178..9d8715abbec1 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -390,6 +390,10 @@ static void vfio_pci_release(void *device_data)
>   if (!(--vdev->refcnt)) {
>   vfio_spapr_pci_eeh_release(vdev->pdev);
>   vfio_pci_disable(vdev);
> + if (vdev->err_trigger)
> + eventfd_ctx_put(vdev->err_trigger);
> + if (vdev->req_trigger)
> + eventfd_ctx_put(vdev->req_trigger);
>   }
>  
>   mutex_unlock(_lock);

This has a fix pending, I'd suggest not picking it on its own:

https://lore.kernel.org/kvm/20200616085052.sahrunsesjyje...@beryllium.lan/
https://lore.kernel.org/kvm/159234276956.31057.6902954364435481688.st...@gimli.home/

Thanks,
Alex

Re: vfio: refcount_t: underflow; use-after-free.

2020-06-16 Thread Alex Williamson

On Tue, 16 Jun 2020 10:50:52 +0200
Daniel Wagner  wrote:

> Hi,
> 
> I'm getting the warning below when starting a KVM the second time with an
> Emulex PCI card 'passthroughed' into a KVM. I'm terminating the session
> via 'ctrl-a x', not sure if this is relevant.
> 
> This is with 5.8-rc1. IIRC, older version didn't have this problem.

Thanks for the report, it's a new regression.  I've just posted a fix
for it.  Thanks,

Alex

[PATCH] vfio/pci: Clear error and request eventfd ctx after releasing

2020-06-16 Thread Alex Williamson

The next use of the device will generate an underflow from the
stale reference.

Cc: Qian Cai 
Fixes: 1518ac272e78 ("vfio/pci: fix memory leaks of eventfd ctx")
Reported-by: Daniel Wagner 
Signed-off-by: Alex Williamson 
---
 drivers/vfio/pci/vfio_pci.c |8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 7c0779018b1b..f634c81998bb 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -521,10 +521,14 @@ static void vfio_pci_release(void *device_data)
vfio_pci_vf_token_user_add(vdev, -1);
vfio_spapr_pci_eeh_release(vdev->pdev);
vfio_pci_disable(vdev);
-   if (vdev->err_trigger)
+   if (vdev->err_trigger) {
eventfd_ctx_put(vdev->err_trigger);
-   if (vdev->req_trigger)
+   vdev->err_trigger = NULL;
+   }
+   if (vdev->req_trigger) {
eventfd_ctx_put(vdev->req_trigger);
+   vdev->req_trigger = NULL;
+   }
}
 
mutex_unlock(>reflck->lock);

Re: [PATCH v2 3/3] iommu/vt-d: Sanity check uapi argsz filled by users

2020-06-11 Thread Alex Williamson

On Thu, 11 Jun 2020 13:02:24 -0700
Jacob Pan  wrote:

> On Thu, 11 Jun 2020 11:08:16 -0600
> Alex Williamson  wrote:
> 
> > On Wed, 10 Jun 2020 21:12:15 -0700
> > Jacob Pan  wrote:
> >   
> > > IOMMU UAPI data has an argsz field which is filled by user. As the
> > > data structures expands, argsz may change. As the UAPI data are
> > > shared among different architectures, extensions of UAPI data could
> > > be a result of one architecture which has no impact on another.
> > > Therefore, these argsz santity checks are performed in the model
> > > specific IOMMU drivers. This patch adds sanity checks in the VT-d
> > > to ensure argsz passed by userspace matches feature flags and other
> > > contents.
> > > 
> > > Signed-off-by: Jacob Pan 
> > > ---
> > >  drivers/iommu/intel-iommu.c | 16 
> > >  drivers/iommu/intel-svm.c   | 12 
> > >  2 files changed, 28 insertions(+)
> > > 
> > > diff --git a/drivers/iommu/intel-iommu.c
> > > b/drivers/iommu/intel-iommu.c index 27ebf4b9faef..c98b5109684b
> > > 100644 --- a/drivers/iommu/intel-iommu.c
> > > +++ b/drivers/iommu/intel-iommu.c
> > > @@ -5365,6 +5365,7 @@ intel_iommu_sva_invalidate(struct
> > > iommu_domain *domain, struct device *dev, struct device_domain_info
> > > *info; struct intel_iommu *iommu;
> > >   unsigned long flags;
> > > + unsigned long minsz;
> > >   int cache_type;
> > >   u8 bus, devfn;
> > >   u16 did, sid;
> > > @@ -5385,6 +5386,21 @@ intel_iommu_sva_invalidate(struct
> > > iommu_domain *domain, struct device *dev, if (!(dmar_domain->flags
> > > & DOMAIN_FLAG_NESTING_MODE)) return -EINVAL;
> > >  
> > > + minsz = offsetofend(struct iommu_cache_invalidate_info,
> > > padding);
> > 
> > Would it still be better to look for the end of the last field that's
> > actually used to avoid the code churn and oversights if/when the
> > padding field does get used and renamed?
> >   
> My thought was that if the padding gets partially re-purposed, the
> remaining padding would still be valid for minsz check. The extension
> rule ensures that there is no size change other the variable size union
> at the end. So use padding would avoid the churn, or i am totally wrong?

No, it's trying to predict the future either way.  I figured checking
minsz against the fields we actually consume allows complete use of the
padding fields and provides a little leniency to the user.  We'd need
to be careful though that if those fields are later used by this
driver, the code would still need to accept the smaller size.  If the
union was named rather than anonymous we could just use offsetof() to
avoid directly referencing the padding fields.
 
> > Per my comment on patch 1/, this also seems like where the device
> > specific IOMMU driver should also have the responsibility of receiving
> > a __user pointer to do the copy_from_user() here.  vfio can't know
> > which flags require which fields to make a UAPI with acceptable
> > compatibility guarantees otherwise.
> >   
> Right, VFIO cannot do compatibility guarantees, it is just seem to be
> that VFIO has enough information to copy_from_user sanely & safely and
> handle over to IOMMU. Please help define the roles/responsibilities in
> my other email. Then I will follow the guideline.

We can keep that part of the discussion in the other thread.  Thanks,

Alex

> > > + if (inv_info->argsz < minsz)
> > > + return -EINVAL;
> > > +
> > > + /* Sanity check user filled invalidation dat sizes */
> > > + if (inv_info->granularity == IOMMU_INV_GRANU_ADDR &&
> > > + inv_info->argsz != offsetofend(struct
> > > iommu_cache_invalidate_info,
> > > + addr_info))
> > > + return -EINVAL;
> > > +
> > > + if (inv_info->granularity == IOMMU_INV_GRANU_PASID &&
> > > + inv_info->argsz != offsetofend(struct
> > > iommu_cache_invalidate_info,
> > > + pasid_info))
> > > + return -EINVAL;
> > > +
> > >   spin_lock_irqsave(_domain_lock, flags);
> > >   spin_lock(>lock);
> > >   info = get_domain_info(dev);
> > > diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
> > > index 35b43fe819ed..64dc2c66dfff 100644
> > > --- a/drivers/iommu/intel-svm.c
> > > +++ b/drivers/iommu/intel-svm.c
> > > @@ -235,15 +235,27 @@ int i

Re: [PATCH v2 1/3] docs: IOMMU user API

2020-06-11 Thread Alex Williamson

On Thu, 11 Jun 2020 12:52:05 -0700
Jacob Pan  wrote:

> Hi Alex,
> 
> On Thu, 11 Jun 2020 09:47:41 -0600
> Alex Williamson  wrote:
> 
> > On Wed, 10 Jun 2020 21:12:13 -0700
> > Jacob Pan  wrote:
> >   
> > > IOMMU UAPI is newly introduced to support communications between
> > > guest virtual IOMMU and host IOMMU. There has been lots of
> > > discussions on how it should work with VFIO UAPI and userspace in
> > > general.
> > > 
> > > This document is indended to clarify the UAPI design and usage. The
> > > mechenics of how future extensions should be achieved are also
> > > covered in this documentation.
> > > 
> > > Signed-off-by: Liu Yi L 
> > > Signed-off-by: Jacob Pan 
> > > ---
> > >  Documentation/userspace-api/iommu.rst | 210
> > > ++ 1 file changed, 210 insertions(+)
> > >  create mode 100644 Documentation/userspace-api/iommu.rst
> > > 
> > > diff --git a/Documentation/userspace-api/iommu.rst
> > > b/Documentation/userspace-api/iommu.rst new file mode 100644
> > > index ..e95dc5a04a41
> > > --- /dev/null
> > > +++ b/Documentation/userspace-api/iommu.rst
> > > @@ -0,0 +1,210 @@
> > > +.. SPDX-License-Identifier: GPL-2.0
> > > +.. iommu:
> > > +
> > > +=
> > > +IOMMU Userspace API
> > > +=
> > > +
> > > +IOMMU UAPI is used for virtualization cases where communications
> > > are +needed between physical and virtual IOMMU drivers. For native
> > > +usage, IOMMU is a system device which does not need to communicate
> > > +with user space directly.
> > > +
> > > +The primary use cases are guest Shared Virtual Address (SVA) and
> > > +guest IO virtual address (IOVA), wherein virtual IOMMU (vIOMMU) is
> > > +required to communicate with the physical IOMMU in the host.
> > > +
> > > +.. contents:: :local:
> > > +
> > > +Functionalities
> > > +
> > > +Communications of user and kernel involve both directions. The
> > > +supported user-kernel APIs are as follows:
> > > +
> > > +1. Alloc/Free PASID
> > > +2. Bind/unbind guest PASID (e.g. Intel VT-d)
> > > +3. Bind/unbind guest PASID table (e.g. ARM sMMU)
> > > +4. Invalidate IOMMU caches
> > > +5. Service page request
> > > +
> > > +Requirements
> > > +
> > > +The IOMMU UAPIs are generic and extensible to meet the following
> > > +requirements:
> > > +
> > > +1. Emulated and para-virtualised vIOMMUs
> > > +2. Multiple vendors (Intel VT-d, ARM sMMU, etc.)
> > > +3. Extensions to the UAPI shall not break existing user space
> > > +
> > > +Interfaces
> > > +
> > > +Although the data structures defined in IOMMU UAPI are
> > > self-contained, +there is no user API functions introduced.
> > > Instead, IOMMU UAPI is +designed to work with existing user driver
> > > frameworks such as VFIO. +
> > > +Extension Rules & Precautions
> > > +-
> > > +When IOMMU UAPI gets extended, the data structures can *only* be
> > > +modified in two ways:
> > > +
> > > +1. Adding new fields by re-purposing the padding[] field. No size
> > > change. +2. Adding new union members at the end. May increase in
> > > size. +
> > > +No new fields can be added *after* the variable size union in that
> > > it +will break backward compatibility when offset moves. In both
> > > cases, a +new flag must be accompanied with a new field such that
> > > the IOMMU +driver can process the data based on the new flag.
> > > Version field is +only reserved for the unlikely event of UAPI
> > > upgrade at its entirety. +
> > > +It's *always* the caller's responsibility to indicate the size of
> > > the +structure passed by setting argsz appropriately.
> > > +
> > > +When IOMMU UAPI extension results in size increase, user such as
> > > VFIO +has to handle the following scenarios:
> > > +
> > > +1. User and kernel has exact size match
> > > +2. An older user with older kernel header (smaller UAPI size)
> > > running on a
> > > +   newer kernel (larger UAPI size)
> > > +3. A

Re: [PATCH v2 02/15] iommu: Report domain nesting info

2020-06-11 Thread Alex Williamson

On Thu, 11 Jun 2020 05:15:21 -0700
Liu Yi L  wrote:

> IOMMUs that support nesting translation needs report the capability info
> to userspace, e.g. the format of first level/stage paging structures.
> 
> Cc: Kevin Tian 
> CC: Jacob Pan 
> Cc: Alex Williamson 
> Cc: Eric Auger 
> Cc: Jean-Philippe Brucker 
> Cc: Joerg Roedel 
> Cc: Lu Baolu 
> Signed-off-by: Liu Yi L 
> Signed-off-by: Jacob Pan 
> ---
> @Jean, Eric: as nesting was introduced for ARM, but looks like no actual
> user of it. right? So I'm wondering if we can reuse DOMAIN_ATTR_NESTING
> to retrieve nesting info? how about your opinions?
> 
>  include/linux/iommu.h  |  1 +
>  include/uapi/linux/iommu.h | 34 ++
>  2 files changed, 35 insertions(+)
> 
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 78a26ae..f6e4b49 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -126,6 +126,7 @@ enum iommu_attr {
>   DOMAIN_ATTR_FSL_PAMUV1,
>   DOMAIN_ATTR_NESTING,/* two stages of translation */
>   DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE,
> + DOMAIN_ATTR_NESTING_INFO,
>   DOMAIN_ATTR_MAX,
>  };
>  
> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> index 303f148..02eac73 100644
> --- a/include/uapi/linux/iommu.h
> +++ b/include/uapi/linux/iommu.h
> @@ -332,4 +332,38 @@ struct iommu_gpasid_bind_data {
>   };
>  };
>  
> +struct iommu_nesting_info {
> + __u32   size;
> + __u32   format;
> + __u32   features;
> +#define IOMMU_NESTING_FEAT_SYSWIDE_PASID (1 << 0)
> +#define IOMMU_NESTING_FEAT_BIND_PGTBL(1 << 1)
> +#define IOMMU_NESTING_FEAT_CACHE_INVLD   (1 << 2)
> + __u32   flags;
> + __u8data[];
> +};
> +
> +/*
> + * @flags:   VT-d specific flags. Currently reserved for future
> + *   extension.
> + * @addr_width:  The output addr width of first level/stage translation
> + * @pasid_bits:  Maximum supported PASID bits, 0 represents no PASID
> + *   support.
> + * @cap_reg: Describe basic capabilities as defined in VT-d capability
> + *   register.
> + * @cap_mask:Mark valid capability bits in @cap_reg.
> + * @ecap_reg:Describe the extended capabilities as defined in VT-d
> + *   extended capability register.
> + * @ecap_mask:   Mark the valid capability bits in @ecap_reg.

Please explain this a little further, why do we need to tell userspace
about cap/ecap register bits that aren't valid through this interface?
Thanks,

Alex


> + */
> +struct iommu_nesting_info_vtd {
> + __u32   flags;
> + __u16   addr_width;
> + __u16   pasid_bits;
> + __u64   cap_reg;
> + __u64   cap_mask;
> + __u64   ecap_reg;
> + __u64   ecap_mask;
> +};
> +
>  #endif /* _UAPI_IOMMU_H */

Re: [PATCH v2 3/3] iommu/vt-d: Sanity check uapi argsz filled by users

2020-06-11 Thread Alex Williamson

On Wed, 10 Jun 2020 21:12:15 -0700
Jacob Pan  wrote:

> IOMMU UAPI data has an argsz field which is filled by user. As the data
> structures expands, argsz may change. As the UAPI data are shared among
> different architectures, extensions of UAPI data could be a result of
> one architecture which has no impact on another. Therefore, these argsz
> santity checks are performed in the model specific IOMMU drivers. This
> patch adds sanity checks in the VT-d to ensure argsz passed by userspace
> matches feature flags and other contents.
> 
> Signed-off-by: Jacob Pan 
> ---
>  drivers/iommu/intel-iommu.c | 16 
>  drivers/iommu/intel-svm.c   | 12 
>  2 files changed, 28 insertions(+)
> 
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index 27ebf4b9faef..c98b5109684b 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -5365,6 +5365,7 @@ intel_iommu_sva_invalidate(struct iommu_domain *domain, 
> struct device *dev,
>   struct device_domain_info *info;
>   struct intel_iommu *iommu;
>   unsigned long flags;
> + unsigned long minsz;
>   int cache_type;
>   u8 bus, devfn;
>   u16 did, sid;
> @@ -5385,6 +5386,21 @@ intel_iommu_sva_invalidate(struct iommu_domain 
> *domain, struct device *dev,
>   if (!(dmar_domain->flags & DOMAIN_FLAG_NESTING_MODE))
>   return -EINVAL;
>  
> + minsz = offsetofend(struct iommu_cache_invalidate_info, padding);

Would it still be better to look for the end of the last field that's
actually used to avoid the code churn and oversights if/when the padding
field does get used and renamed?

Per my comment on patch 1/, this also seems like where the device
specific IOMMU driver should also have the responsibility of receiving
a __user pointer to do the copy_from_user() here.  vfio can't know
which flags require which fields to make a UAPI with acceptable
compatibility guarantees otherwise.

> + if (inv_info->argsz < minsz)
> + return -EINVAL;
> +
> + /* Sanity check user filled invalidation dat sizes */
> + if (inv_info->granularity == IOMMU_INV_GRANU_ADDR &&
> + inv_info->argsz != offsetofend(struct 
> iommu_cache_invalidate_info,
> + addr_info))
> + return -EINVAL;
> +
> + if (inv_info->granularity == IOMMU_INV_GRANU_PASID &&
> + inv_info->argsz != offsetofend(struct 
> iommu_cache_invalidate_info,
> + pasid_info))
> + return -EINVAL;
> +
>   spin_lock_irqsave(_domain_lock, flags);
>   spin_lock(>lock);
>   info = get_domain_info(dev);
> diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
> index 35b43fe819ed..64dc2c66dfff 100644
> --- a/drivers/iommu/intel-svm.c
> +++ b/drivers/iommu/intel-svm.c
> @@ -235,15 +235,27 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, 
> struct device *dev,
>   struct dmar_domain *dmar_domain;
>   struct intel_svm_dev *sdev;
>   struct intel_svm *svm;
> + unsigned long minsz;
>   int ret = 0;
>  
>   if (WARN_ON(!iommu) || !data)
>   return -EINVAL;
>  
> + /*
> +  * We mandate that no size change in IOMMU UAPI data before the
> +  * variable size union at the end.
> +  */
> + minsz = offsetofend(struct iommu_gpasid_bind_data, padding);

Same.  Thanks,

Alex

> + if (data->argsz < minsz)
> + return -EINVAL;
> +
>   if (data->version != IOMMU_GPASID_BIND_VERSION_1 ||
>   data->format != IOMMU_PASID_FORMAT_INTEL_VTD)
>   return -EINVAL;
>  
> + if (data->argsz != offsetofend(struct iommu_gpasid_bind_data, vtd))
> + return -EINVAL;
> +
>   if (!dev_is_pci(dev))
>   return -ENOTSUPP;
>

Re: [PATCH v2 2/3] iommu/uapi: Add argsz for user filled data

2020-06-11 Thread Alex Williamson

On Wed, 10 Jun 2020 21:12:14 -0700
Jacob Pan  wrote:

> As IOMMU UAPI gets extended, user data size may increase. To support
> backward compatibiliy, this patch introduces a size field to each UAPI
> data structures. It is *always* the responsibility for the user to fill in
> the correct size.

Though at the same time, argsz is user provided data which we don't
trust.  The argsz field allows the user to indicate how much data
they're providing, it's still the kernel's responsibility to validate
whether it's correct and sufficient for the requested operation.
Thanks,

Alex

> Specific scenarios for user data handling are documented in:
> Documentation/userspace-api/iommu.rst
> 
> Signed-off-by: Liu Yi L 
> Signed-off-by: Jacob Pan 
> ---
>  include/uapi/linux/iommu.h | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> index e907b7091a46..303f148a5cd7 100644
> --- a/include/uapi/linux/iommu.h
> +++ b/include/uapi/linux/iommu.h
> @@ -135,6 +135,7 @@ enum iommu_page_response_code {
>  
>  /**
>   * struct iommu_page_response - Generic page response information
> + * @argsz: User filled size of this data
>   * @version: API version of this structure
>   * @flags: encodes whether the corresponding fields are valid
>   * (IOMMU_FAULT_PAGE_RESPONSE_* values)
> @@ -143,6 +144,7 @@ enum iommu_page_response_code {
>   * @code: response code from  iommu_page_response_code
>   */
>  struct iommu_page_response {
> + __u32   argsz;
>  #define IOMMU_PAGE_RESP_VERSION_11
>   __u32   version;
>  #define IOMMU_PAGE_RESP_PASID_VALID  (1 << 0)
> @@ -218,6 +220,7 @@ struct iommu_inv_pasid_info {
>  /**
>   * struct iommu_cache_invalidate_info - First level/stage invalidation
>   * information
> + * @argsz: User filled size of this data
>   * @version: API version of this structure
>   * @cache: bitfield that allows to select which caches to invalidate
>   * @granularity: defines the lowest granularity used for the invalidation:
> @@ -246,6 +249,7 @@ struct iommu_inv_pasid_info {
>   * must support the used granularity.
>   */
>  struct iommu_cache_invalidate_info {
> + __u32   argsz;
>  #define IOMMU_CACHE_INVALIDATE_INFO_VERSION_1 1
>   __u32   version;
>  /* IOMMU paging structure cache */
> @@ -292,6 +296,7 @@ struct iommu_gpasid_bind_data_vtd {
>  
>  /**
>   * struct iommu_gpasid_bind_data - Information about device and guest PASID 
> binding
> + * @argsz:   User filled size of this data
>   * @version: Version of this data structure
>   * @format:  PASID table entry format
>   * @flags:   Additional information on guest bind request
> @@ -309,6 +314,7 @@ struct iommu_gpasid_bind_data_vtd {
>   * PASID to host PASID based on this bind data.
>   */
>  struct iommu_gpasid_bind_data {
> + __u32 argsz;
>  #define IOMMU_GPASID_BIND_VERSION_1  1
>   __u32 version;
>  #define IOMMU_PASID_FORMAT_INTEL_VTD 1

Re: [PATCH v2 1/3] docs: IOMMU user API

2020-06-11 Thread Alex Williamson

On Wed, 10 Jun 2020 21:12:13 -0700
Jacob Pan  wrote:

> IOMMU UAPI is newly introduced to support communications between guest
> virtual IOMMU and host IOMMU. There has been lots of discussions on how
> it should work with VFIO UAPI and userspace in general.
> 
> This document is indended to clarify the UAPI design and usage. The
> mechenics of how future extensions should be achieved are also covered
> in this documentation.
> 
> Signed-off-by: Liu Yi L 
> Signed-off-by: Jacob Pan 
> ---
>  Documentation/userspace-api/iommu.rst | 210 
> ++
>  1 file changed, 210 insertions(+)
>  create mode 100644 Documentation/userspace-api/iommu.rst
> 
> diff --git a/Documentation/userspace-api/iommu.rst 
> b/Documentation/userspace-api/iommu.rst
> new file mode 100644
> index ..e95dc5a04a41
> --- /dev/null
> +++ b/Documentation/userspace-api/iommu.rst
> @@ -0,0 +1,210 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +.. iommu:
> +
> +=
> +IOMMU Userspace API
> +=
> +
> +IOMMU UAPI is used for virtualization cases where communications are
> +needed between physical and virtual IOMMU drivers. For native
> +usage, IOMMU is a system device which does not need to communicate
> +with user space directly.
> +
> +The primary use cases are guest Shared Virtual Address (SVA) and
> +guest IO virtual address (IOVA), wherein virtual IOMMU (vIOMMU) is
> +required to communicate with the physical IOMMU in the host.
> +
> +.. contents:: :local:
> +
> +Functionalities
> +
> +Communications of user and kernel involve both directions. The
> +supported user-kernel APIs are as follows:
> +
> +1. Alloc/Free PASID
> +2. Bind/unbind guest PASID (e.g. Intel VT-d)
> +3. Bind/unbind guest PASID table (e.g. ARM sMMU)
> +4. Invalidate IOMMU caches
> +5. Service page request
> +
> +Requirements
> +
> +The IOMMU UAPIs are generic and extensible to meet the following
> +requirements:
> +
> +1. Emulated and para-virtualised vIOMMUs
> +2. Multiple vendors (Intel VT-d, ARM sMMU, etc.)
> +3. Extensions to the UAPI shall not break existing user space
> +
> +Interfaces
> +
> +Although the data structures defined in IOMMU UAPI are self-contained,
> +there is no user API functions introduced. Instead, IOMMU UAPI is
> +designed to work with existing user driver frameworks such as VFIO.
> +
> +Extension Rules & Precautions
> +-
> +When IOMMU UAPI gets extended, the data structures can *only* be
> +modified in two ways:
> +
> +1. Adding new fields by re-purposing the padding[] field. No size change.
> +2. Adding new union members at the end. May increase in size.
> +
> +No new fields can be added *after* the variable size union in that it
> +will break backward compatibility when offset moves. In both cases, a
> +new flag must be accompanied with a new field such that the IOMMU
> +driver can process the data based on the new flag. Version field is
> +only reserved for the unlikely event of UAPI upgrade at its entirety.
> +
> +It's *always* the caller's responsibility to indicate the size of the
> +structure passed by setting argsz appropriately.
> +
> +When IOMMU UAPI extension results in size increase, user such as VFIO
> +has to handle the following scenarios:
> +
> +1. User and kernel has exact size match
> +2. An older user with older kernel header (smaller UAPI size) running on a
> +   newer kernel (larger UAPI size)
> +3. A newer user with newer kernel header (larger UAPI size) running
> +   on a older kernel.
> +4. A malicious/misbehaving user pass illegal/invalid size but within
> +   range. The data may contain garbage.
> +
> +
> +Feature Checking
> +
> +While launching a guest with vIOMMU, it is important to ensure that host
> +can support the UAPI data structures to be used for vIOMMU-pIOMMU
> +communications. Without the upfront compatibility checking, future
> +faults are difficult to report even in normal conditions. For example,
> +TLB invalidations should always succeed from vIOMMU's
> +perspective. There is no architectural way to report back to the vIOMMU
> +if the UAPI data is incompatible. For this reason the following IOMMU
> +UAPIs cannot fail:
> +
> +1. Free PASID
> +2. Unbind guest PASID
> +3. Unbind guest PASID table (SMMU)
> +4. Cache invalidate
> +5. Page response
> +
> +User applications such as QEMU is expected to import kernel UAPI
> +headers. Only backward compatibility is supported. For example, an
> +older QEMU (with older kernel header) can run on newer kernel. Newer
> +QEMU (with new kernel header) may fail on older kernel.

"Build your user application against newer kernels and it may break on
older kernels" is not a great selling point of this UAPI.  Clearly new
features may not be available on older kernels and an

Re: [RFC PATCH v4 07/10] vfio/pci: introduce a new irq type VFIO_IRQ_TYPE_REMAP_BAR_REGION

2020-06-05 Thread Alex Williamson

On Fri, 5 Jun 2020 00:26:10 +
"He, Shaopeng"  wrote:

> > From: Alex Williamson 
> > Sent: Thursday, June 4, 2020 12:11 PM
> > 
> > On Wed, 3 Jun 2020 22:42:28 -0400
> > Yan Zhao  wrote:
> >   
> > > On Wed, Jun 03, 2020 at 05:04:52PM -0600, Alex Williamson wrote:  
> > > > On Tue, 2 Jun 2020 21:40:58 -0400
> > > > Yan Zhao  wrote:
> > > >  
> > > > > On Tue, Jun 02, 2020 at 01:34:35PM -0600, Alex Williamson wrote:  
> > > > > > I'm not at all happy with this.  Why do we need to hide the
> > > > > > migration sparse mmap from the user until migration time?  What
> > > > > > if instead we introduced a new
> > > > > > VFIO_REGION_INFO_CAP_SPARSE_MMAP_SAVING capability where  
> > the  
> > > > > > existing capability is the normal runtime sparse setup and the
> > > > > > user is required to use this new one prior to enabled
> > > > > > device_state with _SAVING.  The vendor driver could then simply
> > > > > > track mmap vmas to the region and refuse to change device_state
> > > > > > if there are outstanding mmaps conflicting with the _SAVING
> > > > > > sparse mmap layout.  No new IRQs required, no new irqfds, an
> > > > > > incremental change to the protocol, backwards compatible to the  
> > extent that a vendor driver requiring this will automatically fail 
> > migration.  
> > > > > >  
> > > > > right. looks we need to use this approach to solve the problem.
> > > > > thanks for your guide.
> > > > > so I'll abandon the current remap irq way for dirty tracking
> > > > > during live migration.
> > > > > but anyway, it demos how to customize irq_types in vendor drivers.
> > > > > then, what do you think about patches 1-5?  
> > > >
> > > > In broad strokes, I don't think we've found the right solution yet.
> > > > I really question whether it's supportable to parcel out vfio-pci
> > > > like this and I don't know how I'd support unraveling whether we
> > > > have a bug in vfio-pci, the vendor driver, or how the vendor driver
> > > > is making use of vfio-pci.
> > > >
> > > > Let me also ask, why does any of this need to be in the kernel?  We
> > > > spend 5 patches slicing up vfio-pci so that we can register a vendor
> > > > driver and have that vendor driver call into vfio-pci as it sees fit.
> > > > We have two patches creating device specific interrupts and a BAR
> > > > remapping scheme that we've decided we don't need.  That brings us
> > > > to the actual i40e vendor driver, where the first patch is simply
> > > > making the vendor driver work like vfio-pci already does, the second
> > > > patch is handling the migration region, and the third patch is
> > > > implementing the BAR remapping IRQ that we decided we don't need.
> > > > It's difficult to actually find the small bit of code that's
> > > > required to support migration outside of just dealing with the
> > > > protocol we've defined to expose this from the kernel.  So why are
> > > > we trying to do this in the kernel?  We have quirk support in QEMU,
> > > > we can easily flip MemoryRegions on and off, etc.  What access to
> > > > the device outside of what vfio-pci provides to the user, and
> > > > therefore QEMU, is necessary to implement this migration support for
> > > > i40e VFs?  Is this just an exercise in making use of the migration
> > > > interface?  Thanks,
> > > >  
> > > hi Alex
> > >
> > > There was a description of intention of this series in RFC v1
> > > (https://www.spinics.net/lists/kernel/msg3337337.html).
> > > sorry, I didn't include it in starting from RFC v2.
> > >
> > > "
> > > The reason why we don't choose the way of writing mdev parent driver
> > > is that  
> > 
> > I didn't mention an mdev approach, I'm asking what are we accomplishing by
> > doing this in the kernel at all versus exposing the device as normal through
> > vfio-pci and providing the migration support in QEMU.  Are you actually
> > leveraging having some sort of access to the PF in supporting migration of 
> > the
> > VF?  Is vfio-pci masking the device in a way that prevents migrating the 
> > state
> > from QEMU?
> >   
> > > (1) VFs are almost all

Re: [RFC PATCH v4 07/10] vfio/pci: introduce a new irq type VFIO_IRQ_TYPE_REMAP_BAR_REGION

2020-06-05 Thread Alex Williamson

On Thu, 4 Jun 2020 22:02:31 -0400
Yan Zhao  wrote:

> On Wed, Jun 03, 2020 at 10:10:58PM -0600, Alex Williamson wrote:
> > On Wed, 3 Jun 2020 22:42:28 -0400
> > Yan Zhao  wrote:
> >   
> > > On Wed, Jun 03, 2020 at 05:04:52PM -0600, Alex Williamson wrote:  
> > > > On Tue, 2 Jun 2020 21:40:58 -0400
> > > > Yan Zhao  wrote:
> > > > 
> > > > > On Tue, Jun 02, 2020 at 01:34:35PM -0600, Alex Williamson wrote:
> > > > > > I'm not at all happy with this.  Why do we need to hide the 
> > > > > > migration
> > > > > > sparse mmap from the user until migration time?  What if instead we
> > > > > > introduced a new VFIO_REGION_INFO_CAP_SPARSE_MMAP_SAVING capability
> > > > > > where the existing capability is the normal runtime sparse setup and
> > > > > > the user is required to use this new one prior to enabled 
> > > > > > device_state
> > > > > > with _SAVING.  The vendor driver could then simply track mmap vmas 
> > > > > > to
> > > > > > the region and refuse to change device_state if there are 
> > > > > > outstanding
> > > > > > mmaps conflicting with the _SAVING sparse mmap layout.  No new IRQs
> > > > > > required, no new irqfds, an incremental change to the protocol,
> > > > > > backwards compatible to the extent that a vendor driver requiring 
> > > > > > this
> > > > > > will automatically fail migration.
> > > > > >   
> > > > > right. looks we need to use this approach to solve the problem.
> > > > > thanks for your guide.
> > > > > so I'll abandon the current remap irq way for dirty tracking during 
> > > > > live
> > > > > migration.
> > > > > but anyway, it demos how to customize irq_types in vendor drivers.
> > > > > then, what do you think about patches 1-5?
> > > > 
> > > > In broad strokes, I don't think we've found the right solution yet.  I
> > > > really question whether it's supportable to parcel out vfio-pci like
> > > > this and I don't know how I'd support unraveling whether we have a bug
> > > > in vfio-pci, the vendor driver, or how the vendor driver is making use
> > > > of vfio-pci.
> > > >
> > > > Let me also ask, why does any of this need to be in the kernel?  We
> > > > spend 5 patches slicing up vfio-pci so that we can register a vendor
> > > > driver and have that vendor driver call into vfio-pci as it sees fit.
> > > > We have two patches creating device specific interrupts and a BAR
> > > > remapping scheme that we've decided we don't need.  That brings us to
> > > > the actual i40e vendor driver, where the first patch is simply making
> > > > the vendor driver work like vfio-pci already does, the second patch is
> > > > handling the migration region, and the third patch is implementing the
> > > > BAR remapping IRQ that we decided we don't need.  It's difficult to
> > > > actually find the small bit of code that's required to support
> > > > migration outside of just dealing with the protocol we've defined to
> > > > expose this from the kernel.  So why are we trying to do this in the
> > > > kernel?  We have quirk support in QEMU, we can easily flip
> > > > MemoryRegions on and off, etc.  What access to the device outside of
> > > > what vfio-pci provides to the user, and therefore QEMU, is necessary to
> > > > implement this migration support for i40e VFs?  Is this just an
> > > > exercise in making use of the migration interface?  Thanks,
> > > > 
> > > hi Alex
> > > 
> > > There was a description of intention of this series in RFC v1
> > > (https://www.spinics.net/lists/kernel/msg3337337.html).
> > > sorry, I didn't include it in starting from RFC v2.
> > > 
> > > "
> > > The reason why we don't choose the way of writing mdev parent driver is
> > > that  
> > 
> > I didn't mention an mdev approach, I'm asking what are we accomplishing
> > by doing this in the kernel at all versus exposing the device as normal
> > through vfio-pci and providing the migration support in QEMU.  Are you
> > actually leveraging having some sort of access to the PF in supporting
> > migration of the VF?  Is vfio-pci masking the device in a way that
> > prevents migra

Re: [PATCH v5 0/4] introduction of migration_version attribute for VFIO live migration

2020-06-05 Thread Alex Williamson

On Fri, 5 Jun 2020 11:22:24 +0100
"Dr. David Alan Gilbert"  wrote:

> * Alex Williamson (alex.william...@redhat.com) wrote:
> > On Wed, 3 Jun 2020 01:24:43 -0400
> > Yan Zhao  wrote:
> >   
> > > On Tue, Jun 02, 2020 at 09:55:28PM -0600, Alex Williamson wrote:  
> > > > On Tue, 2 Jun 2020 23:19:48 -0400
> > > > Yan Zhao  wrote:
> > > > 
> > > > > On Tue, Jun 02, 2020 at 04:55:27PM -0600, Alex Williamson wrote:
> > > > > > On Wed, 29 Apr 2020 20:39:50 -0400
> > > > > > Yan Zhao  wrote:
> > > > > >   
> > > > > > > On Wed, Apr 29, 2020 at 05:48:44PM +0800, Dr. David Alan Gilbert 
> > > > > > > wrote:
> > > > > > >   
> > > > > > > > > > > > > > > > > > > > An mdev type is meant to define a 
> > > > > > > > > > > > > > > > > > > > software compatible interface, so in
> > > > > > > > > > > > > > > > > > > > the case of mdev->mdev migration, 
> > > > > > > > > > > > > > > > > > > > doesn't migrating to a different type
> > > > > > > > > > > > > > > > > > > > fail the most basic of compatibility 
> > > > > > > > > > > > > > > > > > > > tests that we expect userspace to
> > > > > > > > > > > > > > > > > > > > perform?  IOW, if two mdev types are 
> > > > > > > > > > > > > > > > > > > > migration compatible, it seems a
> > > > > > > > > > > > > > > > > > > > prerequisite to that is that they 
> > > > > > > > > > > > > > > > > > > > provide the same software interface,
> > > > > > > > > > > > > > > > > > > > which means they should be the same 
> > > > > > > > > > > > > > > > > > > > mdev type.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > In the hybrid cases of mdev->phys or 
> > > > > > > > > > > > > > > > > > > > phys->mdev, how does a
> > > > > > > > > > > > > > > > > > > management
> > > > > > > > > > > > > > > > > > > > tool begin to even guess what might be 
> > > > > > > > > > > > > > > > > > > > compatible?  Are we expecting
> > > > > > > > > > > > > > > > > > > > libvirt to probe ever device with this 
> > > > > > > > > > > > > > > > > > > > attribute in the system?  Is
> > > > > > > > > > > > > > > > > > > > there going to be a new class hierarchy 
> > > > > > > > > > > > > > > > > > > > created to enumerate all
> > > > > > > > > > > > > > > > > > > > possible migrate-able devices?
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > yes, management tool needs to guess and 
> > > > > > > > > > > > > > > > > > > test migration compatible
> > > > > > > > > > > > > > > > > > > between two devices. But I think it's not 
> > > > > > > > > > > > > > > > > > > the problem only for
> > > > > > > > > > > > > > > > > > > mdev->phys or phys->mdev. even for 
> > > > > > > > > > > > > > > > > > > mdev->mdev, management tool needs
> > > > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > >

Re: [RFC PATCH v4 07/10] vfio/pci: introduce a new irq type VFIO_IRQ_TYPE_REMAP_BAR_REGION

2020-06-03 Thread Alex Williamson

On Wed, 3 Jun 2020 22:42:28 -0400
Yan Zhao  wrote:

> On Wed, Jun 03, 2020 at 05:04:52PM -0600, Alex Williamson wrote:
> > On Tue, 2 Jun 2020 21:40:58 -0400
> > Yan Zhao  wrote:
> >   
> > > On Tue, Jun 02, 2020 at 01:34:35PM -0600, Alex Williamson wrote:  
> > > > I'm not at all happy with this.  Why do we need to hide the migration
> > > > sparse mmap from the user until migration time?  What if instead we
> > > > introduced a new VFIO_REGION_INFO_CAP_SPARSE_MMAP_SAVING capability
> > > > where the existing capability is the normal runtime sparse setup and
> > > > the user is required to use this new one prior to enabled device_state
> > > > with _SAVING.  The vendor driver could then simply track mmap vmas to
> > > > the region and refuse to change device_state if there are outstanding
> > > > mmaps conflicting with the _SAVING sparse mmap layout.  No new IRQs
> > > > required, no new irqfds, an incremental change to the protocol,
> > > > backwards compatible to the extent that a vendor driver requiring this
> > > > will automatically fail migration.
> > > > 
> > > right. looks we need to use this approach to solve the problem.
> > > thanks for your guide.
> > > so I'll abandon the current remap irq way for dirty tracking during live
> > > migration.
> > > but anyway, it demos how to customize irq_types in vendor drivers.
> > > then, what do you think about patches 1-5?  
> > 
> > In broad strokes, I don't think we've found the right solution yet.  I
> > really question whether it's supportable to parcel out vfio-pci like
> > this and I don't know how I'd support unraveling whether we have a bug
> > in vfio-pci, the vendor driver, or how the vendor driver is making use
> > of vfio-pci.
> >
> > Let me also ask, why does any of this need to be in the kernel?  We
> > spend 5 patches slicing up vfio-pci so that we can register a vendor
> > driver and have that vendor driver call into vfio-pci as it sees fit.
> > We have two patches creating device specific interrupts and a BAR
> > remapping scheme that we've decided we don't need.  That brings us to
> > the actual i40e vendor driver, where the first patch is simply making
> > the vendor driver work like vfio-pci already does, the second patch is
> > handling the migration region, and the third patch is implementing the
> > BAR remapping IRQ that we decided we don't need.  It's difficult to
> > actually find the small bit of code that's required to support
> > migration outside of just dealing with the protocol we've defined to
> > expose this from the kernel.  So why are we trying to do this in the
> > kernel?  We have quirk support in QEMU, we can easily flip
> > MemoryRegions on and off, etc.  What access to the device outside of
> > what vfio-pci provides to the user, and therefore QEMU, is necessary to
> > implement this migration support for i40e VFs?  Is this just an
> > exercise in making use of the migration interface?  Thanks,
> >   
> hi Alex
> 
> There was a description of intention of this series in RFC v1
> (https://www.spinics.net/lists/kernel/msg3337337.html).
> sorry, I didn't include it in starting from RFC v2.
> 
> "
> The reason why we don't choose the way of writing mdev parent driver is
> that

I didn't mention an mdev approach, I'm asking what are we accomplishing
by doing this in the kernel at all versus exposing the device as normal
through vfio-pci and providing the migration support in QEMU.  Are you
actually leveraging having some sort of access to the PF in supporting
migration of the VF?  Is vfio-pci masking the device in a way that
prevents migrating the state from QEMU?

> (1) VFs are almost all the time directly passthroughed. Directly binding
> to vfio-pci can make most of the code shared/reused. If we write a
> vendor specific mdev parent driver, most of the code (like passthrough
> style of rw/mmap) still needs to be copied from vfio-pci driver, which is
> actually a duplicated and tedious work.
> (2) For features like dynamically trap/untrap pci bars, if they are in
> vfio-pci, they can be available to most people without repeated code
> copying and re-testing.
> (3) with a 1:1 mdev driver which passes through VFs most of the time, people
> have to decide whether to bind VFs to vfio-pci or mdev parent driver before
> it runs into a real migration need. However, if vfio-pci is bound
> initially, they have no chance to do live migration when there's a need
> later.
> "
> particularly, there're some devices (like NVMe) they purely reply on
> vfio-pci to do devic

Re: [RFC PATCH v4 07/10] vfio/pci: introduce a new irq type VFIO_IRQ_TYPE_REMAP_BAR_REGION

2020-06-03 Thread Alex Williamson

On Tue, 2 Jun 2020 21:40:58 -0400
Yan Zhao  wrote:

> On Tue, Jun 02, 2020 at 01:34:35PM -0600, Alex Williamson wrote:
> > I'm not at all happy with this.  Why do we need to hide the migration
> > sparse mmap from the user until migration time?  What if instead we
> > introduced a new VFIO_REGION_INFO_CAP_SPARSE_MMAP_SAVING capability
> > where the existing capability is the normal runtime sparse setup and
> > the user is required to use this new one prior to enabled device_state
> > with _SAVING.  The vendor driver could then simply track mmap vmas to
> > the region and refuse to change device_state if there are outstanding
> > mmaps conflicting with the _SAVING sparse mmap layout.  No new IRQs
> > required, no new irqfds, an incremental change to the protocol,
> > backwards compatible to the extent that a vendor driver requiring this
> > will automatically fail migration.
> >   
> right. looks we need to use this approach to solve the problem.
> thanks for your guide.
> so I'll abandon the current remap irq way for dirty tracking during live
> migration.
> but anyway, it demos how to customize irq_types in vendor drivers.
> then, what do you think about patches 1-5?

In broad strokes, I don't think we've found the right solution yet.  I
really question whether it's supportable to parcel out vfio-pci like
this and I don't know how I'd support unraveling whether we have a bug
in vfio-pci, the vendor driver, or how the vendor driver is making use
of vfio-pci.

Let me also ask, why does any of this need to be in the kernel?  We
spend 5 patches slicing up vfio-pci so that we can register a vendor
driver and have that vendor driver call into vfio-pci as it sees fit.
We have two patches creating device specific interrupts and a BAR
remapping scheme that we've decided we don't need.  That brings us to
the actual i40e vendor driver, where the first patch is simply making
the vendor driver work like vfio-pci already does, the second patch is
handling the migration region, and the third patch is implementing the
BAR remapping IRQ that we decided we don't need.  It's difficult to
actually find the small bit of code that's required to support
migration outside of just dealing with the protocol we've defined to
expose this from the kernel.  So why are we trying to do this in the
kernel?  We have quirk support in QEMU, we can easily flip
MemoryRegions on and off, etc.  What access to the device outside of
what vfio-pci provides to the user, and therefore QEMU, is necessary to
implement this migration support for i40e VFs?  Is this just an
exercise in making use of the migration interface?  Thanks,

Alex

[GIT PULL] VFIO updates for v5.8-rc1

2020-06-03 Thread Alex Williamson

Hi Linus,

The following changes since commit 9cb1fd0efd195590b828b9b865421ad345a4a145:

  Linux 5.7-rc7 (2020-05-24 15:32:54 -0700)

are available in the Git repository at:

  git://github.com/awilliam/linux-vfio.git tags/vfio-v5.8-rc1

for you to fetch changes up to 4f085ca2f5a8047845ab2d6bbe97089daed28655:

  Merge branch 'v5.8/vfio/kirti-migration-fixes' into v5.8/vfio/next 
(2020-06-02 13:53:00 -0600)


VFIO updates for v5.8-rc1

 - Block accesses to disabled MMIO space (Alex Williamson)

 - VFIO device migration API (Kirti Wankhede)

 - type1 IOMMU dirty bitmap API and implementation (Kirti Wankhede)

 - PCI NULL capability masking (Alex Williamson)

 - Memory leak fixes (Qian Cai)

 - Reference leak fix (Qiushi Wu)


Alex Williamson (7):
  vfio/type1: Support faulting PFNMAP vmas
  vfio-pci: Fault mmaps to enable vma tracking
  vfio-pci: Invalidate mmaps and block MMIO access on disabled memory
  vfio-pci: Mask cap zero
  Merge branches 'v5.8/vfio/alex-block-mmio-v3', 
'v5.8/vfio/alex-zero-cap-v2' and 'v5.8/vfio/qian-leak-fixes' into v5.8/vfio/next
  Merge branch 'qiushi-wu-mdev-ref-v1' into v5.8/vfio/next
  Merge branch 'v5.8/vfio/kirti-migration-fixes' into v5.8/vfio/next

Kirti Wankhede (10):
  vfio: UAPI for migration interface for device state
  vfio iommu: Remove atomicity of ref_count of pinned pages
  vfio iommu: Cache pgsize_bitmap in struct vfio_iommu
  vfio iommu: Add ioctl definition for dirty pages tracking
  vfio iommu: Implementation of ioctl for dirty pages tracking
  vfio iommu: Update UNMAP_DMA ioctl to get dirty bitmap before unmap
  vfio iommu: Add migration capability to report supported features
  vfio: Selective dirty page tracking if IOMMU backed device pins pages
  vfio iommu: Use shift operation for 64-bit integer division
  vfio iommu: typecast corrections

Qian Cai (2):
  vfio/pci: fix memory leaks in alloc_perm_bits()
  vfio/pci: fix memory leaks of eventfd ctx

Qiushi Wu (1):
  vfio/mdev: Fix reference count leak in add_mdev_supported_type

 drivers/vfio/mdev/mdev_sysfs.c  |   2 +-
 drivers/vfio/pci/vfio_pci.c | 353 +++--
 drivers/vfio/pci/vfio_pci_config.c  |  50 ++-
 drivers/vfio/pci/vfio_pci_intrs.c   |  14 +
 drivers/vfio/pci/vfio_pci_private.h |  15 +
 drivers/vfio/pci/vfio_pci_rdwr.c|  24 +-
 drivers/vfio/vfio.c |  13 +-
 drivers/vfio/vfio_iommu_type1.c | 609 
 include/linux/vfio.h|   4 +-
 include/uapi/linux/vfio.h   | 319 +++
 10 files changed, 1301 insertions(+), 102 deletions(-)

Re: [PATCH v5 0/4] introduction of migration_version attribute for VFIO live migration

2020-06-03 Thread Alex Williamson

On Wed, 3 Jun 2020 01:24:43 -0400
Yan Zhao  wrote:

> On Tue, Jun 02, 2020 at 09:55:28PM -0600, Alex Williamson wrote:
> > On Tue, 2 Jun 2020 23:19:48 -0400
> > Yan Zhao  wrote:
> >   
> > > On Tue, Jun 02, 2020 at 04:55:27PM -0600, Alex Williamson wrote:  
> > > > On Wed, 29 Apr 2020 20:39:50 -0400
> > > > Yan Zhao  wrote:
> > > > 
> > > > > On Wed, Apr 29, 2020 at 05:48:44PM +0800, Dr. David Alan Gilbert 
> > > > > wrote:
> > > > > 
> > > > > > > > > > > > > > > > > > An mdev type is meant to define a software 
> > > > > > > > > > > > > > > > > > compatible interface, so in
> > > > > > > > > > > > > > > > > > the case of mdev->mdev migration, doesn't 
> > > > > > > > > > > > > > > > > > migrating to a different type
> > > > > > > > > > > > > > > > > > fail the most basic of compatibility tests 
> > > > > > > > > > > > > > > > > > that we expect userspace to
> > > > > > > > > > > > > > > > > > perform?  IOW, if two mdev types are 
> > > > > > > > > > > > > > > > > > migration compatible, it seems a
> > > > > > > > > > > > > > > > > > prerequisite to that is that they provide 
> > > > > > > > > > > > > > > > > > the same software interface,
> > > > > > > > > > > > > > > > > > which means they should be the same mdev 
> > > > > > > > > > > > > > > > > > type.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > In the hybrid cases of mdev->phys or 
> > > > > > > > > > > > > > > > > > phys->mdev, how does a  
> > > > > > > > > > > > > > > > > management  
> > > > > > > > > > > > > > > > > > tool begin to even guess what might be 
> > > > > > > > > > > > > > > > > > compatible?  Are we expecting
> > > > > > > > > > > > > > > > > > libvirt to probe ever device with this 
> > > > > > > > > > > > > > > > > > attribute in the system?  Is
> > > > > > > > > > > > > > > > > > there going to be a new class hierarchy 
> > > > > > > > > > > > > > > > > > created to enumerate all
> > > > > > > > > > > > > > > > > > possible migrate-able devices?
> > > > > > > > > > > > > > > > > >  
> > > > > > > > > > > > > > > > > yes, management tool needs to guess and test 
> > > > > > > > > > > > > > > > > migration compatible
> > > > > > > > > > > > > > > > > between two devices. But I think it's not the 
> > > > > > > > > > > > > > > > > problem only for
> > > > > > > > > > > > > > > > > mdev->phys or phys->mdev. even for 
> > > > > > > > > > > > > > > > > mdev->mdev, management tool needs
> > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > first assume that the two mdevs have the same 
> > > > > > > > > > > > > > > > > type of parent devices
> > > > > > > > > > > > > > > > > (e.g.their pciids are equal). otherwise, it's 
> > > > > > > > > > > > > > > > > still enumerating
> > > > > > > > > > > > > > > > > possibilities.
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > >

Re: [PATCH v5 0/4] introduction of migration_version attribute for VFIO live migration

2020-06-02 Thread Alex Williamson

On Tue, 2 Jun 2020 23:19:48 -0400
Yan Zhao  wrote:

> On Tue, Jun 02, 2020 at 04:55:27PM -0600, Alex Williamson wrote:
> > On Wed, 29 Apr 2020 20:39:50 -0400
> > Yan Zhao  wrote:
> >   
> > > On Wed, Apr 29, 2020 at 05:48:44PM +0800, Dr. David Alan Gilbert wrote:
> > >   
> > > > > > > > > > > > > > > > An mdev type is meant to define a software 
> > > > > > > > > > > > > > > > compatible interface, so in
> > > > > > > > > > > > > > > > the case of mdev->mdev migration, doesn't 
> > > > > > > > > > > > > > > > migrating to a different type
> > > > > > > > > > > > > > > > fail the most basic of compatibility tests that 
> > > > > > > > > > > > > > > > we expect userspace to
> > > > > > > > > > > > > > > > perform?  IOW, if two mdev types are migration 
> > > > > > > > > > > > > > > > compatible, it seems a
> > > > > > > > > > > > > > > > prerequisite to that is that they provide the 
> > > > > > > > > > > > > > > > same software interface,
> > > > > > > > > > > > > > > > which means they should be the same mdev type.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > In the hybrid cases of mdev->phys or 
> > > > > > > > > > > > > > > > phys->mdev, how does a
> > > > > > > > > > > > > > > management
> > > > > > > > > > > > > > > > tool begin to even guess what might be 
> > > > > > > > > > > > > > > > compatible?  Are we expecting
> > > > > > > > > > > > > > > > libvirt to probe ever device with this 
> > > > > > > > > > > > > > > > attribute in the system?  Is
> > > > > > > > > > > > > > > > there going to be a new class hierarchy created 
> > > > > > > > > > > > > > > > to enumerate all
> > > > > > > > > > > > > > > > possible migrate-able devices?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > yes, management tool needs to guess and test 
> > > > > > > > > > > > > > > migration compatible
> > > > > > > > > > > > > > > between two devices. But I think it's not the 
> > > > > > > > > > > > > > > problem only for
> > > > > > > > > > > > > > > mdev->phys or phys->mdev. even for mdev->mdev, 
> > > > > > > > > > > > > > > management tool needs
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > first assume that the two mdevs have the same 
> > > > > > > > > > > > > > > type of parent devices
> > > > > > > > > > > > > > > (e.g.their pciids are equal). otherwise, it's 
> > > > > > > > > > > > > > > still enumerating
> > > > > > > > > > > > > > > possibilities.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > on the other hand, for two mdevs,
> > > > > > > > > > > > > > > mdev1 from pdev1, its mdev_type is 1/2 of pdev1;
> > > > > > > > > > > > > > > mdev2 from pdev2, its mdev_type is 1/4 of pdev2;
> > > > > > > > > > > > > > > if pdev2 is exactly 2 times of pdev1, why not 
> > > > > > > > > > > > > > > allow migration between
> > > > > > > > > > > > > > > mdev1 <-> mdev2.
> > > > > > > > > > > > > > 
> > > >

Re: [PATCH v5 0/4] introduction of migration_version attribute for VFIO live migration

2020-06-02 Thread Alex Williamson

On Wed, 29 Apr 2020 20:39:50 -0400
Yan Zhao  wrote:

> On Wed, Apr 29, 2020 at 05:48:44PM +0800, Dr. David Alan Gilbert wrote:
> 
> > > > > > > > > > > > > > An mdev type is meant to define a software 
> > > > > > > > > > > > > > compatible interface, so in
> > > > > > > > > > > > > > the case of mdev->mdev migration, doesn't migrating 
> > > > > > > > > > > > > > to a different type
> > > > > > > > > > > > > > fail the most basic of compatibility tests that we 
> > > > > > > > > > > > > > expect userspace to
> > > > > > > > > > > > > > perform?  IOW, if two mdev types are migration 
> > > > > > > > > > > > > > compatible, it seems a
> > > > > > > > > > > > > > prerequisite to that is that they provide the same 
> > > > > > > > > > > > > > software interface,
> > > > > > > > > > > > > > which means they should be the same mdev type.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > In the hybrid cases of mdev->phys or phys->mdev, 
> > > > > > > > > > > > > > how does a  
> > > > > > > > > > > > > management  
> > > > > > > > > > > > > > tool begin to even guess what might be compatible?  
> > > > > > > > > > > > > > Are we expecting
> > > > > > > > > > > > > > libvirt to probe ever device with this attribute in 
> > > > > > > > > > > > > > the system?  Is
> > > > > > > > > > > > > > there going to be a new class hierarchy created to 
> > > > > > > > > > > > > > enumerate all
> > > > > > > > > > > > > > possible migrate-able devices?
> > > > > > > > > > > > > >  
> > > > > > > > > > > > > yes, management tool needs to guess and test 
> > > > > > > > > > > > > migration compatible
> > > > > > > > > > > > > between two devices. But I think it's not the problem 
> > > > > > > > > > > > > only for
> > > > > > > > > > > > > mdev->phys or phys->mdev. even for mdev->mdev, 
> > > > > > > > > > > > > management tool needs
> > > > > > > > > > > > > to
> > > > > > > > > > > > > first assume that the two mdevs have the same type of 
> > > > > > > > > > > > > parent devices
> > > > > > > > > > > > > (e.g.their pciids are equal). otherwise, it's still 
> > > > > > > > > > > > > enumerating
> > > > > > > > > > > > > possibilities.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > on the other hand, for two mdevs,
> > > > > > > > > > > > > mdev1 from pdev1, its mdev_type is 1/2 of pdev1;
> > > > > > > > > > > > > mdev2 from pdev2, its mdev_type is 1/4 of pdev2;
> > > > > > > > > > > > > if pdev2 is exactly 2 times of pdev1, why not allow 
> > > > > > > > > > > > > migration between
> > > > > > > > > > > > > mdev1 <-> mdev2.  
> > > > > > > > > > > > 
> > > > > > > > > > > > How could the manage tool figure out that 1/2 of pdev1 
> > > > > > > > > > > > is equivalent 
> > > > > > > > > > > > to 1/4 of pdev2? If we really want to allow such thing 
> > > > > > > > > > > > happen, the best
> > > > > > > > > > > > choice is to report the same mdev type on both pdev1 
> > > > > > > > > > > > and pdev2.  
> > > > > > > > > > > I think that's exactly the value of this 
> > > > > > > > > > > migration_version interface.
> > > > > > > > > > > the management tool can take advantage of this interface 
> > > > > > > > > > > to know if two
> > > > > > > > > > > devices are migration compatible, no matter they are 
> > > > > > > > > > > mdevs, non-mdevs,
> > > > > > > > > > > or mix.
> > > > > > > > > > > 
> > > > > > > > > > > as I know, (please correct me if not right), current 
> > > > > > > > > > > libvirt still
> > > > > > > > > > > requires manually generating mdev devices, and it just 
> > > > > > > > > > > duplicates src vm
> > > > > > > > > > > configuration to the target vm.
> > > > > > > > > > > for libvirt, currently it's always phys->phys and 
> > > > > > > > > > > mdev->mdev (and of the
> > > > > > > > > > > same mdev type).
> > > > > > > > > > > But it does not justify that hybrid cases should not be 
> > > > > > > > > > > allowed. otherwise,
> > > > > > > > > > > why do we need to introduce this migration_version 
> > > > > > > > > > > interface and leave
> > > > > > > > > > > the judgement of migration compatibility to vendor 
> > > > > > > > > > > driver? why not simply
> > > > > > > > > > > set the criteria to something like "pciids of parent 
> > > > > > > > > > > devices are equal,
> > > > > > > > > > > and mdev types are equal" ?
> > > > > > > > > > > 
> > > > > > > > > > >   
> > > > > > > > > > > > btw mdev<->phys just brings trouble to upper stack as 
> > > > > > > > > > > > Alex pointed out.   
> > > > > > > > > > > could you help me understand why it will bring trouble to 
> > > > > > > > > > > upper stack?
> > > > > > > > > > > 
> > > > > > > > > > > I think it just needs to read src migration_version under 
> > > > > > > > > > > src dev node,
> > > > > > > > > > > and test it in target migration version under target dev 
> > > > > > > > > > > node. 
> > > > > > > > > > > 
> > > > > > > > > > > after all, through this interface we just help the upper 
> > > > > > > > > > > layer
> > > > > > > > > > >

Re: linux-next: Tree for Jun 2 (vfio)

2020-06-02 Thread Alex Williamson

On Tue, 2 Jun 2020 09:16:15 -0600
Alex Williamson  wrote:

> On Tue, 2 Jun 2020 07:36:45 -0700
> Randy Dunlap  wrote:
> 
> > On 6/2/20 3:37 AM, Stephen Rothwell wrote:  
> > > Hi all,
> > > 
> > > News: The merge window has opened, so please do *not* add v5.9 material
> > > to your linux-next included branches until after v5.8-rc1 has been
> > > released.
> > > 
> > > Changes since 20200529:
> > > 
> > 
> > on i386:
> > 
> > ld: drivers/vfio/vfio_iommu_type1.o: in function `vfio_dma_populate_bitmap':
> > vfio_iommu_type1.c:(.text.unlikely+0x41): undefined reference to 
> > `__udivdi3'  
> 
> I think Kirti received a 0-day report on this.  Kirti, could you please
> post the fix you identified?  Thanks,

This should be resolved in the next refresh.  Thanks,

Alex

Re: [RFC PATCH v4 07/10] vfio/pci: introduce a new irq type VFIO_IRQ_TYPE_REMAP_BAR_REGION

2020-06-02 Thread Alex Williamson

On Tue, 2 Jun 2020 04:28:58 -0400
Yan Zhao  wrote:

> On Mon, Jun 01, 2020 at 10:43:07AM -0600, Alex Williamson wrote:
> > On Mon, 1 Jun 2020 02:57:26 -0400
> > Yan Zhao  wrote:
> >   
> > > On Fri, May 29, 2020 at 03:45:47PM -0600, Alex Williamson wrote:  
> > > > On Sun, 17 May 2020 22:52:45 -0400
> > > > Yan Zhao  wrote:
> > > > 
> > > > > This is a virtual irq type.
> > > > > vendor driver triggers this irq when it wants to notify userspace to
> > > > > remap PCI BARs.
> > > > > 
> > > > > 1. vendor driver triggers this irq and packs the target bar number in
> > > > >the ctx count. i.e. "1 << bar_number".
> > > > >if a bit is set, the corresponding bar is to be remapped.
> > > > > 
> > > > > 2. userspace requery the specified PCI BAR from kernel and if flags of
> > > > > the bar regions are changed, it removes the old subregions and 
> > > > > attaches
> > > > > subregions according to the new flags.
> > > > > 
> > > > > 3. userspace notifies back to kernel by writing one to the eventfd of
> > > > > this irq.
> > > > > 
> > > > > Please check the corresponding qemu implementation from the reply of 
> > > > > this
> > > > > patch, and a sample usage in vendor driver in patch [10/10].
> > > > > 
> > > > > Cc: Kevin Tian 
> > > > > Signed-off-by: Yan Zhao 
> > > > > ---
> > > > >  include/uapi/linux/vfio.h | 11 +++
> > > > >  1 file changed, 11 insertions(+)
> > > > > 
> > > > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > > > index 2d0d85c7c4d4..55895f75d720 100644
> > > > > --- a/include/uapi/linux/vfio.h
> > > > > +++ b/include/uapi/linux/vfio.h
> > > > > @@ -704,6 +704,17 @@ struct vfio_irq_info_cap_type {
> > > > >   __u32 subtype;  /* type specific */
> > > > >  };
> > > > >  
> > > > > +/* Bar Region Query IRQ TYPE */
> > > > > +#define VFIO_IRQ_TYPE_REMAP_BAR_REGION   (1)
> > > > > +
> > > > > +/* sub-types for VFIO_IRQ_TYPE_REMAP_BAR_REGION */
> > > > > +/*
> > > > > + * This irq notifies userspace to re-query BAR region and remaps the
> > > > > + * subregions.
> > > > > + */
> > > > > +#define VFIO_IRQ_SUBTYPE_REMAP_BAR_REGION(0)
> > > > 
> > > > Hi Yan,
> > > > 
> > > > How do we do this in a way that's backwards compatible?  Or maybe, how
> > > > do we perform a handshake between the vendor driver and userspace to
> > > > indicate this support?
> > > hi Alex
> > > thank you for your thoughtful review!
> > > 
> > > do you think below sequence can provide enough backwards compatibility?
> > > 
> > > - on vendor driver opening, it registers an irq of type
> > >   VFIO_IRQ_TYPE_REMAP_BAR_REGION, and reports to driver vfio-pci there's
> > >   1 vendor irq.
> > > 
> > > - after userspace detects the irq of type VFIO_IRQ_TYPE_REMAP_BAR_REGION
> > >   it enables it by signaling ACTION_TRIGGER.
> > >   
> > > - on receiving this ACTION_TRIGGER, vendor driver will try to setup a
> > >   virqfd to monitor file write to the fd of this irq, enable this irq
> > >   and return its enabling status to userspace.  
> > 
> > I'm not sure I follow here, what's the purpose of the irqfd?  When and
> > what does the user signal by writing to the irqfd?  Is this an ACK
> > mechanism?  Is this a different fd from the signaling eventfd?  
> it's not the kvm irqfd.
> in the vendor driver side, once ACTION_TRIGGER is received for the remap irq,
> interface vfio_virqfd_enable() is called to monitor writes to the eventfd of
> this irq.
> 
> when vendor driver signals the eventfd, remap handler in QEMU is
> called and it writes to the eventfd after remapping is done.
> Then the virqfd->handler registered in vendor driver is called to receive
> the QEMU ack.

This seems racy to use the same fd as both an eventfd and irqfd, does
the host need to wait for the user to service the previous IRQ before
sending a new one?  Don't we have gaps where the user is either reading
or writing where we can lose an interrupt?  Does the user also write a
bitmap?  How do we avoid get

Re: linux-next: Tree for Jun 2 (vfio)

2020-06-02 Thread Alex Williamson

On Tue, 2 Jun 2020 07:36:45 -0700
Randy Dunlap  wrote:

> On 6/2/20 3:37 AM, Stephen Rothwell wrote:
> > Hi all,
> > 
> > News: The merge window has opened, so please do *not* add v5.9 material
> > to your linux-next included branches until after v5.8-rc1 has been
> > released.
> > 
> > Changes since 20200529:
> >   
> 
> on i386:
> 
> ld: drivers/vfio/vfio_iommu_type1.o: in function `vfio_dma_populate_bitmap':
> vfio_iommu_type1.c:(.text.unlikely+0x41): undefined reference to `__udivdi3'

I think Kirti received a 0-day report on this.  Kirti, could you please
post the fix you identified?  Thanks,

Alex

Re: [PATCH v2 2/9] vfio/fsl-mc: Scan DPRC objects on vfio-fsl-mc driver bind

2020-06-01 Thread Alex Williamson

On Fri,  8 May 2020 10:20:32 +0300
Diana Craciun  wrote:

> The DPRC (Data Path Resource Container) device is a bus device and has
> child devices attached to it. When the vfio-fsl-mc driver is probed
> the DPRC is scanned and the child devices discovered and initialized.
> 
> Signed-off-by: Bharat Bhushan 
> Signed-off-by: Diana Craciun 
> ---
>  drivers/vfio/fsl-mc/vfio_fsl_mc.c | 106 ++
>  drivers/vfio/fsl-mc/vfio_fsl_mc_private.h |   1 +
>  2 files changed, 107 insertions(+)
> 
> diff --git a/drivers/vfio/fsl-mc/vfio_fsl_mc.c 
> b/drivers/vfio/fsl-mc/vfio_fsl_mc.c
> index 8b53c2a25b32..ea301ba81225 100644
> --- a/drivers/vfio/fsl-mc/vfio_fsl_mc.c
> +++ b/drivers/vfio/fsl-mc/vfio_fsl_mc.c
> @@ -15,6 +15,8 @@
>  
>  #include "vfio_fsl_mc_private.h"
>  
> +static struct fsl_mc_driver vfio_fsl_mc_driver;
> +
>  static int vfio_fsl_mc_open(void *device_data)
>  {
>   if (!try_module_get(THIS_MODULE))
> @@ -84,6 +86,69 @@ static const struct vfio_device_ops vfio_fsl_mc_ops = {
>   .mmap   = vfio_fsl_mc_mmap,
>  };
>  
> +static int vfio_fsl_mc_bus_notifier(struct notifier_block *nb,
> + unsigned long action, void *data)
> +{
> + struct vfio_fsl_mc_device *vdev = container_of(nb,
> + struct vfio_fsl_mc_device, nb);
> + struct device *dev = data;
> + struct fsl_mc_device *mc_dev = to_fsl_mc_device(dev);
> + struct fsl_mc_device *mc_cont = to_fsl_mc_device(mc_dev->dev.parent);
> +
> + if (action == BUS_NOTIFY_ADD_DEVICE &&
> + vdev->mc_dev == mc_cont) {
> + mc_dev->driver_override = kasprintf(GFP_KERNEL, "%s",
> + vfio_fsl_mc_ops.name);
> + dev_info(dev, "Setting driver override for device in dprc %s\n",
> +  dev_name(_cont->dev));
> + } else if (action == BUS_NOTIFY_BOUND_DRIVER &&
> + vdev->mc_dev == mc_cont) {
> + struct fsl_mc_driver *mc_drv = to_fsl_mc_driver(dev->driver);
> +
> + if (mc_drv && mc_drv != _fsl_mc_driver)
> + dev_warn(dev, "Object %s bound to driver %s while DPRC 
> bound to vfio-fsl-mc\n",
> +  dev_name(dev), mc_drv->driver.name);
> + }
> +
> + return 0;
> +}
> +
> +static int vfio_fsl_mc_init_device(struct vfio_fsl_mc_device *vdev)
> +{
> + struct fsl_mc_device *mc_dev = vdev->mc_dev;
> + int ret = 0;
> +
> + /* Non-dprc devices share mc_io from parent */
> + if (!is_fsl_mc_bus_dprc(mc_dev)) {
> + struct fsl_mc_device *mc_cont = 
> to_fsl_mc_device(mc_dev->dev.parent);
> +
> + mc_dev->mc_io = mc_cont->mc_io;
> + return 0;
> + }
> +
> + vdev->nb.notifier_call = vfio_fsl_mc_bus_notifier;
> + ret = bus_register_notifier(_mc_bus_type, >nb);
> + if (ret)
> + return ret;
> +
> + /* open DPRC, allocate a MC portal */
> + ret = dprc_setup(mc_dev);
> + if (ret < 0) {
> + dev_err(_dev->dev, "Failed to setup DPRC (error = %d)\n", 
> ret);
> + bus_unregister_notifier(_mc_bus_type, >nb);
> + return ret;
> + }
> +
> + ret = dprc_scan_container(mc_dev, false);
> + if (ret < 0) {
> + dev_err(_dev->dev, "Container scanning failed: %d\n", ret);
> + bus_unregister_notifier(_mc_bus_type, >nb);
> + dprc_cleanup(mc_dev);
> + }
> +
> + return 0;


The last error branch falls through, did you intend to return 'ret'
here to capture that?  Also, nit, ret doesn't need to be initialized.


> +}
> +
>  static int vfio_fsl_mc_probe(struct fsl_mc_device *mc_dev)
>  {
>   struct iommu_group *group;
> @@ -112,9 +177,42 @@ static int vfio_fsl_mc_probe(struct fsl_mc_device 
> *mc_dev)
>   return ret;
>   }
>  
> + ret = vfio_fsl_mc_init_device(vdev);
> + if (ret) {
> + vfio_iommu_group_put(group, dev);
> + return ret;
> + }


The error condition value is a bit inconsistent between
vfio_fs_mc_init_device() and here, <0 vs !0.  Thanks,

Alex


> +
>   return ret;
>  }
>  
> +static int vfio_fsl_mc_device_remove(struct device *dev, void *data)
> +{
> + struct fsl_mc_device *mc_dev;
> +
> + WARN_ON(!dev);
> + mc_dev = to_fsl_mc_device(dev);
> + if (WARN_ON(!mc_dev))
> + return -ENODEV;
> +
> + kfree(mc_dev->driver_override);
> + mc_dev->driver_override = NULL;
> +
> + /*
> +  * The device-specific remove callback will get invoked by device_del()
> +  */
> + device_del(_dev->dev);
> + put_device(_dev->dev);
> +
> + return 0;
> +}
> +
> +static void vfio_fsl_mc_cleanup_dprc(struct fsl_mc_device *mc_dev)
> +{
> + device_for_each_child(_dev->dev, NULL, vfio_fsl_mc_device_remove);
> + dprc_cleanup(mc_dev);
> +}
> +
>  static int vfio_fsl_mc_remove(struct fsl_mc_device *mc_dev)
>  {
>   struct

Re: [PATCH v2 4/9] vfio/fsl-mc: Implement VFIO_DEVICE_GET_REGION_INFO ioctl call

2020-06-01 Thread Alex Williamson

On Fri,  8 May 2020 10:20:34 +0300
Diana Craciun  wrote:

> Expose to userspace information about the memory regions.
> 
> Signed-off-by: Bharat Bhushan 
> Signed-off-by: Diana Craciun 
> ---
>  drivers/vfio/fsl-mc/vfio_fsl_mc.c | 77 ++-
>  drivers/vfio/fsl-mc/vfio_fsl_mc_private.h | 19 ++
>  2 files changed, 95 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/fsl-mc/vfio_fsl_mc.c 
> b/drivers/vfio/fsl-mc/vfio_fsl_mc.c
> index 8a4d3203b176..c162fa27c02c 100644
> --- a/drivers/vfio/fsl-mc/vfio_fsl_mc.c
> +++ b/drivers/vfio/fsl-mc/vfio_fsl_mc.c
> @@ -17,16 +17,72 @@
>  
>  static struct fsl_mc_driver vfio_fsl_mc_driver;
>  
> +static int vfio_fsl_mc_regions_init(struct vfio_fsl_mc_device *vdev)
> +{
> + struct fsl_mc_device *mc_dev = vdev->mc_dev;
> + int count = mc_dev->obj_desc.region_count;
> + int i;
> +
> + vdev->regions = kcalloc(count, sizeof(struct vfio_fsl_mc_region),
> + GFP_KERNEL);
> + if (!vdev->regions)
> + return -ENOMEM;
> +
> + for (i = 0; i < count; i++) {
> + struct resource *res = _dev->regions[i];
> +
> + vdev->regions[i].addr = res->start;
> + vdev->regions[i].size = PAGE_ALIGN((resource_size(res)));


Why do we need this page alignment to resource_size()?  It makes me
worry that we're actually giving the user access to an extended size
that might overlap another device or to MMIO that's not backed by any
device and might trigger a fault when accessed.  In vfio-pci we make
some effort to reserve resources when we want to allow mmap of sub-page
ranges.  Thanks,

Alex


> + vdev->regions[i].flags = 0;
> + }
> +
> + vdev->num_regions = mc_dev->obj_desc.region_count;
> + return 0;
> +}
> +
> +static void vfio_fsl_mc_regions_cleanup(struct vfio_fsl_mc_device *vdev)
> +{
> + vdev->num_regions = 0;
> + kfree(vdev->regions);
> +}
> +
>  static int vfio_fsl_mc_open(void *device_data)
>  {
> + struct vfio_fsl_mc_device *vdev = device_data;
> + int ret;
> +
>   if (!try_module_get(THIS_MODULE))
>   return -ENODEV;
>  
> + mutex_lock(>driver_lock);
> + if (!vdev->refcnt) {
> + ret = vfio_fsl_mc_regions_init(vdev);
> + if (ret)
> + goto err_reg_init;
> + }
> + vdev->refcnt++;
> +
> + mutex_unlock(>driver_lock);
> +
>   return 0;
> +
> +err_reg_init:
> + mutex_unlock(>driver_lock);
> + module_put(THIS_MODULE);
> + return ret;
>  }
>  
>  static void vfio_fsl_mc_release(void *device_data)
>  {
> + struct vfio_fsl_mc_device *vdev = device_data;
> +
> + mutex_lock(>driver_lock);
> +
> + if (!(--vdev->refcnt))
> + vfio_fsl_mc_regions_cleanup(vdev);
> +
> + mutex_unlock(>driver_lock);
> +
>   module_put(THIS_MODULE);
>  }
>  
> @@ -59,7 +115,25 @@ static long vfio_fsl_mc_ioctl(void *device_data, unsigned 
> int cmd,
>   }
>   case VFIO_DEVICE_GET_REGION_INFO:
>   {
> - return -ENOTTY;
> + struct vfio_region_info info;
> +
> + minsz = offsetofend(struct vfio_region_info, offset);
> +
> + if (copy_from_user(, (void __user *)arg, minsz))
> + return -EFAULT;
> +
> + if (info.argsz < minsz)
> + return -EINVAL;
> +
> + if (info.index >= vdev->num_regions)
> + return -EINVAL;
> +
> + /* map offset to the physical address  */
> + info.offset = VFIO_FSL_MC_INDEX_TO_OFFSET(info.index);
> + info.size = vdev->regions[info.index].size;
> + info.flags = vdev->regions[info.index].flags;
> +
> + return copy_to_user((void __user *)arg, , minsz);
>   }
>   case VFIO_DEVICE_GET_IRQ_INFO:
>   {
> @@ -201,6 +275,7 @@ static int vfio_fsl_mc_probe(struct fsl_mc_device *mc_dev)
>   vfio_iommu_group_put(group, dev);
>   return ret;
>   }
> + mutex_init(>driver_lock);
>  
>   return ret;
>  }
> diff --git a/drivers/vfio/fsl-mc/vfio_fsl_mc_private.h 
> b/drivers/vfio/fsl-mc/vfio_fsl_mc_private.h
> index 37d61eaa58c8..818dfd3df4db 100644
> --- a/drivers/vfio/fsl-mc/vfio_fsl_mc_private.h
> +++ b/drivers/vfio/fsl-mc/vfio_fsl_mc_private.h
> @@ -7,9 +7,28 @@
>  #ifndef VFIO_FSL_MC_PRIVATE_H
>  #define VFIO_FSL_MC_PRIVATE_H
>  
> +#define VFIO_FSL_MC_OFFSET_SHIFT40
> +#define VFIO_FSL_MC_OFFSET_MASK (((u64)(1) << VFIO_FSL_MC_OFFSET_SHIFT) - 1)
> +
> +#define VFIO_FSL_MC_OFFSET_TO_INDEX(off) ((off) >> VFIO_FSL_MC_OFFSET_SHIFT)
> +
> +#define VFIO_FSL_MC_INDEX_TO_OFFSET(index)   \
> + ((u64)(index) << VFIO_FSL_MC_OFFSET_SHIFT)
> +
> +struct vfio_fsl_mc_region {
> + u32 flags;
> + u32 type;
> + u64 addr;
> + resource_size_t size;
> +};
> +
>  struct vfio_fsl_mc_device {
>   struct

Re: [PATCH v2 5/9] vfio/fsl-mc: Allow userspace to MMAP fsl-mc device MMIO regions

2020-06-01 Thread Alex Williamson

On Fri,  8 May 2020 10:20:35 +0300
Diana Craciun  wrote:

> Allow userspace to mmap device regions for direct access of
> fsl-mc devices.
> 
> Signed-off-by: Bharat Bhushan 
> Signed-off-by: Diana Craciun 
> ---
>  drivers/vfio/fsl-mc/vfio_fsl_mc.c | 60 ++-
>  drivers/vfio/fsl-mc/vfio_fsl_mc_private.h |  2 +
>  2 files changed, 60 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vfio/fsl-mc/vfio_fsl_mc.c 
> b/drivers/vfio/fsl-mc/vfio_fsl_mc.c
> index c162fa27c02c..a92c6c97c29a 100644
> --- a/drivers/vfio/fsl-mc/vfio_fsl_mc.c
> +++ b/drivers/vfio/fsl-mc/vfio_fsl_mc.c
> @@ -33,7 +33,11 @@ static int vfio_fsl_mc_regions_init(struct 
> vfio_fsl_mc_device *vdev)
>  
>   vdev->regions[i].addr = res->start;
>   vdev->regions[i].size = PAGE_ALIGN((resource_size(res)));
> - vdev->regions[i].flags = 0;
> + vdev->regions[i].flags = VFIO_REGION_INFO_FLAG_MMAP;
> + vdev->regions[i].flags |= VFIO_REGION_INFO_FLAG_READ;
> + if (!(mc_dev->regions[i].flags & IORESOURCE_READONLY))
> + vdev->regions[i].flags |= VFIO_REGION_INFO_FLAG_WRITE;


I'm a little confused that we advertise read and write here, but it's
only relative to the mmap and even later in the series where we add
read and write callback support, it's only for the dprc and dpmcp
devices.  Doesn't this leave dpaa2 accelerator devices with only mmap
access?  vfio doesn't really have a way to specify that a device only
has mmap access and the read/write interfaces can be quite useful when
debugging or tracing.

> + vdev->regions[i].type = mc_dev->regions[i].flags & 
> IORESOURCE_BITS;
>   }
>  
>   vdev->num_regions = mc_dev->obj_desc.region_count;
> @@ -164,9 +168,61 @@ static ssize_t vfio_fsl_mc_write(void *device_data, 
> const char __user *buf,
>   return -EINVAL;
>  }
>  
> +static int vfio_fsl_mc_mmap_mmio(struct vfio_fsl_mc_region region,
> +  struct vm_area_struct *vma)
> +{
> + u64 size = vma->vm_end - vma->vm_start;
> + u64 pgoff, base;
> +
> + pgoff = vma->vm_pgoff &
> + ((1U << (VFIO_FSL_MC_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
> + base = pgoff << PAGE_SHIFT;
> +
> + if (region.size < PAGE_SIZE || base + size > region.size)

We've already aligned region.size up to PAGE_SIZE, so that test can't
be true.  Whether it was a good idea to do that alignment, I'm not so
sure.

> + return -EINVAL;
> +
> + if (!(region.type & VFIO_DPRC_REGION_CACHEABLE))
> + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> +
> + vma->vm_pgoff = (region.addr >> PAGE_SHIFT) + pgoff;
> +
> + return remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
> +size, vma->vm_page_prot);
> +}
> +
>  static int vfio_fsl_mc_mmap(void *device_data, struct vm_area_struct *vma)
>  {
> - return -EINVAL;
> + struct vfio_fsl_mc_device *vdev = device_data;
> + struct fsl_mc_device *mc_dev = vdev->mc_dev;
> + int index;
> +
> + index = vma->vm_pgoff >> (VFIO_FSL_MC_OFFSET_SHIFT - PAGE_SHIFT);
> +
> + if (vma->vm_end < vma->vm_start)
> + return -EINVAL;
> + if (vma->vm_start & ~PAGE_MASK)
> + return -EINVAL;
> + if (vma->vm_end & ~PAGE_MASK)
> + return -EINVAL;
> + if (!(vma->vm_flags & VM_SHARED))
> + return -EINVAL;
> + if (index >= vdev->num_regions)
> + return -EINVAL;
> +
> + if (!(vdev->regions[index].flags & VFIO_REGION_INFO_FLAG_MMAP))
> + return -EINVAL;
> +
> + if (!(vdev->regions[index].flags & VFIO_REGION_INFO_FLAG_READ)
> + && (vma->vm_flags & VM_READ))
> + return -EINVAL;
> +
> + if (!(vdev->regions[index].flags & VFIO_REGION_INFO_FLAG_WRITE)
> + && (vma->vm_flags & VM_WRITE))
> + return -EINVAL;
> +
> + vma->vm_private_data = mc_dev;
> +
> + return vfio_fsl_mc_mmap_mmio(vdev->regions[index], vma);
>  }
>  
>  static const struct vfio_device_ops vfio_fsl_mc_ops = {
> diff --git a/drivers/vfio/fsl-mc/vfio_fsl_mc_private.h 
> b/drivers/vfio/fsl-mc/vfio_fsl_mc_private.h
> index 818dfd3df4db..89d2e2a602d8 100644
> --- a/drivers/vfio/fsl-mc/vfio_fsl_mc_private.h
> +++ b/drivers/vfio/fsl-mc/vfio_fsl_mc_private.h
> @@ -15,6 +15,8 @@
>  #define VFIO_FSL_MC_INDEX_TO_OFFSET(index)   \
>   ((u64)(index) << VFIO_FSL_MC_OFFSET_SHIFT)
>  
> +#define VFIO_DPRC_REGION_CACHEABLE   0x0001


There appears to be some sort of magic mapping of this to bus specific
bits in the IORESOURCE_BITS range.  If the bus specific bits get
shifted we'll be subtly broken here.  Can't we use the bus #define so
that we can't get out of sync?  Thanks,

Alex


> +
>  struct vfio_fsl_mc_region {
>   u32 flags;
>   u32 type;

Re: [PATCH] PCI: Relax ACS requirement for Intel RCiEP devices.

2020-06-01 Thread Alex Williamson

On Mon, 1 Jun 2020 14:40:23 -0700
"Raj, Ashok"  wrote:

> On Mon, Jun 01, 2020 at 04:25:19PM -0500, Bjorn Helgaas wrote:
> > On Thu, May 28, 2020 at 01:57:42PM -0700, Ashok Raj wrote:  
> > > All Intel platforms guarantee that all root complex implementations
> > > must send transactions up to IOMMU for address translations. Hence for
> > > RCiEP devices that are Vendor ID Intel, can claim exception for lack of
> > > ACS support.
> > > 
> > > 
> > > 3.16 Root-Complex Peer to Peer Considerations
> > > When DMA remapping is enabled, peer-to-peer requests through the
> > > Root-Complex must be handled
> > > as follows:
> > > • The input address in the request is translated (through first-level,
> > >   second-level or nested translation) to a host physical address (HPA).
> > >   The address decoding for peer addresses must be done only on the
> > >   translated HPA. Hardware implementations are free to further limit
> > >   peer-to-peer accesses to specific host physical address regions
> > >   (or to completely disallow peer-forwarding of translated requests).
> > > • Since address translation changes the contents (address field) of
> > >   the PCI Express Transaction Layer Packet (TLP), for PCI Express
> > >   peer-to-peer requests with ECRC, the Root-Complex hardware must use
> > >   the new ECRC (re-computed with the translated address) if it
> > >   decides to forward the TLP as a peer request.
> > > • Root-ports, and multi-function root-complex integrated endpoints, may
> > >   support additional peerto-peer control features by supporting PCI 
> > > Express
> > >   Access Control Services (ACS) capability. Refer to ACS capability in
> > >   PCI Express specifications for details.
> > > 
> > > Since Linux didn't give special treatment to allow this exception, certain
> > > RCiEP MFD devices are getting grouped in a single iommu group. This
> > > doesn't permit a single device to be assigned to a guest for instance.
> > > 
> > > In one vendor system: Device 14.x were grouped in a single IOMMU group.
> > > 
> > > /sys/kernel/iommu_groups/5/devices/:00:14.0
> > > /sys/kernel/iommu_groups/5/devices/:00:14.2
> > > /sys/kernel/iommu_groups/5/devices/:00:14.3
> > > 
> > > After the patch:
> > > /sys/kernel/iommu_groups/5/devices/:00:14.0
> > > /sys/kernel/iommu_groups/5/devices/:00:14.2
> > > /sys/kernel/iommu_groups/6/devices/:00:14.3 <<< new group
> > > 
> > > 14.0 and 14.2 are integrated devices, but legacy end points.
> > > Whereas 14.3 was a PCIe compliant RCiEP.
> > > 
> > > 00:14.3 Network controller: Intel Corporation Device 9df0 (rev 30)
> > > Capabilities: [40] Express (v2) Root Complex Integrated Endpoint, MSI 00
> > > 
> > > This permits assigning this device to a guest VM.
> > > 
> > > Fixes: f096c061f552 ("iommu: Rework iommu_group_get_for_pci_dev()")
> > > Signed-off-by: Ashok Raj 
> > > To: Joerg Roedel 
> > > To: Bjorn Helgaas 
> > > Cc: linux-kernel@vger.kernel.org
> > > Cc: io...@lists.linux-foundation.org
> > > Cc: Lu Baolu 
> > > Cc: Alex Williamson 
> > > Cc: Darrel Goeddel 
> > > Cc: Mark Scott ,
> > > Cc: Romil Sharma 
> > > Cc: Ashok Raj   
> > 
> > Tentatively applied to pci/virtualization for v5.8, thanks!
> > 
> > The spec says this handling must apply "when DMA remapping is
> > enabled".  The patch does not check whether DMA remapping is enabled.
> > 
> > Is there any case where DMA remapping is *not* enabled, and we rely on
> > this patch to tell us whether the device is isolated?  It sounds like
> > it may give the wrong answer in such a case?
> > 
> > Can you confirm that I don't need to worry about this?
> 
> I think all of this makes sense only when DMA remapping is enabled.
> Otherwise there is no enforcement for isolation. 

Yep, without an IOMMU all devices operate in the same IOVA space and we
have no isolation.  We only enable ACS when an IOMMU driver requests it
and it's only used by IOMMU code to determine IOMMU grouping of
devices.  Thanks,

Alex

Re: [RFC PATCH v4 07/10] vfio/pci: introduce a new irq type VFIO_IRQ_TYPE_REMAP_BAR_REGION

2020-06-01 Thread Alex Williamson

On Mon, 1 Jun 2020 02:57:26 -0400
Yan Zhao  wrote:

> On Fri, May 29, 2020 at 03:45:47PM -0600, Alex Williamson wrote:
> > On Sun, 17 May 2020 22:52:45 -0400
> > Yan Zhao  wrote:
> >   
> > > This is a virtual irq type.
> > > vendor driver triggers this irq when it wants to notify userspace to
> > > remap PCI BARs.
> > > 
> > > 1. vendor driver triggers this irq and packs the target bar number in
> > >the ctx count. i.e. "1 << bar_number".
> > >if a bit is set, the corresponding bar is to be remapped.
> > > 
> > > 2. userspace requery the specified PCI BAR from kernel and if flags of
> > > the bar regions are changed, it removes the old subregions and attaches
> > > subregions according to the new flags.
> > > 
> > > 3. userspace notifies back to kernel by writing one to the eventfd of
> > > this irq.
> > > 
> > > Please check the corresponding qemu implementation from the reply of this
> > > patch, and a sample usage in vendor driver in patch [10/10].
> > > 
> > > Cc: Kevin Tian 
> > > Signed-off-by: Yan Zhao 
> > > ---
> > >  include/uapi/linux/vfio.h | 11 +++
> > >  1 file changed, 11 insertions(+)
> > > 
> > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > index 2d0d85c7c4d4..55895f75d720 100644
> > > --- a/include/uapi/linux/vfio.h
> > > +++ b/include/uapi/linux/vfio.h
> > > @@ -704,6 +704,17 @@ struct vfio_irq_info_cap_type {
> > >   __u32 subtype;  /* type specific */
> > >  };
> > >  
> > > +/* Bar Region Query IRQ TYPE */
> > > +#define VFIO_IRQ_TYPE_REMAP_BAR_REGION   (1)
> > > +
> > > +/* sub-types for VFIO_IRQ_TYPE_REMAP_BAR_REGION */
> > > +/*
> > > + * This irq notifies userspace to re-query BAR region and remaps the
> > > + * subregions.
> > > + */
> > > +#define VFIO_IRQ_SUBTYPE_REMAP_BAR_REGION(0)  
> > 
> > Hi Yan,
> > 
> > How do we do this in a way that's backwards compatible?  Or maybe, how
> > do we perform a handshake between the vendor driver and userspace to
> > indicate this support?  
> hi Alex
> thank you for your thoughtful review!
> 
> do you think below sequence can provide enough backwards compatibility?
> 
> - on vendor driver opening, it registers an irq of type
>   VFIO_IRQ_TYPE_REMAP_BAR_REGION, and reports to driver vfio-pci there's
>   1 vendor irq.
> 
> - after userspace detects the irq of type VFIO_IRQ_TYPE_REMAP_BAR_REGION
>   it enables it by signaling ACTION_TRIGGER.
>   
> - on receiving this ACTION_TRIGGER, vendor driver will try to setup a
>   virqfd to monitor file write to the fd of this irq, enable this irq
>   and return its enabling status to userspace.

I'm not sure I follow here, what's the purpose of the irqfd?  When and
what does the user signal by writing to the irqfd?  Is this an ACK
mechanism?  Is this a different fd from the signaling eventfd?

> > Would the vendor driver refuse to change
> > device_state in the migration region if the user has not enabled this
> > IRQ?  
> yes, vendor driver can refuse to change device_state if the irq
> VFIO_IRQ_TYPE_REMAP_BAR_REGION is not enabled.
> in my sample i40e_vf driver (patch 10/10), it implemented this logic
> like below:
> 
> i40e_vf_set_device_state
> |-> case VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RUNNING:
> |  ret = i40e_vf_prepare_dirty_track(i40e_vf_dev);
>   |->ret = i40e_vf_remap_bars(i40e_vf_dev, true);
>|->if 
> (!i40e_vf_dev->remap_irq_ctx.init)
> return -ENODEV;
> 
> 
> (i40e_vf_dev->remap_irq_ctx.init is set in below path)
> i40e_vf_ioctl(cmd==VFIO_DEVICE_SET_IRQS)
> |->i40e_vf_set_irq_remap_bars
>|->i40e_vf_enable_remap_bars_irq
>|-> vf_dev->remap_irq_ctx.init = true;

This should be a documented aspect of the uapi, not left to vendor
discretion to implement.
 
> > 
> > Everything you've described in the commit log needs to be in this
> > header, we can't have the usage protocol buried in a commit log.  It  
> got it! I'll move all descriptions in commit logs to this header so that
> readers can understand the whole picture here.
> 
> > also seems like this is unnecessarily PCI specific.  Can't the count
> > bitmap simply indicate the region index to re-evaluate?  Maybe you were  
> yes, it is possible. but what prevented me from do

Re: [PATCH] vfio/mdev: Fix reference count leak in add_mdev_supported_type.

2020-05-29 Thread Alex Williamson

On Wed, 27 May 2020 21:01:09 -0500
wu000...@umn.edu wrote:

> From: Qiushi Wu 
> 
> kobject_init_and_add() takes reference even when it fails.
> If this function returns an error, kobject_put() must be called to
> properly clean up the memory associated with the object. Thus,
> replace kfree() by kobject_put() to fix this issue. Previous
> commit "b8eb718348b8" fixed a similar problem.
> 
> Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> Signed-off-by: Qiushi Wu 
> ---

Applied to vfio next branch for v5.8 with Connie's and Kirti's reviews.
Thanks,

Alex

>  drivers/vfio/mdev/mdev_sysfs.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
> index 8ad14e5c02bf..917fd84c1c6f 100644
> --- a/drivers/vfio/mdev/mdev_sysfs.c
> +++ b/drivers/vfio/mdev/mdev_sysfs.c
> @@ -110,7 +110,7 @@ static struct mdev_type *add_mdev_supported_type(struct 
> mdev_parent *parent,
>  "%s-%s", dev_driver_string(parent->dev),
>  group->name);
>   if (ret) {
> - kfree(type);
> + kobject_put(>kobj);
>   return ERR_PTR(ret);
>   }
>

Re: [RFC PATCH v4 07/10] vfio/pci: introduce a new irq type VFIO_IRQ_TYPE_REMAP_BAR_REGION

2020-05-29 Thread Alex Williamson

On Sun, 17 May 2020 22:52:45 -0400
Yan Zhao  wrote:

> This is a virtual irq type.
> vendor driver triggers this irq when it wants to notify userspace to
> remap PCI BARs.
> 
> 1. vendor driver triggers this irq and packs the target bar number in
>the ctx count. i.e. "1 << bar_number".
>if a bit is set, the corresponding bar is to be remapped.
> 
> 2. userspace requery the specified PCI BAR from kernel and if flags of
> the bar regions are changed, it removes the old subregions and attaches
> subregions according to the new flags.
> 
> 3. userspace notifies back to kernel by writing one to the eventfd of
> this irq.
> 
> Please check the corresponding qemu implementation from the reply of this
> patch, and a sample usage in vendor driver in patch [10/10].
> 
> Cc: Kevin Tian 
> Signed-off-by: Yan Zhao 
> ---
>  include/uapi/linux/vfio.h | 11 +++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 2d0d85c7c4d4..55895f75d720 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -704,6 +704,17 @@ struct vfio_irq_info_cap_type {
>   __u32 subtype;  /* type specific */
>  };
>  
> +/* Bar Region Query IRQ TYPE */
> +#define VFIO_IRQ_TYPE_REMAP_BAR_REGION   (1)
> +
> +/* sub-types for VFIO_IRQ_TYPE_REMAP_BAR_REGION */
> +/*
> + * This irq notifies userspace to re-query BAR region and remaps the
> + * subregions.
> + */
> +#define VFIO_IRQ_SUBTYPE_REMAP_BAR_REGION(0)

Hi Yan,

How do we do this in a way that's backwards compatible?  Or maybe, how
do we perform a handshake between the vendor driver and userspace to
indicate this support?  Would the vendor driver refuse to change
device_state in the migration region if the user has not enabled this
IRQ?

Everything you've described in the commit log needs to be in this
header, we can't have the usage protocol buried in a commit log.  It
also seems like this is unnecessarily PCI specific.  Can't the count
bitmap simply indicate the region index to re-evaluate?  Maybe you were
worried about running out of bits in the ctx count?  An IRQ per region
could resolve that, but maybe we could also just add another IRQ for
the next bitmap of regions.  I assume that the bitmap can indicate
multiple regions to re-evaluate, but that should be documented.

Also, what sort of service requirements does this imply?  Would the
vendor driver send this IRQ when the user tries to set the device_state
to _SAVING and therefore we'd require the user to accept, implement the
mapping change, and acknowledge the IRQ all while waiting for the write
to device_state to return?  That implies quite a lot of asynchronous
support in the userspace driver.  Thanks,

Alex

> +
> +
>  /**
>   * VFIO_DEVICE_SET_IRQS - _IOW(VFIO_TYPE, VFIO_BASE + 10, struct 
> vfio_irq_set)
>   *

Re: [PATCH] PCI: Relax ACS requirement for Intel RCiEP devices.

2020-05-28 Thread Alex Williamson

On Thu, 28 May 2020 13:57:42 -0700
Ashok Raj  wrote:

> All Intel platforms guarantee that all root complex implementations
> must send transactions up to IOMMU for address translations. Hence for
> RCiEP devices that are Vendor ID Intel, can claim exception for lack of
> ACS support.
> 
> 
> 3.16 Root-Complex Peer to Peer Considerations
> When DMA remapping is enabled, peer-to-peer requests through the
> Root-Complex must be handled
> as follows:
> • The input address in the request is translated (through first-level,
>   second-level or nested translation) to a host physical address (HPA).
>   The address decoding for peer addresses must be done only on the
>   translated HPA. Hardware implementations are free to further limit
>   peer-to-peer accesses to specific host physical address regions
>   (or to completely disallow peer-forwarding of translated requests).
> • Since address translation changes the contents (address field) of
>   the PCI Express Transaction Layer Packet (TLP), for PCI Express
>   peer-to-peer requests with ECRC, the Root-Complex hardware must use
>   the new ECRC (re-computed with the translated address) if it
>   decides to forward the TLP as a peer request.
> • Root-ports, and multi-function root-complex integrated endpoints, may
>   support additional peerto-peer control features by supporting PCI Express
>   Access Control Services (ACS) capability. Refer to ACS capability in
>   PCI Express specifications for details.
> 
> Since Linux didn't give special treatment to allow this exception, certain
> RCiEP MFD devices are getting grouped in a single iommu group. This
> doesn't permit a single device to be assigned to a guest for instance.
> 
> In one vendor system: Device 14.x were grouped in a single IOMMU group.
> 
> /sys/kernel/iommu_groups/5/devices/:00:14.0
> /sys/kernel/iommu_groups/5/devices/:00:14.2
> /sys/kernel/iommu_groups/5/devices/:00:14.3
> 
> After the patch:
> /sys/kernel/iommu_groups/5/devices/:00:14.0
> /sys/kernel/iommu_groups/5/devices/:00:14.2
> /sys/kernel/iommu_groups/6/devices/:00:14.3 <<< new group
> 
> 14.0 and 14.2 are integrated devices, but legacy end points.
> Whereas 14.3 was a PCIe compliant RCiEP.
> 
> 00:14.3 Network controller: Intel Corporation Device 9df0 (rev 30)
> Capabilities: [40] Express (v2) Root Complex Integrated Endpoint, MSI 00
> 
> This permits assigning this device to a guest VM.
> 
> Fixes: f096c061f552 ("iommu: Rework iommu_group_get_for_pci_dev()")

I don't really understand this Fixes tag.  This seems like a feature,
not a fix.  If you want it in stable releases as a feature, request it
via Cc: sta...@vger.kernel.org.  I'd drop that tag, that's my nit.
Otherwise:

Reviewed-by: Alex Williamson 

> Signed-off-by: Ashok Raj 
> To: Joerg Roedel 
> To: Bjorn Helgaas 
> Cc: linux-kernel@vger.kernel.org
> Cc: io...@lists.linux-foundation.org
> Cc: Lu Baolu 
> Cc: Alex Williamson 
> Cc: Darrel Goeddel 
> Cc: Mark Scott ,
> Cc: Romil Sharma 
> Cc: Ashok Raj 
> ---
> v2: Moved functionality from iommu to pci quirks - Alex Williamson
> 
>  drivers/pci/quirks.c | 15 +++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 28c9a2409c50..63373ca0a3fe 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -4682,6 +4682,20 @@ static int pci_quirk_mf_endpoint_acs(struct pci_dev 
> *dev, u16 acs_flags)
>   PCI_ACS_CR | PCI_ACS_UF | PCI_ACS_DT);
>  }
>  
> +static int pci_quirk_rciep_acs(struct pci_dev *dev, u16 acs_flags)
> +{
> + /*
> +  * RCiEP's are required to allow p2p only on translated addresses.
> +  * Refer to Intel VT-d specification Section 3.16 Root-Complex Peer
> +  * to Peer Considerations
> +  */
> + if (pci_pcie_type(dev) != PCI_EXP_TYPE_RC_END)
> + return -ENOTTY;
> +
> + return pci_acs_ctrl_enabled(acs_flags,
> + PCI_ACS_SV | PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_UF);
> +}
> +
>  static int pci_quirk_brcm_acs(struct pci_dev *dev, u16 acs_flags)
>  {
>   /*
> @@ -4764,6 +4778,7 @@ static const struct pci_dev_acs_enabled {
>   /* I219 */
>   { PCI_VENDOR_ID_INTEL, 0x15b7, pci_quirk_mf_endpoint_acs },
>   { PCI_VENDOR_ID_INTEL, 0x15b8, pci_quirk_mf_endpoint_acs },
> + { PCI_VENDOR_ID_INTEL, PCI_ANY_ID, pci_quirk_rciep_acs },
>   /* QCOM QDF2xxx root ports */
>   { PCI_VENDOR_ID_QCOM, 0x0400, pci_quirk_qcom_rp_acs },
>   { PCI_VENDOR_ID_QCOM, 0x0401, pci_quirk_qcom_rp_acs },

Re: [PATCH] iommu: Relax ACS requirement for Intel RCiEP devices.

2020-05-26 Thread Alex Williamson

On Tue, 26 May 2020 15:17:35 -0700
Ashok Raj  wrote:

> All Intel platforms guarantee that all root complex implementations
> must send transactions up to IOMMU for address translations. Hence for
> RCiEP devices that are Vendor ID Intel, can claim exception for lack of
> ACS support.
> 
> 
> 3.16 Root-Complex Peer to Peer Considerations
> When DMA remapping is enabled, peer-to-peer requests through the
> Root-Complex must be handled
> as follows:
> • The input address in the request is translated (through first-level,
>   second-level or nested translation) to a host physical address (HPA).
>   The address decoding for peer addresses must be done only on the
>   translated HPA. Hardware implementations are free to further limit
>   peer-to-peer accesses to specific host physical address regions
>   (or to completely disallow peer-forwarding of translated requests).
> • Since address translation changes the contents (address field) of
>   the PCI Express Transaction Layer Packet (TLP), for PCI Express
>   peer-to-peer requests with ECRC, the Root-Complex hardware must use
>   the new ECRC (re-computed with the translated address) if it
>   decides to forward the TLP as a peer request.
> • Root-ports, and multi-function root-complex integrated endpoints, may
>   support additional peerto-peer control features by supporting PCI Express
>   Access Control Services (ACS) capability. Refer to ACS capability in
>   PCI Express specifications for details.
> 
> Since Linux didn't give special treatment to allow this exception, certain
> RCiEP MFD devices are getting grouped in a single iommu group. This
> doesn't permit a single device to be assigned to a guest for instance.
> 
> In one vendor system: Device 14.x were grouped in a single IOMMU group.
> 
> /sys/kernel/iommu_groups/5/devices/:00:14.0
> /sys/kernel/iommu_groups/5/devices/:00:14.2
> /sys/kernel/iommu_groups/5/devices/:00:14.3
> 
> After the patch:
> /sys/kernel/iommu_groups/5/devices/:00:14.0
> /sys/kernel/iommu_groups/5/devices/:00:14.2
> /sys/kernel/iommu_groups/6/devices/:00:14.3 <<< new group
> 
> 14.0 and 14.2 are integrated devices, but legacy end points.
> Whereas 14.3 was a PCIe compliant RCiEP.
> 
> 00:14.3 Network controller: Intel Corporation Device 9df0 (rev 30)
> Capabilities: [40] Express (v2) Root Complex Integrated Endpoint, MSI 00
> 
> This permits assigning this device to a guest VM.
> 
> Fixes: f096c061f552 ("iommu: Rework iommu_group_get_for_pci_dev()")
> Signed-off-by: Ashok Raj 
> To: Joerg Roedel 
> To: Bjorn Helgaas 
> Cc: linux-kernel@vger.kernel.org
> Cc: io...@lists.linux-foundation.org
> Cc: Lu Baolu 
> Cc: Alex Williamson 
> Cc: Darrel Goeddel 
> Cc: Mark Scott ,
> Cc: Romil Sharma 
> Cc: Ashok Raj 
> ---
>  drivers/iommu/iommu.c | 13 -
>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 2b471419e26c..31b595dfedde 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -1187,7 +1187,18 @@ static struct iommu_group 
> *get_pci_function_alias_group(struct pci_dev *pdev,
>   struct pci_dev *tmp = NULL;
>   struct iommu_group *group;
>  
> - if (!pdev->multifunction || pci_acs_enabled(pdev, REQ_ACS_FLAGS))
> + /*
> +  * Intel VT-d Specification Section 3.16, Root-Complex Peer to Peer
> +  * Considerations manadate that all transactions in RCiEP's and
> +  * even Integrated MFD's *must* be sent up to the IOMMU. P2P is
> +  * only possible on translated addresses. This gives enough
> +  * guarantee that such devices can be forgiven for lack of ACS
> +  * support.
> +  */
> + if (!pdev->multifunction ||
> + (pdev->vendor == PCI_VENDOR_ID_INTEL &&
> +  pci_pcie_type(pdev) == PCI_EXP_TYPE_RC_END) ||
> +  pci_acs_enabled(pdev, REQ_ACS_FLAGS))
>   return NULL;
>  
>   for_each_pci_dev(tmp) {

Hi Ashok,

As this is an Intel/VT-d standard, not a PCIe standard, why not
implement this in pci_dev_specific_acs_enabled() with all the other
quirks?  Thanks,

Alex

Re: [PATCH] iommu: Relax ACS requirement for RCiEP devices.

2020-05-26 Thread Alex Williamson

On Tue, 26 May 2020 11:06:48 -0700
"Raj, Ashok"  wrote:

> Hi Alex,
> 
> I was able to find better language in the IOMMU spec that gaurantees 
> the behavior we need. See below.
> 
> 
> On Tue, May 05, 2020 at 09:34:14AM -0600, Alex Williamson wrote:
> > On Tue, 5 May 2020 07:56:06 -0700
> > "Raj, Ashok"  wrote:
> >   
> > > On Tue, May 05, 2020 at 08:05:14AM -0600, Alex Williamson wrote:  
> > > > On Mon, 4 May 2020 23:11:07 -0700
> > > > "Raj, Ashok"  wrote:
> > > > 
> > > > > Hi Alex
> > > > > 
> > > > > + Joerg, accidently missed in the Cc.
> > > > > 
> > > > > On Mon, May 04, 2020 at 11:19:36PM -0600, Alex Williamson wrote:
> > > > > > On Mon,  4 May 2020 21:42:16 -0700
> > > > > > Ashok Raj  wrote:
> > > > > >   
> > > > > > > PCIe Spec recommends we can relax ACS requirement for RCIEP 
> > > > > > > devices.
> > > > > > > 
> > > > > > > PCIe 5.0 Specification.
> > > > > > > 6.12 Access Control Services (ACS)
> > > > > > > Implementation of ACS in RCiEPs is permitted but not required. It 
> > > > > > > is
> > > > > > > explicitly permitted that, within a single Root Complex, some 
> > > > > > > RCiEPs
> > > > > > > implement ACS and some do not. It is strongly recommended that 
> > > > > > > Root Complex
> > > > > > > implementations ensure that all accesses originating from RCiEPs
> > > > > > > (PFs and VFs) without ACS capability are first subjected to 
> > > > > > > processing by
> > > > > > > the Translation Agent (TA) in the Root Complex before further 
> > > > > > > decoding and
> > > > > > > processing. The details of such Root Complex handling are outside 
> > > > > > > the scope
> > > > > > > of this specification.
> > > > > > > 
> > > > > > 
> > > > > > Is the language here really strong enough to make this change?  ACS 
> > > > > > is
> > > > > > an optional feature, so being permitted but not required is rather
> > > > > > meaningless.  The spec is also specifically avoiding the words 
> > > > > > "must"
> > > > > > or "shall" and even when emphasized with "strongly", we still only 
> > > > > > have
> > > > > > a recommendation that may or may not be honored.  This seems like a
> > > > > > weak basis for assuming that RCiEPs universally honor this
> > > > > > recommendation.  Thanks,
> > > > > >   
> > > > > 
> > > > > We are speaking about PCIe spec, where people write it about 5 years 
> > > > > ahead
> > > > > and every vendor tries to massage their product behavior with vague
> > > > > words like this..  :)
> > > > > 
> > > > > But honestly for any any RCiEP, or even integrated endpoints, there 
> > > > > is no way to send them except up north. These aren't behind a RP.
> > > > 
> > > > But they are multi-function devices and the spec doesn't define routing
> > > > within multifunction packages.  A single function RCiEP will already be
> > > > assumed isolated within its own group.
> > > 
> > > That's right. The other two devices only have legacy PCI headers. So 
> > > they can't claim to be RCiEP's but just integrated endpoints. The legacy
> > > devices don't even have a PCIe header.
> > > 
> > > I honestly don't know why these are groped as MFD's in the first place.
> > >   
> > > >  
> > > > > I did check with couple folks who are part of the SIG, and seem to 
> > > > > agree
> > > > > that ACS treatment for RCiEP's doesn't mean much. 
> > > > > 
> > > > > I understand the language isn't strong, but it doesn't seem like ACS 
> > > > > should
> > > > > be a strong requirement for RCiEP's and reasonable to relax.
> > > > > 
> > > > > What are your thoughts? 
> > > > 
> > > > I think hardware vendors have ACS at their disposal to clarify when
> > > &g

Re: [PATCH -next] vfio/pci: fix a null-ptr-deref in vfio_config_free()

2020-05-26 Thread Alex Williamson

On Thu, 21 May 2020 21:18:29 -0400
Qian Cai  wrote:

> It is possible vfio_config_init() does not call vfio_cap_len(), and then
> vdev->msi_perm == NULL. Later, in vfio_config_free(), it could trigger a
> null-ptr-deref.
> 
>  BUG: kernel NULL pointer dereference, address: 
>  RIP: 0010:vfio_config_free+0x7a/0xe0 [vfio_pci]
>  vfio_config_free+0x7a/0xe0:
>  free_perm_bits at drivers/vfio/pci/vfio_pci_config.c:340
>  (inlined by) vfio_config_free at drivers/vfio/pci/vfio_pci_config.c:1760
>  Call Trace:
>   vfio_pci_release+0x3a4/0x9e0 [vfio_pci]
>   vfio_device_fops_release+0x50/0x80 [vfio]
>   __fput+0x200/0x460
>   fput+0xe/0x10
>   task_work_run+0x127/0x1b0
>   do_exit+0x782/0x10d0
>   do_group_exit+0xc7/0x1c0
>   __x64_sys_exit_group+0x2c/0x30
>   do_syscall_64+0x64/0x350
>   entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> Fixes: bea890bdb161 ("vfio/pci: fix memory leaks in alloc_perm_bits()")
> Signed-off-by: Qian Cai 
> ---
>  drivers/vfio/pci/vfio_pci_config.c | 8 +---
>  1 file changed, 5 insertions(+), 3 deletions(-)

I may get yelled at for it, but I need to break my next branch to fix
the lockdep issue you noted in my series, so I'm going to go ahead and
roll this into your previous patch.  Thanks,

Alex
 
> diff --git a/drivers/vfio/pci/vfio_pci_config.c 
> b/drivers/vfio/pci/vfio_pci_config.c
> index d127a0c50940..8746c943247a 100644
> --- a/drivers/vfio/pci/vfio_pci_config.c
> +++ b/drivers/vfio/pci/vfio_pci_config.c
> @@ -1757,9 +1757,11 @@ void vfio_config_free(struct vfio_pci_device *vdev)
>   vdev->vconfig = NULL;
>   kfree(vdev->pci_config_map);
>   vdev->pci_config_map = NULL;
> - free_perm_bits(vdev->msi_perm);
> - kfree(vdev->msi_perm);
> - vdev->msi_perm = NULL;
> + if (vdev->msi_perm) {
> + free_perm_bits(vdev->msi_perm);
> + kfree(vdev->msi_perm);
> + vdev->msi_perm = NULL;
> + }
>  }
>  
>  /*

Re: [PATCH v3 3/3] vfio-pci: Invalidate mmaps and block MMIO access on disabled memory

2020-05-26 Thread Alex Williamson

On Tue, 26 May 2020 12:53:31 -0300
Jason Gunthorpe  wrote:

> On Tue, May 26, 2020 at 08:32:18AM -0600, Alex Williamson wrote:
> > > > Certainly there is no reason to optimize the fringe case of vfio
> > > > sleeping if there is and incorrect concurrnent attempt to disable the
> > > > a BAR.
> > > 
> > > If fixup_user_fault() (which is always with ALLOW_RETRY && !RETRY_NOWAIT) 
> > > is
> > > the only path for the new fault(), then current way seems ok.  Not sure 
> > > whether
> > > this would worth a WARN_ON_ONCE(RETRY_NOWAIT) in the fault() to be clear 
> > > of
> > > that fact.  
> > 
> > Thanks for the discussion over the weekend folks.  Peter, I take it
> > you'd be satisfied if this patch were updated as:
> > 
> > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> > index aabba6439a5b..35bd7cd4e268 100644
> > +++ b/drivers/vfio/pci/vfio_pci.c
> > @@ -1528,6 +1528,13 @@ static vm_fault_t vfio_pci_mmap_fault(struct 
> > vm_fault *vmf)
> > struct vfio_pci_device *vdev = vma->vm_private_data;
> > vm_fault_t ret = VM_FAULT_NOPAGE;
> >  
> > +   /*
> > +* We don't expect to be called with NOWAIT and there are conflicting
> > +* opinions on whether NOWAIT suggests we shouldn't wait for locks or
> > +* just shouldn't wait for I/O.
> > +*/
> > +   WARN_ON_ONCE(vmf->flags & FAULT_FLAG_RETRY_NOWAIT);  
> 
> I don't think this is right, this implies there is some reason this
> code fails with FAULT_FLAG_RETRY_NOWAIT - but it is fine as written,
> AFAICT

Ok, Peter said he's fine either way, I'll use the patch as originally
posted and include Peter's R-b.  Thanks,

Alex

Re: [PATCH v3 3/3] vfio-pci: Invalidate mmaps and block MMIO access on disabled memory

2020-05-26 Thread Alex Williamson

On Tue, 26 May 2020 09:49:54 -0400
Peter Xu  wrote:

> On Mon, May 25, 2020 at 09:37:05PM -0300, Jason Gunthorpe wrote:
> > On Mon, May 25, 2020 at 01:56:28PM -0700, John Hubbard wrote:  
> > > On 2020-05-25 09:56, Jason Gunthorpe wrote:  
> > > > On Mon, May 25, 2020 at 11:11:42AM -0400, Peter Xu wrote:  
> > > > > On Mon, May 25, 2020 at 11:46:51AM -0300, Jason Gunthorpe wrote:  
> > > > > > On Mon, May 25, 2020 at 10:28:06AM -0400, Peter Xu wrote:  
> > > > > > > On Mon, May 25, 2020 at 09:26:07AM -0300, Jason Gunthorpe wrote:  
> > > > > > > > On Sat, May 23, 2020 at 07:52:57PM -0400, Peter Xu wrote:
> > > > > > > >   
> > > > > > > > > For what I understand now, IMHO we should still need all 
> > > > > > > > > those handlings of
> > > > > > > > > FAULT_FLAG_RETRY_NOWAIT like in the initial version.  E.g., 
> > > > > > > > > IIUC KVM gup will
> > > > > > > > > try with FOLL_NOWAIT when async is allowed, before the 
> > > > > > > > > complete slow path.  I'm
> > > > > > > > > not sure what would be the side effect of that if fault() 
> > > > > > > > > blocked it.  E.g.,
> > > > > > > > > the caller could be in an atomic context.  
> > > > > > > > 
> > > > > > > > AFAICT FAULT_FLAG_RETRY_NOWAIT only impacts what happens when
> > > > > > > > VM_FAULT_RETRY is returned, which this doesn't do?  
> > > > > > > 
> > > > > > > Yes, that's why I think we should still properly return 
> > > > > > > VM_FAULT_RETRY if
> > > > > > > needed..  because IMHO it is still possible that the caller calls 
> > > > > > > with
> > > > > > > FAULT_FLAG_RETRY_NOWAIT.
> > > > > > > 
> > > > > > > My understanding is that FAULT_FLAG_RETRY_NOWAIT majorly means:
> > > > > > > 
> > > > > > >- We cannot release the mmap_sem, and,
> > > > > > >- We cannot sleep  
> > > > > > 
> > > > > > Sleeping looks fine, look at any FS implementation of fault, say,
> > > > > > xfs. The first thing it does is xfs_ilock() which does 
> > > > > > down_write().  
> > > > > 
> > > > > Yeah.  My wild guess is that maybe fs code will always be without
> > > > > FAULT_FLAG_RETRY_NOWAIT so it's safe to sleep unconditionally (e.g., 
> > > > > I think
> > > > > the general #PF should be fine to sleep in fault(); gup should be 
> > > > > special, but
> > > > > I didn't observe any gup code called upon file systems)?  
> > > > 
> > > > get_user_pages is called on filesystem backed pages.
> > > > 
> > > > I have no idea what FAULT_FLAG_RETRY_NOWAIT is supposed to do. Maybe
> > > > John was able to guess when he reworked that stuff?
> > > >   
> > > 
> > > Although I didn't end up touching that particular area, I'm sure it's 
> > > going
> > > to come up sometime soon, so I poked around just now, and found that
> > > FAULT_FLAG_RETRY_NOWAIT was added almost exactly 9 years ago. This flag 
> > > was
> > > intended to make KVM and similar things behave better when doing GUP on
> > > file-backed pages that might, or might not be in memory.
> > > 
> > > The idea is described in the changelog, but not in the code comments or
> > > Documentation, sigh:
> > > 
> > > commit 318b275fbca1ab9ec0862de71420e0e92c3d1aa7
> > > Author: Gleb Natapov 
> > > Date:   Tue Mar 22 16:30:51 2011 -0700
> > > 
> > > mm: allow GUP to fail instead of waiting on a page
> > > 
> > > GUP user may want to try to acquire a reference to a page if it is 
> > > already
> > > in memory, but not if IO, to bring it in, is needed.  For example KVM 
> > > may
> > > tell vcpu to schedule another guest process if current one is trying 
> > > to
> > > access swapped out page.  Meanwhile, the page will be swapped in and 
> > > the
> > > guest process, that depends on it, will be able to run again.
> > > 
> > > This patch adds FAULT_FLAG_RETRY_NOWAIT (suggested by Linus) and
> > > FOLL_NOWAIT follow_page flags.  FAULT_FLAG_RETRY_NOWAIT, when used in
> > > conjunction with VM_FAULT_ALLOW_RETRY, indicates to handle_mm_fault 
> > > that
> > > it shouldn't drop mmap_sem and wait on a page, but return 
> > > VM_FAULT_RETRY
> > > instead.  
> > 
> > So, from kvm's perspective it was to avoid excessively long blocking in
> > common paths when it could rejoin the completed IO by somehow waiting
> > on a page itself?
> > 
> > It all seems like it should not be used unless the page is going to go
> > to IO?  
> 
> I think NOWAIT is used as a common flag for kvm for its initial attempt to
> fault in a normal page, however...  I just noticed another fact that actually
> __get_user_pages() won't work with PFNMAP (check_vma_flags should fail), but
> KVM just started to support fault() for PFNMAP from commit add6a0cd1c5b (2016)
> using fixup_user_fault(), where nvidia seems to have a similar request to have
> a fault handler on some mapped BARs.
> 
> > 
> > Certainly there is no reason to optimize the fringe case of vfio
> > sleeping if there is and incorrect concurrnent attempt to disable the
> > a BAR.  
> 
> If fixup_user_fault() (which is always with ALLOW_RETRY &&

Re: [PATCH v3 3/3] vfio-pci: Invalidate mmaps and block MMIO access on disabled memory

2020-05-23 Thread Alex Williamson

On Sat, 23 May 2020 15:34:17 -0400
Peter Xu  wrote:

> Hi, Alex,
> 
> On Fri, May 22, 2020 at 01:17:43PM -0600, Alex Williamson wrote:
> > @@ -1346,15 +1526,32 @@ static vm_fault_t vfio_pci_mmap_fault(struct 
> > vm_fault *vmf)
> >  {
> > struct vm_area_struct *vma = vmf->vma;
> > struct vfio_pci_device *vdev = vma->vm_private_data;
> > +   vm_fault_t ret = VM_FAULT_NOPAGE;
> > +
> > +   mutex_lock(>vma_lock);
> > +   down_read(>memory_lock);  
> 
> I remembered to have seen the fault() handling FAULT_FLAG_RETRY_NOWAIT at 
> least
> in the very first version, but it's not here any more...  Could I ask what's
> the reason behind?  I probably have missed something along with the versions,
> I'm just not sure whether e.g. this would potentially block a GUP caller even
> if it's with FOLL_NOWAIT.

This is largely what v2 was about, from the cover letter:

Locking in 3/ is substantially changed to avoid the retry scenario
within the fault handler, therefore a caller who does not allow
retry will no longer receive a SIGBUS on contention.

The discussion thread starts here:

https://lore.kernel.org/kvm/20200501234849.gq26...@ziepe.ca/

Feel free to interject if there's something that doesn't make sense,
the idea is that since we've fixed the lock ordering we never need to
release one lock to wait for another, therefore we can wait for the
lock.  I'm under the impression that we can wait for the lock
regardless of the flags under these conditions.

> Side note: Another thing I thought about when reading this patch - there seems
> to have quite some possibility that the VFIO_DEVICE_PCI_HOT_RESET ioctl will
> start to return -EBUSY now.  Not a problem for this series, but maybe we 
> should
> rememeber to let the userspace handle -EBUSY properly as follow up too, since 
> I
> saw QEMU seemed to not handle -EBUSY for host reset path right now.

I think this has always been the case, whether it's the device lock or
this lock, the only way I know to avoid potential deadlock is to use
the 'try' locking semantics.  In normal scenarios I expect access to
sibling devices is quiesced at this point, so a user driver actually
wanting to achieve a reset shouldn't be affected by this.  Thanks,

Alex

Re: [PATCH v3 0/3] vfio-pci: Block user access to disabled device MMIO

2020-05-22 Thread Alex Williamson

On Fri, 22 May 2020 18:08:58 -0400
Qian Cai  wrote:

> On Fri, May 22, 2020 at 01:17:09PM -0600, Alex Williamson wrote:
> > v3:
> > 
> > The memory_lock semaphore is only held in the MSI-X path for callouts
> > to functions that may access MSI-X MMIO space of the device, this
> > should resolve the circular locking dependency reported by Qian
> > (re-testing very much appreciated).  I've also incorporated the
> > pci_map_rom() and pci_unmap_rom() calls under the memory_lock.  Commit
> > 0cfd027be1d6 ("vfio_pci: Enable memory accesses before calling
> > pci_map_rom") made sure memory was enabled on the info path, but did
> > not provide locking to protect that state.  The r/w path of the BAR
> > access is expanded to include ROM mapping/unmapping.  Unless there
> > are objections, I'll plan to drop v2 from my next branch and replace
> > it with this.  Thanks,  
> 
> FYI, the lockdep warning is gone.
> 

Thank you for testing!

Alex

[PATCH v3 3/3] vfio-pci: Invalidate mmaps and block MMIO access on disabled memory

2020-05-22 Thread Alex Williamson

Accessing the disabled memory space of a PCI device would typically
result in a master abort response on conventional PCI, or an
unsupported request on PCI express.  The user would generally see
these as a -1 response for the read return data and the write would be
silently discarded, possibly with an uncorrected, non-fatal AER error
triggered on the host.  Some systems however take it upon themselves
to bring down the entire system when they see something that might
indicate a loss of data, such as this discarded write to a disabled
memory space.

To avoid this, we want to try to block the user from accessing memory
spaces while they're disabled.  We start with a semaphore around the
memory enable bit, where writers modify the memory enable state and
must be serialized, while readers make use of the memory region and
can access in parallel.  Writers include both direct manipulation via
the command register, as well as any reset path where the internal
mechanics of the reset may both explicitly and implicitly disable
memory access, and manipulation of the MSI-X configuration, where the
MSI-X vector table resides in MMIO space of the device.  Readers
include the read and write file ops to access the vfio device fd
offsets as well as memory mapped access.  In the latter case, we make
use of our new vma list support to zap, or invalidate, those memory
mappings in order to force them to be faulted back in on access.

Our semaphore usage will stall user access to MMIO spaces across
internal operations like reset, but the user might experience new
behavior when trying to access the MMIO space while disabled via the
PCI command register.  Access via read or write while disabled will
return -EIO and access via memory maps will result in a SIGBUS.  This
is expected to be compatible with known use cases and potentially
provides better error handling capabilities than present in the
hardware, while avoiding the more readily accessible and severe
platform error responses that might otherwise occur.

Fixes: CVE-2020-12888
Signed-off-by: Alex Williamson 
---
 drivers/vfio/pci/vfio_pci.c |  291 +++
 drivers/vfio/pci/vfio_pci_config.c  |   36 
 drivers/vfio/pci/vfio_pci_intrs.c   |   14 ++
 drivers/vfio/pci/vfio_pci_private.h |8 +
 drivers/vfio/pci/vfio_pci_rdwr.c|   24 ++-
 5 files changed, 330 insertions(+), 43 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 66a545a01f8f..aabba6439a5b 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "vfio_pci_private.h"
 
@@ -184,6 +185,7 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device 
*vdev)
 
 static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev);
 static void vfio_pci_disable(struct vfio_pci_device *vdev);
+static int vfio_pci_try_zap_and_vma_lock_cb(struct pci_dev *pdev, void *data);
 
 /*
  * INTx masking requires the ability to disable INTx signaling via PCI_COMMAND
@@ -736,6 +738,12 @@ int vfio_pci_register_dev_region(struct vfio_pci_device 
*vdev,
return 0;
 }
 
+struct vfio_devices {
+   struct vfio_device **devices;
+   int cur_index;
+   int max_index;
+};
+
 static long vfio_pci_ioctl(void *device_data,
   unsigned int cmd, unsigned long arg)
 {
@@ -809,7 +817,7 @@ static long vfio_pci_ioctl(void *device_data,
{
void __iomem *io;
size_t size;
-   u16 orig_cmd;
+   u16 cmd;
 
info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
info.flags = 0;
@@ -829,10 +837,7 @@ static long vfio_pci_ioctl(void *device_data,
 * Is it really there?  Enable memory decode for
 * implicit access in pci_map_rom().
 */
-   pci_read_config_word(pdev, PCI_COMMAND, _cmd);
-   pci_write_config_word(pdev, PCI_COMMAND,
- orig_cmd | PCI_COMMAND_MEMORY);
-
+   cmd = vfio_pci_memory_lock_and_enable(vdev);
io = pci_map_rom(pdev, );
if (io) {
info.flags = VFIO_REGION_INFO_FLAG_READ;
@@ -840,8 +845,8 @@ static long vfio_pci_ioctl(void *device_data,
} else {
info.size = 0;
}
+   vfio_pci_memory_unlock_and_restore(vdev, cmd);
 
-   pci_write_config_word(pdev, PCI_COMMAND, orig_cmd);
break;
}
case VFIO_PCI_VGA_REGION_INDEX:
@@ -984,8 +989,16 @@ static long vfio_pci_ioctl(void *device_data,
return ret;
 
} else if (cmd == VFIO_DE

[PATCH v3 2/3] vfio-pci: Fault mmaps to enable vma tracking

2020-05-22 Thread Alex Williamson

Rather than calling remap_pfn_range() when a region is mmap'd, setup
a vm_ops handler to support dynamic faulting of the range on access.
This allows us to manage a list of vmas actively mapping the area that
we can later use to invalidate those mappings.  The open callback
invalidates the vma range so that all tracking is inserted in the
fault handler and removed in the close handler.

Reviewed-by: Peter Xu 
Signed-off-by: Alex Williamson 
---
 drivers/vfio/pci/vfio_pci.c |   76 ++-
 drivers/vfio/pci/vfio_pci_private.h |7 +++
 2 files changed, 81 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 6c6b37b5c04e..66a545a01f8f 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -1299,6 +1299,70 @@ static ssize_t vfio_pci_write(void *device_data, const 
char __user *buf,
return vfio_pci_rw(device_data, (char __user *)buf, count, ppos, true);
 }
 
+static int vfio_pci_add_vma(struct vfio_pci_device *vdev,
+   struct vm_area_struct *vma)
+{
+   struct vfio_pci_mmap_vma *mmap_vma;
+
+   mmap_vma = kmalloc(sizeof(*mmap_vma), GFP_KERNEL);
+   if (!mmap_vma)
+   return -ENOMEM;
+
+   mmap_vma->vma = vma;
+
+   mutex_lock(>vma_lock);
+   list_add(_vma->vma_next, >vma_list);
+   mutex_unlock(>vma_lock);
+
+   return 0;
+}
+
+/*
+ * Zap mmaps on open so that we can fault them in on access and therefore
+ * our vma_list only tracks mappings accessed since last zap.
+ */
+static void vfio_pci_mmap_open(struct vm_area_struct *vma)
+{
+   zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);
+}
+
+static void vfio_pci_mmap_close(struct vm_area_struct *vma)
+{
+   struct vfio_pci_device *vdev = vma->vm_private_data;
+   struct vfio_pci_mmap_vma *mmap_vma;
+
+   mutex_lock(>vma_lock);
+   list_for_each_entry(mmap_vma, >vma_list, vma_next) {
+   if (mmap_vma->vma == vma) {
+   list_del(_vma->vma_next);
+   kfree(mmap_vma);
+   break;
+   }
+   }
+   mutex_unlock(>vma_lock);
+}
+
+static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
+{
+   struct vm_area_struct *vma = vmf->vma;
+   struct vfio_pci_device *vdev = vma->vm_private_data;
+
+   if (vfio_pci_add_vma(vdev, vma))
+   return VM_FAULT_OOM;
+
+   if (remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
+   vma->vm_end - vma->vm_start, vma->vm_page_prot))
+   return VM_FAULT_SIGBUS;
+
+   return VM_FAULT_NOPAGE;
+}
+
+static const struct vm_operations_struct vfio_pci_mmap_ops = {
+   .open = vfio_pci_mmap_open,
+   .close = vfio_pci_mmap_close,
+   .fault = vfio_pci_mmap_fault,
+};
+
 static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
 {
struct vfio_pci_device *vdev = device_data;
@@ -1357,8 +1421,14 @@ static int vfio_pci_mmap(void *device_data, struct 
vm_area_struct *vma)
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
 
-   return remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
-  req_len, vma->vm_page_prot);
+   /*
+* See remap_pfn_range(), called from vfio_pci_fault() but we can't
+* change vm_flags within the fault handler.  Set them now.
+*/
+   vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP;
+   vma->vm_ops = _pci_mmap_ops;
+
+   return 0;
 }
 
 static void vfio_pci_request(void *device_data, unsigned int count)
@@ -1608,6 +1678,8 @@ static int vfio_pci_probe(struct pci_dev *pdev, const 
struct pci_device_id *id)
spin_lock_init(>irqlock);
mutex_init(>ioeventfds_lock);
INIT_LIST_HEAD(>ioeventfds_list);
+   mutex_init(>vma_lock);
+   INIT_LIST_HEAD(>vma_list);
 
ret = vfio_add_group_dev(>dev, _pci_ops, vdev);
if (ret)
diff --git a/drivers/vfio/pci/vfio_pci_private.h 
b/drivers/vfio/pci/vfio_pci_private.h
index 36ec69081ecd..9b25f9f6ce1d 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -92,6 +92,11 @@ struct vfio_pci_vf_token {
int users;
 };
 
+struct vfio_pci_mmap_vma {
+   struct vm_area_struct   *vma;
+   struct list_headvma_next;
+};
+
 struct vfio_pci_device {
struct pci_dev  *pdev;
void __iomem*barmap[PCI_STD_NUM_BARS];
@@ -132,6 +137,8 @@ struct vfio_pci_device {
struct list_headioeventfds_list;
struct vfio_pci_vf_token*vf_token;
struct notifier_block   nb;
+   struct mutexvma_lock;
+   struct list_headvma_list;
 };
 
 #define is_intx(vdev) (vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX)

[PATCH v3 0/3] vfio-pci: Block user access to disabled device MMIO

2020-05-22 Thread Alex Williamson

v3:

The memory_lock semaphore is only held in the MSI-X path for callouts
to functions that may access MSI-X MMIO space of the device, this
should resolve the circular locking dependency reported by Qian
(re-testing very much appreciated).  I've also incorporated the
pci_map_rom() and pci_unmap_rom() calls under the memory_lock.  Commit
0cfd027be1d6 ("vfio_pci: Enable memory accesses before calling
pci_map_rom") made sure memory was enabled on the info path, but did
not provide locking to protect that state.  The r/w path of the BAR
access is expanded to include ROM mapping/unmapping.  Unless there
are objections, I'll plan to drop v2 from my next branch and replace
it with this.  Thanks,

Alex

v2:

Locking in 3/ is substantially changed to avoid the retry scenario
within the fault handler, therefore a caller who does not allow retry
will no longer receive a SIGBUS on contention.  IOMMU invalidations
are still not included here, I expect that will be a future follow-on
change as we're not fundamentally changing that issue in this series.
The 'add to vma list only on fault' behavior is also still included
here, per the discussion I think it's still a valid approach and has
some advantages, particularly in a VM scenario where we potentially
defer the mapping until the MMIO BAR is actually DMA mapped into the
VM address space (or the guest driver actually accesses the device
if that DMA mapping is eliminated at some point).  Further discussion
and review appreciated.  Thanks,

Alex

v1:

Add tracking of the device memory enable bit and block/fault accesses
to device MMIO space while disabled.  This provides synchronous fault
handling for CPU accesses to the device and prevents the user from
triggering platform level error handling present on some systems.
Device reset and MSI-X vector table accesses are also included such
that access is blocked across reset and vector table accesses do not
depend on the user configuration of the device.

This is based on the vfio for-linus branch currently in next, making
use of follow_pfn() in vaddr_get_pfn() and therefore requiring patch
1/ to force the user fault in the case that a PFNMAP vma might be
DMA mapped before user access.  Further PFNMAP iommu invalidation
tracking is not yet included here.

As noted in the comments, I'm copying quite a bit of the logic from
rdma code for performing the zap_vma_ptes() calls and I'm also
attempting to resolve lock ordering issues in the fault handler to
lockdep's satisfaction.  I appreciate extra eyes on these sections in
particular.

I expect this to be functionally equivalent for any well behaved
userspace driver, but obviously there is a potential for the user to
get -EIO or SIGBUS on device access.  The device is provided to the
user enabled and device resets will restore the command register, so
by my evaluation a user would need to explicitly disable the memory
enable bit to trigger these faults.  We could potentially remap vmas
to a zero page rather than SIGBUS if we experience regressions, but
without known code requiring that, SIGBUS seems the appropriate
response to this condition.  Thanks,

Alex

---

Alex Williamson (3):
  vfio/type1: Support faulting PFNMAP vmas
  vfio-pci: Fault mmaps to enable vma tracking
  vfio-pci: Invalidate mmaps and block MMIO access on disabled memory


 drivers/vfio/pci/vfio_pci.c |  349 ---
 drivers/vfio/pci/vfio_pci_config.c  |   36 +++-
 drivers/vfio/pci/vfio_pci_intrs.c   |   14 +
 drivers/vfio/pci/vfio_pci_private.h |   15 ++
 drivers/vfio/pci/vfio_pci_rdwr.c|   24 ++
 drivers/vfio/vfio_iommu_type1.c |   36 +++-
 6 files changed, 435 insertions(+), 39 deletions(-)

[PATCH v3 1/3] vfio/type1: Support faulting PFNMAP vmas

2020-05-22 Thread Alex Williamson

With conversion to follow_pfn(), DMA mapping a PFNMAP range depends on
the range being faulted into the vma.  Add support to manually provide
that, in the same way as done on KVM with hva_to_pfn_remapped().

Reviewed-by: Peter Xu 
Signed-off-by: Alex Williamson 
---
 drivers/vfio/vfio_iommu_type1.c |   36 +---
 1 file changed, 33 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index cc1d64765ce7..4a4cb7cd86b2 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -317,6 +317,32 @@ static int put_pfn(unsigned long pfn, int prot)
return 0;
 }
 
+static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
+   unsigned long vaddr, unsigned long *pfn,
+   bool write_fault)
+{
+   int ret;
+
+   ret = follow_pfn(vma, vaddr, pfn);
+   if (ret) {
+   bool unlocked = false;
+
+   ret = fixup_user_fault(NULL, mm, vaddr,
+  FAULT_FLAG_REMOTE |
+  (write_fault ?  FAULT_FLAG_WRITE : 0),
+  );
+   if (unlocked)
+   return -EAGAIN;
+
+   if (ret)
+   return ret;
+
+   ret = follow_pfn(vma, vaddr, pfn);
+   }
+
+   return ret;
+}
+
 static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
 int prot, unsigned long *pfn)
 {
@@ -339,12 +365,16 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
long vaddr,
 
vaddr = untagged_addr(vaddr);
 
+retry:
vma = find_vma_intersection(mm, vaddr, vaddr + 1);
 
if (vma && vma->vm_flags & VM_PFNMAP) {
-   if (!follow_pfn(vma, vaddr, pfn) &&
-   is_invalid_reserved_pfn(*pfn))
-   ret = 0;
+   ret = follow_fault_pfn(vma, mm, vaddr, pfn, prot & IOMMU_WRITE);
+   if (ret == -EAGAIN)
+   goto retry;
+
+   if (!ret && !is_invalid_reserved_pfn(*pfn))
+   ret = -EFAULT;
}
 done:
up_read(>mmap_sem);

Re: [PATCH v2 3/3] vfio-pci: Invalidate mmaps and block MMIO access on disabled memory

2020-05-21 Thread Alex Williamson

On Thu, 21 May 2020 22:39:06 -0400
Qian Cai  wrote:

> On Tue, May 05, 2020 at 03:55:02PM -0600, Alex Williamson wrote:
> []
> vfio_pci_mmap_fault(struct vm_fault *vmf)
> >  {
> > struct vm_area_struct *vma = vmf->vma;
> > struct vfio_pci_device *vdev = vma->vm_private_data;
> > +   vm_fault_t ret = VM_FAULT_NOPAGE;
> >  
> > -   if (vfio_pci_add_vma(vdev, vma))
> > -   return VM_FAULT_OOM;
> > +   mutex_lock(>vma_lock);
> > +   down_read(>memory_lock);  
> 
> This lock here will trigger,
> 
> [17368.321363][T3614103] 
> ==
> [17368.321375][T3614103] WARNING: possible circular locking dependency 
> detected
> [17368.321399][T3614103] 5.7.0-rc6-next-20200521+ #116 Tainted: GW
> 
> [17368.321410][T3614103] 
> --
> [17368.321433][T3614103] qemu-kvm/3614103 is trying to acquire lock:
> [17368.321443][T3614103] c000200fb2328968 (>lock){+.+.}-{3:3}, at: 
> kvmppc_irq_bypass_add_producer_hv+0xd4/0x3b0 [kvm_hv]
> [17368.321488][T3614103] 
> [17368.321488][T3614103] but task is already holding lock:
> [17368.321533][T3614103] c16f4dc8 (lock#7){+.+.}-{3:3}, at: 
> irq_bypass_register_producer+0x80/0x1d0
> [17368.321564][T3614103] 
> [17368.321564][T3614103] which lock already depends on the new lock.
> [17368.321564][T3614103] 
> [17368.321590][T3614103] 
> [17368.321590][T3614103] the existing dependency chain (in reverse order) is:
> [17368.321625][T3614103] 
> [17368.321625][T3614103] -> #4 (lock#7){+.+.}-{3:3}:
> [17368.321662][T3614103]__mutex_lock+0xdc/0xb80
> [17368.321683][T3614103]irq_bypass_register_producer+0x80/0x1d0
> [17368.321706][T3614103]vfio_msi_set_vector_signal+0x1d8/0x350 
> [vfio_pci]
> [17368.321719][T3614103]vfio_msi_set_block+0xb0/0x1e0 [vfio_pci]
> [17368.321752][T3614103]vfio_pci_set_msi_trigger+0x13c/0x3e0 
> [vfio_pci]
> [17368.321787][T3614103]vfio_pci_set_irqs_ioctl+0x134/0x2c0 [vfio_pci]
> [17368.321821][T3614103]vfio_pci_ioctl+0xe10/0x1460 [vfio_pci]
> [17368.321855][T3614103]vfio_device_fops_unl_ioctl+0x44/0x70 [vfio]
> [17368.321879][T3614103]ksys_ioctl+0xd8/0x130
> [17368.321888][T3614103]sys_ioctl+0x28/0x40
> [17368.321910][T3614103]system_call_exception+0x108/0x1d0
> [17368.321932][T3614103]system_call_common+0xf0/0x278
> [17368.321951][T3614103] 
> [17368.321951][T3614103] -> #3 (>memory_lock){}-{3:3}:
> [17368.321988][T3614103]lock_release+0x190/0x5e0
> [17368.322009][T3614103]__mutex_unlock_slowpath+0x68/0x410
> [17368.322042][T3614103]vfio_pci_mmap_fault+0xe8/0x1f0 [vfio_pci]
> vfio_pci_mmap_fault at drivers/vfio/pci/vfio_pci.c:1534
> [17368.322066][T3614103]__do_fault+0x64/0x220
> [17368.322086][T3614103]handle_mm_fault+0x12f0/0x19e0
> [17368.322107][T3614103]__do_page_fault+0x284/0xf70
> [17368.322116][T3614103]handle_page_fault+0x10/0x2c
> [17368.322136][T3614103] 
> [17368.322136][T3614103] -> #2 (>mmap_sem){}-{3:3}:
> [17368.322160][T3614103]__might_fault+0x84/0xe0
> [17368.322182][T3614103]_copy_to_user+0x3c/0x120
> [17368.322206][T3614103]kvm_vcpu_ioctl+0x1ec/0xac0 [kvm]
> [17368.322239][T3614103]ksys_ioctl+0xd8/0x130
> [17368.322270][T3614103]sys_ioctl+0x28/0x40
> [17368.322301][T3614103]system_call_exception+0x108/0x1d0
> [17368.322334][T3614103]system_call_common+0xf0/0x278
> [17368.322375][T3614103] 
> [17368.322375][T3614103] -> #1 (>mutex){+.+.}-{3:3}:
> [17368.322411][T3614103]__mutex_lock+0xdc/0xb80
> [17368.322446][T3614103]kvmppc_xive_release+0xd8/0x260 [kvm]
> [17368.322484][T3614103]kvm_device_release+0xc4/0x110 [kvm]
> [17368.322518][T3614103]__fput+0x154/0x3b0
> [17368.322562][T3614103]task_work_run+0xd8/0x170
> [17368.322583][T3614103]do_exit+0x4f8/0xeb0
> [17368.322604][T3614103]do_group_exit+0x78/0x160
> [17368.322625][T3614103]get_signal+0x230/0x1440
> [17368.322657][T3614103]do_notify_resume+0x130/0x3e0
> [17368.322677][T3614103]syscall_exit_prepare+0x1a4/0x280
> [17368.322687][T3614103]system_call_common+0xf8/0x278
> [17368.322718][T3614103] 
> [17368.322718][T3614103] -> #0 (>lock){+.+.}-{3:3}:
> [17368.322753][T3614103]__lock_acquire+0x1fe4/0x3190
> [17368.322774][T3614103]lock_acquire+0x140/0x9a0
> [17368.322805][T3614103]__mutex_lock+0xdc/0xb80
> [17368.322838][T3614103]kvmppc_irq_bypass_add_producer_hv+0xd4/0x3b0 
>

Re: (a design open) RE: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host

2020-05-15 Thread Alex Williamson

tion (a.k.a stage 1)
> > structure to host iommu. And for VFIO_TYPE1_NESTING_IOMMU, not only
> > bind
> > guest page table is needed, it also requires to expose interface to guest
> > for iommu cache invalidation when guest modified the first-level/stage-1
> > translation structures since hardware needs to be notified to flush stale
> > iotlbs. This would be introduced in next patch.
> > 
> > In this patch, guest page table bind and unbind are done by using flags
> > VFIO_IOMMU_BIND_GUEST_PGTBL and
> > VFIO_IOMMU_UNBIND_GUEST_PGTBL under IOCTL
> > VFIO_IOMMU_BIND, the bind/unbind data are conveyed by
> > struct iommu_gpasid_bind_data. Before binding guest page table to host,
> > VM should have got a PASID allocated by host via
> > VFIO_IOMMU_PASID_REQUEST.
> > 
> > Bind guest translation structures (here is guest page table) to host
> > are the first step to setup vSVA (Virtual Shared Virtual Addressing).
> > 
> > Cc: Kevin Tian 
> > CC: Jacob Pan 
> > Cc: Alex Williamson 
> > Cc: Eric Auger 
> > Cc: Jean-Philippe Brucker 
> > Signed-off-by: Jean-Philippe Brucker 
> > Signed-off-by: Liu Yi L 
> > Signed-off-by: Jacob Pan 
> > ---
> >  drivers/vfio/vfio_iommu_type1.c | 158
> > 
> >  include/uapi/linux/vfio.h   |  46 
> >  2 files changed, 204 insertions(+)
> > 
> > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > b/drivers/vfio/vfio_iommu_type1.c
> > index 82a9e0b..a877747 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -130,6 +130,33 @@ struct vfio_regions {
> >  #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)\
> > (!list_empty(>domain_list))
> > 
> > +struct domain_capsule {
> > +   struct iommu_domain *domain;
> > +   void *data;
> > +};
> > +
> > +/* iommu->lock must be held */
> > +static int vfio_iommu_for_each_dev(struct vfio_iommu *iommu,
> > + int (*fn)(struct device *dev, void *data),
> > + void *data)
> > +{
> > +   struct domain_capsule dc = {.data = data};
> > +   struct vfio_domain *d;
> > +   struct vfio_group *g;
> > +   int ret = 0;
> > +
> > +   list_for_each_entry(d, >domain_list, next) {
> > +   dc.domain = d->domain;
> > +   list_for_each_entry(g, >group_list, next) {
> > +   ret = iommu_group_for_each_dev(g->iommu_group,
> > +  , fn);
> > +   if (ret)
> > +   break;
> > +   }
> > +   }
> > +   return ret;
> > +}
> > +
> >  static int put_pfn(unsigned long pfn, int prot);
> > 
> >  /*
> > @@ -2314,6 +2341,88 @@ static int
> > vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
> > return 0;
> >  }
> > 
> > +static int vfio_bind_gpasid_fn(struct device *dev, void *data)
> > +{
> > +   struct domain_capsule *dc = (struct domain_capsule *)data;
> > +   struct iommu_gpasid_bind_data *gbind_data =
> > +   (struct iommu_gpasid_bind_data *) dc->data;
> > +
> > +   return iommu_sva_bind_gpasid(dc->domain, dev, gbind_data);
> > +}
> > +
> > +static int vfio_unbind_gpasid_fn(struct device *dev, void *data)
> > +{
> > +   struct domain_capsule *dc = (struct domain_capsule *)data;
> > +   struct iommu_gpasid_bind_data *gbind_data =
> > +   (struct iommu_gpasid_bind_data *) dc->data;
> > +
> > +   return iommu_sva_unbind_gpasid(dc->domain, dev,
> > +   gbind_data->hpasid);
> > +}
> > +
> > +/**
> > + * Unbind specific gpasid, caller of this function requires hold
> > + * vfio_iommu->lock
> > + */
> > +static long vfio_iommu_type1_do_guest_unbind(struct vfio_iommu
> > *iommu,
> > +   struct iommu_gpasid_bind_data *gbind_data)
> > +{
> > +   return vfio_iommu_for_each_dev(iommu,
> > +   vfio_unbind_gpasid_fn, gbind_data);
> > +}
> > +
> > +static long vfio_iommu_type1_bind_gpasid(struct vfio_iommu *iommu,
> > +   struct iommu_gpasid_bind_data *gbind_data)
> > +{
> > +   int ret = 0;
> > +
> > +   mutex_lock(>lock);
> > +   if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > +   ret = -EINVAL;
> > +   goto out_unlock;
> > +

Re: [PATCH 0/2] vfio/type1/pci: IOMMU PFNMAP invalidation

2020-05-15 Thread Alex Williamson

On Fri, 15 May 2020 11:22:51 -0400
Peter Xu  wrote:

> On Thu, May 14, 2020 at 04:55:17PM -0600, Alex Williamson wrote:
> > > I'm not if this makes sense, can't we arrange to directly trap the
> > > IOMMU failure and route it into qemu if that is what is desired?  
> > 
> > Can't guarantee it, some systems wire that directly into their
> > management processor so that they can "protect their users" regardless
> > of whether they want or need it.  Yay firmware first error handling,
> > *sigh*.  Thanks,  
> 
> Sorry to be slightly out of topic - Alex, does this mean the general approach
> of fault reporting from vfio to the userspace is not gonna work too?

AFAIK these platforms only generate a fatal fault on certain classes of
access which imply a potential for data loss, for example a DMA write to
an invalid PTE entry.  The actual IOMMU page faulting mechanism should
not be affected by this, or at least one would hope.  Thanks,

Alex

Re: [PATCH 0/2] vfio/type1/pci: IOMMU PFNMAP invalidation

2020-05-14 Thread Alex Williamson

On Thu, 14 May 2020 19:24:15 -0300
Jason Gunthorpe  wrote:

> On Thu, May 14, 2020 at 04:17:12PM -0600, Alex Williamson wrote:
> 
> > that much.  I think this would also address Jason's primary concern.
> > It's better to get an IOMMU fault from the user trying to access those
> > mappings than it is to leave them in place.  
> 
> Yes, there are few options here - if the pages are available for use
> by the IOMMU and *asynchronously* someone else revokes them, then the
> only way to protect the kernel is to block them from the IOMMUU.
> 
> For this to be sane the revokation must be under complete control of
> the VFIO user. ie if a user decides to disable MMIO traffic then of
> course the IOMMU should block P2P transfer to the MMIO bar. It is user
> error to have not disabled those transfers in the first place.
> 
> When this is all done inside a guest the whole logic applies. On bare
> metal you might get some AER or crash or MCE. In virtualization you'll
> get an IOMMU fault.
> 
> > due to the memory enable bit.  If we could remap the range to a kernel
> > page we could maybe avoid the IOMMU fault and maybe even have a crude
> > test for whether any data was written to the page while that mapping
> > was in place (ie. simulating more restricted error handling, though
> > more asynchronous than done at the platform level).
> 
> I'm not if this makes sense, can't we arrange to directly trap the
> IOMMU failure and route it into qemu if that is what is desired?

Can't guarantee it, some systems wire that directly into their
management processor so that they can "protect their users" regardless
of whether they want or need it.  Yay firmware first error handling,
*sigh*.  Thanks,

Alex

Re: [PATCH 0/2] vfio/type1/pci: IOMMU PFNMAP invalidation

2020-05-14 Thread Alex Williamson

On Thu, 14 May 2020 17:25:38 -0400
Peter Xu  wrote:

> On Thu, May 14, 2020 at 10:51:46AM -0600, Alex Williamson wrote:
> > This is a follow-on series to "vfio-pci: Block user access to disabled
> > device MMIO"[1], which extends user access blocking of disabled MMIO
> > ranges to include unmapping the ranges from the IOMMU.  The first patch
> > adds an invalidation callback path, allowing vfio bus drivers to signal
> > the IOMMU backend to unmap ranges with vma level granularity.  This
> > signaling is done both when the MMIO range becomes inaccessible due to
> > memory disabling, as well as when a vma is closed, making up for the
> > lack of tracking or pinning for non-page backed vmas.  The second
> > patch adds registration and testing interfaces such that the IOMMU
> > backend driver can test whether a given PFNMAP vma is provided by a
> > vfio bus driver supporting invalidation.  We can then implement more
> > restricted semantics to only allow PFNMAP DMA mappings when we have
> > such support, which becomes the new default.  
> 
> Hi, Alex,
> 
> IIUC we'll directly tearing down the IOMMU page table without telling the
> userspace for those PFNMAP pages.  I'm thinking whether there be any side
> effect on the userspace side when userspace cached these mapping information
> somehow.  E.g., is there a way for userspace to know this event?
> 
> Currently, QEMU VT-d will maintain all the IOVA mappings for each assigned
> device when used with vfio-pci.  In this case, QEMU will probably need to
> depend some invalidations sent from the guest (either userspace or kernel)
> device drivers to invalidate such IOVA mappings after they got removed from 
> the
> hardware IOMMU page table underneath.  I haven't thought deeper on what would
> happen if the vIOMMU has got an inconsistent mapping of the real device.

Full disclosure, I haven't tested vIOMMU, there might be issues.  Let's
puzzle through this.  Without a vIOMMU the vfio MemoryListener in QEMU
makes use of address_space_memory, which is essentially the vCPU view
of memory.  When the memory bit of a PCI device is disabled, QEMU
correctly removes the MMIO regions of the device from this AddressSpace.
When re-enabled, they get re-added.  In that case what we're doing here
is a little bit redundant, the IOMMU mappings get dropped in response
to the memory bit and the subsequent callback from the MemoryListener
is effectively a no-op since the range is already unmapped.  When the
memory bit is re-enabled, the AddressSpace gets updated, the
MemoryListener fires and we re-fault the mmap as we're re-adding the
IOMMU mapping.

When we have a vIOMMU, the address_space_memory behavior should be the
same as above; the vIOMMU isn't involved in vCPU to device access.  So
I think our concern is explicit mappings, much like vfio itself makes.
That feels like a gap.  When the vma is being closed, I think dropping
those mappings out from under the user is probably still the right
approach and I think this series would still be useful if we only did
that much.  I think this would also address Jason's primary concern.
It's better to get an IOMMU fault from the user trying to access those
mappings than it is to leave them in place.

OTOH, if we're dropping mappings in response to disabling the memory
bit, we're changing a potential disabled MMIO access fault into an
IOMMU fault, where the platform response might very well be fatal in
either case.  Maybe we need to look at a more temporary invalidation
due to the memory enable bit.  If we could remap the range to a kernel
page we could maybe avoid the IOMMU fault and maybe even have a crude
test for whether any data was written to the page while that mapping
was in place (ie. simulating more restricted error handling, though
more asynchronous than done at the platform level).  Let me look into
it.  Thanks,

Alex

[PATCH 2/2] vfio: Introduce strict PFNMAP mappings

2020-05-14 Thread Alex Williamson

We can't pin PFNMAP IOMMU mappings like we can standard page-backed
mappings, therefore without an invalidation mechanism we can't know
if we should have revoked a user's mapping.  Now that we have an
invalidation callback mechanism we can create an interface for vfio
bus drivers to indicate their support for invalidation by registering
supported vm_ops functions with vfio-core.  A vfio IOMMU backend
driver can then test a vma against the registered vm_ops with this
support to determine whether to allow such a mapping.  The type1
backend then adopts a new 'strict_mmio_maps' module option, enabled
by default, restricting IOMMU mapping of PFNMAP vmas to only those
supporting invalidation callbacks.  vfio-pci is updated to register
vfio_pci_mmap_ops as supporting this feature.

Signed-off-by: Alex Williamson 
---
 drivers/vfio/pci/vfio_pci.c |7 
 drivers/vfio/vfio.c |   62 +++
 drivers/vfio/vfio_iommu_type1.c |9 +-
 include/linux/vfio.h|4 +++
 4 files changed, 81 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 100fe5f6bc22..dbfe6a11aa74 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -2281,6 +2281,7 @@ static void vfio_pci_try_bus_reset(struct vfio_pci_device 
*vdev)
 
 static void __exit vfio_pci_cleanup(void)
 {
+   vfio_unregister_vma_inv_ops(_pci_mmap_ops);
pci_unregister_driver(_pci_driver);
vfio_pci_uninit_perm_bits();
 }
@@ -2340,10 +2341,16 @@ static int __init vfio_pci_init(void)
if (ret)
goto out_driver;
 
+   ret = vfio_register_vma_inv_ops(_pci_mmap_ops);
+   if (ret)
+   goto out_inv_ops;
+
vfio_pci_fill_ids();
 
return 0;
 
+out_inv_ops:
+   pci_unregister_driver(_pci_driver);
 out_driver:
vfio_pci_uninit_perm_bits();
return ret;
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 0fff057b7cd9..0f0a9d3b38aa 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -47,6 +47,8 @@ static struct vfio {
struct cdev group_cdev;
dev_t   group_devt;
wait_queue_head_t   release_q;
+   struct list_headvma_inv_ops_list;
+   struct mutexvma_inv_ops_lock;
 } vfio;
 
 struct vfio_iommu_driver {
@@ -98,6 +100,11 @@ struct vfio_device {
void*device_data;
 };
 
+struct vfio_vma_inv_ops {
+   const struct vm_operations_struct   *ops;
+   struct list_headops_next;
+};
+
 #ifdef CONFIG_VFIO_NOIOMMU
 static bool noiommu __read_mostly;
 module_param_named(enable_unsafe_noiommu_mode,
@@ -2332,6 +2339,58 @@ int vfio_unregister_notifier(struct device *dev, enum 
vfio_notify_type type,
 }
 EXPORT_SYMBOL(vfio_unregister_notifier);
 
+int vfio_register_vma_inv_ops(const struct vm_operations_struct *ops)
+{
+   struct vfio_vma_inv_ops *inv_ops;
+
+   inv_ops = kmalloc(sizeof(*inv_ops), GFP_KERNEL);
+   if (!inv_ops)
+   return -ENOMEM;
+
+   inv_ops->ops = ops;
+
+   mutex_lock(_inv_ops_lock);
+   list_add(_ops->ops_next, _inv_ops_list);
+   mutex_unlock(_inv_ops_lock);
+
+   return 0;
+}
+EXPORT_SYMBOL(vfio_register_vma_inv_ops);
+
+void vfio_unregister_vma_inv_ops(const struct vm_operations_struct *ops)
+{
+   struct vfio_vma_inv_ops *inv_ops;
+
+   mutex_lock(_inv_ops_lock);
+   list_for_each_entry(inv_ops, _inv_ops_list, ops_next) {
+   if (inv_ops->ops == ops) {
+   list_del(_ops->ops_next);
+   kfree(inv_ops);
+   break;
+   }
+   }
+   mutex_unlock(_inv_ops_lock);
+}
+EXPORT_SYMBOL(vfio_unregister_vma_inv_ops);
+
+bool vfio_vma_has_inv_ops(struct vm_area_struct *vma)
+{
+   struct vfio_vma_inv_ops *inv_ops;
+   bool ret = false;
+
+   mutex_lock(_inv_ops_lock);
+   list_for_each_entry(inv_ops, _inv_ops_list, ops_next) {
+   if (inv_ops->ops == vma->vm_ops) {
+   ret = true;
+   break;
+   }
+   }
+   mutex_unlock(_inv_ops_lock);
+
+   return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_vma_has_inv_ops);
+
 /**
  * Module/class support
  */
@@ -2355,8 +2414,10 @@ static int __init vfio_init(void)
idr_init(_idr);
mutex_init(_lock);
mutex_init(_drivers_lock);
+   mutex_init(_inv_ops_lock);
INIT_LIST_HEAD(_list);
INIT_LIST_HEAD(_drivers_list);
+   INIT_LIST_HEAD(_inv_ops_list);
init_waitqueue_head(_q);
 
ret = misc_register(_dev);
@@ -2403,6 +2464,7 @@ static int __init vfio_init(void)
 static void __exit vfio_cleanup(void)
 {
WARN_ON(!list_empty(_list));
+   WARN_ON(!list_empty(_inv_ops_list));
 
 #ifdef C

[PATCH 1/2] vfio: Introduce bus driver to IOMMU invalidation interface

2020-05-14 Thread Alex Williamson

VFIO bus drivers, like vfio-pci, can allow mmaps of non-page backed
device memory, such as MMIO regions of the device.  The user may then
map these ranges through the IOMMU, for example to enable peer-to-peer
DMA between devices.  When these ranges are zapped or removed from the
user, such as when the MMIO region is disabled or the device is
released, we should also remove the IOMMU mapping.  This provides
kernel level enforcement of the behavior we already see from userspace
drivers like QEMU, where these ranges are unmapped when they become
inaccessible.  This userspace behavior is still recommended as this
support only provides invalidation, dropping unmapped vmas.  Those
vmas are not automatically re-installed when re-mapped.

Signed-off-by: Alex Williamson 
---
 drivers/vfio/pci/vfio_pci.c |   34 +-
 drivers/vfio/pci/vfio_pci_private.h |1 
 drivers/vfio/vfio.c |   14 
 drivers/vfio/vfio_iommu_type1.c |  123 ++-
 include/linux/vfio.h|5 +
 5 files changed, 142 insertions(+), 35 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 49ae9faa6099..100fe5f6bc22 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -521,6 +521,8 @@ static void vfio_pci_release(void *device_data)
vfio_pci_vf_token_user_add(vdev, -1);
vfio_spapr_pci_eeh_release(vdev->pdev);
vfio_pci_disable(vdev);
+   vfio_group_put_external_user(vdev->group);
+   vdev->group = NULL;
}
 
mutex_unlock(>reflck->lock);
@@ -539,6 +541,15 @@ static int vfio_pci_open(void *device_data)
mutex_lock(>reflck->lock);
 
if (!vdev->refcnt) {
+   struct pci_dev *pdev = vdev->pdev;
+
+   vdev->group = vfio_group_get_external_user_from_dev(>dev);
+   if (IS_ERR_OR_NULL(vdev->group)) {
+   ret = PTR_ERR(vdev->group);
+   vdev->group = NULL;
+   goto error;
+   }
+
ret = vfio_pci_enable(vdev);
if (ret)
goto error;
@@ -549,8 +560,13 @@ static int vfio_pci_open(void *device_data)
vdev->refcnt++;
 error:
mutex_unlock(>reflck->lock);
-   if (ret)
+   if (ret) {
module_put(THIS_MODULE);
+   if (vdev->group) {
+   vfio_group_put_external_user(vdev->group);
+   vdev->group = NULL;
+   }
+   }
return ret;
 }
 
@@ -1370,7 +1386,7 @@ static ssize_t vfio_pci_write(void *device_data, const 
char __user *buf,
 /* Return 1 on zap and vma_lock acquired, 0 on contention (only with @try) */
 static int vfio_pci_zap_and_vma_lock(struct vfio_pci_device *vdev, bool try)
 {
-   struct vfio_pci_mmap_vma *mmap_vma, *tmp;
+   struct vfio_pci_mmap_vma *mmap_vma;
 
/*
 * Lock ordering:
@@ -1420,6 +1436,7 @@ static int vfio_pci_zap_and_vma_lock(struct 
vfio_pci_device *vdev, bool try)
return 1;
mutex_unlock(>vma_lock);
 
+again:
if (try) {
if (!down_read_trylock(>mmap_sem)) {
mmput(mm);
@@ -1438,8 +1455,8 @@ static int vfio_pci_zap_and_vma_lock(struct 
vfio_pci_device *vdev, bool try)
} else {
mutex_lock(>vma_lock);
}
-   list_for_each_entry_safe(mmap_vma, tmp,
->vma_list, vma_next) {
+   list_for_each_entry(mmap_vma,
+   >vma_list, vma_next) {
struct vm_area_struct *vma = mmap_vma->vma;
 
if (vma->vm_mm != mm)
@@ -1450,6 +1467,10 @@ static int vfio_pci_zap_and_vma_lock(struct 
vfio_pci_device *vdev, bool try)
 
zap_vma_ptes(vma, vma->vm_start,
 vma->vm_end - vma->vm_start);
+   mutex_unlock(>vma_lock);
+   up_read(>mmap_sem);
+   vfio_invalidate_pfnmap_vma(vdev->group, vma);
+   goto again;
}
mutex_unlock(>vma_lock);
}
@@ -1494,16 +1515,21 @@ static void vfio_pci_mmap_close(struct vm_area_struct 
*vma)
 {
struct vfio_pci_device *vdev = vma->vm_private_data;
struct vfio_pci_mmap_vma *mmap_vma;
+   bool found = false;
 
mutex_lock(>vma_lock);
list_for_each_entry(mmap_vma, >vma_list, vma_next) {
if (mmap_vma->vma == vma) {

[PATCH 0/2] vfio/type1/pci: IOMMU PFNMAP invalidation

2020-05-14 Thread Alex Williamson

This is a follow-on series to "vfio-pci: Block user access to disabled
device MMIO"[1], which extends user access blocking of disabled MMIO
ranges to include unmapping the ranges from the IOMMU.  The first patch
adds an invalidation callback path, allowing vfio bus drivers to signal
the IOMMU backend to unmap ranges with vma level granularity.  This
signaling is done both when the MMIO range becomes inaccessible due to
memory disabling, as well as when a vma is closed, making up for the
lack of tracking or pinning for non-page backed vmas.  The second
patch adds registration and testing interfaces such that the IOMMU
backend driver can test whether a given PFNMAP vma is provided by a
vfio bus driver supporting invalidation.  We can then implement more
restricted semantics to only allow PFNMAP DMA mappings when we have
such support, which becomes the new default.

Jason, if you'd like Suggested-by credit for the ideas here I'd be
glad to add it.  Thanks,

Alex

[1]https://lore.kernel.org/kvm/158871401328.15589.17598154478222071285.st...@gimli.home/

---

Alex Williamson (2):
  vfio: Introduce bus driver to IOMMU invalidation interface
  vfio: Introduce strict PFNMAP mappings


 drivers/vfio/pci/vfio_pci.c |   41 ++-
 drivers/vfio/pci/vfio_pci_private.h |1 
 drivers/vfio/vfio.c |   76 
 drivers/vfio/vfio_iommu_type1.c |  130 +++
 include/linux/vfio.h|9 ++
 5 files changed, 222 insertions(+), 35 deletions(-)

Re: [PATCH 08/12] vfio: use __anon_inode_getfd

2020-05-08 Thread Alex Williamson

On Fri,  8 May 2020 17:36:30 +0200
Christoph Hellwig  wrote:

> Use __anon_inode_getfd instead of opencoding the logic using
> get_unused_fd_flags + anon_inode_getfile.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/vfio/vfio.c | 37 -
>  1 file changed, 8 insertions(+), 29 deletions(-)


Thanks!

Acked-by: Alex Williamson 

> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 765e0e5d83ed9..33a88103f857f 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -1451,42 +1451,21 @@ static int vfio_group_get_device_fd(struct vfio_group 
> *group, char *buf)
>   return ret;
>   }
>  
> - /*
> -  * We can't use anon_inode_getfd() because we need to modify
> -  * the f_mode flags directly to allow more than just ioctls
> -  */
> - ret = get_unused_fd_flags(O_CLOEXEC);
> - if (ret < 0) {
> - device->ops->release(device->device_data);
> - vfio_device_put(device);
> - return ret;
> - }
> -
> - filep = anon_inode_getfile("[vfio-device]", _device_fops,
> -device, O_RDWR);
> - if (IS_ERR(filep)) {
> - put_unused_fd(ret);
> - ret = PTR_ERR(filep);
> - device->ops->release(device->device_data);
> - vfio_device_put(device);
> - return ret;
> - }
> -
> - /*
> -  * TODO: add an anon_inode interface to do this.
> -  * Appears to be missing by lack of need rather than
> -  * explicitly prevented.  Now there's need.
> -  */
> + ret = __anon_inode_getfd("[vfio-device]", _device_fops,
> +device, O_CLOEXEC | O_RDWR, );
> + if (ret < 0)
> + goto release;
>   filep->f_mode |= (FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE);
> -
>   atomic_inc(>container_users);
> -
>   fd_install(ret, filep);
>  
>   if (group->noiommu)
>   dev_warn(device->dev, "vfio-noiommu device opened by user "
>"(%s:%d)\n", current->comm, task_pid_nr(current));
> -
> + return ret;
> +release:
> + device->ops->release(device->device_data);
> + vfio_device_put(device);
>   return ret;
>  }
>

Re: [PATCH v2 1/3] vfio/type1: Support faulting PFNMAP vmas

2020-05-08 Thread Alex Williamson

On Fri, 8 May 2020 12:05:40 -0300
Jason Gunthorpe  wrote:

> On Fri, May 08, 2020 at 10:30:42AM -0400, Peter Xu wrote:
> > On Fri, May 08, 2020 at 09:10:13AM -0300, Jason Gunthorpe wrote:  
> > > On Thu, May 07, 2020 at 10:19:39PM -0400, Peter Xu wrote:  
> > > > On Thu, May 07, 2020 at 08:54:21PM -0300, Jason Gunthorpe wrote:  
> > > > > On Thu, May 07, 2020 at 05:24:43PM -0400, Peter Xu wrote:  
> > > > > > On Tue, May 05, 2020 at 03:54:44PM -0600, Alex Williamson wrote:  
> > > > > > > With conversion to follow_pfn(), DMA mapping a PFNMAP range 
> > > > > > > depends on
> > > > > > > the range being faulted into the vma.  Add support to manually 
> > > > > > > provide
> > > > > > > that, in the same way as done on KVM with hva_to_pfn_remapped().
> > > > > > > 
> > > > > > > Signed-off-by: Alex Williamson 
> > > > > > >  drivers/vfio/vfio_iommu_type1.c |   36 
> > > > > > > +---
> > > > > > >  1 file changed, 33 insertions(+), 3 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/drivers/vfio/vfio_iommu_type1.c 
> > > > > > > b/drivers/vfio/vfio_iommu_type1.c
> > > > > > > index cc1d64765ce7..4a4cb7cd86b2 100644
> > > > > > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > > > > > @@ -317,6 +317,32 @@ static int put_pfn(unsigned long pfn, int 
> > > > > > > prot)
> > > > > > >   return 0;
> > > > > > >  }
> > > > > > >  
> > > > > > > +static int follow_fault_pfn(struct vm_area_struct *vma, struct 
> > > > > > > mm_struct *mm,
> > > > > > > + unsigned long vaddr, unsigned long *pfn,
> > > > > > > + bool write_fault)
> > > > > > > +{
> > > > > > > + int ret;
> > > > > > > +
> > > > > > > + ret = follow_pfn(vma, vaddr, pfn);
> > > > > > > + if (ret) {
> > > > > > > + bool unlocked = false;
> > > > > > > +
> > > > > > > + ret = fixup_user_fault(NULL, mm, vaddr,
> > > > > > > +FAULT_FLAG_REMOTE |
> > > > > > > +(write_fault ?  FAULT_FLAG_WRITE 
> > > > > > > : 0),
> > > > > > > +);
> > > > > > > + if (unlocked)
> > > > > > > + return -EAGAIN;  
> > > > > > 
> > > > > > Hi, Alex,
> > > > > > 
> > > > > > IIUC this retry is not needed too because fixup_user_fault() will 
> > > > > > guarantee the
> > > > > > fault-in is done correctly with the valid PTE as long as ret==0, 
> > > > > > even if
> > > > > > unlocked==true.  
> > > > > 
> > > > > It is true, and today it is fine, but be careful when reworking this
> > > > > to use notifiers as unlocked also means things like the vma pointer
> > > > > are invalidated.  
> > > > 
> > > > Oh right, thanks for noticing that.  Then we should probably still keep 
> > > > the
> > > > retry logic... because otherwise the latter follow_pfn() could be 
> > > > referencing
> > > > an invalid vma already...  
> > > 
> > > I looked briefly and thought this flow used the vma only once?  
> > 
> > ret = follow_pfn(vma, vaddr, pfn);
> > if (ret) {
> > bool unlocked = false;
> >  
> > ret = fixup_user_fault(NULL, mm, vaddr,
> >FAULT_FLAG_REMOTE |
> >(write_fault ?  FAULT_FLAG_WRITE : 
> > 0),
> >);
> > if (unlocked)
> > return -EAGAIN;
> >  
> > if (ret)
> > return ret;
> >  
> > ret = follow_pfn(vma, vaddr, pfn);  <--- [1]
> > }
> > 
> > So imo the 2nd follow_pfn() [1] could be racy if without the unlocked 
> > check.  
> 
> Ah yes, I didn't notice that, you can't touch vma here if unlocked is true.

Thanks for the discussion.  I gather then that this patch is correct as
written, which probably also mean the patch Peter linked for KVM should
not be applied since the logic is the same there.  Correct?  Thanks,

Alex

Re: [PATCH v2 0/3] vfio-pci: Block user access to disabled device MMIO

2020-05-07 Thread Alex Williamson

On Thu, 7 May 2020 17:59:08 -0400
Peter Xu  wrote:

> On Tue, May 05, 2020 at 03:54:36PM -0600, Alex Williamson wrote:
> > v2:
> > 
> > Locking in 3/ is substantially changed to avoid the retry scenario
> > within the fault handler, therefore a caller who does not allow retry
> > will no longer receive a SIGBUS on contention.  IOMMU invalidations
> > are still not included here, I expect that will be a future follow-on
> > change as we're not fundamentally changing that issue in this series.
> > The 'add to vma list only on fault' behavior is also still included
> > here, per the discussion I think it's still a valid approach and has
> > some advantages, particularly in a VM scenario where we potentially
> > defer the mapping until the MMIO BAR is actually DMA mapped into the
> > VM address space (or the guest driver actually accesses the device
> > if that DMA mapping is eliminated at some point).  Further discussion
> > and review appreciated.  Thanks,  
> 
> Hi, Alex,
> 
> I have a general question on the series.
> 
> IIUC this series tries to protect illegal vfio userspace writes to device MMIO
> regions which may cause platform-level issues.  That makes perfect sense to 
> me.
> However what if the write comes from the devices' side?  E.g.:
> 
>   - Device A maps MMIO region X
> 
>   - Device B do VFIO_IOMMU_DMA_MAP on Device A's MMIO region X
> (so X's MMIO PFNs are mapped in device B's IOMMU page table)
> 
>   - Device A clears PCI_COMMAND_MEMORY (reset, etc.)
> - this should zap all existing vmas that mapping region X, however device
>   B's IOMMU page table is not aware of this?
> 
>   - Device B writes to MMIO region X of device A even if PCI_COMMAND_MEMORY
> cleared on device A's PCI_COMMAND register
> 
> Could this happen?

Yes, this can happen and Jason has brought up variations on this
scenario that are important to fix as well.  I've got some ideas, but
the access in this series was the current priority.  There are also
issues in the above scenario that if a platform considers a DMA write
to an invalid IOMMU PTE and triggering an IOMMU fault to have the same
severity as the write to disabled MMIO space we've prevented, then our
hands are tied.  Thanks,

Alex

Re: [PATCH v2 2/3] vfio-pci: Fault mmaps to enable vma tracking

2020-05-07 Thread Alex Williamson

On Thu, 7 May 2020 17:47:44 -0400
Peter Xu  wrote:

> Hi, Alex,
> 
> On Tue, May 05, 2020 at 03:54:53PM -0600, Alex Williamson wrote:
> > +/*
> > + * Zap mmaps on open so that we can fault them in on access and therefore
> > + * our vma_list only tracks mappings accessed since last zap.
> > + */
> > +static void vfio_pci_mmap_open(struct vm_area_struct *vma)
> > +{
> > +   zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);  
> 
> A pure question: is this only a safety-belt or it is required in some known
> scenarios?

It's not required.  I originally did this so that I'm not allocating a
vma_list entry in a path where I can't return error, but as Jason
suggested I could zap here only in the case that I do encounter that
allocation fault.  However I still like consolidating the vma_list
handling to the vm_ops .fault and .close callbacks and potentially we
reduce the zap latency by keeping the vma_list to actual users, which
we'll get to eventually anyway in the VM case as memory BARs are sized
and assigned addresses.

> In all cases:
> 
> Reviewed-by: Peter Xu 

Thanks!
Alex

Re: [PATCH v2 1/3] vfio/type1: Support faulting PFNMAP vmas

2020-05-07 Thread Alex Williamson

On Thu, 7 May 2020 17:24:43 -0400
Peter Xu  wrote:

> On Tue, May 05, 2020 at 03:54:44PM -0600, Alex Williamson wrote:
> > With conversion to follow_pfn(), DMA mapping a PFNMAP range depends on
> > the range being faulted into the vma.  Add support to manually provide
> > that, in the same way as done on KVM with hva_to_pfn_remapped().
> > 
> > Signed-off-by: Alex Williamson 
> > ---
> >  drivers/vfio/vfio_iommu_type1.c |   36 +---
> >  1 file changed, 33 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/vfio/vfio_iommu_type1.c 
> > b/drivers/vfio/vfio_iommu_type1.c
> > index cc1d64765ce7..4a4cb7cd86b2 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -317,6 +317,32 @@ static int put_pfn(unsigned long pfn, int prot)
> > return 0;
> >  }
> >  
> > +static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct 
> > *mm,
> > +   unsigned long vaddr, unsigned long *pfn,
> > +   bool write_fault)
> > +{
> > +   int ret;
> > +
> > +   ret = follow_pfn(vma, vaddr, pfn);
> > +   if (ret) {
> > +   bool unlocked = false;
> > +
> > +   ret = fixup_user_fault(NULL, mm, vaddr,
> > +  FAULT_FLAG_REMOTE |
> > +  (write_fault ?  FAULT_FLAG_WRITE : 0),
> > +  );
> > +   if (unlocked)
> > +   return -EAGAIN;  
> 
> Hi, Alex,
> 
> IIUC this retry is not needed too because fixup_user_fault() will guarantee 
> the
> fault-in is done correctly with the valid PTE as long as ret==0, even if
> unlocked==true.
> 
> Note: there's another patch just removed the similar retry in kvm:
> 
> https://lore.kernel.org/kvm/20200416155906.267462-1-pet...@redhat.com/

Great, I was basing this on that kvm code, so I can make essentially an
identical fix.  Thanks!

Alex

[PATCH v2] vfio-pci: Mask cap zero

2020-05-05 Thread Alex Williamson

The PCI Code and ID Assignment Specification changed capability ID 0
from reserved to a NULL capability in the v1.1 revision.  The NULL
capability is defined to include only the 16-bit capability header,
ie. only the ID and next pointer.  Unfortunately vfio-pci creates a
map of config space, where ID 0 is used to reserve the standard type
0 header.  Finding an actual capability with this ID therefore results
in a bogus range marked in that map and conflicts with subsequent
capabilities.  As this seems to be a dummy capability anyway and we
already support dropping capabilities, let's hide this one rather than
delving into the potentially subtle dependencies within our map.

Seen on an NVIDIA Tesla T4.

Signed-off-by: Alex Williamson 
---
 drivers/vfio/pci/vfio_pci_config.c |7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/vfio_pci_config.c 
b/drivers/vfio/pci/vfio_pci_config.c
index 3dcddbd572e6..0d110e268094 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -1486,7 +1486,12 @@ static int vfio_cap_init(struct vfio_pci_device *vdev)
if (ret)
return ret;
 
-   if (cap <= PCI_CAP_ID_MAX) {
+   /*
+* ID 0 is a NULL capability, conflicting with our fake
+* PCI_CAP_ID_BASIC.  As it has no content, consider it
+* hidden for now.
+*/
+   if (cap && cap <= PCI_CAP_ID_MAX) {
len = pci_cap_length[cap];
if (len == 0xFF) { /* Variable length */
len = vfio_cap_len(vdev, cap, pos);

[PATCH v2 2/3] vfio-pci: Fault mmaps to enable vma tracking

2020-05-05 Thread Alex Williamson

Rather than calling remap_pfn_range() when a region is mmap'd, setup
a vm_ops handler to support dynamic faulting of the range on access.
This allows us to manage a list of vmas actively mapping the area that
we can later use to invalidate those mappings.  The open callback
invalidates the vma range so that all tracking is inserted in the
fault handler and removed in the close handler.

Signed-off-by: Alex Williamson 
---
 drivers/vfio/pci/vfio_pci.c |   76 ++-
 drivers/vfio/pci/vfio_pci_private.h |7 +++
 2 files changed, 81 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 6c6b37b5c04e..66a545a01f8f 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -1299,6 +1299,70 @@ static ssize_t vfio_pci_write(void *device_data, const 
char __user *buf,
return vfio_pci_rw(device_data, (char __user *)buf, count, ppos, true);
 }
 
+static int vfio_pci_add_vma(struct vfio_pci_device *vdev,
+   struct vm_area_struct *vma)
+{
+   struct vfio_pci_mmap_vma *mmap_vma;
+
+   mmap_vma = kmalloc(sizeof(*mmap_vma), GFP_KERNEL);
+   if (!mmap_vma)
+   return -ENOMEM;
+
+   mmap_vma->vma = vma;
+
+   mutex_lock(>vma_lock);
+   list_add(_vma->vma_next, >vma_list);
+   mutex_unlock(>vma_lock);
+
+   return 0;
+}
+
+/*
+ * Zap mmaps on open so that we can fault them in on access and therefore
+ * our vma_list only tracks mappings accessed since last zap.
+ */
+static void vfio_pci_mmap_open(struct vm_area_struct *vma)
+{
+   zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);
+}
+
+static void vfio_pci_mmap_close(struct vm_area_struct *vma)
+{
+   struct vfio_pci_device *vdev = vma->vm_private_data;
+   struct vfio_pci_mmap_vma *mmap_vma;
+
+   mutex_lock(>vma_lock);
+   list_for_each_entry(mmap_vma, >vma_list, vma_next) {
+   if (mmap_vma->vma == vma) {
+   list_del(_vma->vma_next);
+   kfree(mmap_vma);
+   break;
+   }
+   }
+   mutex_unlock(>vma_lock);
+}
+
+static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
+{
+   struct vm_area_struct *vma = vmf->vma;
+   struct vfio_pci_device *vdev = vma->vm_private_data;
+
+   if (vfio_pci_add_vma(vdev, vma))
+   return VM_FAULT_OOM;
+
+   if (remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
+   vma->vm_end - vma->vm_start, vma->vm_page_prot))
+   return VM_FAULT_SIGBUS;
+
+   return VM_FAULT_NOPAGE;
+}
+
+static const struct vm_operations_struct vfio_pci_mmap_ops = {
+   .open = vfio_pci_mmap_open,
+   .close = vfio_pci_mmap_close,
+   .fault = vfio_pci_mmap_fault,
+};
+
 static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
 {
struct vfio_pci_device *vdev = device_data;
@@ -1357,8 +1421,14 @@ static int vfio_pci_mmap(void *device_data, struct 
vm_area_struct *vma)
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
 
-   return remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
-  req_len, vma->vm_page_prot);
+   /*
+* See remap_pfn_range(), called from vfio_pci_fault() but we can't
+* change vm_flags within the fault handler.  Set them now.
+*/
+   vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP;
+   vma->vm_ops = _pci_mmap_ops;
+
+   return 0;
 }
 
 static void vfio_pci_request(void *device_data, unsigned int count)
@@ -1608,6 +1678,8 @@ static int vfio_pci_probe(struct pci_dev *pdev, const 
struct pci_device_id *id)
spin_lock_init(>irqlock);
mutex_init(>ioeventfds_lock);
INIT_LIST_HEAD(>ioeventfds_list);
+   mutex_init(>vma_lock);
+   INIT_LIST_HEAD(>vma_list);
 
ret = vfio_add_group_dev(>dev, _pci_ops, vdev);
if (ret)
diff --git a/drivers/vfio/pci/vfio_pci_private.h 
b/drivers/vfio/pci/vfio_pci_private.h
index 36ec69081ecd..9b25f9f6ce1d 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -92,6 +92,11 @@ struct vfio_pci_vf_token {
int users;
 };
 
+struct vfio_pci_mmap_vma {
+   struct vm_area_struct   *vma;
+   struct list_headvma_next;
+};
+
 struct vfio_pci_device {
struct pci_dev  *pdev;
void __iomem*barmap[PCI_STD_NUM_BARS];
@@ -132,6 +137,8 @@ struct vfio_pci_device {
struct list_headioeventfds_list;
struct vfio_pci_vf_token*vf_token;
struct notifier_block   nb;
+   struct mutexvma_lock;
+   struct list_headvma_list;
 };
 
 #define is_intx(vdev) (vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX)

[PATCH v2 3/3] vfio-pci: Invalidate mmaps and block MMIO access on disabled memory

2020-05-05 Thread Alex Williamson

Accessing the disabled memory space of a PCI device would typically
result in a master abort response on conventional PCI, or an
unsupported request on PCI express.  The user would generally see
these as a -1 response for the read return data and the write would be
silently discarded, possibly with an uncorrected, non-fatal AER error
triggered on the host.  Some systems however take it upon themselves
to bring down the entire system when they see something that might
indicate a loss of data, such as this discarded write to a disabled
memory space.

To avoid this, we want to try to block the user from accessing memory
spaces while they're disabled.  We start with a semaphore around the
memory enable bit, where writers modify the memory enable state and
must be serialized, while readers make use of the memory region and
can access in parallel.  Writers include both direct manipulation via
the command register, as well as any reset path where the internal
mechanics of the reset may both explicitly and implicitly disable
memory access, and manipulation of the MSI-X configuration, where the
MSI-X vector table resides in MMIO space of the device.  Readers
include the read and write file ops to access the vfio device fd
offsets as well as memory mapped access.  In the latter case, we make
use of our new vma list support to zap, or invalidate, those memory
mappings in order to force them to be faulted back in on access.

Our semaphore usage will stall user access to MMIO spaces across
internal operations like reset, but the user might experience new
behavior when trying to access the MMIO space while disabled via the
PCI command register.  Access via read or write while disabled will
return -EIO and access via memory maps will result in a SIGBUS.  This
is expected to be compatible with known use cases and potentially
provides better error handling capabilities than present in the
hardware, while avoiding the more readily accessible and severe
platform error responses that might otherwise occur.

Signed-off-by: Alex Williamson 
---
 drivers/vfio/pci/vfio_pci.c |  263 +++
 drivers/vfio/pci/vfio_pci_config.c  |   36 -
 drivers/vfio/pci/vfio_pci_intrs.c   |   18 ++
 drivers/vfio/pci/vfio_pci_private.h |5 +
 drivers/vfio/pci/vfio_pci_rdwr.c|   12 ++
 5 files changed, 300 insertions(+), 34 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 66a545a01f8f..49ae9faa6099 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "vfio_pci_private.h"
 
@@ -184,6 +185,7 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device 
*vdev)
 
 static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev);
 static void vfio_pci_disable(struct vfio_pci_device *vdev);
+static int vfio_pci_try_zap_and_vma_lock_cb(struct pci_dev *pdev, void *data);
 
 /*
  * INTx masking requires the ability to disable INTx signaling via PCI_COMMAND
@@ -736,6 +738,12 @@ int vfio_pci_register_dev_region(struct vfio_pci_device 
*vdev,
return 0;
 }
 
+struct vfio_devices {
+   struct vfio_device **devices;
+   int cur_index;
+   int max_index;
+};
+
 static long vfio_pci_ioctl(void *device_data,
   unsigned int cmd, unsigned long arg)
 {
@@ -984,8 +992,16 @@ static long vfio_pci_ioctl(void *device_data,
return ret;
 
} else if (cmd == VFIO_DEVICE_RESET) {
-   return vdev->reset_works ?
-   pci_try_reset_function(vdev->pdev) : -EINVAL;
+   int ret;
+
+   if (!vdev->reset_works)
+   return -EINVAL;
+
+   vfio_pci_zap_and_down_write_memory_lock(vdev);
+   ret = pci_try_reset_function(vdev->pdev);
+   up_write(>memory_lock);
+
+   return ret;
 
} else if (cmd == VFIO_DEVICE_GET_PCI_HOT_RESET_INFO) {
struct vfio_pci_hot_reset_info hdr;
@@ -1065,8 +1081,9 @@ static long vfio_pci_ioctl(void *device_data,
int32_t *group_fds;
struct vfio_pci_group_entry *groups;
struct vfio_pci_group_info info;
+   struct vfio_devices devs = { .cur_index = 0 };
bool slot = false;
-   int i, count = 0, ret = 0;
+   int i, group_idx, mem_idx = 0, count = 0, ret = 0;
 
minsz = offsetofend(struct vfio_pci_hot_reset, count);
 
@@ -1118,9 +1135,9 @@ static long vfio_pci_ioctl(void *device_data,
 * user interface and store the group and iommu ID.  This
 * ensures the group is held across the reset.
 */
-   for (i = 0; i < hdr.count; i++) {
+   for (group_idx = 0; group_idx < hdr.count; group_idx++) {
struct vfio_group *group;
-

[PATCH v2 0/3] vfio-pci: Block user access to disabled device MMIO

2020-05-05 Thread Alex Williamson

v2:

Locking in 3/ is substantially changed to avoid the retry scenario
within the fault handler, therefore a caller who does not allow retry
will no longer receive a SIGBUS on contention.  IOMMU invalidations
are still not included here, I expect that will be a future follow-on
change as we're not fundamentally changing that issue in this series.
The 'add to vma list only on fault' behavior is also still included
here, per the discussion I think it's still a valid approach and has
some advantages, particularly in a VM scenario where we potentially
defer the mapping until the MMIO BAR is actually DMA mapped into the
VM address space (or the guest driver actually accesses the device
if that DMA mapping is eliminated at some point).  Further discussion
and review appreciated.  Thanks,

Alex

v1:

Add tracking of the device memory enable bit and block/fault accesses
to device MMIO space while disabled.  This provides synchronous fault
handling for CPU accesses to the device and prevents the user from
triggering platform level error handling present on some systems.
Device reset and MSI-X vector table accesses are also included such
that access is blocked across reset and vector table accesses do not
depend on the user configuration of the device.

This is based on the vfio for-linus branch currently in next, making
use of follow_pfn() in vaddr_get_pfn() and therefore requiring patch
1/ to force the user fault in the case that a PFNMAP vma might be
DMA mapped before user access.  Further PFNMAP iommu invalidation
tracking is not yet included here.

As noted in the comments, I'm copying quite a bit of the logic from
rdma code for performing the zap_vma_ptes() calls and I'm also
attempting to resolve lock ordering issues in the fault handler to
lockdep's satisfaction.  I appreciate extra eyes on these sections in
particular.

I expect this to be functionally equivalent for any well behaved
userspace driver, but obviously there is a potential for the user to
get -EIO or SIGBUS on device access.  The device is provided to the
user enabled and device resets will restore the command register, so
by my evaluation a user would need to explicitly disable the memory
enable bit to trigger these faults.  We could potentially remap vmas
to a zero page rather than SIGBUS if we experience regressions, but
without known code requiring that, SIGBUS seems the appropriate
response to this condition.  Thanks,

Alex

---

Alex Williamson (3):
  vfio/type1: Support faulting PFNMAP vmas
  vfio-pci: Fault mmaps to enable vma tracking
  vfio-pci: Invalidate mmaps and block MMIO access on disabled memory


 drivers/vfio/pci/vfio_pci.c |  321 +--
 drivers/vfio/pci/vfio_pci_config.c  |   36 +++-
 drivers/vfio/pci/vfio_pci_intrs.c   |   18 ++
 drivers/vfio/pci/vfio_pci_private.h |   12 +
 drivers/vfio/pci/vfio_pci_rdwr.c|   12 +
 drivers/vfio/vfio_iommu_type1.c |   36 
 6 files changed, 405 insertions(+), 30 deletions(-)

[PATCH v2 1/3] vfio/type1: Support faulting PFNMAP vmas

2020-05-05 Thread Alex Williamson

With conversion to follow_pfn(), DMA mapping a PFNMAP range depends on
the range being faulted into the vma.  Add support to manually provide
that, in the same way as done on KVM with hva_to_pfn_remapped().

Signed-off-by: Alex Williamson 
---
 drivers/vfio/vfio_iommu_type1.c |   36 +---
 1 file changed, 33 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index cc1d64765ce7..4a4cb7cd86b2 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -317,6 +317,32 @@ static int put_pfn(unsigned long pfn, int prot)
return 0;
 }
 
+static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
+   unsigned long vaddr, unsigned long *pfn,
+   bool write_fault)
+{
+   int ret;
+
+   ret = follow_pfn(vma, vaddr, pfn);
+   if (ret) {
+   bool unlocked = false;
+
+   ret = fixup_user_fault(NULL, mm, vaddr,
+  FAULT_FLAG_REMOTE |
+  (write_fault ?  FAULT_FLAG_WRITE : 0),
+  );
+   if (unlocked)
+   return -EAGAIN;
+
+   if (ret)
+   return ret;
+
+   ret = follow_pfn(vma, vaddr, pfn);
+   }
+
+   return ret;
+}
+
 static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
 int prot, unsigned long *pfn)
 {
@@ -339,12 +365,16 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
long vaddr,
 
vaddr = untagged_addr(vaddr);
 
+retry:
vma = find_vma_intersection(mm, vaddr, vaddr + 1);
 
if (vma && vma->vm_flags & VM_PFNMAP) {
-   if (!follow_pfn(vma, vaddr, pfn) &&
-   is_invalid_reserved_pfn(*pfn))
-   ret = 0;
+   ret = follow_fault_pfn(vma, mm, vaddr, pfn, prot & IOMMU_WRITE);
+   if (ret == -EAGAIN)
+   goto retry;
+
+   if (!ret && !is_invalid_reserved_pfn(*pfn))
+   ret = -EFAULT;
}
 done:
up_read(>mmap_sem);

Re: [PATCH 3/3] vfio-pci: Invalidate mmaps and block MMIO access on disabled memory

2020-05-05 Thread Alex Williamson

On Mon, 4 May 2020 17:01:23 -0300
Jason Gunthorpe  wrote:

> On Mon, May 04, 2020 at 01:35:52PM -0600, Alex Williamson wrote:
> 
> > Ok, this all makes a lot more sense with memory_lock still in the
> > picture.  And it looks like you're not insisting on the wait_event, we
> > can block on memory_lock so long as we don't have an ordering issue.
> > I'll see what I can do.  Thanks,  
> 
> Right, you can block on the rwsem if it is ordered properly vs
> mmap_sem.

This is what I've come up with, please see if you agree with the logic:

void vfio_pci_zap_and_down_write_memory_lock(struct vfio_pci_device *vdev)
{
struct vfio_pci_mmap_vma *mmap_vma, *tmp;

/*
 * Lock ordering:
 * vma_lock is nested under mmap_sem for vm_ops callback paths.
 * The memory_lock semaphore is used by both code paths calling
 * into this function to zap vmas and the vm_ops.fault callback
 * to protect the memory enable state of the device.
 *
 * When zapping vmas we need to maintain the mmap_sem => vma_lock
 * ordering, which requires using vma_lock to walk vma_list to
 * acquire an mm, then dropping vma_lock to get the mmap_sem and
 * reacquiring vma_lock.  This logic is derived from similar
 * requirements in uverbs_user_mmap_disassociate().
 *
 * mmap_sem must always be the top-level lock when it is taken.
 * Therefore we can only hold the memory_lock write lock when
 * vma_list is empty, as we'd need to take mmap_sem to clear
 * entries.  vma_list can only be guaranteed empty when holding
 * vma_lock, thus memory_lock is nested under vma_lock.
 *
 * This enables the vm_ops.fault callback to acquire vma_lock,
 * followed by memory_lock read lock, while already holding
 * mmap_sem without risk of deadlock.
 */
while (1) {
struct mm_struct *mm = NULL;

mutex_lock(>vma_lock);
while (!list_empty(>vma_list)) {
mmap_vma = list_first_entry(>vma_list,
struct vfio_pci_mmap_vma,
vma_next);
mm = mmap_vma->vma->vm_mm;
if (mmget_not_zero(mm))
break;

list_del(_vma->vma_next);
kfree(mmap_vma);
mm = NULL;
}

if (!mm)
break;
mutex_unlock(>vma_lock);

down_read(>mmap_sem);
if (mmget_still_valid(mm)) {
mutex_lock(>vma_lock);
list_for_each_entry_safe(mmap_vma, tmp,
 >vma_list, vma_next) {
struct vm_area_struct *vma = mmap_vma->vma;

if (vma->vm_mm != mm)
continue;

list_del(_vma->vma_next);
kfree(mmap_vma);

zap_vma_ptes(vma, vma->vm_start,
 vma->vm_end - vma->vm_start);
}
mutex_unlock(>vma_lock);
}
up_read(>mmap_sem);
mmput(mm);
}

down_write(>memory_lock);
mutex_unlock(>vma_lock);
}

As noted in the comment, the fault handler can simply do:

mutex_lock(>vma_lock);
down_read(>memory_lock);

This should be deadlock free now, so we can drop the retry handling

Paths needing to acquire memory_lock with vmas zapped (device reset,
memory bit *->0 transition) call this function, perform their
operation, then simply release with up_write(>memory_lock).  Both
the read and write version of acquiring memory_lock can still occur
outside this function for operations that don't require flushing all
vmas or otherwise touch vma_lock or mmap_sem (ex. read/write, MSI-X
vector table access, writing *->1 to memory enable bit).

I still need to work on the bus reset path as acquiring memory_lock
write locks across multiple devices seems like it requires try-lock
behavior, which is clearly complicated, or at least messy in the above
function.

Does this seem like it's going in a reasonable direction?  Thanks,

Alex

Re: [PATCH] iommu: Relax ACS requirement for RCiEP devices.

2020-05-05 Thread Alex Williamson

On Tue, 5 May 2020 07:56:06 -0700
"Raj, Ashok"  wrote:

> On Tue, May 05, 2020 at 08:05:14AM -0600, Alex Williamson wrote:
> > On Mon, 4 May 2020 23:11:07 -0700
> > "Raj, Ashok"  wrote:
> >   
> > > Hi Alex
> > > 
> > > + Joerg, accidently missed in the Cc.
> > > 
> > > On Mon, May 04, 2020 at 11:19:36PM -0600, Alex Williamson wrote:  
> > > > On Mon,  4 May 2020 21:42:16 -0700
> > > > Ashok Raj  wrote:
> > > > 
> > > > > PCIe Spec recommends we can relax ACS requirement for RCIEP devices.
> > > > > 
> > > > > PCIe 5.0 Specification.
> > > > > 6.12 Access Control Services (ACS)
> > > > > Implementation of ACS in RCiEPs is permitted but not required. It is
> > > > > explicitly permitted that, within a single Root Complex, some RCiEPs
> > > > > implement ACS and some do not. It is strongly recommended that Root 
> > > > > Complex
> > > > > implementations ensure that all accesses originating from RCiEPs
> > > > > (PFs and VFs) without ACS capability are first subjected to 
> > > > > processing by
> > > > > the Translation Agent (TA) in the Root Complex before further 
> > > > > decoding and
> > > > > processing. The details of such Root Complex handling are outside the 
> > > > > scope
> > > > > of this specification.
> > > > >   
> > > > 
> > > > Is the language here really strong enough to make this change?  ACS is
> > > > an optional feature, so being permitted but not required is rather
> > > > meaningless.  The spec is also specifically avoiding the words "must"
> > > > or "shall" and even when emphasized with "strongly", we still only have
> > > > a recommendation that may or may not be honored.  This seems like a
> > > > weak basis for assuming that RCiEPs universally honor this
> > > > recommendation.  Thanks,
> > > > 
> > > 
> > > We are speaking about PCIe spec, where people write it about 5 years ahead
> > > and every vendor tries to massage their product behavior with vague
> > > words like this..  :)
> > > 
> > > But honestly for any any RCiEP, or even integrated endpoints, there 
> > > is no way to send them except up north. These aren't behind a RP.  
> > 
> > But they are multi-function devices and the spec doesn't define routing
> > within multifunction packages.  A single function RCiEP will already be
> > assumed isolated within its own group.  
> 
> That's right. The other two devices only have legacy PCI headers. So 
> they can't claim to be RCiEP's but just integrated endpoints. The legacy
> devices don't even have a PCIe header.
> 
> I honestly don't know why these are groped as MFD's in the first place.
> 
> >
> > > I did check with couple folks who are part of the SIG, and seem to agree
> > > that ACS treatment for RCiEP's doesn't mean much. 
> > > 
> > > I understand the language isn't strong, but it doesn't seem like ACS 
> > > should
> > > be a strong requirement for RCiEP's and reasonable to relax.
> > > 
> > > What are your thoughts?   
> > 
> > I think hardware vendors have ACS at their disposal to clarify when
> > isolation is provided, otherwise vendors can submit quirks, but I don't
> > see that the "strongly recommended" phrasing is sufficient to assume
> > isolation between multifunction RCiEPs.  Thanks,  
> 
> You point is that integrated MFD endpoints, without ACS, there is no 
> gaurantee to SW that they are isolated.
> 
> As far as a quirk, do you think:
>   - a cmdline optput for integrated endpoints, and RCiEP's suffice?
> along with a compile time default that is strict enforcement
>   - typical vid/did type exception list?
> 
> A more generic way to ask for exception would be scalable until we can stop
> those type of integrated devices. Or we need to maintain these device lists
> for eternity. 

I don't think the language in the spec is anything sufficient to handle
RCiEP uniquely.  We've previously rejected kernel command line opt-outs
for ACS, and the extent to which those patches still float around the
user community and are blindly used to separate IOMMU groups are a
testament to the failure of this approach.  Users do not have a basis
for enabling this sort of opt-out.  The benefit is obvious in the IOMMU
grouping, but the risk is entirely unknown.  A kconfig opt

Re: [PATCH] iommu: Relax ACS requirement for RCiEP devices.

2020-05-05 Thread Alex Williamson

On Mon, 4 May 2020 23:11:07 -0700
"Raj, Ashok"  wrote:

> Hi Alex
> 
> + Joerg, accidently missed in the Cc.
> 
> On Mon, May 04, 2020 at 11:19:36PM -0600, Alex Williamson wrote:
> > On Mon,  4 May 2020 21:42:16 -0700
> > Ashok Raj  wrote:
> >   
> > > PCIe Spec recommends we can relax ACS requirement for RCIEP devices.
> > > 
> > > PCIe 5.0 Specification.
> > > 6.12 Access Control Services (ACS)
> > > Implementation of ACS in RCiEPs is permitted but not required. It is
> > > explicitly permitted that, within a single Root Complex, some RCiEPs
> > > implement ACS and some do not. It is strongly recommended that Root 
> > > Complex
> > > implementations ensure that all accesses originating from RCiEPs
> > > (PFs and VFs) without ACS capability are first subjected to processing by
> > > the Translation Agent (TA) in the Root Complex before further decoding and
> > > processing. The details of such Root Complex handling are outside the 
> > > scope
> > > of this specification.
> > > 
> > > Since Linux didn't give special treatment to allow this exception, certain
> > > RCiEP MFD devices are getting grouped in a single iommu group. This
> > > doesn't permit a single device to be assigned to a guest for instance.
> > > 
> > > In one vendor system: Device 14.x were grouped in a single IOMMU group.
> > > 
> > > /sys/kernel/iommu_groups/5/devices/:00:14.0
> > > /sys/kernel/iommu_groups/5/devices/:00:14.2
> > > /sys/kernel/iommu_groups/5/devices/:00:14.3
> > > 
> > > After the patch:
> > > /sys/kernel/iommu_groups/5/devices/:00:14.0
> > > /sys/kernel/iommu_groups/5/devices/:00:14.2
> > > /sys/kernel/iommu_groups/6/devices/:00:14.3 <<< new group
> > > 
> > > 14.0 and 14.2 are integrated devices, but legacy end points.
> > > Whereas 14.3 was a PCIe compliant RCiEP.
> > > 
> > > 00:14.3 Network controller: Intel Corporation Device 9df0 (rev 30)
> > > Capabilities: [40] Express (v2) Root Complex Integrated Endpoint, MSI 00
> > > 
> > > This permits assigning this device to a guest VM.
> > > 
> > > Fixes: f096c061f552 ("iommu: Rework iommu_group_get_for_pci_dev()")
> > > Signed-off-by: Ashok Raj 
> > > To: Joerg Roedel 
> > > To: Bjorn Helgaas 
> > > Cc: linux-kernel@vger.kernel.org
> > > Cc: io...@lists.linux-foundation.org
> > > Cc: Lu Baolu 
> > > Cc: Alex Williamson 
> > > Cc: Darrel Goeddel 
> > > Cc: Mark Scott ,
> > > Cc: Romil Sharma 
> > > Cc: Ashok Raj 
> > > ---
> > >  drivers/iommu/iommu.c | 15 ++-
> > >  1 file changed, 14 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > > index 2b471419e26c..5744bd65f3e2 100644
> > > --- a/drivers/iommu/iommu.c
> > > +++ b/drivers/iommu/iommu.c
> > > @@ -1187,7 +1187,20 @@ static struct iommu_group 
> > > *get_pci_function_alias_group(struct pci_dev *pdev,
> > >   struct pci_dev *tmp = NULL;
> > >   struct iommu_group *group;
> > >  
> > > - if (!pdev->multifunction || pci_acs_enabled(pdev, REQ_ACS_FLAGS))
> > > + /*
> > > +  * PCI Spec 5.0, Section 6.12 Access Control Service
> > > +  * Implementation of ACS in RCiEPs is permitted but not required.
> > > +  * It is explicitly permitted that, within a single Root
> > > +  * Complex, some RCiEPs implement ACS and some do not. It is
> > > +  * strongly recommended that Root Complex implementations ensure
> > > +  * that all accesses originating from RCiEPs (PFs and VFs) without
> > > +  * ACS capability are first subjected to processing by the Translation
> > > +  * Agent (TA) in the Root Complex before further decoding and
> > > +  * processing.
> > > +  */  
> > 
> > Is the language here really strong enough to make this change?  ACS is
> > an optional feature, so being permitted but not required is rather
> > meaningless.  The spec is also specifically avoiding the words "must"
> > or "shall" and even when emphasized with "strongly", we still only have
> > a recommendation that may or may not be honored.  This seems like a
> > weak basis for assuming that RCiEPs universally honor this
> > recommendation.  Thanks,
> >   
> 
> We are speaking about PCIe spec, where people write it about 5 years ahead
> and every

Re: [PATCH] iommu: Relax ACS requirement for RCiEP devices.

2020-05-04 Thread Alex Williamson

On Mon,  4 May 2020 21:42:16 -0700
Ashok Raj  wrote:

> PCIe Spec recommends we can relax ACS requirement for RCIEP devices.
> 
> PCIe 5.0 Specification.
> 6.12 Access Control Services (ACS)
> Implementation of ACS in RCiEPs is permitted but not required. It is
> explicitly permitted that, within a single Root Complex, some RCiEPs
> implement ACS and some do not. It is strongly recommended that Root Complex
> implementations ensure that all accesses originating from RCiEPs
> (PFs and VFs) without ACS capability are first subjected to processing by
> the Translation Agent (TA) in the Root Complex before further decoding and
> processing. The details of such Root Complex handling are outside the scope
> of this specification.
> 
> Since Linux didn't give special treatment to allow this exception, certain
> RCiEP MFD devices are getting grouped in a single iommu group. This
> doesn't permit a single device to be assigned to a guest for instance.
> 
> In one vendor system: Device 14.x were grouped in a single IOMMU group.
> 
> /sys/kernel/iommu_groups/5/devices/:00:14.0
> /sys/kernel/iommu_groups/5/devices/:00:14.2
> /sys/kernel/iommu_groups/5/devices/:00:14.3
> 
> After the patch:
> /sys/kernel/iommu_groups/5/devices/:00:14.0
> /sys/kernel/iommu_groups/5/devices/:00:14.2
> /sys/kernel/iommu_groups/6/devices/:00:14.3 <<< new group
> 
> 14.0 and 14.2 are integrated devices, but legacy end points.
> Whereas 14.3 was a PCIe compliant RCiEP.
> 
> 00:14.3 Network controller: Intel Corporation Device 9df0 (rev 30)
> Capabilities: [40] Express (v2) Root Complex Integrated Endpoint, MSI 00
> 
> This permits assigning this device to a guest VM.
> 
> Fixes: f096c061f552 ("iommu: Rework iommu_group_get_for_pci_dev()")
> Signed-off-by: Ashok Raj 
> To: Joerg Roedel 
> To: Bjorn Helgaas 
> Cc: linux-kernel@vger.kernel.org
> Cc: io...@lists.linux-foundation.org
> Cc: Lu Baolu 
> Cc: Alex Williamson 
> Cc: Darrel Goeddel 
> Cc: Mark Scott ,
> Cc: Romil Sharma 
> Cc: Ashok Raj 
> ---
>  drivers/iommu/iommu.c | 15 ++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 2b471419e26c..5744bd65f3e2 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -1187,7 +1187,20 @@ static struct iommu_group 
> *get_pci_function_alias_group(struct pci_dev *pdev,
>   struct pci_dev *tmp = NULL;
>   struct iommu_group *group;
>  
> - if (!pdev->multifunction || pci_acs_enabled(pdev, REQ_ACS_FLAGS))
> + /*
> +  * PCI Spec 5.0, Section 6.12 Access Control Service
> +  * Implementation of ACS in RCiEPs is permitted but not required.
> +  * It is explicitly permitted that, within a single Root
> +  * Complex, some RCiEPs implement ACS and some do not. It is
> +  * strongly recommended that Root Complex implementations ensure
> +  * that all accesses originating from RCiEPs (PFs and VFs) without
> +  * ACS capability are first subjected to processing by the Translation
> +  * Agent (TA) in the Root Complex before further decoding and
> +  * processing.
> +  */

Is the language here really strong enough to make this change?  ACS is
an optional feature, so being permitted but not required is rather
meaningless.  The spec is also specifically avoiding the words "must"
or "shall" and even when emphasized with "strongly", we still only have
a recommendation that may or may not be honored.  This seems like a
weak basis for assuming that RCiEPs universally honor this
recommendation.  Thanks,

Alex

> + if (!pdev->multifunction ||
> + (pci_pcie_type(pdev) == PCI_EXP_TYPE_RC_END) ||
> +  pci_acs_enabled(pdev, REQ_ACS_FLAGS))
>   return NULL;
>  
>   for_each_pci_dev(tmp) {

Re: [PATCH] vfio-pci: Mask cap zero

2020-05-04 Thread Alex Williamson

On Mon, 4 May 2020 15:08:08 -0700
Neo Jia  wrote:

> On Mon, May 04, 2020 at 12:52:53PM -0600, Alex Williamson wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > On Mon, 4 May 2020 18:09:16 +0200
> > Cornelia Huck  wrote:
> >   
> > > On Fri, 01 May 2020 15:41:24 -0600
> > > Alex Williamson  wrote:
> > >  
> > > > There is no PCI spec defined capability with ID 0, therefore we don't
> > > > expect to find it in a capability chain and we use this index in an
> > > > internal array for tracking the sizes of various capabilities to handle
> > > > standard config space.  Therefore if a device does present us with a
> > > > capability ID 0, we mark our capability map with nonsense that can
> > > > trigger conflicts with other capabilities in the chain.  Ignore ID 0
> > > > when walking the capability chain, handling it as a hidden capability.
> > > >
> > > > Seen on an NVIDIA Tesla T4.
> > > >
> > > > Signed-off-by: Alex Williamson 
> > > > ---
> > > >  drivers/vfio/pci/vfio_pci_config.c |2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/vfio/pci/vfio_pci_config.c 
> > > > b/drivers/vfio/pci/vfio_pci_config.c
> > > > index 87d0cc8c86ad..5935a804cb88 100644
> > > > --- a/drivers/vfio/pci/vfio_pci_config.c
> > > > +++ b/drivers/vfio/pci/vfio_pci_config.c
> > > > @@ -1487,7 +1487,7 @@ static int vfio_cap_init(struct vfio_pci_device 
> > > > *vdev)
> > > > if (ret)
> > > > return ret;
> > > >
> > > > -   if (cap <= PCI_CAP_ID_MAX) {  
> > >
> > > Maybe add a comment:
> > >
> > > /* no PCI spec defined capability with ID 0: hide it */  
> 
> Hi Alex,
> 
> I think this is NULL Capability defined in Codes and IDs spec, probably we
> should just add a new enum to represent that?

Yes, it looks like the 1.1 version of that specification from June 2015
changed ID 0 from reserved to a NULL capability.  So my description and
this comment are wrong, but I wonder if we should did anything
different with the handling of this capability.  It's specified to
contain only the ID and next pointer, so I'd expect it's primarily a
mechanism for hardware vendors to blow fuses in config space to
maintain a capability chain while maybe hiding a feature not supported
by the product sku.  Hiding the capability in vfio is trivial, exposing
it implies some changes to our config space map that might be more
subtle.  I'm inclined to stick with this solution for now.  Thanks,

Alex

> > 
> > Sure.
> >   
> > >  
> > > > +   if (cap && cap <= PCI_CAP_ID_MAX) {
> > > > len = pci_cap_length[cap];
> > > > if (len == 0xFF) { /* Variable length */
> > > > len = vfio_cap_len(vdev, cap, pos);
> > > >  
> > >
> > > Is there a requirement for caps to be strictly ordered? If not, could
> > > len hold a residual value from a previous iteration?  
> > 
> > There is no ordering requirement for capabilities, but len is declared
> > non-static with an initial value within the scope of the loop, it's
> > reset every iteration.  Thanks,
> > 
> > Alex
> >   
>

Re: [PATCH 3/3] vfio-pci: Invalidate mmaps and block MMIO access on disabled memory

2020-05-04 Thread Alex Williamson

On Mon, 4 May 2020 15:44:36 -0300
Jason Gunthorpe  wrote:

> On Mon, May 04, 2020 at 12:26:43PM -0600, Alex Williamson wrote:
> > On Fri, 1 May 2020 20:48:49 -0300
> > Jason Gunthorpe  wrote:
> >   
> > > On Fri, May 01, 2020 at 03:39:30PM -0600, Alex Williamson wrote:
> > >   
> > > >  static int vfio_pci_add_vma(struct vfio_pci_device *vdev,
> > > > struct vm_area_struct *vma)
> > > >  {
> > > > @@ -1346,15 +1450,49 @@ static vm_fault_t vfio_pci_mmap_fault(struct 
> > > > vm_fault *vmf)
> > > >  {
> > > > struct vm_area_struct *vma = vmf->vma;
> > > > struct vfio_pci_device *vdev = vma->vm_private_data;
> > > > +   vm_fault_t ret = VM_FAULT_NOPAGE;
> > > >  
> > > > -   if (vfio_pci_add_vma(vdev, vma))
> > > > -   return VM_FAULT_OOM;
> > > > +   /*
> > > > +* Zap callers hold memory_lock and acquire mmap_sem, we hold
> > > > +* mmap_sem and need to acquire memory_lock to avoid races with
> > > > +* memory bit settings.  Release mmap_sem, wait, and retry, or 
> > > > fail.
> > > > +*/
> > > > +   if (unlikely(!down_read_trylock(>memory_lock))) {
> > > > +   if (vmf->flags & FAULT_FLAG_ALLOW_RETRY) {
> > > > +   if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT)
> > > > +   return VM_FAULT_RETRY;
> > > > +
> > > > +   up_read(>vm_mm->mmap_sem);
> > > > +
> > > > +   if (vmf->flags & FAULT_FLAG_KILLABLE) {
> > > > +   if 
> > > > (!down_read_killable(>memory_lock))
> > > > +   up_read(>memory_lock);
> > > > +   } else {
> > > > +   down_read(>memory_lock);
> > > > +   up_read(>memory_lock);
> > > > +   }
> > > > +   return VM_FAULT_RETRY;
> > > > +   }
> > > > +   return VM_FAULT_SIGBUS;
> > > > +   }
> > > 
> > > So, why have the wait? It isn't reliable - if this gets faulted from a
> > > call site that can't handle retry then it will SIGBUS anyhow?  
> > 
> > Do such call sites exist?  My assumption was that half of the branch
> > was unlikely to ever occur.  
> 
> hmm_range_fault() for instance doesn't set ALLOW_RETRY, I assume there
> are enough other case to care about, but am not so sure
> 
> > > The weird use of a rwsem as a completion suggest that perhaps using
> > > wait_event might improve things:
> > > 
> > > disable:
> > >   // Clean out the vma list with zap, then:
> > > 
> > >   down_read(mm->mmap_sem)  
> > 
> > I assume this is simplifying the dance we do in zapping to first take
> > vma_lock in order to walk vma_list, to find a vma from which we can
> > acquire the mm, drop vma_lock, get mmap_sem, then re-get vma_lock
> > below.
> 
> No, that has to stay..

Sorry, I stated that unclearly, I'm assuming we keep that and it's been
omitted from this pseudo code for simplicity.
 
> > Also accounting that vma_list might be empty and we might need
> > to drop and re-acquire vma_lock to get to another mm, so we really
> > probably want to set pause_faults at the start rather than at the end.  
> 
> New vmas should not created/faulted while vma_lock is held, so the
> order shouldn't matter..

Technically that's true, but if vfio_pci_zap_mmap_vmas() drops vma_lock
to go back and get another mm, then vm_ops.fault() could get another
vma into the list while we're trying to zap and clear them all.  The
result is the same, but we might be doing unnecessary work versus
holding off the fault from the start.
 
> > >   mutex_lock(vma_lock);
> > >   list_for_each_entry_safe()
> > >  // zap and remove all vmas
> > > 
> > >   pause_faults = true;
> > >   mutex_write(vma_lock);
> > > 
> > > fault:
> > >   // Already have down_read(mmap_sem)
> > >   mutex_lock(vma_lock);
> > >   while (pause_faults) {
> > >  mutex_unlock(vma_lock)
> > >  wait_event(..., !pause_faults)
> > >  mutex_lock(vma_lock)
> > >   }  
> > 
> > Nit, we need to te

Re: [PATCH] vfio-pci: Mask cap zero

2020-05-04 Thread Alex Williamson

On Mon, 4 May 2020 18:09:16 +0200
Cornelia Huck  wrote:

> On Fri, 01 May 2020 15:41:24 -0600
> Alex Williamson  wrote:
> 
> > There is no PCI spec defined capability with ID 0, therefore we don't
> > expect to find it in a capability chain and we use this index in an
> > internal array for tracking the sizes of various capabilities to handle
> > standard config space.  Therefore if a device does present us with a
> > capability ID 0, we mark our capability map with nonsense that can
> > trigger conflicts with other capabilities in the chain.  Ignore ID 0
> > when walking the capability chain, handling it as a hidden capability.
> > 
> > Seen on an NVIDIA Tesla T4.
> > 
> > Signed-off-by: Alex Williamson 
> > ---
> >  drivers/vfio/pci/vfio_pci_config.c |2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/vfio/pci/vfio_pci_config.c 
> > b/drivers/vfio/pci/vfio_pci_config.c
> > index 87d0cc8c86ad..5935a804cb88 100644
> > --- a/drivers/vfio/pci/vfio_pci_config.c
> > +++ b/drivers/vfio/pci/vfio_pci_config.c
> > @@ -1487,7 +1487,7 @@ static int vfio_cap_init(struct vfio_pci_device *vdev)
> > if (ret)
> > return ret;
> >  
> > -   if (cap <= PCI_CAP_ID_MAX) {  
> 
> Maybe add a comment:
> 
> /* no PCI spec defined capability with ID 0: hide it */
> 

Sure.

> 
> > +   if (cap && cap <= PCI_CAP_ID_MAX) {
> > len = pci_cap_length[cap];
> > if (len == 0xFF) { /* Variable length */
> > len = vfio_cap_len(vdev, cap, pos);
> >   
> 
> Is there a requirement for caps to be strictly ordered? If not, could
> len hold a residual value from a previous iteration?

There is no ordering requirement for capabilities, but len is declared
non-static with an initial value within the scope of the loop, it's
reset every iteration.  Thanks,

Alex

Re: [PATCH 3/3] vfio-pci: Invalidate mmaps and block MMIO access on disabled memory

2020-05-04 Thread Alex Williamson

On Fri, 1 May 2020 20:48:49 -0300
Jason Gunthorpe  wrote:

> On Fri, May 01, 2020 at 03:39:30PM -0600, Alex Williamson wrote:
> 
> >  static int vfio_pci_add_vma(struct vfio_pci_device *vdev,
> > struct vm_area_struct *vma)
> >  {
> > @@ -1346,15 +1450,49 @@ static vm_fault_t vfio_pci_mmap_fault(struct 
> > vm_fault *vmf)
> >  {
> > struct vm_area_struct *vma = vmf->vma;
> > struct vfio_pci_device *vdev = vma->vm_private_data;
> > +   vm_fault_t ret = VM_FAULT_NOPAGE;
> >  
> > -   if (vfio_pci_add_vma(vdev, vma))
> > -   return VM_FAULT_OOM;
> > +   /*
> > +* Zap callers hold memory_lock and acquire mmap_sem, we hold
> > +* mmap_sem and need to acquire memory_lock to avoid races with
> > +* memory bit settings.  Release mmap_sem, wait, and retry, or fail.
> > +*/
> > +   if (unlikely(!down_read_trylock(>memory_lock))) {
> > +   if (vmf->flags & FAULT_FLAG_ALLOW_RETRY) {
> > +   if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT)
> > +   return VM_FAULT_RETRY;
> > +
> > +   up_read(>vm_mm->mmap_sem);
> > +
> > +   if (vmf->flags & FAULT_FLAG_KILLABLE) {
> > +   if (!down_read_killable(>memory_lock))
> > +   up_read(>memory_lock);
> > +   } else {
> > +   down_read(>memory_lock);
> > +   up_read(>memory_lock);
> > +   }
> > +   return VM_FAULT_RETRY;
> > +   }
> > +   return VM_FAULT_SIGBUS;
> > +   }  
> 
> So, why have the wait? It isn't reliable - if this gets faulted from a
> call site that can't handle retry then it will SIGBUS anyhow?

Do such call sites exist?  My assumption was that half of the branch
was unlikely to ever occur.

> The weird use of a rwsem as a completion suggest that perhaps using
> wait_event might improve things:
> 
> disable:
>   // Clean out the vma list with zap, then:
> 
>   down_read(mm->mmap_sem)

I assume this is simplifying the dance we do in zapping to first take
vma_lock in order to walk vma_list, to find a vma from which we can
acquire the mm, drop vma_lock, get mmap_sem, then re-get vma_lock
below.  Also accounting that vma_list might be empty and we might need
to drop and re-acquire vma_lock to get to another mm, so we really
probably want to set pause_faults at the start rather than at the end.

>   mutex_lock(vma_lock);
>   list_for_each_entry_safe()
>  // zap and remove all vmas
> 
>   pause_faults = true;
>   mutex_write(vma_lock);
> 
> fault:
>   // Already have down_read(mmap_sem)
>   mutex_lock(vma_lock);
>   while (pause_faults) {
>  mutex_unlock(vma_lock)
>  wait_event(..., !pause_faults)
>  mutex_lock(vma_lock)
>   }

Nit, we need to test the memory enable bit setting somewhere under this
lock since it seems to be the only thing protecting it now.

>   list_add()
>   remap_pfn()
>   mutex_unlock(vma_lock)

The read and write file ops would need similar mechanisms.

> enable:
>   pause_faults = false
>   wake_event()

Hmm, vma_lock was dropped above and not re-acquired here.  I'm not sure
if it was an oversight that pause_faults was not tested in the disable
path, but this combination appears to lead to concurrent writers and
serialized readers??

So yeah, this might resolve a theoretical sigbus if we can't retry to
get the memory_lock ordering correct, but we also lose the concurrency
that memory_lock provided us.

> 
> The only requirement here is that while inside the write side of
> memory_lock you cannot touch user pages (ie no copy_from_user/etc)

I'm lost at this statement, I can only figure the above works if we
remove memory_lock.  Are you referring to a different lock?  Thanks,

Alex

Re: [PATCH 2/3] vfio-pci: Fault mmaps to enable vma tracking

2020-05-04 Thread Alex Williamson

On Mon, 4 May 2020 12:05:56 -0300
Jason Gunthorpe  wrote:

> On Mon, May 04, 2020 at 08:20:55AM -0600, Alex Williamson wrote:
> > On Fri, 1 May 2020 20:25:50 -0300
> > Jason Gunthorpe  wrote:
> >   
> > > On Fri, May 01, 2020 at 03:39:19PM -0600, Alex Williamson wrote:  
> > > > Rather than calling remap_pfn_range() when a region is mmap'd, setup
> > > > a vm_ops handler to support dynamic faulting of the range on access.
> > > > This allows us to manage a list of vmas actively mapping the area that
> > > > we can later use to invalidate those mappings.  The open callback
> > > > invalidates the vma range so that all tracking is inserted in the
> > > > fault handler and removed in the close handler.
> > > > 
> > > > Signed-off-by: Alex Williamson 
> > > >  drivers/vfio/pci/vfio_pci.c |   76 
> > > > ++-
> > > >  drivers/vfio/pci/vfio_pci_private.h |7 +++
> > > >  2 files changed, 81 insertions(+), 2 deletions(-)
> > >   
> > > > +static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
> > > > +{
> > > > +   struct vm_area_struct *vma = vmf->vma;
> > > > +   struct vfio_pci_device *vdev = vma->vm_private_data;
> > > > +
> > > > +   if (vfio_pci_add_vma(vdev, vma))
> > > > +   return VM_FAULT_OOM;
> > > > +
> > > > +   if (remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
> > > > +   vma->vm_end - vma->vm_start, 
> > > > vma->vm_page_prot))
> > > > +   return VM_FAULT_SIGBUS;
> > > > +
> > > > +   return VM_FAULT_NOPAGE;
> > > > +}
> > > > +
> > > > +static const struct vm_operations_struct vfio_pci_mmap_ops = {
> > > > +   .open = vfio_pci_mmap_open,
> > > > +   .close = vfio_pci_mmap_close,
> > > > +   .fault = vfio_pci_mmap_fault,
> > > > +};
> > > > +
> > > >  static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
> > > >  {
> > > > struct vfio_pci_device *vdev = device_data;
> > > > @@ -1357,8 +1421,14 @@ static int vfio_pci_mmap(void *device_data, 
> > > > struct vm_area_struct *vma)
> > > > vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> > > > vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) 
> > > > + pgoff;
> > > >  
> > > > -   return remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
> > > > -  req_len, vma->vm_page_prot);
> > > > +   /*
> > > > +* See remap_pfn_range(), called from vfio_pci_fault() but we 
> > > > can't
> > > > +* change vm_flags within the fault handler.  Set them now.
> > > > +*/
> > > > +   vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | 
> > > > VM_DONTDUMP;
> > > > +   vma->vm_ops = _pci_mmap_ops;
> > > 
> > > Perhaps do the vfio_pci_add_vma & remap_pfn_range combo here if the
> > > BAR is activated ? That way a fully populated BAR is presented in the
> > > common case and avoids taking a fault path?
> > > 
> > > But it does seem OK as is  
> > 
> > Thanks for reviewing.  There's also an argument that we defer
> > remap_pfn_range() until the device is actually touched, which might
> > reduce the startup latency.  
> 
> But not startup to a functional VM as that will now have to take the
> slower fault path.

We need to take the fault path regardless because a VM will size and
(virtually) map the BARs, toggling the memory enable bit.  As provided
here, we don't trigger the fault unless the user attempts to access the
BAR or we DMA map the BAR.  That defers the fault until the VM is (to
some extent) up and running, and has a better chance for multi-threaded
faulting than does QEMU initialization. 
 
> > It's also a bit inconsistent with the vm_ops.open() path where I
> > can't return error, so I can't call vfio_pci_add_vma(), I can only
> > zap the vma so that the fault handler can return an error if
> > necessary.  
> 
> open could allocate memory so the zap isn't needed. If allocation
> fails then do the zap and take the slow path.

That's a good idea, but it also gives us one more initialization
variation.  I thought it was a rather nice feature that our vma_list
incl

Re: [PATCH 2/3] vfio-pci: Fault mmaps to enable vma tracking

2020-05-04 Thread Alex Williamson

On Fri, 1 May 2020 20:25:50 -0300
Jason Gunthorpe  wrote:

> On Fri, May 01, 2020 at 03:39:19PM -0600, Alex Williamson wrote:
> > Rather than calling remap_pfn_range() when a region is mmap'd, setup
> > a vm_ops handler to support dynamic faulting of the range on access.
> > This allows us to manage a list of vmas actively mapping the area that
> > we can later use to invalidate those mappings.  The open callback
> > invalidates the vma range so that all tracking is inserted in the
> > fault handler and removed in the close handler.
> > 
> > Signed-off-by: Alex Williamson 
> > ---
> >  drivers/vfio/pci/vfio_pci.c |   76 
> > ++-
> >  drivers/vfio/pci/vfio_pci_private.h |7 +++
> >  2 files changed, 81 insertions(+), 2 deletions(-)  
> 
> > +static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
> > +{
> > +   struct vm_area_struct *vma = vmf->vma;
> > +   struct vfio_pci_device *vdev = vma->vm_private_data;
> > +
> > +   if (vfio_pci_add_vma(vdev, vma))
> > +   return VM_FAULT_OOM;
> > +
> > +   if (remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
> > +   vma->vm_end - vma->vm_start, vma->vm_page_prot))
> > +   return VM_FAULT_SIGBUS;
> > +
> > +   return VM_FAULT_NOPAGE;
> > +}
> > +
> > +static const struct vm_operations_struct vfio_pci_mmap_ops = {
> > +   .open = vfio_pci_mmap_open,
> > +   .close = vfio_pci_mmap_close,
> > +   .fault = vfio_pci_mmap_fault,
> > +};
> > +
> >  static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
> >  {
> > struct vfio_pci_device *vdev = device_data;
> > @@ -1357,8 +1421,14 @@ static int vfio_pci_mmap(void *device_data, struct 
> > vm_area_struct *vma)
> > vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> > vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
> >  
> > -   return remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
> > -  req_len, vma->vm_page_prot);
> > +   /*
> > +* See remap_pfn_range(), called from vfio_pci_fault() but we can't
> > +* change vm_flags within the fault handler.  Set them now.
> > +*/
> > +   vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP;
> > +   vma->vm_ops = _pci_mmap_ops;  
> 
> Perhaps do the vfio_pci_add_vma & remap_pfn_range combo here if the
> BAR is activated ? That way a fully populated BAR is presented in the
> common case and avoids taking a fault path?
> 
> But it does seem OK as is

Thanks for reviewing.  There's also an argument that we defer
remap_pfn_range() until the device is actually touched, which might
reduce the startup latency.  It's also a bit inconsistent with the
vm_ops.open() path where I can't return error, so I can't call
vfio_pci_add_vma(), I can only zap the vma so that the fault handler
can return an error if necessary.  Therefore it felt more consistent,
with potential startup latency improvements, to defer all mappings to
the fault handler.  If there's a good reason to do otherwise, I can
make the change, but I doubt I'd have encountered the dma mapping of an
unfaulted vma issue had I done it this way, so maybe there's a test
coverage argument as well.  Thanks,

Alex

< 1 2 3 4 5 6 7 8 9 10 >

301 - 400 of 5215 matches

Mail list logo